11 Mar

Python – Processing JSON Data

JSON file stores data as text in human-readable format. JSON stands for JavaScript Object Notation. Pandas can read JSON files using the read_json function.

Input Data

Create a JSON file by copying the below data into a text editor like notepad. Save the file with .json extension and choosing the file type as all files(.).

{
"ID":["1","2","3","4","5","6","7","8" ],
"Name":["Raj","Sham","Tusar","Ram","Jaggu","Karishma","Pranab","Shami" ]
"Salary":["520.5","625.2","411.2","821","900.25","178.2","532.5","822.4" ],
"StartDate":[ "2/2/2012","10/24/2011","12/13/2019","6/4/2004","2/22/2017","6/21/2014",
"8/20/2003","1/18/2019"],
"Dept":[ "Software","Operations","Management","HR","Finance","Software","Operations","Finance"]
}

Read the JSON File

The read_json function of the pandas library can be used to read the JSON file into a pandas DataFrame.

import pandas as pd
D = pd.read_json('C:/programFiles/path/input.csv')
print (D)

When we execute the above code, it produces the following result.
Dept ID Name Salary StartDate
0 Software 1 Raj 520.5 2/2/2012
1 Operations 2 Sham 625.2 10/24/2011
2 IT 3 Tusar 411.2 12/13/2019
3 HR 4 Ram 821 6/4/2004
4 Finance 5 Jaggu 900.25 2/22/2017
5 Software 6 Karishma 178.2 8/20/2003
6 Operations 7 Pranab 532.5 1/18/2019
7 Finance 8 Shami 822.4 6/17/2014
Reading Specific Columns and Rows
Similar to what we have already seen in the previous chapter to read the CSV file, the read_json function of the pandas library can also be used to read some specific columns and specific rows after the JSON file is read to a DataFrame.
We use the multi-axes indexing method called .loc() for this purpose. We choose to display the Salary and Name column for some of the rows.
import pandas as pd
data = pd.read_json('path/input.xlsx')

Use the multi-axes indexing funtion

print (data.loc[[1,3,5],['salary','name']])
When we execute the above code, it produces the following result.
salary name
1 520.5 Raj
3 411.2 Tusar
5 900.25 Jaggu
Reading JSON file as Records
We can also apply the to_json function along with parameters to read the JSON file content into individual records.
import pandas as pd
D = pd.read_json('C:/programfiles/path/input.xlsx')
print(D.to_json(orient='records', lines=True))
When we execute the above code, it produces the following result.
{"Dept":"Software ","ID":1,"Name":"Raj ","Salary":520.5,"StartDate":"2/2/2012"}
{"Dept":"Operations","ID":2,"Name":"Sham ","Salary":625.2,"StartDate":"10/24/2011"}
{"Dept":"IT","ID":3,"Name":"Tusar","Salary":911.0,"StartDate":"12/13/2019"}
{"Dept":"HR","ID":4,"Name":"Ram ","Salary":821,"StartDate":"6/4/2004"}
{"Dept":"Finance","ID":5,"Name":"Jaggu ","Salary":900.25,"StartDate":"2/22/2017"}
{"Dept":"IT","ID":6,"Name":"Rasmi","Salary":578.0,"StartDate":"5/21/2013"}
{"Dept":"Operations","ID":7,"Name":"Karishma ","Salary"178.2,"StartDate":"8/20/2003"}
{"Dept":"Finance","ID":8,"Name":"Shami ","Salary":822.4,"StartDate":"6/17/2014"}
Previous Page
Print Page
Next Page
Advertisements

Dept  ID    Name  Salary   StartDate
0          Software    1    Raj   520.5     2/2/2012
1  Operations   2     Sham   625.2   10/24/2011
2          Management   3   Tusar  411.2  12/13/2019
3          HR   4    Ram   821    6/4/2004
4     Finance   5    Jaggu   900.25   2/22/2017
5          Software    6   Karishma   178.2   8/20/2003
6  Operations   7  Pranab  532.5    1/18/2019
7     Finance   8    Shami   822.4    6/17/2014

Reading Specific Columns and Rows

Similar to what we have already seen in the previous chapter to read the CSV file, the read_json function of the pandas library can also be used to read some specific columns and specific rows after the JSON file is read to a DataFrame.
We use the multi-axes indexing method called .loc() for this purpose. We choose to display the Salary and Name column for some of the rows.

import pandas as pd
D = pd.read_json('C:/programfiles/path/input.xlsx')
# Use the multi-axes indexing funtion
print (D.loc[[1,3,5],['salary','name']])

When we execute the above code, it produces the following result.

salary   name
1   515.2    Raj
3   729.0   Tusar
5   578.0  Karishma

Python - Processing XLS Data

Microsoft Excel is a very widely used spread sheet program. Its user friendliness and appealing features makes it a very frequently used tool in Data Science.
The Panadas library provides features using which we can read the Excel file in full as well as in parts for only a selected group of Data. We can also read an Excel file with multiple sheets in it. We use the read_excel function to read the data from it.

Input as Excel File

We Create an excel file with multiple sheets in the windows OS. The Data in the different sheets is as shown below.
You can create this file using the Excel Program in windows OS. Save the file as input.xlsx.

Data in Sheet1

id,name,salary,start_date,dept
1,Raj ,520.5 ,2/2/2012,Software
2,Sham ,625.2 ,10/24/2011,Operations
3,Tusar,411.2,12/13/2019,Management
4,Ram ,821 ,6/4/2004 ,Finance
6,Jaggu ,900.25 , 2/22/2017,Software
7,Pranab,532.5 ,1/18/2019,Operations
8,Shami ,822.4 ,6/17/2014,Finance

Data in Sheet2

id name zipcode
1 Raj 602225
2 Sham 341255
3 Tusar 602326
4 Ram 222568
5 Jaggu 438700
6 Karishma 224556
7 Pranab 341211
8 Shami 669875
Reading an Excel File
The read_excel function of the pandas library is used read the content of an Excel file into the python environment as a pandas DataFrame. The function can read the files from the OS by using proper path to the file. By default, the function will read Sheet1.
import pandas as pd
D = pd.read_excel('C:/localDisc/path/input.xlsx')
print (D)
When we execute the above code, it produces the following result. Please note how an additional column starting with zero as a index has been created by the function.
Dept ID Name Salary StartDate
0 Software 1 Raj 520.5 2/2/2012
1 Operations 2 Sham 625.2 10/24/2011
2 Management 3 Tusar 411.2 12/13/2019
3 HR 4 Ram 821 6/4/2004
4 Finance 5 Jaggu 900.25 2/22/2017
5 Software 6 Karishma 178.2 8/20/2003
6 Operations 7 Pranab 532.5 1/18/2019
7 Finance 8 Shami 822.4 6/17/2014
Reading Specific Columns and Rows
Similar to what we have already seen in the previous chapter to read the CSV file, the read_excel function of the pandas library can also be used to read some specific columns and specific rows.
We use the multi-axes indexing method called .loc() for this purpose. We choose to display the salary and name column for some of the rows.
import pandas as pd
D = pd.read_excel('C:/localDisc/path/input.xlsx')

Use the multi-axes indexing funtion

print (data.loc[[1,3,5],['salary','name']])
When we execute the above code, it produces the following result.
salary name
1 515.2 Raj
3 729.0 Ram
5 578.0 Jaggu
Reading Multiple Excel Sheets
Multiple sheets with different Data formats can also be read by using read_excel function with help of a wrapper class named ExcelFile. It will read the multiple sheets into memory only once.
In the below example we read sheet1 and sheet2 into two data frames and print them out individually.
import pandas as pd
with pd.ExcelFile('C:/Users/Rasmi/Documents/pydatasci/input.xlsx') as xls:
df_1 = pd.read_excel(xls, 'Sheet1')
df_2 = pd.read_excel(xls, 'Sheet2')
print("Result Sheet 1")
print (df1[0:5]['salary'])
print("")
print("*Result Sheet 2")
print (df2[0:5]['zipcode'])
When we execute the above code, it produces the following result.
Result Sheet 1***
0 520.5
1 625.2
2 411.2
3 821
4 900.25
Name: salary, dtype: float64
Result Sheet 2****
0 602225
1 341255
2 602326
3 222568
4 438700
Name: zipcode, dtype: int64
Previous Page
Print Page
Next Page
Advertisements
You can create this file using the Excel Program in windows OS. Save the file as input.xlsx.

# Data in Sheet1
id,name,salary,start_date,dept
1,Raj ,520.5 ,2/2/2012,Software
2,Sham ,625.2 ,10/24/2011,Operations
3,Tusar,411.2,12/13/2019,Management
4,Ram ,821 ,6/4/2004 ,Finance
6,Jaggu ,900.25 , 2/22/2017,Software
7,Pranab,532.5 ,1/18/2019,Operations
8,Shami ,822.4 ,6/17/2014,Finance
# Data in Sheet2
id  name    zipcode
1   Raj 602225
2   Sham    341255
3   Tusar   602326
4   Ram 222568
5   Jaggu   438700
6   Karishma    224556
7   Pranab  341211
8   Shami   669875

Python - Relational Databases

We can connect to relational databases for analysing data using the pandas library as well as another additional library for implementing database connectivity.
This package is named as sqlalchemy which provides full SQL language functionality to be used in python.

Installing SQLAlchemy

The installation is very straight forward using Anaconda which we have discussed in the chapter Data Science Environment. Assuming you have installed Anaconda as described in this chapter,
run the following command in the Anaconda Prompt Window to install the SQLAlchemy package.

conda install sqlalchemy

Reading Relational Tables

We will use Sqlite3 as our relational database as it is very light weight and easy to use. Though the SQLAlchemy library can connect to a variety of relational sources including MySql, Oracle and Postgresql and Mssql.
We first create a database engine and then connect to the database engine using the to_sql function of the SQLAlchemy library.
In the below example we create the relational table by using the to_sql function from a dataframe already created by reading a csv file.
Then we use the read_sql_query function from pandas to execute and capture the results from various SQL queries.

from sqlalchemy import create_engine
import pandas as pd
D = pd.read_csv('C:/programfiles/path/input.csv')
# Create the db engine
eng = create_engine('sqlite:///:memory:')
# Store the dataframe as a table
D.to_sql('data_table', eng)
# Query 1 on the relational table
result1 = pd.read_sql_query('SELECT * FROM data_table', engine)
print('Result 1')
print(result1)
print('')
# Query 2 on the relational table
result2 = pd.read_sql_query('SELECT dept,sum(salary) FROM data_table group by dept', engine)
print('Result 2')
print(result2)

When we execute the above code, it produces the following result.

Result 1
index  id    name  salary  start_date        dept
0      0   1    Raj   520.5   2/2/2012          Software
1      1   2     Sham   625.2 10/24/2011  Operations
2      2   3   Tusar  411.2  12/13/2019          Management
3      3   4    Ram   821   6/4/2004          HR
4      4   5    Jaggu   900.25   2/22/2017     Finance
5      5   6   Karishma  578.96  2013-05-21          Software
6      6   7  Pranab  532.5   1/18/2019  Operations
7      7   8    Shami   822.4   6/17/2014     Finance
Result 2
dept  sum(salary)
0     Finance      1865.75
1          HR       1029.00
2          IT      2112.30
3  Operations      1618.00

Inserting Data to Relational Tables

We can also insert data into relational tables using sql.execute function available in pandas. In the below code we previous csv file as input data set, store it in a relational table and then
insert another record using sql.execute.

from sqlalchemy import create_engine
from pandas.io import sql
import pandas as pd
D = pd.read_csv('C:/Admin/Karishma/Documents/pydatasci/input.csv')
eng = create_engine('sqlite:///:memory:')
# Store the Data in a relational table
data.to_sql('data_table', eng)
# Insert another row
sql.execute('INSERT INTO data_table VALUES(?,?,?,?,?,?)', eng, params=[('id',9,'Rubika',628.78,'2014-01-17','Software')])
# Read from the relational table
result = pd.read_sql_query('SELECT ID,Dept,Name,Salary,start_date FROM data_table', eng)
print(result)

When we execute the above code, it produces the following result.

index  id    name  salary  start_date        dept
0      0   1    Raj   520.5   2/2/2012          Software
1      1   2     Sham   625.2 10/24/2011  Operations
2      2   3   Tusar  411.2  12/13/2019          Management
3      3   4    Ram   821   6/4/2004          HR
4      4   5    Jaggu   900.25   2/22/2017     Finance
5      5   6   Karishma  578.96  2013-05-21          Software
6      6   7  Pranab  532.5   1/18/2019  Operations
7      7   8    Shami   822.4   6/17/2014     Finance

Deleting Data from Relational Tables

We can also delete data into relational tables using sql.execute function available in pandas. The below code deletes a row based on the input condition given.

from sqlalchemy import create_engine
from pandas.io import sql
import pandas as pd
D = pd.read_csv('C:/Admin/Karishma/Documents/pydatasci/input.csv')
eng = create_engine('sqlite:///:memory:')
data.to_sql('data_table', eng)
sql.execute('Delete from data_table where name = (?) ', eng,  params=[('Gary')])
result = pd.read_sql_query('SELECT ID,Dept,Name,Salary,start_date FROM data_table', eng)
print(result)

When we execute the above code, it produces the following result.

index  id    name  salary  start_date        dept
0      0   1    Raj   520.5   2/2/2012          Software
1      1   2     Sham   625.2 10/24/2011  Operations
2      2   3   Tusar  411.2  12/13/2019          Management
3      3   4    Ram   821   6/4/2004          HR
4      4   5    Jaggu   900.25   2/22/2017     Finance
5      5   6   Karishma  578.96  2013-05-21          Software
6      6   7    Shami   822.4   6/17/2014     Finance