Python - Data Wrangling

11 Mar

Python – Data Wrangling

Data wrangling involves processing the data in various formats like - merging, grouping, concatenating etc. for the purpose of analysing or getting them ready to be used with another set of data.
Python has built-in features to apply these wrangling methods to various data sets to achieve the analytical goal. In this chapter we will look at few examples describing these methods.

Merging Data

The Pandas library in python provides a single function, merge, as the entry point for all standard database join operations between DataFrame objects −

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True)

Let us now create two different DataFrames and perform the merging operations on it.

# import the pandas library
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Ram', 'Sham', 'Aryan', 'Ayush', 'Shatrugun'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Balram', 'Krishna', 'Riddhi', 'Siddhi', 'Ganesh'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print left
print right

Its output is as follows −

Name  id   subject_id
0   Ram   1         sub1
1    Sham   2         sub2
2  Aryan   3         sub4
3  Ayush   4         sub6
4  Shatrugun  5         sub5
Name  id   subject_id
0  Balram   1         sub2
1  Krishna   2         sub4
2  Riddhi    3         sub3
3  Siddhi   4         sub6
4  Ganesh   5         sub5

Grouping Data

Grouping data sets is a frequent need in data analysis where we need the result in terms of various groups present in the data set. Panadas has in-built methods
which can roll the data into various groups.
In the below example we group the data by year and then get the result for a specific year.

# import the pandas library
import pandas as pd
ipl_data = {'Team': ['Royals', 'Royals', 'Riders', 'Devils', 'Riders',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[856,781,847,669,701,899,724,763,612,756,859,699]}
database = pd.DataFrame(ipl_data)
grouped = database.groupby('Year')
print grouped.get_group(2014)

Its output is as follows −

Points  Rank     Team    Year
0     859     1   Riders    2014
2     859     2   Devils    2014
4     856     3   Kings     2014
9     847     4   Royals    2014

Concatenating Data

Pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects. In the below example
the concat function performs concatenation operations along an axis. Let us create different objects and do concatenation.

import pandas as pd
one = pd.DataFrame({
'Name': ['Ram', 'Sham', 'Aryan', 'Ayush', 'Shatrugun'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[100,98,81,79,96]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Balram', 'Krishna', 'Riddhi', 'Siddhi', 'Ganesh'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[96,100,94,95,99]},
index=[1,2,3,4,5])
print pd.concat([one,two])

Its output is as follows −

Marks_scored     Name   subject_id
1             100     Ram
sub1
2             98      Sham         sub2
3             81    Aryan            sub4
4             79    Ayush            sub6
5             96   Shatrugun           sub5
1             96    Balram            sub2
2             100    Krishna            sub4
3             94     Riddhi            sub3
4             95    Siddhi            sub6
5             99    Ganesh            sub5