Python - Processing CSV Data

11 Mar

Python – Processing CSV Data

Reading data from CSV(comma separated values) is a fundamental necessity in Data Science. Often, we get data from various sources which can get exported to CSV format so that they can be used by other systems. The Panadas library provides features using which we can read the CSV file in full as well as in parts for only a selected group of columns and rows.

Input as CSV File

The csv file is a text file in which the values in the columns are separated by a comma. Let's consider the following data present in the file named input.csv.
You can create this file using windows notepad by copying and pasting this data. Save the file as input.csv using the save As All files(.) option in notepad.

id,name,salary,start_date,dept
1,Raj,597.28,2012-08-01,Software
2,Sham,614.36,2015-09-23,Operations
3,Tusar,589.65,2013-01-16,Management
4,Ram,478.23,2013-04-11,HR
5,Jaggu,987.15,2018-02-27,Finance
6,Karishma,678.31,2021-06-11,Software
7,Pranab,699.89 ,2011-06-30,Operations
8,Shami,822.47,2011-08-11,Finance

Reading a CSV File

The read_csv function of the pandas library is used read the content of a CSV file into the python environment as a pandas DataFrame. The function can read the files from the OS by using proper path to the file.

import pandas as pd
D = pd.read_csv('C:/path/input.csv')
print (D)

When we execute the above code, it produces the following result. Please note how an additional column starting with zero as a index has been created by the function.

id    name  salary  start_date        dept
0   1    Raj  597.28  2012-08-01          Software
1   2     Sham  614.36  2015-09-23 Operations
2   3   Tusar  589.65  2013-01-16         Management
3   4    Ram  478.23  2013-04-11         HR
4   5    Jaggu  987.15  2018-02-27     Finance
5   6   Karishma  678.31  2021-06-11          Software
6   7  Pranab  699.89  2011-06-30  Operations
7   8    Shami  822.47  2011-08-11     Finance

Reading Specific Rows

The read_csv function of the pandas library can also be used to read some specific rows for a given column. We slice the result from the read_csv function using the code shown below for first 5 rows for the column named salary.

import pandas as pd
D = pd.read_csv('C:/programFiles/path/input.csv')
# Slice the result for first 5 rows
print (D[0:5]['salary'])

When we execute the above code, it produces the following result.

0    597.28
1    614.36
2    589.65
3    478.23
4    987.15
Name: salary, dtype: float64

Reading Specific Columns

The read_csv function of the pandas library can also be used to read some specific columns.
We use the multi-axes indexing method called .loc() for this purpose. We choose to display the salary and name column for all the rows.

import pandas as pd
D = pd.read_csv('C:/programFiles/path/input.csv')
# Use the multi-axes indexing funtion
print (D.loc[:,['salary','name']])

When we execute the above code, it produces the following result.

salary    name
0  597.28    Raj
1  614.36     Sham
2  589.65   Tusar
3  478.23    Ram
4  987.15    Jaggu
5  678.31   Karishma
6  699.89  Pranab
7  822.47    Shami

Reading Specific Columns and Rows

The read_csv function of the pandas library can also be used to read some specific columns and specific rows.
We use the multi-axes indexing method called .loc() for this purpose. We choose to display the salary and name column for some of the rows.

import pandas as pd
D = pd.read_csv('C:/programFiles/path/input.csv')
# Use the multi-axes indexing funtion
print (D.loc[[1,3,5],['salary','name']])

When we execute the above code, it produces the following result.

salary   name
1   614.36    Sham
3   478.23   Ram
5   678.31  Karishma

Reading Specific Columns for a Range of Rows

The read_csv function of the pandas library can also be used to read some specific columns and a range of rows.
We use the multi-axes indexing method called .loc() for this purpose. We choose to display the salary and name column for some of the rows.

import pandas as pd
D = pd.read_csv('C:/programFiles/path/input.csv')
# Use the multi-axes indexing funtion
print (D.loc[2:6,['salary','name']])

When we execute the above code, it produces the following result.

salary    name
2  589.65   Tusar
3  478.23    Ram
4  987.15    Jaggu
5  678.31   Karishma
6  699.89  Pranab