Python - Data Aggregation

11 Mar

Python – Data Aggregation

Python has several methods are available to perform aggregations on data. It is done using the pandas and numpy libraries. The data must be available or converted to
a dataframe to apply the aggregation functions.

Applying Aggregations on DataFrame

Let us create a DataFrame and apply aggregations on it.

import pandas as pd
import numpy as np
database = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print database
r = database.rolling(window=3,min_periods=1)
print r

Its output is as follows −

A           B           C           D
2020-01-01   1.946546   -0.032165   -2.566516   -0.606506
2020-01-02   0.065644   -0.698798   -0.065460    0.654066
2020-01-03  -0.890656   -0.065168    0.964065   -2.650465
2020-01-04   1.897906    1.984065   -0.984066    1.987684
2020-01-05   0.987065   -0.065654    0.897650   -0.890646
2020-01-06   0.897406    0.031664   -1.984650    0.121640
2020-01-07   0.984650   -0.987106   -1.987065    0.804650
2020-01-08   0.984006   -1.894065    0.984065   -1.065064
2020-01-09   1.980656   -0.056497    0.650652   -0.894056
2020-01-10   0.260569    1.984065    0.206054   -1.894560
Rolling [window=3,min_periods=1,center=False,axis=0]

We can aggregate by passing a function to the entire DataFrame, or select a column via the standard get item method.

Apply Aggregation on a Whole Dataframe

import pandas as pd
import numpy as np
database = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print database
r = df.rolling(window=3,min_periods=1)
print r.aggregate(np.sum)

Its output is as follows −

A           B           C           D
2020-01-01   1.946546   -0.032165   -2.566516   -0.606506
2020-01-02   0.065644   -0.698798   -0.065460    0.654066
2020-01-03  -0.890656   -0.065168    0.964065   -2.650465
2020-01-04   1.897906    1.984065   -0.984066    1.987684
2020-01-05   0.987065   -0.065654    0.897650   -0.890646
2020-01-06   0.897406    0.031664   -1.984650    0.121640
2020-01-07   0.984650   -0.987106   -1.987065    0.804650
2020-01-08   0.984006   -1.894065    0.984065   -1.065064
2020-01-09   1.980656   -0.056497    0.650652   -0.894056
2020-01-10   0.260569    1.984065    0.206054   -1.894560
A           B           C           D
2020-01-01   1.456454   -0.542134   -2.46545   -0.456345
2020-01-02   1.55131   -1.546531   -3.546165   -0.065654
2020-01-03   1.87046   -2.032065   -3.065646   -2.003354
2020-01-04   1.45645   -0.456464   -0.564611   -0.894564
2020-01-05   1.65461    0.486064    0.846566   -1.98465
2020-01-06   1.89406    1.984656   -0.065649    1.798564
2020-01-07   0.03654    0.987546   -2.064897    0.031657
2020-01-08   0.013564   -0.03168   -2.031697   -0.001345
2020-01-09   2.642106   -1.03165   -0.654064   -1.031654
2020-01-10   2.876564    0.564060    1.65490   -3.654123

Apply Aggregation on a Single Column of a Dataframe

import pandas as pd
import numpy as np
database = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print database
r = database.rolling(window=3,min_periods=1)
print r['A'].aggregate(np.sum)

Its output is as follows −

A           B           C           D
2020-01-01   1.946546   -0.032165   -2.566516   -0.606506
2020-01-02   0.065644   -0.698798   -0.065460    0.654066
2020-01-03  -0.890656   -0.065168    0.964065   -2.650465
2020-01-04   1.897906    1.984065   -0.984066    1.987684
2020-01-05   0.987065   -0.065654    0.897650   -0.890646
2020-01-06   0.897406    0.031664   -1.984650    0.121640
2020-01-07   0.984650   -0.987106   -1.987065    0.804650
2020-01-08   0.984006   -1.894065    0.984065   -1.065064
2020-01-09   1.980656   -0.056497    0.650652   -0.894056
2020-01-10   0.260569    1.984065    0.206054   -1.894560
2020-01-01   1.946546
2020-01-02   1.032165
2020-01-03   1.566516
2020-01-04   1.606506
2020-01-05   1.456456
2020-01-06   1.789787
2020-01-07   0.424568
2020-01-08   0.756456
2020-01-09   2.344566
2020-01-10   2.123456
Freq: D, Name: A, dtype: float64

Apply Aggregation on Multiple Columns of a DataFrame

import pandas as pd
import numpy as np
database = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print database
r = database.rolling(window=3,min_periods=1)
print r[['A','B']].aggregate(np.sum)

Its output is as follows −

A           B           C           D
2020-01-01   1.946546   -0.032165   -2.566516   -0.606506
2020-01-02   0.065644   -0.698798   -0.065460    0.654066
2020-01-03  -0.890656   -0.065168    0.964065   -2.650465
2020-01-04   1.897906    1.984065   -0.984066    1.987684
2020-01-05   0.987065   -0.065654    0.897650   -0.890646
2020-01-06   0.897406    0.031664   -1.984650    0.121640
2020-01-07   0.984650   -0.987106   -1.987065    0.804650
2020-01-08   0.984006   -1.894065    0.984065   -1.065064
2020-01-09   1.980656   -0.056497    0.650652   -0.894056
2020-01-10   0.260569    1.984065    0.206054   -1.894560
A           B           C           D
2020-01-01   1.456454   -0.542134   -2.46545   -0.456345
2020-01-02   1.55131   -1.546531   -3.546165   -0.065654
2020-01-03   1.87046   -2.032065   -3.065646   -2.003354
2020-01-04   1.45645   -0.456464   -0.564611   -0.894564
2020-01-05   1.65461    0.486064    0.846566   -1.98465
2020-01-06   1.89406    1.984656   -0.065649    1.798564
2020-01-07   0.03654    0.987546   -2.064897    0.031657
2020-01-08   0.013564   -0.03168   -2.031697   -0.001345
2020-01-09   2.642106   -1.03165   -0.654064   -1.031654
2020-01-10   2.876564    0.564060    1.65490   -3.654123