Time Series – Data Processing and Visualization

Time Series – Data Processing and Visualization

Time Series – Data Processing and Visualization

Time Series is a sequence of observations indexed in equi-spaced time intervals. Hence, the order and continuity should be maintained in any time series.
The dataset we will be using is a multi-variate time series having hourly data for approximately one year, for air quality in a significantly polluted Italian city. The dataset can be downloaded from the link given below − https://archive.ics.uci.edu/ml/datasets/air+quality.
It is necessary to make sure that −

  • The time series is equally spaced, and
  • There are no redundant values or gaps in it.
    The time series is equally spaced, and
    There are no redundant values or gaps in it.
    In case the time series is not continuous, we can upsample or downsample it.

    Showing df.head()

    In [122]:

    import pandas

    In [123]:

    database = pandas.read_csv("AirQualityUCI.csv", sep = ";", decimal = ",")
    database = database.iloc[ : , 0:14]

    In [124]:

    len(database)

    Out[124]:

    9471

    In [125]:

    database.head()

    Out[125]:
    For preprocessing the time series, we make sure there are no NaN(NULL) values in the dataset; if there are, we can replace them with either 0 or average or preceding or succeeding values. Replacing is a preferred choice over dropping so that the continuity of the time series is maintained. However, in our dataset the last few values seem to be NULL and hence dropping will not affect the continuity.

    Dropping NaN(Not-a-Number)

    In [126]:

    database.isna().sum()
    Out[126]:
    Date             114
    Time             114
    CO(GT)           114
    PT08.S1(CO)      114
    NMHC(GT)         114
    C6H6(GT)         114
    PT08.S2(NMHC)    114
    NOx(GT)          114
    PT08.S3(NOx)     114
    NO2(GT)          114
    PT08.S4(NO2)     114
    PT08.S5(O3)      114
    T                114
    RH               114
    dtype: int64

    In [127]:

    database = database[database['Date'].notnull()]

    In [128]:

    database.isna().sum()

    Out[128]:

    Date             0
    Time             0
    CO(GT)           0
    PT08.S1(CO)      0
    NMHC(GT)         0
    C6H6(GT)         0
    PT08.S2(NMHC)    0
    NOx(GT)          0
    PT08.S3(NOx)     0
    NO2(GT)          0
    PT08.S4(NO2)     0
    PT08.S5(O3)      0
    T                0
    RH               0
    dtype: int64

    Time Series are usually plotted as line graphs against time. For that we will now combine the date and time column and convert it into a datetime object from strings. This can be accomplished using the datetime library.

    Conversion to datetime object

    In [129]:

    database['DateTime'] = (database.Date) + ' ' + (database.Time)
    print (type(database.DateTime[0]))

    In [130]:

    import datetime
    database.DateTime = database.DateTime.apply(lambda x: datetime.datetime.strptime(x, '%d/%m/%Y %H.%M.%S'))
    print (type(database.DateTime[0]))

    Let us see how some variables like temperature changes with change in time.

    Displaying plots

    In [131]:

    database.index = database.DateTime

    In [132]:

    import matplotlib.pyplot as plt
    plt.plot(database['T'])

    Out[132]:

    [<matplotlib.lines.Line2D at 0x1eaad67f780>]

    In [208]:

    plt.plot(database['C6H6(GT)'])

    Out[208]:

    [<matplotlib.lines.Line2D at 0x1eaaeedff28>]

    Box-plots are another useful kind of graphs that allow you to condense a lot of information about a dataset into a single graph. It shows the mean, 25% and 75% quartile and outliers of one or multiple variables. In the case when number of outliers is few and is very distant from the mean, we can eliminate the outliers by setting them to mean value or 75% quartile value.

    Displaying Boxplots

    In [134]:

    plt.boxplot(database[['T','C6H6(GT)']].values)

    Out[134]:

    {'whiskers': [<matplotlib.lines.Line2D at 0x1eaac16de80>,
    <matplotlib.lines.Line2D at 0x1eaac16d908>,
    <matplotlib.lines.Line2D at 0x1eaac177a58>,
    <matplotlib.lines.Line2D at 0x1eaac177cf8>],
    'caps': [<matplotlib.lines.Line2D at 0x1eaac16d2b0>,
    <matplotlib.lines.Line2D at 0x1eaac16d588>,
    <matplotlib.lines.Line2D at 0x1eaac1a69e8>,
    <matplotlib.lines.Line2D at 0x1eaac1a64a8>],
    'boxes': [<matplotlib.lines.Line2D at 0x1eaac16dc50>,
    <matplotlib.lines.Line2D at 0x1eaac1779b0>],
    'medians': [<matplotlib.lines.Line2D at 0x1eaac16d4a8>,
    <matplotlib.lines.Line2D at 0x1eaac1a6c50>],
    'fliers': [<matplotlib.lines.Line2D at 0x1eaac177dd8>,
    <matplotlib.lines.Line2D at 0x1eaac1a6c18>],'means': []
    }
Time Series – Python Libraries (Prev Lesson)
(Next Lesson) Modeling