Time Series Analysis

Time Series Analysis

Open In Colab RPI LOGO

Time Series Analysis


A time series is a series of data points indexed (or listed or graphed) in time order. Time series analysis pertains to methods extracting meaningful statistics from time series data. This is commonly used for forecasting and other models.

Learning Objectives


  1. Understand the uses of Time Series Analysis
  2. Understand the pros and cons of various TSA methods, including differentiating between linear and non-linear methods.
  3. Apply the facebook prophet model and analyze the results on given rossman store data.

Problem Description


We will use the rossman store sales database for this notebook. Following is the description of Data from the website:

"Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied."

In [1]:
# Library Imports
import numpy as np
import pandas as pd
import matplotlib
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
from scipy.stats import skew
from scipy.stats.stats import pearsonr
from math import sqrt
from sklearn.metrics import mean_squared_error
import fbprophet

# matplotlib parameters
matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['text.color'] = 'k'

%config InlineBackend.figure_format = 'retina' 
%matplotlib inline

Exploratory Data Analysis (Train)


We start by seeing what our data conists of. We want to see which variables are continuous vs which are categorical. After exploring some of the data, we see that we can create a feature. Number of sales divided by customers could give us a good metric to measure average sales per customer. We can also make an assumption that if we have missing values in this column that we have 0 customers. Since customers drive sales, we elect to remove all of these values.

Notice the order in which the data is listed. It is ordered from most recent date to oldest date. This may cause a problem when we look to develop our model.

In [3]:
train.head()

Out[3]:
Store DayOfWeek Sales Customers Open Promo StateHoliday SchoolHoliday
Date
2015-07-31 1 5 5263 555 1 1 0 1
2015-07-31 2 5 6064 625 1 1 0 1
2015-07-31 3 5 8314 821 1 1 0 1
2015-07-31 4 5 13995 1498 1 1 0 1
2015-07-31 5 5 4822 559 1 1 0 1

In [4]:
train.head()

Out[4]:
Store DayOfWeek Sales Customers Open Promo StateHoliday SchoolHoliday
Date
2015-07-31 1 5 5263 555 1 1 0 1
2015-07-31 2 5 6064 625 1 1 0 1
2015-07-31 3 5 8314 821 1 1 0 1
2015-07-31 4 5 13995 1498 1 1 0 1
2015-07-31 5 5 4822 559 1 1 0 1

In [5]:
train.shape

Out[5]:
(1017209, 8)

After that we will use the amazing .describe() function which can provide us most of the statistic elements.

In [6]:
train.describe()

Out[6]:
Store DayOfWeek Sales Customers Open Promo SchoolHoliday
count 1.017209e+06 1.017209e+06 1.017209e+06 1.017209e+06 1.017209e+06 1.017209e+06 1.017209e+06
mean 5.584297e+02 3.998341e+00 5.773819e+03 6.331459e+02 8.301067e-01 3.815145e-01 1.786467e-01
std 3.219087e+02 1.997391e+00 3.849926e+03 4.644117e+02 3.755392e-01 4.857586e-01 3.830564e-01
min 1.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 2.800000e+02 2.000000e+00 3.727000e+03 4.050000e+02 1.000000e+00 0.000000e+00 0.000000e+00
50% 5.580000e+02 4.000000e+00 5.744000e+03 6.090000e+02 1.000000e+00 0.000000e+00 0.000000e+00
75% 8.380000e+02 6.000000e+00 7.856000e+03 8.370000e+02 1.000000e+00 1.000000e+00 0.000000e+00
max 1.115000e+03 7.000000e+00 4.155100e+04 7.388000e+03 1.000000e+00 1.000000e+00 1.000000e+00

We will check all the missing elements over here.

In [7]:
missing = train.isnull().sum()
missing.sort_values(ascending=False)

Out[7]:
SchoolHoliday    0
StateHoliday     0
Promo            0
Open             0
Customers        0
Sales            0
DayOfWeek        0
Store            0
dtype: int64

Next, we create a new metric to see average sales per customer.

In [8]:
train['SalesPerCustomer'] = train['Sales']/train['Customers']
train['SalesPerCustomer'].head()

Out[8]:
Date
2015-07-31     9.482883
2015-07-31     9.702400
2015-07-31    10.126675
2015-07-31     9.342457
2015-07-31     8.626118
Name: SalesPerCustomer, dtype: float64

We are going to Check if there are any missing values with our new metric and drop them. Either the customers or the sales should be zero to give us a null SalesPerCustomer.

In [9]:
missing = train.isnull().sum()
missing.sort_values(ascending=False)

Out[9]:
SalesPerCustomer    172869
SchoolHoliday            0
StateHoliday             0
Promo                    0
Open                     0
Customers                0
Sales                    0
DayOfWeek                0
Store                    0
dtype: int64

In [10]:
train.dropna().head()

Out[10]:
Store DayOfWeek Sales Customers Open Promo StateHoliday SchoolHoliday SalesPerCustomer
Date
2015-07-31 1 5 5263 555 1 1 0 1 9.482883
2015-07-31 2 5 6064 625 1 1 0 1 9.702400
2015-07-31 3 5 8314 821 1 1 0 1 10.126675
2015-07-31 4 5 13995 1498 1 1 0 1 9.342457
2015-07-31 5 5 4822 559 1 1 0 1 8.626118

Exploratory Data Analysis (Store Data)


We do the same as we did for our training set. Exploring the data, we see that there are only 3 missing values in CompetitionDistance. Because this is such a small amount, we elect to replace these with the mean of the column. The other missing values are all dependent on Promo2. Because these missing values are because Promo2 is equal to 0, we can replace these nulls with 0.

In [11]:
store.head()

Out[11]:
Store StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear PromoInterval
0 1 c a 1270.0 9.0 2008.0 0 NaN NaN NaN
1 2 a a 570.0 11.0 2007.0 1 13.0 2010.0 Jan,Apr,Jul,Oct
2 3 a a 14130.0 12.0 2006.0 1 14.0 2011.0 Jan,Apr,Jul,Oct
3 4 c c 620.0 9.0 2009.0 0 NaN NaN NaN
4 5 a a 29910.0 4.0 2015.0 0 NaN NaN NaN

In [12]:
store.shape

Out[12]:
(1115, 10)

In [13]:
store.isnull().sum()

Out[13]:
Store                          0
StoreType                      0
Assortment                     0
CompetitionDistance            3
CompetitionOpenSinceMonth    354
CompetitionOpenSinceYear     354
Promo2                         0
Promo2SinceWeek              544
Promo2SinceYear              544
PromoInterval                544
dtype: int64

Since there are only 3 missing values from this, we fill with the average from the column

In [14]:
store['CompetitionDistance'].fillna(store['CompetitionDistance'].mean(), inplace = True)

The rows that do not have any Promo2 we can fill the rest of the values with 0

In [15]:
store.fillna(0, inplace = True)

In [16]:
store.head()

Out[16]:
Store StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear PromoInterval
0 1 c a 1270.0 9.0 2008.0 0 0.0 0.0 0
1 2 a a 570.0 11.0 2007.0 1 13.0 2010.0 Jan,Apr,Jul,Oct
2 3 a a 14130.0 12.0 2006.0 1 14.0 2011.0 Jan,Apr,Jul,Oct
3 4 c c 620.0 9.0 2009.0 0 0.0 0.0 0
4 5 a a 29910.0 4.0 2015.0 0 0.0 0.0 0

Join the data together using an inner join so only the data that is in both data set is joined

In [17]:
train = train.merge(right=store, on='Store', how='left')

Moving-Average Model (Naive Model)


We are going to be using a moving average model for the stock prediction of GM for our baseline model. The moving average model will take the average of different “windows” of time to come up with its forecast

We reload the data because now we have a sense of how we want to maniplate it for our model. After doing the same data manipulation as before, we start to look at the trend of our sales.

In [19]:
train = pd.read_csv("https://raw.githubusercontent.com/RPI-DATA/tutorials-intro/master/rossmann-store-sales/rossmann-store-sales/train.csv", parse_dates = True, low_memory = False, index_col = 'Date')
train = train.sort_index(ascending = True)

In [20]:
train['SalesPerCustomer'] = train['Sales']/train['Customers']
train['SalesPerCustomer'].head()

Out[20]:
Date
2013-01-01   NaN
2013-01-01   NaN
2013-01-01   NaN
2013-01-01   NaN
2013-01-01   NaN
Name: SalesPerCustomer, dtype: float64

In [21]:
train = train.dropna()

Here, we are simply graphing the sales that we have. As you can see, there are a tremendous amount of sales to the point where our graph just looks like a blue shading. However, we can get a sense of how our sales are distributed.

In [22]:
plt.plot(train.index, train['Sales'])
plt.title('Rossmann Sales')
plt.ylabel('Price ($)');
plt.show()

To clean up our graph, we want to form a new column which only accounts for the year of the sales.

In [23]:
train['Year'] = train.index.year

# Take Dates from index and move to Date column train.reset_index(level=0, inplace = True) train['sales'] = 0

Split the data into a train and test set. We use an 8020 split. Then, we look to start are modein.

test_store stands for the forecasting part.

In [24]:
train_store=train[0:675472]
test_store=train[675472:]

In [25]:
train_store.Date = pd.to_datetime(train_store.Date, format="%Y-%m-%d")
train_store.index = train_store.Date
test_store.Date = pd.to_datetime(test_store.Date, format="%Y-%m-%d")
test_store.index = test_store.Date

train_store = train_store.resample('D').mean() train_store = train_store.interpolate(method='linear')

test_store = test_store.resample('D').mean() test_store = test_store.interpolate(method='linear')

/Users/josefigueroa/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py:4405: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value

In [26]:
train_store.Sales.plot(figsize=(25,10), title='daily sales', fontsize=25)
test_store.Sales.plot()

Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1c3f5b00>

In [27]:
y_hat_avg_moving = test_store.copy()
y_hat_avg_moving['moving_avg_forcast'] = train_store.Sales.rolling(90).mean().iloc[-1]
plt.figure(figsize=(25,10))
plt.plot(train_store['Sales'], label='Train')
plt.plot(test_store['Sales'], label='Test')
plt.plot(y_hat_avg_moving['moving_avg_forcast'], label='Moving Forecast')
plt.legend(loc='best')
plt.title('Moving Average Forecast')

Out[27]:
Text(0.5, 1.0, 'Moving Average Forecast')