Washington DC Biking data | Hourly Bike Count Prediction

3. Modeling for WorkingDays

MBD O-1-5

Holidays and Working days were modelled separately and later their predictions were combined. Initially it was decided to go with the same approach for both datasets, which is that a simple model is applied with a gridsearch to get the best possible parameters and see which parameters behave better. However, a further step was taken to improve the score. Stacking ensemble was used to understand errors of the first models and build on them. Expected results were obtained for the working days dataset but not for the holidays.

In order to stack ensemble, third quarter of year 2 was left aside and level 0 models were trained on the one year and a half. Predictions for third quarter were obtained from level 0 models and joined with explanatory variables. Thereafter, an ensemble was trained on third quarter in order to predict for the fourth quarter (test set).

In [3]:
# To automatically reload the function file 
%load_ext autoreload
%aimport My_Functions
%run My_Functions.py
%autoreload 1
%matplotlib inline
warnings.simplefilter(action='ignore')
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
In [36]:
# To automatically reload the function file 
%load_ext autoreload
%aimport My_Functions
%run My_Functions.py
%autoreload 1
In [3]:
# Load datasets
wd_h=pd.read_csv('./workingdays_data_prepared.csv')
wd_h['dteday']=pd.to_datetime(wd_h['dteday'], format='%Y-%m-%d')

Subset 2011 -> 2012Q3 Data

Dividing the dataset in one year and a half and Q3 in order to be able to train level 0 Models AND PREDICT

In [4]:
X_Train_2011_2012Q2 = wd_h[wd_h['dteday'] < datetime.datetime(2012, 7, 5, 0, 0)].drop(['cnt','casual','registered','dteday'],axis=1)## NONE OF THst_EM IN DATA
Y_cnt_train_2011_2012Q2 =wd_h['cnt'][wd_h['dteday'] < datetime.datetime(2012, 7, 5, 0, 0)]

X_Test_2012Q3 = wd_h[(wd_h['dteday'] >= datetime.datetime(2012, 7, 5, 0, 0)) & (wd_h['dteday'] <= datetime.datetime(2012, 9, 30, 0, 0))].drop(['cnt','casual','registered','dteday'],axis=1)## NONE OF THEM IN DATA
Y_cnt_test_2012Q3 =wd_h['cnt'][(wd_h['dteday'] >= datetime.datetime(2012, 7, 5, 0, 0)) & (wd_h['dteday'] <= datetime.datetime(2012, 9, 30, 0, 0))]

Linear regression w/ GS for Weekdays for 2011 -> 2012Q3

Initially, predictions for Q3 2012 were computed. The score is bound to be better than the score of Q4 2012 because bikers' pattern in winter is very different compared to other seasons. The EDA shows that fall, spring, and summer behaved similarly but pattern in winter is different. Therefore, it is reasonable to say that the model learned from 5 seasons with similar patterns interms of bikers' traffic.

In [5]:
##Linear Regression
lm_parameters = {'fit_intercept':[True,False], 'normalize':[True,False], 'copy_X':[True], "n_jobs": [-1]}

lm = GridSearchCV(LinearRegression(),
                                 param_grid=lm_parameters,
                                 cv=tscv ,return_train_score=True)

lm.fit(X_Train_2011_2012Q2, Y_cnt_train_2011_2012Q2)
lm.cv_results_
lm_predictions = lm.predict(X_Test_2012Q3)
lm.score(X_Test_2012Q3, Y_cnt_test_2012Q3)
Out[5]:
0.941401943285146

Random Forest w/ GS for Weekdays for 2011 -> 2012Q3

After trying linear regressor, random forest was tried with a comprehensive grid search in order to make sure that the best possible parameters are used. It was expected that Random Forest would produce a higher score than linear model, but that was not the case. This means that there is a linear relationship between the data at hand and the target, and that the assumption that a random forest would always give a better result is not true specially in a regression problem.

In [6]:
RF_parameters = {'n_estimators': [10, 30 ,100],
                                             'bootstrap': [True],
                                             'max_depth': [80, 100 ],
                                             'max_features': ['sqrt',16 ,32],
                                             'min_samples_leaf': [2,  5 , 8],
                                             'min_samples_split': [ 10 , 8 , 15],
                                            'random_state':[random_seed],
                                             "n_jobs": [-1],
                                            'criterion':['mse']}
rf = GridSearchCV(RandomForestRegressor(),
                                 param_grid= RF_parameters,
                                 cv=tscv)

rf.fit(X_Train_2011_2012Q2, Y_cnt_train_2011_2012Q2)
rf.cv_results_
rf_predictions = lm.predict(X_Test_2012Q3)
rf.score(X_Test_2012Q3, Y_cnt_test_2012Q3)
Out[6]:
0.9289808963621973

K Nearest Neighbors regressor w/ GS for Holidays for 2011 -> 2012Q3

A simple implementation of KNN regression is to calculate the average of the numerical target of the K nearest neighbors. This model was tried because EDA showed that there are multiple ways to cluster different hours, for instance, by weather, hour, season and day.

In [7]:
knn_parameters = {'n_neighbors':[5,10,15],'weights':['uniform'], 'algorithm':['auto'],'leaf_size':[10, 20,30],
                  'p':[1,2,3],'metric':['minkowski'],"n_jobs": [-1]}
knn = GridSearchCV(KNeighborsRegressor(),
                                 param_grid=knn_parameters,
                                 cv=tscv,return_train_score=True)
knn.fit(X_Train_2011_2012Q2, Y_cnt_train_2011_2012Q2)
# rf.cv_results_
knn_predictions = knn.predict(X_Test_2012Q3)
knn.score(X_Test_2012Q3, Y_cnt_test_2012Q3)
Out[7]:
0.9211277733616634

Support Vector regressor w/ GS for Holidays for 2011 -> 2012Q3

In [8]:
svr = GridSearchCV(SVR(kernel='rbf', gamma=0.1), cv=tscv,
                   param_grid={"C": [1e0, 1e1, 1e2, 1e3],
                               "gamma": np.logspace(-2, 2, 5)})

svr.fit(X_Train_2011_2012Q2, Y_cnt_train_2011_2012Q2)
# rf.cv_results_
svr_predictions = svr.predict(X_Test_2012Q3)
svr.score(X_Test_2012Q3, Y_cnt_test_2012Q3)
Out[8]:
0.7455187270287198

XGBoost regressor w/ GS for Holidays for 2011 -> 2012Q3

XGBoost works on errors of a tree algorithm and improves on them. A comprehensive Grid Search produced the best score compared to all the other models

In [9]:
param_grid = {'learning_rate': [0.01, 0.1], 
          'max_depth': [4,8,12],
          'min_child_weight': [3,5,10,20,35,50],
          'subsample': [0.5, 0.75],
          'colsample_bytree': [0.5, 0.75],
          'n_estimators': [100, 300], 'random_state':[random_seed]
}

model = xgb.XGBRegressor()

xg = GridSearchCV(model,
                    param_grid = param_grid,
                    cv = tscv,
                    n_jobs = -1,
                    scoring = 'r2',
                    verbose=True)
xg.fit(X_Train_2011_2012Q2, Y_cnt_train_2011_2012Q2)
xg_predictions = xg.predict(X_Test_2012Q3)
xg.score(X_Test_2012Q3, Y_cnt_test_2012Q3)
Fitting 5 folds for each of 288 candidates, totalling 1440 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   32.7s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  3.6min
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  9.5min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed: 17.4min
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed: 29.8min
[Parallel(n_jobs=-1)]: Done 1440 out of 1440 | elapsed: 37.2min finished
Out[9]:
0.934419033223012

Taking predictions from LR and Random Forest and xgboost to test on 2012Q4

In Level 1 modelling, the predictions of top 3 level 0 algortihms (Random Forest, Linear Regression, and XGBoost) on Q3 were attached to Q3 data for training an algorithm

In [10]:
# LR and Random Forest predictions
combinedPredictions = pd.DataFrame({'lm_predictions':lm_predictions,'rf_predictions':rf_predictions,'xg_predictions':xg_predictions })

# Concat original data
X_Test_2012Q3.reset_index(drop=True,inplace=True)
combinedPredictions = pd.concat([combinedPredictions,X_Test_2012Q3], axis=1)

# Target data
combinedPredictionsTarget = pd.DataFrame({'target':Y_cnt_test_2012Q3})

#Reset indices
combinedPredictions.reset_index(drop=True,inplace=True)
combinedPredictionsTarget.reset_index(drop=True,inplace=True)

Subset Q4 data and concat it with its level-0 predctions

Q4 data is added to predictions of Level 0 models for Q4 2012

In [11]:
#Get Q4 data
X_Test_Q4 = wd_h[(wd_h['dteday'] >= datetime.datetime(2012, 10, 1, 0, 0)) & (wd_h['dteday'] <= datetime.datetime(2012, 12, 31, 0, 0))].drop(['cnt','casual','registered','dteday'],axis=1)## NONE OF THEM IN DATA
Y_cnt_test_Q4 =wd_h['cnt'][(wd_h['dteday'] >= datetime.datetime(2012, 10, 1, 0, 0)) & (wd_h['dteday'] <= datetime.datetime(2012, 12, 31, 0, 0))]
oringal_cnt = Y_cnt_test_Q4

#get level-0 predictions for Q4 data
Q4_lm_predictions = pd.DataFrame(lm.predict(X_Test_Q4), columns=["lm_predictions"])
Q4_rf_predictions = pd.DataFrame(rf.predict(X_Test_Q4), columns=["rf_predictions"])
Q4_xg_predictions = pd.DataFrame(xg.predict(X_Test_Q4), columns=["xg_predictions"])

X_Test_Q4.reset_index(drop=True,inplace=True)
Q4_lm_predictions.reset_index(drop=True,inplace=True)
Q4_rf_predictions.reset_index(drop=True,inplace=True)
Q4_xg_predictions.reset_index(drop=True,inplace=True)
Y_cnt_test_Q4.reset_index(drop=True,inplace=True)

#concat Q4 with level-0 predictions 
X_Test_Q4_with_predictions = pd.concat([Q4_lm_predictions, Q4_rf_predictions,Q4_xg_predictions, X_Test_Q4 ], axis=1)
Y_cnt_test_Q4 = pd.DataFrame({"target": Y_cnt_test_Q4})

Q4 scores

Scores of Level 0 algorithms for Q4 2012

Linear Regression :

In [12]:
lm.score(X_Test_Q4, Y_cnt_test_Q4)
Out[12]:
0.8839966859540709

Random Forest :

In [13]:
rf.score(X_Test_Q4, Y_cnt_test_Q4)
Out[13]:
0.8715926918607073

XGBoost :

In [14]:
xg.score(X_Test_Q4, Y_cnt_test_Q4)
Out[14]:
0.8980765759281988

K Nearest Neighbors:

In [15]:
knn.score(X_Test_Q4, Y_cnt_test_Q4)
Out[15]:
0.8340477314724115

Support Vector Regressor :

In [16]:
svr.score(X_Test_Q4, Y_cnt_test_Q4)
Out[16]:
0.7077025873956615

Train and test Q4

XGBoost regressor w/ GS for Level 1 Holidays for 2011 -> 2012Q3

Level 1 XGBoost is trained on Q3 2012 data with predictions of the three best level 0 models. Level 1 XGBoost will predict for Q4 2012

In [19]:
param_grid = {'learning_rate': [0.01, 0.1], 
          'max_depth': [4,8,12],
          'min_child_weight': [3,5,10,20,35,50],
          'subsample': [0.5, 0.75],
          'colsample_bytree': [0.5, 0.75],
           "n_jobs": [-1],
          'n_estimators': [100, 300], 'random_state':[random_seed]
}

model = xgb.XGBRegressor()

xg1 = GridSearchCV(model,
                    param_grid = param_grid,
                    cv = tscv,
                    n_jobs = -1,
                    scoring = 'explained_variance',
                    verbose=True)

xg1.fit(combinedPredictions, combinedPredictionsTarget)
xg1_predictions = xg1.predict(X_Test_Q4_with_predictions)
xg1.score(X_Test_Q4_with_predictions, Y_cnt_test_Q4)
Fitting 5 folds for each of 288 candidates, totalling 1440 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  51 tasks      | elapsed:    7.8s
[Parallel(n_jobs=-1)]: Done 204 tasks      | elapsed:   36.6s
[Parallel(n_jobs=-1)]: Done 454 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 804 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done 1254 tasks      | elapsed:  4.9min
[Parallel(n_jobs=-1)]: Done 1440 out of 1440 | elapsed:  5.9min finished
Out[19]:
0.9010145680993351

This is the final score for the predictions of Q4 2012 workingdays : 0.9022288970285075

Steps that could have further increased the score:

- Choosing the right amount of features for every algorithms, however it was computationaly heavy.
- Increasing the gridsearch or using Bayesian Optimization 

Steps that were thought to increase the score but did not work:

- Running level 1 algorithms only on the predictions without the initial data 
- Stacking in a way that would use predictions of TimeSeriesCross Validation and then retrain level 0 algorithms with more data      

Save Predictions for Weekdays

In [43]:
work_predictions  = pd.concat( (pd.DataFrame(xg1.predict(X_Test_Q4_with_predictions)), Y_cnt_test_Q4), axis =1 )
dates = wd_h['dteday'][(wd_h['dteday'] >= datetime.datetime(2012, 9, 30, 0, 0)) & (wd_h['dteday'] <= datetime.datetime(2012, 12, 31, 0, 0))]
dates=pd.to_datetime(dates, format='%Y-%m-%d')
dates.reset_index(drop=True,inplace=True)
work_predictions.reset_index(drop=True,inplace=True)
full_wd=pd.concat([dates,work_predictions],axis=1)
full_wd.to_csv("pred_working.csv", index=False)