Holidays and Working days were modelled separately and later their predictions were combined. Initially it was decided to go with the same approach for both datasets, which is that a simple model is applied with a gridsearch to get the best possible parameters and see which parameters behave better. However, a further step was taken to improve the score. Stacking ensemble was used to understand errors of the first models and build on them. Expected results were obtained for the working days dataset but not for the holidays.
In order to stack ensemble, third quarter of year 2 was left aside and level 0 models were trained on the one year and a half. Predictions for third quarter were obtained from level 0 models and joined with explanatory variables. Thereafter, an ensemble was trained on third quarter in order to predict for the fourth quarter (test set).
# To automatically reload the function file
%load_ext autoreload
%aimport My_Functions
%run My_Functions.py
%autoreload 1
%matplotlib inline
warnings.simplefilter(action='ignore')
# To automatically reload the function file
%load_ext autoreload
%aimport My_Functions
%run My_Functions.py
%autoreload 1
# Load datasets
wd_h=pd.read_csv('./workingdays_data_prepared.csv')
wd_h['dteday']=pd.to_datetime(wd_h['dteday'], format='%Y-%m-%d')
X_Train_2011_2012Q2 = wd_h[wd_h['dteday'] < datetime.datetime(2012, 7, 5, 0, 0)].drop(['cnt','casual','registered','dteday'],axis=1)## NONE OF THst_EM IN DATA
Y_cnt_train_2011_2012Q2 =wd_h['cnt'][wd_h['dteday'] < datetime.datetime(2012, 7, 5, 0, 0)]
X_Test_2012Q3 = wd_h[(wd_h['dteday'] >= datetime.datetime(2012, 7, 5, 0, 0)) & (wd_h['dteday'] <= datetime.datetime(2012, 9, 30, 0, 0))].drop(['cnt','casual','registered','dteday'],axis=1)## NONE OF THEM IN DATA
Y_cnt_test_2012Q3 =wd_h['cnt'][(wd_h['dteday'] >= datetime.datetime(2012, 7, 5, 0, 0)) & (wd_h['dteday'] <= datetime.datetime(2012, 9, 30, 0, 0))]
Initially, predictions for Q3 2012 were computed. The score is bound to be better than the score of Q4 2012 because bikers' pattern in winter is very different compared to other seasons. The EDA shows that fall, spring, and summer behaved similarly but pattern in winter is different. Therefore, it is reasonable to say that the model learned from 5 seasons with similar patterns interms of bikers' traffic.
##Linear Regression
lm_parameters = {'fit_intercept':[True,False], 'normalize':[True,False], 'copy_X':[True], "n_jobs": [-1]}
lm = GridSearchCV(LinearRegression(),
param_grid=lm_parameters,
cv=tscv ,return_train_score=True)
lm.fit(X_Train_2011_2012Q2, Y_cnt_train_2011_2012Q2)
lm.cv_results_
lm_predictions = lm.predict(X_Test_2012Q3)
lm.score(X_Test_2012Q3, Y_cnt_test_2012Q3)
After trying linear regressor, random forest was tried with a comprehensive grid search in order to make sure that the best possible parameters are used. It was expected that Random Forest would produce a higher score than linear model, but that was not the case. This means that there is a linear relationship between the data at hand and the target, and that the assumption that a random forest would always give a better result is not true specially in a regression problem.
RF_parameters = {'n_estimators': [10, 30 ,100],
'bootstrap': [True],
'max_depth': [80, 100 ],
'max_features': ['sqrt',16 ,32],
'min_samples_leaf': [2, 5 , 8],
'min_samples_split': [ 10 , 8 , 15],
'random_state':[random_seed],
"n_jobs": [-1],
'criterion':['mse']}
rf = GridSearchCV(RandomForestRegressor(),
param_grid= RF_parameters,
cv=tscv)
rf.fit(X_Train_2011_2012Q2, Y_cnt_train_2011_2012Q2)
rf.cv_results_
rf_predictions = lm.predict(X_Test_2012Q3)
rf.score(X_Test_2012Q3, Y_cnt_test_2012Q3)
A simple implementation of KNN regression is to calculate the average of the numerical target of the K nearest neighbors. This model was tried because EDA showed that there are multiple ways to cluster different hours, for instance, by weather, hour, season and day.
knn_parameters = {'n_neighbors':[5,10,15],'weights':['uniform'], 'algorithm':['auto'],'leaf_size':[10, 20,30],
'p':[1,2,3],'metric':['minkowski'],"n_jobs": [-1]}
knn = GridSearchCV(KNeighborsRegressor(),
param_grid=knn_parameters,
cv=tscv,return_train_score=True)
knn.fit(X_Train_2011_2012Q2, Y_cnt_train_2011_2012Q2)
# rf.cv_results_
knn_predictions = knn.predict(X_Test_2012Q3)
knn.score(X_Test_2012Q3, Y_cnt_test_2012Q3)
svr = GridSearchCV(SVR(kernel='rbf', gamma=0.1), cv=tscv,
param_grid={"C": [1e0, 1e1, 1e2, 1e3],
"gamma": np.logspace(-2, 2, 5)})
svr.fit(X_Train_2011_2012Q2, Y_cnt_train_2011_2012Q2)
# rf.cv_results_
svr_predictions = svr.predict(X_Test_2012Q3)
svr.score(X_Test_2012Q3, Y_cnt_test_2012Q3)
XGBoost works on errors of a tree algorithm and improves on them. A comprehensive Grid Search produced the best score compared to all the other models
param_grid = {'learning_rate': [0.01, 0.1],
'max_depth': [4,8,12],
'min_child_weight': [3,5,10,20,35,50],
'subsample': [0.5, 0.75],
'colsample_bytree': [0.5, 0.75],
'n_estimators': [100, 300], 'random_state':[random_seed]
}
model = xgb.XGBRegressor()
xg = GridSearchCV(model,
param_grid = param_grid,
cv = tscv,
n_jobs = -1,
scoring = 'r2',
verbose=True)
xg.fit(X_Train_2011_2012Q2, Y_cnt_train_2011_2012Q2)
xg_predictions = xg.predict(X_Test_2012Q3)
xg.score(X_Test_2012Q3, Y_cnt_test_2012Q3)
In Level 1 modelling, the predictions of top 3 level 0 algortihms (Random Forest, Linear Regression, and XGBoost) on Q3 were attached to Q3 data for training an algorithm
# LR and Random Forest predictions
combinedPredictions = pd.DataFrame({'lm_predictions':lm_predictions,'rf_predictions':rf_predictions,'xg_predictions':xg_predictions })
# Concat original data
X_Test_2012Q3.reset_index(drop=True,inplace=True)
combinedPredictions = pd.concat([combinedPredictions,X_Test_2012Q3], axis=1)
# Target data
combinedPredictionsTarget = pd.DataFrame({'target':Y_cnt_test_2012Q3})
#Reset indices
combinedPredictions.reset_index(drop=True,inplace=True)
combinedPredictionsTarget.reset_index(drop=True,inplace=True)
Q4 data is added to predictions of Level 0 models for Q4 2012
#Get Q4 data
X_Test_Q4 = wd_h[(wd_h['dteday'] >= datetime.datetime(2012, 10, 1, 0, 0)) & (wd_h['dteday'] <= datetime.datetime(2012, 12, 31, 0, 0))].drop(['cnt','casual','registered','dteday'],axis=1)## NONE OF THEM IN DATA
Y_cnt_test_Q4 =wd_h['cnt'][(wd_h['dteday'] >= datetime.datetime(2012, 10, 1, 0, 0)) & (wd_h['dteday'] <= datetime.datetime(2012, 12, 31, 0, 0))]
oringal_cnt = Y_cnt_test_Q4
#get level-0 predictions for Q4 data
Q4_lm_predictions = pd.DataFrame(lm.predict(X_Test_Q4), columns=["lm_predictions"])
Q4_rf_predictions = pd.DataFrame(rf.predict(X_Test_Q4), columns=["rf_predictions"])
Q4_xg_predictions = pd.DataFrame(xg.predict(X_Test_Q4), columns=["xg_predictions"])
X_Test_Q4.reset_index(drop=True,inplace=True)
Q4_lm_predictions.reset_index(drop=True,inplace=True)
Q4_rf_predictions.reset_index(drop=True,inplace=True)
Q4_xg_predictions.reset_index(drop=True,inplace=True)
Y_cnt_test_Q4.reset_index(drop=True,inplace=True)
#concat Q4 with level-0 predictions
X_Test_Q4_with_predictions = pd.concat([Q4_lm_predictions, Q4_rf_predictions,Q4_xg_predictions, X_Test_Q4 ], axis=1)
Y_cnt_test_Q4 = pd.DataFrame({"target": Y_cnt_test_Q4})
Linear Regression :
lm.score(X_Test_Q4, Y_cnt_test_Q4)
Random Forest :
rf.score(X_Test_Q4, Y_cnt_test_Q4)
XGBoost :
xg.score(X_Test_Q4, Y_cnt_test_Q4)
K Nearest Neighbors:
knn.score(X_Test_Q4, Y_cnt_test_Q4)
Support Vector Regressor :
svr.score(X_Test_Q4, Y_cnt_test_Q4)
Level 1 XGBoost is trained on Q3 2012 data with predictions of the three best level 0 models. Level 1 XGBoost will predict for Q4 2012
param_grid = {'learning_rate': [0.01, 0.1],
'max_depth': [4,8,12],
'min_child_weight': [3,5,10,20,35,50],
'subsample': [0.5, 0.75],
'colsample_bytree': [0.5, 0.75],
"n_jobs": [-1],
'n_estimators': [100, 300], 'random_state':[random_seed]
}
model = xgb.XGBRegressor()
xg1 = GridSearchCV(model,
param_grid = param_grid,
cv = tscv,
n_jobs = -1,
scoring = 'explained_variance',
verbose=True)
xg1.fit(combinedPredictions, combinedPredictionsTarget)
xg1_predictions = xg1.predict(X_Test_Q4_with_predictions)
xg1.score(X_Test_Q4_with_predictions, Y_cnt_test_Q4)
This is the final score for the predictions of Q4 2012 workingdays : 0.9022288970285075
Steps that could have further increased the score:
- Choosing the right amount of features for every algorithms, however it was computationaly heavy.
- Increasing the gridsearch or using Bayesian Optimization
Steps that were thought to increase the score but did not work:
- Running level 1 algorithms only on the predictions without the initial data
- Stacking in a way that would use predictions of TimeSeriesCross Validation and then retrain level 0 algorithms with more data
work_predictions = pd.concat( (pd.DataFrame(xg1.predict(X_Test_Q4_with_predictions)), Y_cnt_test_Q4), axis =1 )
dates = wd_h['dteday'][(wd_h['dteday'] >= datetime.datetime(2012, 9, 30, 0, 0)) & (wd_h['dteday'] <= datetime.datetime(2012, 12, 31, 0, 0))]
dates=pd.to_datetime(dates, format='%Y-%m-%d')
dates.reset_index(drop=True,inplace=True)
work_predictions.reset_index(drop=True,inplace=True)
full_wd=pd.concat([dates,work_predictions],axis=1)
full_wd.to_csv("pred_working.csv", index=False)