Washington DC Biking data | Hourly Bike Count Prediction¶

MBD O-1-5

Executive Summary¶

The report details steps taken to predict number of total bikers on hourly basis for the last quarter of 2012 for Washington D.C. bike sharing system. The dataset comprises of hourly data and daily data from 2011 to 2012, daily data is merely an aggregation of hourly data. For the purpose of analysis, only hourly data was used. Two-step approach was followed to compute predictions, first, level 0 models were used to predict for third quarter of 2012 and then stacked ensemble was used to predict for fourth quarter of 2012. Linear forest, XGBoost, Random forest regressor, and K nearest neighbor were used to compute Level 0 predictions, which were added to explanatory data. Further, XGBoost was trained on quarter 3 explanatory variables, which contained predictions of level 0 models, to predict for the test set. The final accuracy score is 0.897. As the model is able to predict number of bike users with almost 90% accuracy, it can be used by Washington D.C. bike sharing system to forecast demand and manage inventory levels.

1. EDA + Baseline¶

Notebook preperation¶

# To automatically reload the function file 
%load_ext autoreload
%aimport My_Functions
%run My_Functions.py
%autoreload 1
%matplotlib inline
warnings.simplefilter(action='ignore')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

# Data Import
hourly_raw_data = pd.read_csv('hour.csv')

EDA¶

Data understanding is the key component of carrying out a prediction task. Therefore, naturally, interactive and statics plots are utilized to understand the dataset. Dataset is divided into two files, hourly data and daily data, the latter is merely an aggregation of hourly data. Hence, exploratory data analysis was carried out using hourly data.

# Making a copy of hourly data for graphs
graph_data_hourly=hourly_raw_data.copy()
graph_data_hourly["dteday"]=graph_data_hourly["dteday"].map(lambda x:  pd.to_datetime(x))

Interactive time series plot shows that there are seasonal variations in data. Number of total bikers increase Despite seasonal variations, there is a positive year-on- year trend in the number of total bikers.

mpl_fig = plt.figure()
ax= plt.axes()
ax.plot(graph_data_hourly['dteday'],graph_data_hourly['cnt'])
ax.plot(graph_data_hourly['dteday'],graph_data_hourly['registered'])
ax.plot(graph_data_hourly['dteday'],graph_data_hourly['casual'])
#ax.legend(['cnt', 'registered', 'casual'], loc='upper left')
#ax.figure(figsize=(20,10))
# locator=MaxNLocator(prune='both',nbins=8)
# ax.xaxis.set_major_locator(locator)


plotly_fig=tls.mpl_to_plotly(mpl_fig)
plotly_fig['layout']['xaxis'] = {
   'tickmode': 'auto',
   'nticks': 10
    }
plotly_fig['layout']['showlegend'] = True
plotly_fig['layout'] = {'width':800}
plotly_fig['layout']['title'] = 'Hourly Bike Users'
py.iplot(plotly_fig,filename='Hourly_Bike_Users')

The draw time for this plot will be slow for clients without much RAM.

The distributions of casual, registered, and total bikers show outliers. This finding proves that in the span of dataset, there were times when number of bikers were much greater than the overall trend. In order to enable the model to capture unexpected increase in bikers, multiple features are created, which will be detailed in the feature engineering part.

The boxplots for seasonal variables, namely, temp, atemp, hum and wind speed, show outliers for hum and wind speed. This finding, also, fed into feature creation for seasonal variables in order for the model to be powerful enough to capture such outliers.

graph_data_hourly.boxplot(column = ['casual','registered','cnt'])
plt.title('Hourly')
plt.show()

graph_data_hourly.boxplot(column = ['temp','atemp','hum','windspeed'])
plt.title('Hourly')
plt.show()

num_bins = 'auto'
to_hist = ['atemp','temp','hum','windspeed']
graph_data_hourly[to_hist].hist(bins=num_bins, figsize=(20, 15))

plt.show()

The line charts show that number of total bikers are highly affected by weather. The seasonal variations for number of total bikers and weather follow an identical trend. Therefore, it is reasonable to expect that weather will play an important part in predicting number of total bikers.

plt.figure(figsize=(20,10))
plt.subplot(2,2, 1)
plt.plot(graph_data_hourly['dteday'],graph_data_hourly['cnt'])
plt.title('Total Bikers per Hour')
plt.subplot(2, 2, 2)
plt.plot(graph_data_hourly['dteday'],graph_data_hourly['temp'])
plt.title('Temperature per Hour')
plt.subplot(2, 2, 3)
plt.plot(graph_data_hourly['dteday'],graph_data_hourly['atemp'])
plt.title('Feels like Temperature per Hour')
plt.subplot(2, 2, 4)
plt.plot(graph_data_hourly['dteday'],graph_data_hourly['hum'])
plt.title('Humidity per Hour')

Text(0.5, 1.0, 'Humidity per Hour')

Another important finding is that the total bikers are much higher on weekends compared to weekdays. The difference in number of total bikers on weekends vs. working days justifies that the data should be split and modeled separately. However, graph for total bikers on working days vs. holidays is a proof that number of total bikers do not only increase on weekends, but also on holidays that are during the weekdays. Therefore, data is split between working days and holidays. Predictions for the two datasets are computed separately.

# Option 1 for weekend/weekday graph
fig,ax = plt.subplots()
sns.pointplot(data=graph_data_hourly[['hr',
                           'cnt',
                           'weekday']],
              x='hr',
              y='cnt',
              hue='weekday',
              ax=ax)
ax.set(title="Weekday wise hourly distribution of Total Bikers")

[Text(0.5, 1.0, 'Weekday wise hourly distribution of Total Bikers')]

# option 2 for weekend/weekday graphs
g = sns.lmplot(x="hr", y="cnt", hue="weekday", col="weekday",data=graph_data_hourly, height=6, aspect=.4, x_jitter=.1)

# workingday/holiday graph
fig,ax = plt.subplots()
sns.pointplot(data=graph_data_hourly[['hr',
                           'cnt',
                           'workingday']],
              x='hr',
              y='cnt',
              hue='workingday',
              ax=ax)
ax.set(title="Weekday wise hourly distribution of Total Bikers")

[Text(0.5, 1.0, 'Weekday wise hourly distribution of Total Bikers')]

EDA after splitting (Working Data / Holiday Days)¶

workingdays = num_name(hourly_raw_data.loc[(hourly_raw_data['workingday'].isin([1]) )])
holidays = num_name(hourly_raw_data.loc[(~hourly_raw_data['workingday'].isin([1]) )])

workingdays.boxplot(column = ['casual','registered','cnt'])
plt.title('Casual, Registered, and Total Bikers on Working Days')
plt.show()
holidays.boxplot(column = ['casual','registered','cnt'])
plt.title('Casual, Registered, and Total Bikers on Holidays')
plt.show()

workingdays.boxplot(column = ['temp','atemp','hum','windspeed'])
plt.title('Temperature, Feels like Temperate, Humidity, and Windspeed on Weekdays')
plt.show()
holidays.boxplot(column = ['temp','atemp','hum','windspeed'])
plt.title('Temperature, Feels like Temperate, Humidity, and Windspeed on Weekends')
plt.show()

num_bins = 'auto'
to_hist = ['atemp','temp','hum','windspeed']
workingdays[to_hist].hist(bins=num_bins, figsize=(20, 15))

plt.show()

num_bins = 'auto'
to_hist = ['atemp','temp','hum','windspeed']
holidays[to_hist].hist(bins=num_bins, figsize=(20, 15))

plt.show()

plt.figure(figsize=(20,10))
plt.subplot(2,2, 1)
plt.plot(holidays['dteday'],holidays['cnt'])
plt.title('Total Bikers on Weekend')
plt.subplot(2, 2, 2)
plt.plot(holidays['dteday'],holidays['temp'])
plt.title('Temperature on Weekend')
plt.subplot(2, 2, 3)
plt.plot(holidays['dteday'],holidays['atemp'])
plt.title('Feels Like Temperature on Weekend')
plt.subplot(2, 2, 4)
plt.plot(holidays['dteday'],holidays['hum'])
plt.title('Humidity on Weekend')

Text(0.5, 1.0, 'Humidity on Weekend')

fig,ax = plt.subplots(1,2,sharey=True,sharex=True,figsize=(15,5))
#plt.figure(figsize=(1000,1000))
sns.pointplot(data=holidays[['hr',
                           'cnt',
                           'season']],
              x='hr',
              y='cnt',
              hue='season',
              ax=ax[0])
sns.pointplot(data=workingdays[['hr',
                           'cnt',
                           'season']],
              x='hr',
              y='cnt',
              hue='season',
              ax=ax[1])
ax[0].set(title="Weekends")
ax[1].set(title="Weekdays")
plt.show()
fig.savefig('test2')

hey = hourly_raw_data.corr()['cnt'][:]
hey.sort_values()

#Hourly
plt.rcParams["figure.figsize"] = (10,10)
corr = hourly_raw_data.corr()
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(corr,cmap='coolwarm', vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,len(hourly_raw_data.columns),1)
ax.set_xticks(ticks)
plt.xticks(rotation=90)
ax.set_yticks(ticks)
ax.set_xticklabels(hourly_raw_data.columns)
ax.set_yticklabels(hourly_raw_data.columns)
plt.show()

Modeling (Baseline)¶

Two data cleaning steps are taken before running a baseline model:

Converting Date to datetime64 format
One-hot encoding of categorical variables

Converting `dteday` to date¶

hourly_raw_data['dteday']=pd.to_datetime(hourly_raw_data['dteday'], format='%Y-%m-%d')
hourly_raw_data['dteday'].head()

0   2011-01-01
1   2011-01-01
2   2011-01-01
3   2011-01-01
4   2011-01-01
Name: dteday, dtype: datetime64[ns]

One hot Encoding¶

For season, weathersit, mnth,weekday,hr

category  = ['season', 'weathersit', 'mnth','weekday','hr']

hourly_raw_data = onehot_encode(hourly_raw_data,category)

Final data before Baseline¶

hourly_raw_data.head()

Splitting train test¶

The dataset is split into train and test set. The train set comprises of data from the first quarter of 2011 till the end of second quarter of 2012 and the test set comprises of third quarter of 2012. Predictions are made for the third quarter of 2012 in the baseline step. The explanatory variables neither include casual users not registered users. The target variable is total users.

X_Train = hourly_raw_data[hourly_raw_data['dteday'] < datetime.datetime(2012, 7, 1, 0, 0)].drop(['cnt','casual','registered','dteday'],axis=1)## NONE OF THst_EM IN DATA
X_Test = hourly_raw_data[(hourly_raw_data['dteday'] >= datetime.datetime(2012, 7, 1, 0, 0)) & (hourly_raw_data['dteday'] <= datetime.datetime(2012, 9, 30, 0, 0))].drop(['cnt','casual','registered','dteday'],axis=1)## NONE OF THEM IN DATA
Y_cnt_test =hourly_raw_data['cnt'][(hourly_raw_data['dteday'] >= datetime.datetime(2012, 7, 1, 0, 0)) & (hourly_raw_data['dteday'] <= '2012-09-30')]
Y_cnt_train =hourly_raw_data['cnt'][hourly_raw_data['dteday'] < datetime.datetime(2012, 7, 1, 0, 0)]

Linear Regression¶

A simple linear regression is used to predict hourly bikers for the third quarter of 2012. Baseline model predicts the number of total bikers with an accuracy of 60%.

regr = LinearRegression(fit_intercept =True)
regr.fit(X_Train, Y_cnt_train)
y_pred = regr.predict(X_Test)
r2_score(Y_cnt_test,Y_cnt_test + y_pred)
regr.score(X_Test, Y_cnt_test)

0.6006646365424165

	casual	instant	temp	dteday	registered	hum	atemp	...
0	3	1	0.24	2011-01-01	13	0.81	0.2879	...
1	8	2	0.22	2011-01-01	32	0.80	0.2727	...
2	5	3	0.22	2011-01-01	27	0.80	0.2727	...
3	3	4	0.24	2011-01-01	10	0.75	0.2879	...
4	0	5	0.24	2011-01-01	1	0.75	0.2879	...