Washington DC Biking data | Hourly Bike Count Prediction

2. Data Preparation & Feature Engineering

MBD O-1-5

Notebook preperation

In [1]:
%matplotlib inline

# To automatically reload the function file 
%load_ext autoreload
%aimport My_Functions
%run My_Functions.py
%autoreload 1
In [2]:
# Data Import
hourly_raw_data = pd.read_csv('hour.csv')

Feature Engineering

Converting dteday to date

In [3]:
hourly_raw_data['dteday']=pd.to_datetime(hourly_raw_data['dteday'], format='%Y-%m-%d')

Add isDaylight and isNoon for hourly data

Astral module is used to calculate flags for daylight and noon time.

A customized function is defined to classify a row as daylight. If the hour of a record is less than the hour of sunset in Washington DC and more than the time of sunrise, it is flagged as daylight, otherwise it is flagged as not daylight. Noon time flag is also created using a customized function. If the hour of a record is equal to the hour of noon in Washington DC, it is flagged as noon, otherwise it is flagged as not noon.

In [4]:
hourly_raw_data['isDaylight']=0
hourly_raw_data['isNoon']=0

hourly_raw_data = hourly_raw_data.apply(lambda x: isDaylight(x), axis=1)

Adding the temp atemp windspeed hum relative to the last 7 days value

In exploratory data analysis, it was found that there are outliers in seasonal variables. In order to make a robust model that is able to predict outliers, new variables are created for temp, atemp, hum, and wind speed. In case of temp, mean of temp for the last seven days is deducted from current temp value and the resulting value is divided by standard deviation of temp for the last seven days.

In [5]:
to_relative  = ['temp', 'atemp', 'hum','windspeed']
hourly_raw_data = relative_values(hourly_raw_data, to_relative)

Adding RushHour-High & RushHour-Med & RushHour-Low

The interactive time series shows that there are variations in casual, registered, and total bikers during the span of a day. This realization led to creation of a rush hour flag. The logic for this flag is as follows:

Working Day:
10:00 AM to 6:00 PM is flagged as high rush hour. 7:00 PM to Midnight and 8:00 AM and 9:00 AM are flagged as medium rush hour. Whereas, all other hours are flagged as low rush hour.
Holiday:
7:00 AM to 9:00 AM and 4:00 to 8:00 PM is flagged as high rush hour. 6:00 AM, 10:00 AM till 1:00 PM, 3:00 PM, and 9:00 PM till 11:00 PM are flagged as medium rush hour. Whereas, all other hours are flagged as low rush hour.

In [6]:
hourly_raw_data['RushHour-High'] = 0
hourly_raw_data['RushHour-Med'] = 0
hourly_raw_data['RushHour-Low'] = 0

hourly_raw_data = hourly_raw_data.apply(lambda x: addRushHourFlags(x), axis=1)

Splitting Data

In [7]:
workingdays = num_name(hourly_raw_data.loc[(hourly_raw_data['workingday'].isin([1]) )])
holidays = num_name(hourly_raw_data.loc[(~hourly_raw_data['workingday'].isin([1]) )])

Mean of the past 3 weeks during the same hour

Exploratory data analysis highlighted outliers in total bikers. In order to make a robust model that is able to predict outliers, new variable is created for total bikers. Mean of total bikers in the last three weeks for the same hour as the current row’s hour is computed and added as a new variable to the dataset. This variable was created separately for working days and holidays as they depict different patterns.

In [8]:
workingdays= mean_per_hour_3weeks(workingdays)
holidays = mean_per_hour_3weeks(holidays)

One hot Encoding | 2x for splitted datasets

For season, weathersit, mnth,weekday,hr

In [9]:
category  = ['season', 'weathersit', 'mnth','weekday','hr']

workingdays = onehot_encode(workingdays,category)
workingdays  = workingdays.drop('instant',axis=1)

holidays = onehot_encode(holidays,category)
holidays  = holidays.drop('instant',axis=1)

Genetic Programming | 2x for splitted datasets

A supervised algorithm that uses simple mathematical equations such as summation, multiplication, square root, etc. in order to find a relationship between the existing features and the target. It tries multiple combination of these equations and has a learning process which gets better with the number of generations it is set to have. This function added 15 features each to working days and holidays datasets.

In [10]:
dates = workingdays['dteday']
registered = workingdays['registered']
casual = workingdays['casual']
workingdays = Genetic_P(workingdays.drop(['registered','casual','dteday'],axis=1),'cnt')
workingdays['dteday'] = dates
workingdays['registered'] = registered
workingdays['casual'] = casual
    |    Population Average   |             Best Individual              |
---- ------------------------- ------------------------------------------ ----------
 Gen   Length          Fitness   Length          Fitness      OOB Fitness  Time Left
   0     8.18 0.10366531211608211       27 0.6803303846010792 0.7018874357779318      1.87m
   1      6.5 0.354709964903118        9 0.7366881993886866 0.7220164378148297      1.59m
   2     6.89 0.5383568806607192        8 0.7476740463886652 0.7490166380713607      1.46m
   3    11.18 0.5057637866501863       13 0.7778487551207786 0.7775870072920257      1.41m
   4     8.73 0.622810154591345       13 0.819772281327269 0.8099643779653568      1.37m
   5    11.37 0.6314295341422707       22 0.8341533887833049 0.8291747736647705      1.31m
   6    15.23 0.6718916100606667       35 0.8493650990505909 0.8556537990469637      1.29m
   7    20.42 0.7157770457788527       38 0.8516395715562576 0.8432751909933697      1.28m
   8    26.64 0.7457576320292357       35 0.8529354308031448 0.8435685600942456      1.26m
   9    31.88 0.7561574160941632       35 0.8529372931229201 0.8419409692500015      1.25m
  10    35.55 0.75590612292432       35 0.8596101056645172 0.8496629159262548      1.21m
  11     35.6 0.7577516298268924       67 0.8617915000801162 0.866364667818459      1.17m
  12     35.5 0.7490225667436534       41 0.8639209008635413 0.8567762898059553      1.09m
  13    36.28 0.7617784293206373       41 0.8658754318718078 0.8359655065101648     57.87s
  14    36.68 0.7745503765323003       41 0.8657213212865869 0.8388066646581145     53.90s
  15    36.95 0.7840683142516855       55 0.8655231255414028 0.841156242650006     45.01s
  16    37.29 0.778035154073355       44 0.8654437783339207 0.8610710800427817     34.89s
  17     38.0 0.7709353855584047       35 0.8701513438893017 0.8841831081116486     23.66s
  18    38.16 0.7818389341782349       36 0.8803818527600898 0.8770770058039786     11.94s
  19    38.61 0.7719413115228365       36 0.8817286785333577 0.8648330383636277      0.00s
Number of features created out of genetic programing: (11841, 15)
In [11]:
dates = holidays['dteday']
registered = holidays['registered']
casual = holidays['casual']
holidays = Genetic_P(holidays.drop(['registered','casual','dteday'],axis=1),'cnt')
holidays['dteday'] = dates
holidays['registered'] = registered
holidays['casual'] = casual
    |    Population Average   |             Best Individual              |
---- ------------------------- ------------------------------------------ ----------
 Gen   Length          Fitness   Length          Fitness      OOB Fitness  Time Left
   0     8.19 0.1279387555910619       27 0.7391447702124135 0.6790842209716381     32.27s
   1     6.83 0.40810662483711346        9 0.7767287166819173 0.796930939034333     43.26s
   2     7.32 0.5711801232866154        8 0.7954894833264207 0.8168501449209555     54.58s
   3    12.02 0.539867688621446       10 0.8138789870687457 0.7884004724625453     59.12s
   4     8.44 0.6675797536118186       12 0.8152968051845696 0.8020240119831885      1.01m
   5    10.46 0.6643969869918799       12 0.8266288755576687 0.8170749732941051     59.83s
   6    13.07 0.6887941841707027       12 0.826024400928929 0.8230713618188912     56.76s
   7    13.13 0.6906523444934862       21 0.8284314229676841 0.8229788724006125     52.35s
   8    14.16 0.7015206001227491       26 0.8366117284867484 0.819554035283775     49.04s
   9    15.96 0.7084049330430291       25 0.8390484509561364 0.825133321104512     44.81s
  10     16.5 0.705023798370279       11 0.8466324501592138 0.8659658979106714     40.85s
  11    18.27 0.7173043817476921       26 0.8525790315332034 0.8344118793758166     36.60s
  12    19.52 0.7172017349380975       27 0.8553979009033416 0.845739195259092     32.49s
  13     17.7 0.6981848544616095       32 0.8584283560865943 0.8722454230352821     28.25s
  14    15.68 0.7168589286196093       46 0.8611864943227328 0.8677935742356144     23.76s
  15    13.16 0.7145666344774058       28 0.8590710866887687 0.8505271936935428     19.13s
  16    10.82 0.7175209255580778       33 0.8590579564960812 0.8387145525030414     14.38s
  17     10.0 0.7297152324631035       26 0.8542884836679081 0.8522640068202044      9.59s
  18    10.14 0.7283766858123347       10 0.8540261614815664 0.8062306386257975      4.78s
  19    10.14 0.7304842495237451       10 0.8542395463656118 0.8124039207191525      0.00s
Number of features created out of genetic programing: (5484, 15)

Final Datasets

In [12]:
holidays[np.arange(1,15)].head()
Out[12]:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
29 0.974798 0.846545 0.779781 0.676480 0.974798 0.861633 0.974798 0.975218 0.662575 0.974798 0.974798 0.732373 0.974798 0.561336
30 0.977095 0.847556 0.780356 0.675562 0.977095 0.862450 0.977095 0.977443 0.661817 0.977095 0.977095 0.731917 0.977095 0.559433
31 0.925130 0.824357 0.767373 0.696062 0.925130 0.844015 0.925130 0.928794 0.679029 0.925130 0.925130 0.742426 0.925130 0.601731
32 0.931994 0.827456 0.769083 0.693390 0.931994 0.846441 0.931994 0.935023 0.676751 0.931994 0.931994 0.741017 0.931994 0.596234
33 0.790369 0.762529 0.734913 0.745258 0.790369 0.797924 0.790369 0.818138 0.723091 0.790369 0.790369 0.770767 0.790369 0.703583
In [13]:
holidays.head()
Out[13]:
yr RushHour-Med workingday atemp relative_hum isDaylight relative_atemp mean_per_hour relative_windspeed RushHour-Low ... 8 9 10 11 12 13 14 dteday registered casual
29 0 0 0 0.4242 -0.775524 0 0.618955 2.0 0.971523 1 ... 0.975218 0.662575 0.974798 0.974798 0.732373 0.974798 0.561336 2011-01-02 2 0
30 0 0 0 0.4091 -0.887625 0 0.401594 3.0 0.117373 1 ... 0.977443 0.661817 0.977095 0.977095 0.731917 0.977095 0.559433 2011-01-02 1 0
31 0 1 0 0.4091 -1.527457 1 0.394223 8.0 0.355008 0 ... 0.928794 0.679029 0.925130 0.925130 0.742426 0.925130 0.601731 2011-01-02 8 0
32 0 1 0 0.3939 -0.798609 1 0.176721 14.0 0.348889 0 ... 0.935023 0.676751 0.931994 0.931994 0.741017 0.931994 0.596234 2011-01-02 19 1
33 0 0 0 0.3485 -0.123220 1 -0.464531 36.0 0.343078 0 ... 0.818138 0.723091 0.790369 0.790369 0.770767 0.790369 0.703583 2011-01-02 46 7

5 rows × 87 columns

Save Both datasets

In [14]:
workingdays.to_csv("workingdays_data_prepared.csv", index=False)
holidays.to_csv("weekends_holi_data_prepared.csv", index=False)