%matplotlib inline
# To automatically reload the function file
%load_ext autoreload
%aimport My_Functions
%run My_Functions.py
%autoreload 1
# Data Import
hourly_raw_data = pd.read_csv('hour.csv')
dteday
to date¶hourly_raw_data['dteday']=pd.to_datetime(hourly_raw_data['dteday'], format='%Y-%m-%d')
isDaylight
and isNoon
for hourly data¶Astral module is used to calculate flags for daylight and noon time.
A customized function is defined to classify a row as daylight. If the hour of a record is less than the hour of sunset in Washington DC and more than the time of sunrise, it is flagged as daylight, otherwise it is flagged as not daylight. Noon time flag is also created using a customized function. If the hour of a record is equal to the hour of noon in Washington DC, it is flagged as noon, otherwise it is flagged as not noon.
hourly_raw_data['isDaylight']=0
hourly_raw_data['isNoon']=0
hourly_raw_data = hourly_raw_data.apply(lambda x: isDaylight(x), axis=1)
In exploratory data analysis, it was found that there are outliers in seasonal variables. In order to make a robust model that is able to predict outliers, new variables are created for temp, atemp, hum, and wind speed. In case of temp, mean of temp for the last seven days is deducted from current temp value and the resulting value is divided by standard deviation of temp for the last seven days.
to_relative = ['temp', 'atemp', 'hum','windspeed']
hourly_raw_data = relative_values(hourly_raw_data, to_relative)
RushHour-High
& RushHour-Med
& RushHour-Low
¶The interactive time series shows that there are variations in casual, registered, and total bikers during the span of a day. This realization led to creation of a rush hour flag. The logic for this flag is as follows:
Working Day:
10:00 AM to 6:00 PM is flagged as high rush hour. 7:00 PM to Midnight and 8:00 AM and 9:00 AM are flagged as medium rush hour. Whereas, all other hours are flagged as low rush hour.
Holiday:
7:00 AM to 9:00 AM and 4:00 to 8:00 PM is flagged as high rush hour. 6:00 AM, 10:00 AM till 1:00 PM, 3:00 PM, and 9:00 PM till 11:00 PM are flagged as medium rush hour. Whereas, all other hours are flagged as low rush hour.
hourly_raw_data['RushHour-High'] = 0
hourly_raw_data['RushHour-Med'] = 0
hourly_raw_data['RushHour-Low'] = 0
hourly_raw_data = hourly_raw_data.apply(lambda x: addRushHourFlags(x), axis=1)
workingdays = num_name(hourly_raw_data.loc[(hourly_raw_data['workingday'].isin([1]) )])
holidays = num_name(hourly_raw_data.loc[(~hourly_raw_data['workingday'].isin([1]) )])
Exploratory data analysis highlighted outliers in total bikers. In order to make a robust model that is able to predict outliers, new variable is created for total bikers. Mean of total bikers in the last three weeks for the same hour as the current row’s hour is computed and added as a new variable to the dataset. This variable was created separately for working days and holidays as they depict different patterns.
workingdays= mean_per_hour_3weeks(workingdays)
holidays = mean_per_hour_3weeks(holidays)
For season
, weathersit
, mnth
,weekday
,hr
category = ['season', 'weathersit', 'mnth','weekday','hr']
workingdays = onehot_encode(workingdays,category)
workingdays = workingdays.drop('instant',axis=1)
holidays = onehot_encode(holidays,category)
holidays = holidays.drop('instant',axis=1)
A supervised algorithm that uses simple mathematical equations such as summation, multiplication, square root, etc. in order to find a relationship between the existing features and the target. It tries multiple combination of these equations and has a learning process which gets better with the number of generations it is set to have. This function added 15 features each to working days and holidays datasets.
dates = workingdays['dteday']
registered = workingdays['registered']
casual = workingdays['casual']
workingdays = Genetic_P(workingdays.drop(['registered','casual','dteday'],axis=1),'cnt')
workingdays['dteday'] = dates
workingdays['registered'] = registered
workingdays['casual'] = casual
dates = holidays['dteday']
registered = holidays['registered']
casual = holidays['casual']
holidays = Genetic_P(holidays.drop(['registered','casual','dteday'],axis=1),'cnt')
holidays['dteday'] = dates
holidays['registered'] = registered
holidays['casual'] = casual
holidays[np.arange(1,15)].head()
holidays.head()
workingdays.to_csv("workingdays_data_prepared.csv", index=False)
holidays.to_csv("weekends_holi_data_prepared.csv", index=False)