Kaggle: Bike Sharing Demand Part 1

Finally had a chance to play around with Kaggle challenge, and bike sharing demand seems to be the easiest to tackle - no domain expertise required or atleast very minimal.

This is going to be Part 1 where I'll go over how I apply minimal statistical knowledge to extract features. Well, extraction and selection to be exact. From these features, I fed them into Random Forest (from Scikit-Learn) which hopefully will be covered in Part 2. The result I got so far is mediocre, I plan to play around with features and try different algos later when I have the time. My final target is to get score below 0.5 or around 0.4 at best before I give up :( If some of the reasoning are wrong let me know, although I have background in Actuarial Science, it has been quite a while.

Feature selection

I will be using R to toy around with the data.

Since this is going to be a regression problem (predicting a continous value), I would like to start simple with linear regression to see how things first.

> str(bike)
'data.frame':	10886 obs. of  14 variables:
 $ datetime  : Factor w/ 10886 levels "2011-01-01 00:00:00",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
 $ weather   : int  1 1 1 1 1 2 1 1 1 1 ...
 $ temp      : num  9.84 9.02 9.02 9.84 9.84 ...
 $ atemp     : num  14.4 13.6 13.6 14.4 14.4 ...
 $ humidity  : int  81 80 80 75 75 75 80 86 75 76 ...
 $ windspeed : num  0 0 0 0 0 ...
 $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
 $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
 $ count     : int  16 40 32 13 1 1 2 3 8 14 ...
 $ time      : Factor w/ 24 levels "00","01","02",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ day       : Factor w/ 7 levels "Friday","Monday",..: 3 3 3 3 3 3 3 3 3 3 ...

time and day are some of the variable that I extracted from datetime. I could do so for month as well. There is no statistical basis behind this, it's just purely from intuition/knowledge. But, we will let stats decide if these features are worthy.

> model1 <- lm(count ~ season + holiday + workingday + weather + temp + atemp + humidity + windspeed, data=bike)
> summary(model1)

Call:
lm(formula = count ~ season + holiday + workingday + weather + 
    temp + atemp + humidity + windspeed, data = bike)

Residuals:
    Min      1Q  Median      3Q     Max 
-335.81 -102.67  -31.95   66.44  677.02 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 135.79052    8.71016  15.590  < 2e-16 ***
season       22.75882    1.42662  15.953  < 2e-16 ***
holiday      -9.15872    9.27009  -0.988 0.323181    
workingday   -1.14953    3.31527  -0.347 0.728795    
weather       5.93872    2.61924   2.267 0.023389 *  
temp          1.84737    1.14210   1.618 0.105796    
atemp         5.63120    1.05057   5.360 8.49e-08 ***
humidity     -3.05684    0.09262 -33.003  < 2e-16 ***
windspeed     0.77762    0.19999   3.888 0.000102 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 155.8 on 10877 degrees of freedom
Multiple R-squared:  0.2609,	Adjusted R-squared:  0.2604 
F-statistic:   480 on 8 and 10877 DF,  p-value: < 2.2e-16

Our R-squared is not that great. Maybe we can improve this by excluding some features. Usually features with * is considered as important when evaluating the model. One thing to note is that atemp (feels like) has 3 *'s while temp does not have any. This is quite misleading. temp should also be equally important. However, because the correlation between temp and atemp is 0.9849481 (highly correlated), the other information is already being 'captured' by the other variable. So deleting one of this feature will result in minimal loss of information.

Let's take a look at humidity. The co-efficient is negative - the relationship between count and humidity is inverse, the higher the humidity the lesser the bike rental. Since I am not a weather expert, I'm not sure how true is this, but let's see it on a graph so we can confirm this.

humidity vs count

Well, the bike rental does get lesser as humidity increases although not smoothly (maybe there are some other factors influencing bike rental at this humidity level).

Another thing that I would like to do is to actually breakdown the rental between the registered and casual and to see if timings are actually affecting them. Registered user might use bike for works, and casual might use for leisure purpose. Let's see this on graph.

Looking at the graph, we can confirm our assumption. We can actually do a separate model to predict for casual and registered. Some of the features might remain the same, but for time, we might model them differently. Once we have the count for casual and registered, we can total them to get the predicted count.

These are some of my thought process when given data. I applied the same simple methodology for any other features like windspeed, day (Sunday is somehow important), etc. Try to analyze the data and later feed them into a model.