Bike Sharing Demand

Forecast use of city bikeshare system

This data was presented to Kaggle by the Capital Bikeshare program in Washington, D.C. I analyzed it before writing a machine learning algorithm to fill in the missing data. These are the results of my analysis.

Original Kaggle Data


## Loading required package: lattice

Analysis

Univariate Plots

Dimensions, Column Names and Structure

## [1] 10886    16
##  [1] "datetime"   "season"     "holiday"    "workingday" "weather"   
##  [6] "temp"       "atemp"      "humidity"   "windspeed"  "casual"    
## [11] "registered" "count"      "hour"       "month"      "year"      
## [16] "yearmonth"
## 'data.frame':    10886 obs. of  16 variables:
##  $ datetime  : POSIXct, format: "2011-01-01 00:00:00" "2011-01-01 01:00:00" ...
##  $ season    : Factor w/ 4 levels "Spring","Summer",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weather   : Factor w/ 4 levels "Clear","Mist",..: 1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num  9.84 9.02 9.02 9.84 9.84 ...
##  $ atemp     : num  14.4 13.6 13.6 14.4 14.4 ...
##  $ humidity  : int  81 80 80 75 75 75 80 86 75 76 ...
##  $ windspeed : num  0 0 0 0 0 ...
##  $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
##  $ count     : int  16 40 32 13 1 1 2 3 8 14 ...
##  $ hour      : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ year      : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ yearmonth : int  201101 201101 201101 201101 201101 201101 201101 201101 201101 201101 ...

Factor Variables

Weather

## [1] "Clear"      "Mist"       "Rain"       "Heavy Rain"

Seasons

## [1] "Spring" "Summer" "Fall"   "Winter"

Summary

##     datetime                      season        holiday       
##  Min.   :2011-01-01 00:00:00   Spring:2686   Min.   :0.00000  
##  1st Qu.:2011-07-02 07:15:00   Summer:2733   1st Qu.:0.00000  
##  Median :2012-01-01 20:30:00   Fall  :2733   Median :0.00000  
##  Mean   :2011-12-27 05:18:05   Winter:2734   Mean   :0.02857  
##  3rd Qu.:2012-07-01 12:45:00                 3rd Qu.:0.00000  
##  Max.   :2012-12-19 23:00:00                 Max.   :1.00000  
##    workingday           weather          temp           atemp      
##  Min.   :0.0000   Clear     :7192   Min.   : 0.82   Min.   : 0.76  
##  1st Qu.:0.0000   Mist      :2834   1st Qu.:13.94   1st Qu.:16.66  
##  Median :1.0000   Rain      : 859   Median :20.50   Median :24.24  
##  Mean   :0.6809   Heavy Rain:   1   Mean   :20.23   Mean   :23.66  
##  3rd Qu.:1.0000                     3rd Qu.:26.24   3rd Qu.:31.06  
##  Max.   :1.0000                     Max.   :41.00   Max.   :45.45  
##     humidity        windspeed          casual         registered   
##  Min.   :  0.00   Min.   : 0.000   Min.   :  0.00   Min.   :  0.0  
##  1st Qu.: 47.00   1st Qu.: 7.002   1st Qu.:  4.00   1st Qu.: 36.0  
##  Median : 62.00   Median :12.998   Median : 17.00   Median :118.0  
##  Mean   : 61.89   Mean   :12.799   Mean   : 36.02   Mean   :155.6  
##  3rd Qu.: 77.00   3rd Qu.:16.998   3rd Qu.: 49.00   3rd Qu.:222.0  
##  Max.   :100.00   Max.   :56.997   Max.   :367.00   Max.   :886.0  
##      count            hour           month             year     
##  Min.   :  1.0   Min.   : 0.00   Min.   : 1.000   Min.   :2011  
##  1st Qu.: 42.0   1st Qu.: 6.00   1st Qu.: 4.000   1st Qu.:2011  
##  Median :145.0   Median :12.00   Median : 7.000   Median :2012  
##  Mean   :191.6   Mean   :11.54   Mean   : 6.521   Mean   :2012  
##  3rd Qu.:284.0   3rd Qu.:18.00   3rd Qu.:10.000   3rd Qu.:2012  
##  Max.   :977.0   Max.   :23.00   Max.   :12.000   Max.   :2012  
##    yearmonth     
##  Min.   :201101  
##  1st Qu.:201107  
##  Median :201201  
##  Mean   :201157  
##  3rd Qu.:201207  
##  Max.   :201212

Standard Deviations for Numerical Data

## [1] "count: 181.14"
## [1] "registered: 151.04"
## [1] "casual: 49.96"
## [1] "windspeed: 8.16"
## [1] "humidity: 19.25"
## [1] "atemp: 8.47"
## [1] "temp: 7.79"

Date Ranges

The date range is from Jan 01, 2011 12am to Dec 19, 2012 11pm and given that 2012 was a leap year we should have 17,255 hours between the two dates. However, we have only 10,886 entries which means that we can count on either this being a sample of the timeframe (likely as it’s called train.csv) or that maybe they excluded hours that contained no riders. We can test for that possiblity…

sum(bikeShare$count == 0)
## [1] 0

So they either don’t display hours that have no riders or there are always riders. Nevertheless, we can count on the fact that we certainly don’t have data for every hour between our selected dates. In fact we only have 63.09%
of the data.

Weather

Since there is a single entry for Heavy Rain in the Weather column we will either have to drop that column or combine it with Rain if we are to use it.

Season

## Spring Summer   Fall Winter 
##   2686   2733   2733   2734

It’s interesting how closely matched the seasons are. If this was just entries that contained riders and not sample data from the timeframe, you wouldn’t expect those numbers to be so closely inline with one another. Then again, even as a sample they must have sampled from each season to come up with these results.

Ridership

As mentioned earlier, all of the entries contain at least one rider and each entry represents one hour. Median ridership is 145 with a maximum number of riders of 977.

Distributions

At first glance it looks like environmental variables (temp, atemp , humidity and windspeed) are close to a normal distribution whereas all the ridership variables (casual, registered and count) are very right skewed. Once we start plotting we’ll see this better.

Basic Histogram

After transforming the long tailed data to get a better understanding of the ridership…

The transformed ridership appears unimodal with a rise as we move past 50 to the peak at around 175 riders after which we have a steady decline.

There are really three sections to this dataset, Time, Weather and Ridership. Since we’ve just reviewed ridership let’s take a glance at the other two.

The spread across the dates seems fairly constant over the period of 719 days.

After overlaying the distribution line with the temperature, we concluded that the temperature and dates don’t really tell us much since the expected distribution appears parametric.

Which I guess is something - though not exactly interesting.