This post documents my preliminary exploratory analysis of the data from Kaggles bike sharing demand competition. Two year hourly data is provided from the DC bike sharing system. The first 20 days of each month (training set) are included while the last 10 are left out (test set). The purpose is to predict hourly demand in the test set from the training set. The variables are
- season: 1-4 for winter,spring, summer and fall
- holiday: Is the day a holiday?
- working day: Is the day a working day?
- weather [1-4]: From best to worst
- atemp: temp feels like
- casual: number of rentals by non-registered
- registered rentals
- count: total number of rentals (casual + registered)
First the data is read into r and the datetime is converted into date format. This has many benefits, one being ease of plotting against time. Next, variables are added to encode month and week. Plotting total rentals against hourly time is highly volatile (not many rentals in the middle of the night) and aggregating by day/month gets rid of much of this variation. Aggregation by month/day is done using the cut function on the datetime variable. Set the breaks of cut for whichever interval is needed. Next, three plots are created for total rentals vs. time at the hourly, day and monthly level. See the following r code and plots.
The hourly plot is probably the least useful because of all the variation. Check out the daily plot. This is still highly variable and the 16 or so spikes down toward zero are a little weird. Applying tapply on count by day (adding up total rentals by day) shows that these spikes down occur on the last days of several months. It turns out that 00:00:00 is added to the last day of the month rather than the first day of the month. The only way to get this to switch to the next day is to add 60 minutes, therefore making 00:00:00, 01:00:00. For some reason adding anything less than 60 minutes doesn’t change the date. The new plot is below
There are a few things that can be gleaned from this graph. First, there are seasonal oscillations in the number of rentals. Warmer months have more rentals than the previous winter months. Secondly, there is an upward trend in bike rentals more generally. For example, the winter months of 2012 appear to have around the same average number of rentals as summer 2011. Something needs to be done with this trend. One option would be to eliminate the trend and seasonality effects. Another method would be to try an explain the upward trend more generally. For example, perhaps casual rentals can explain the growth? Casual renters may like the service, choosing to register themselves for greater future use.
To be continued.