Hello. My blog has been moved to http://www.datextract.com.
The full post is located on Rpubs at
This paper explores Chrysler dealer trade data aggregated over 3.5 months between Oct 2014 and Jan 2015. VIN numbers were scraped biweekly from Chrysler dealer websites in Minnesota, North Dakota, South Dakota, Nebraska, Iowa, Wisconsin and Illinois. Raw inventory was tracked as individual models moved from dealer to dealer through trade. The data gives a glimpse of the Chrysler dealer trade network. The trading of new vehicles between dealers is crucially important to the functioning of this market, and thus questions related to profitability and efficiency should take this network structure into account.
This paper explores one facet of the data, trade per unit inventory. Each dealers trade volume is normalized by inventory size to enable comparisons across dealers of different sizes. One might expect the smallest dealers to trade most, relative to inventory size. The only way to satisfy demand with limited selection would be to trade more, per unit in stock. On the other hand, smaller inventories make it more difficult to trade. Limited selection also implies that distant dealers are unlikely to find their desired model at any one particular location with high frequency. This makes the development of trade relationships difficult. The data provides some evidence for an inverted U relationship between inventory size and trade per unit inventory. However, there are a few dealers in the smallest group that trade more relative to inventory size. The overall relationship is noisy. Inventories are divided into 5 quantiles, from smallest to largest. The mean and median of the second quantile is lower than that of the third quantile. The median of the first quantiles is less than the second and third, however the mean is higher. Histograms show that trade per unit Inventory in the smallest group (1st quantile) is highly right skewed, raising the mean. A few dealers in this group are able (or willing) to trade a lot, per unit inventory. Without these few (and thus the skew), the mean of the first quantile would be less than the second and third. The mean and median of trade per unit inventory drops abruptly after the third quantile.
The exploratory analysis raises several interesting theoretical questions. What is the optimal trade policy for smaller dealers to pursue? In the smallest inventory group, most dealers are unable or unwilling to trade relatively more often, but a few dealers do. Why is this histogram bimodal? If these dealers are unable to trade, in what ways can strategy limit the trade frictions caused by limited selection? Should small dealers pursue a few strong relationships or many one trade events? Should they be willing to go greater distances to trade than larger dealers?
The Kaggle bike sharing competition asks for hourly predictions on the test set, given the training data. The latter consists of the first 19 days of each month, while the test set is the 20th day to the end of each month. As a base model, I’ll just use linear regression. The model is
Here time indicates the day (details later). Sorry for the horrible latex formatting. I’m new to using latex in wordpress and don’t want to spend time right now figuring out how to align equations in wordpress. The variables are
- season: Dummy for months 4-6 (2), months 7-9 (3), fall (4). Leave out winter (1)
- holiday: Is the day a holiday?
- working day: Is the day a working day?
- weather [1-3]: Dummy variables for weather
- 1: clear (left out)
- 2: mist, cloudy, broken clouds
- 3: light snow, light rain, thunderstorm
- 4: heavy rain, thunderstorm, snow and fog
- atemp: temp feels like
- total rentals
There are several things to note in the model. First, I assume a linear deterministic time trend. In other words, the expected change in count from t to t+1 is constant for all t. There is no acceleration (or deceleration) in trend growth. This assumption is important for the theoretical implications of OLS (in particular consistency and asymptotic normality). Another model would take into account (and check for) stochastic time trends, however I don’t do that here. Second, hour is included with its squared term. Why? Its reasonable to expect that the impact on rentals from hour of the day starts low, increases toward mid day and then decreases at night. The squared term allows for parabola opening downward, which could depict this situation. Higher orders of hour could be included to fit more complex functional relationships. Perhaps there are different peaks for morning… afternoon… evening… etc. This model uses the second order to capture the overall trend (again I expect to see an inverted parabola in hours). Third, dummy variables are created for the season and weather variables. One type for each category must be left out to avoid perfect multicollinearity.
I’ll be extending the R code from the previous two posts on kaggle bike sharing. First create the time variable. I need to be a little careful here because of the missing data. Suppose I start numbering each day in the training set, starting with January 1st 2012 as 1. January 19th would be 19, but then the 20th day is missing (as it is in the test set). February 1st should be 19 + however many days are left in January. The following code creates this time variable and runs the regression stated above.
The regression output for this model is
The p value for the test statistic is small enough so that the model is not useless. Formally I can reject at the one percent level that all coefficients are not simultaneously zero. Lets look at some of the coefficients for a minute. As expected, the hour variable forms an inverted parabola with maximum at hour 14.63 (shortly after 2:30PM). In words, starting at midnight, increasing hours increases total demand until around 2:30PM, wheres increasing the hour after decreases total demand. Higher feels like temp corresponds to higher total rentals, while higher humidity corresponds to less rentals. It may have been interesting here to add a non linear (squared) feels like temp variable to capture since too hot could correspond to less rentals (again we would expect an inverted parabola in feels like temp).
The season and weather estimates give some counter intuitive results. First, all else equal months 1-3 and 10-12 raise rentals from winter time levels. Months 7-9 however decrease rentals. Intuitively I expect all of them to be positive. The second counter intuitive result is that when compared to the best weather, the worst weather reduces rentals by less than better weather. Again, very hard to explain. Could the kaggle coding for weather be wrong? The kaggle stated season indicators didn’t match up, so maybe the weather indicators are off too.
The following code generates a predicted vs. observed plot aggregated by day.
And corresponding plot
This post continues the exploratory analysis of bike sharing data from Kaggles competition. From the last post, I ended trying to predict the general upward trend through time of total bike rentals. One idea I posted there was that maybe the number of casual riders could be used to predict changes in later total rentals. For example, if the service is growing in popularity people might start by renting as a casual (non registered) rider, grow to like the service and eventually become registered themselves. The following code chunk generates the monthly plot of total, casual and registered rentals. The code that comes before it is located in the initial Kaggle post.
The first notable trend from this plot is that casual rentals seem to be flat on season. For example, winter months seem to have the same average number of casual rentals independent of year. Therefore, the growth in total rentals is driven by the growth in registered rentals. The ideal data set would include the transmission from casual to registered, but I don’t have that. However, given that casual riders are relatively flat on season, and if casual riders convert to registered riders at a somewhat constant rate, I could expect to see continued trend growth in total rentals at the same rate. If on the other hand casual riders were increasing on season, one might expect to see total rentals increasing at an increasing rate.
This post documents my preliminary exploratory analysis of the data from Kaggles bike sharing demand competition. Two year hourly data is provided from the DC bike sharing system. The first 20 days of each month (training set) are included while the last 10 are left out (test set). The purpose is to predict hourly demand in the test set from the training set. The variables are
- season: 1-4 for winter,spring, summer and fall
- holiday: Is the day a holiday?
- working day: Is the day a working day?
- weather [1-4]: From best to worst
- atemp: temp feels like
- casual: number of rentals by non-registered
- registered rentals
- count: total number of rentals (casual + registered)
First the data is read into r and the datetime is converted into date format. This has many benefits, one being ease of plotting against time. Next, variables are added to encode month and week. Plotting total rentals against hourly time is highly volatile (not many rentals in the middle of the night) and aggregating by day/month gets rid of much of this variation. Aggregation by month/day is done using the cut function on the datetime variable. Set the breaks of cut for whichever interval is needed. Next, three plots are created for total rentals vs. time at the hourly, day and monthly level. See the following r code and plots.
The hourly plot is probably the least useful because of all the variation. Check out the daily plot. This is still highly variable and the 16 or so spikes down toward zero are a little weird. Applying tapply on count by day (adding up total rentals by day) shows that these spikes down occur on the last days of several months. It turns out that 00:00:00 is added to the last day of the month rather than the first day of the month. The only way to get this to switch to the next day is to add 60 minutes, therefore making 00:00:00, 01:00:00. For some reason adding anything less than 60 minutes doesn’t change the date. The new plot is below
There are a few things that can be gleaned from this graph. First, there are seasonal oscillations in the number of rentals. Warmer months have more rentals than the previous winter months. Secondly, there is an upward trend in bike rentals more generally. For example, the winter months of 2012 appear to have around the same average number of rentals as summer 2011. Something needs to be done with this trend. One option would be to eliminate the trend and seasonality effects. Another method would be to try an explain the upward trend more generally. For example, perhaps casual rentals can explain the growth? Casual renters may like the service, choosing to register themselves for greater future use.
To be continued.
The idea for this post came when thinking about wired.com’s recent posts about Silicon Valley child vaccination rates. Basically wired looked at vaccination rates at several tech company sponsored day cares and found some low (as well as high) vaccination rates. Some of the companies with reportedly low vaccination rates blamed inaccuracies in the data. For more information see my post at
The data used for California child care and kindergarten vaccination rates is located at
A little background. The personal belief exemption (PBE) in California allows parents to not vaccinate their children if it violates their personal belief. On January 1st 2014, the law changed slightly so that parents could still claim a PBE, but only with the signature of a pediatrician. The permanent medical exemption (PME) exempts children for medical reasons.
The explanation given by the Silicon Valley companies of misreporting data got me thinking. What fraction of the unvaccination rate is unexplained? PBE’s and PME’s allow for unvaccination, but how about what’s leftover? I used kindergarten rather than child care facility data because the former had a longer time horizon. Children are required by law to vaccinate upon entrance of kindergarten if they do not hold a PME or PBE. Perhaps then the unexplained unvaccination could be a proxy for mismeasurement? On the other hand. Perhaps the regulation is somewhat lax and children without PME’s and PBE’s are slipping through the cracks? I created the following graphic in R.
The r code to clean the data and generate the plot is located at my github account, napairolero
There are several interesting things to note about the plot. First, the PME rate is relatively flat over time. Children don’t seem to be exempted for medical reasons any more or less over time. Second, the PBE rate does drop sharply with the altered law in 2014. The doctor’s signature for a PBE seems to be deterring its use. Next, take a look at the unexplained unvaccination rate. Recall that this is the part of unvaccination not explained by PME and PBE. It jumps sharply between 2011 and 2012, which the unvaccination rate mirrors. Why? Secondly, it increases in 2014. If the unexplained unvaccination just gives measurement error then this is not that interesting. However, if students are allowed to slip through the cracks because of lax regulation then perhaps some unvaccinated children filtered out of PBE status to entering kindergarten without a PBE? If this is the case then perhaps CA has more of an enforcement issue and needs to do more than just amending the PBE law.
On the other hand, if the unexplained unvaccination rate is just measurement error then the accused silicon valley companies could point to it and show that their vaccination levels are much higher than reported. Rather than looking at the unvaccination rate for these companies, wired.com should be looking at PBE and PME status at tech company sponsored child care facilities.
I read the article “Sickeningly low vaccination rates at silicon valley day cares” from wired.com this morning.
The author attempts to answer a very interesting question. How do some of the world’s most renowned innovators from Silicon Valley stack up in the heated vaccination debate? The author uses vaccination data for children in tech company sponsored child care facilities to make the claim that “anti science, anti-vaccine thinking (exists) in one of the smartest regions on the Earth”.
The first piece of evidence given to support this claim is, “of 12 day care facilities affiliated with tech companies, six – thats half – have below average vaccination rates, according to the states data.” Their italics. This piece of data does imply that many silicon valley parents don’t vaccinate their children. However, a much stronger argument could be made with another simple statistic.
Let me explain. 6 of 12 simply provides evidence that silicon valley parents are probably a lot like the rest of us when it comes to vaccinating our children. One would expect this by the very definition of average. Given five numbers (1,2,3,4,5), the average is 3. If I pull three of these five (1,3,4) from a hat and say that half are below average, in support of some claim, I haven’t said much at all.
Solution? Simply report the average of the sample. If its below the national average, then you have some evidence to support that claim that silicon valley parents vaccinate their children less than the rest of us. In fact this would be a much more interesting claim than the one made by the article. Perhaps innovation and forward thinking thrive on unscientific thinking on occasion? The article gives the following graphic.
which definitely shows some low end potential. However, without stating the national vaccination average how can we really be sure about anything?
A better title might be, “Silicon Valley parents might vaccinate their kids just like the rest of us”. However, I understand why the author didn’t take this route. Who would click? Again, there is a much more interesting story lying right there in the open. Do silicon valley parents vaccinate their kids less than the rest of us? This counter-intuitive piece of information would warrant further investigation into the minds that generate innovation.