Scraping distance between zip codes in R

As part of my research, I need to obtain the distance between 341 car dealerships in Minnesota, North Dakota, South Dakota, Nebraska, Iowa, Wisconsin and Illinois. One way to do this would be to manually enter “zip code 1 to zip code 2” into google, and record the result. However, there are 341(340)/2 or 57970 pairs of dealerships… obviously something that would take a lot of time. My first attempt to automate the process was through google. Google “55901 to 55904” (two zip codes in SE Minnesota) and you immediately get the roadway travel distance 10.9 miles. The entire process could be automated for all zip code pairs since the URL is standardized. The URL for google search “55901 to 55904” is

https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#safe=off&q=55901%20to%2055904
where you could substitute the zips to the new zips on every iteration. So far so good. But then taking a look at the source code (click view source in your browser), I noticed “10.9” occurred nowhere. I suspect Google hides its search content because it essentially is a massive web scraping engine. Searching for another site, I came across distancecheck.com. The URL was similarly substitutable and html code observable.

zipscrape1

The above R code extracts the html line from distancecheck.com that contains the distance between zip1 and zip2. The gsub function substitutes one thing for another in a string. For example, the first gsub call finds 55901 in URL and then substitutes the new zip code (zip1) for it. The readlines function extracts the HTML source code of its argument. The next bit of code was needed because the line structure of the HTML code changed as I iterated through all zip codes. That is, the distance was sometimes on line 145, but then would show as line 163 on the next iteration. Then it would change again. The grepl function simply looks for a match between the two strings provided. The first string in the argument was always contained in the HTML line that also contained the distance. Therefore, the for and if statements record the line that contains the distance.

All thats left is to pull the distance from the html line extracted in the previous step.

zipscrape2

The above code extracts the distance from line. The strsplit function splits the string at the word pre, which is defined to be “miles”. You can split the string in whichever way you choose, this is just the first way that came to me. distance then records the second part of the split (the part that contains the distance). Then “>” is subbed out for a space ” “. The distance is directly preceded by “>” which is subbed out for the space. Then the string is split again at ” “, and the distance is the last word in the first element of the split. The function tail extracts the last word in a string (that is the last set of characters directly preceded by ” “). This is the reason why “>” was subbed out for ” “, to create the breathing room for the tail function.

And thats it. The code is very fast except for the readlines function. Scaled up to 57920 dealer pairs required a few days computing time on my machine. Regardless, automating the process was a lot easier than the alternative.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s