Google directions vs ESRI’s Network Analyst: Estimates of time and distance

When it comes to the accuracy of estimates of network distance and time between points our guess it that Google is king. Along with millions of others we trust Google to provide accurate directions. In many cases, Google Maps is the perfect tool, but there are a few instances where Google Maps is not an option. In these instances we use Esri’s Network Analyst to compute network distances. In order to feel confident about the results, though, we decided to compare calculations from the two approaches.

For small public datasets, estimating distance and time with Google (via Python) is the way to go

For many of our projects we need to compute network distances between two locations. When the amount of data is small and there are no security-related concerns related to the data we’re happy to use Google accessed through Python and the packages simplejson and urllib. You can see examples of functions that do this on Google’s site as well as on stack overflow and other sources.

For large datasets and/or those that cannot be exposed to the Internet we’ve tended to use Esri’s Network Analyst

Many of our projects involve both relatively large datasets (10s of thousands of records) as well as health data from scientific studies that cannot be exposed to the Internet for privacy reasons. In these settings we tend to rely on Esri’s Network Analyst extension. (Note that we did experiment with PostgreSQL/pgRouting but found that it was too difficult to work with in its current version). With Network Analyst we create our own network datasets and distance/time estimates are based on trip length and the speed limit. I’ll admit that Network Analyst is not our favorite tool, but it seems to get the job done. My sense has always been that these estimates would not be as good as accurate as Google, particularly given that we’re working with speed limits rather than actual speeds that are likely included in Google’s models. But I’ve also assumed that the estimates would come close to Google’s and that they would, on average, be the same. Nevertheless, we’ve never tested this… until now.

Comparing Google and Network Analyst

The data

In order to compare we took advantage of some data we generated for a recent project for Emory University. We are helping on a recent project related to racial and ethnic disparities in HIV prevalence. As part of this study, we computed population-weighted centroids for all ZIP codes in the study areas. And for each of these population-weighted centroids (~1100) we needed to compute the distance and travel time to potentially thousands of drug treatment facilities in a city. (The drug treatment data was extracted from PDFs as described in this post). Although neither the centroids nor the drug treatment facility locations are sensitive data the number of calculations will be more than a million — far more than Google would permit (plus using Google for these kinds of calculations without showing a Google map would likely be against their terms of service).

The network datasets

Esri provides a set of map data (called Data & Maps) with ArcGIS licenses. The data sets include a street link file for the full US including more than 30 million individual links. In the past this data was based on US Census Bureau TIGER line files which could be woefully inaccurate but in the last few years the street data has been based on much more accurate data from TeleAtlas. Although the data made available from Esri can be several years old our research on this project focuses on urban areas where the road network is not likely to change enough to make a difference.

In terms of creating the network dataset, I recommend looking at Esri’s tutorial but in short we created network datasets for 27 major cities in the US. We used Python to clip the streets file to the city boundaries (based on the Census ‘Core Based Statistical Area’ (CBSA) – buffered by five miles) but unfortunately we could not find a way to create a new network dataset using Python (arcpy) — please let us know if there is a way.

In terms of creating the network datasets we mostly accepted ArcGIS’ default settings but with a few minor tweaks. They are created by right-clicking on the street file in ArcCatalog.

The To/From point data

As we mentioned above, we computed population-weighted centroids for all ZIP codes (or, to be precise, for US Census Bureau ZCTA) in our target urban areas. I won’t go into detail on how these were computed, we used PostGIS and SQL and compute the centroids based on block-level population. For fun, we put the centroids along with the block data on a Stamen Watercolor base map:

pop_block_centroids

In total we computed approximately 1100 centroids (for the ZCTA in the study).

Select sample pairs for the experiment

We are computing the time and distance between the population-weighted ZIP centroids (as an ‘average’ of where people live) to the drug treatment facilities. For this experiment we randomly chose 2500 pairs of centroids-treatment facilities and for these 2500 we computed distance and time using Google and using Network Analyst. These are not completely randomly selected pairs in the sense that we only selected pairs where both the centroid and the drug treatment facility occurred in the same city. Here is a map of where the sample pairs are located:

5_error_check_locations-Converted_600px

Computing distance/time via Google with Python

We used Python help us compute distance/direction using Google. In particular, we made use of the libraries urllib and simplejson to connect with Google and manage the JSON response, respectively. The code for the directions function is below. It makes use of a custom geocode function that we created using a similar approach (not shown):

def directions(fromLoc, toLoc, needsGeocode=False, fullresponse=False):

    import simplejson
    import urllib

    if needsGeocode:
        fromLoc = geocode(fromLoc)
        toLoc = geocode(toLoc)
        print 'The location types are ' + fromLoc[2] + ' and ' + toLoc[2]

    url = "http://maps.googleapis.com/maps/api/distancematrix/json?origins={0}&destinations={1}&mode=driving&language=en-EN&sensor=false".format(str(tuple(fromLoc[0:2])),str(tuple(toLoc[0:2])))
    theresult= simplejson.load(urllib.urlopen(url))
    if fullresponse:
        return theresult
    else:
        driving_timeMinutes = round(theresult['rows'][0]['elements'][0]['duration']['value']/60,2)
        driving_distKM = round(theresult['rows'][0]['elements'][0]['distance']['value']/1000,2)
        driving_status = str(theresult['rows'][0]['elements'][0]['status'])
        return {'driveTime':driving_timeMinutes, 'driveTimeUnit': 'minutes', 'driveDist':driving_distKM, 'driveDistUnit': 'kilometers', 'status': driving_status }

Computing distance/time with ArcGIS

For ArcGIS we wrote a Python script to compute the distances. The script is long and detailed but here is some pseudo-code to give you an idea of what it looks like:

# PSEUDO-CODE!!
import arcpy, csv, sys, os
from arcpy import env
from dbfpy import dbf
arcpy.env.overwriteOutput = True
env.qualifiedFieldNames = False

# Check out any necessary licenses
arcpy.CheckOutExtension("Network")
arcpy.env.overwriteOutput = True


# Process: Make OD Cost Matrix Layer
arcpy.MakeODCostMatrixLayer_na()
arcpy.AddLocations_na()# add origins
arcpy.AddLocations_na() # add destinations
arcpy.Solve_na() #solve
arcpy.SelectData_management() # grab the lines with directions
arcpy.CopyFeatures_management() # copy lines to new shapefile

# if you want to save the entire layer file
#arcpy.SaveToLayerFile_management()

# here we read in the saved DBF and cursor through the features
# and write to a CSV

Compare results and conclusions

Now we’re ready to see how the results compare. You can see below that for both time and distances there is a decent amount of spread/variation, but on the whole the results are very similar. For distance the correlation (Pearson) is 0.99 (intercept of 0.30, slope of 1.0). For time, the values are also strongly correlated (0.97). The slope is again near 1 (0.97). The intercept of -3.12 suggests that, on average, Network Analyst time estimates are slightly lower. In this particular study, given that we only care about relative distances and times (is one location farther away than another location) the strong correlation between both distance and time suggests strongly that using ArcGIS Network Analysis will give us accurate results.

esri_google_time_distance

One response

  1. I stumbled on this posting and found it a nice read. I am a technical manager in a GIS shop in city government and I shared this post with staff and colleagues as an example of how to organize a data analysis project.

    Thanks for hosting this blog. I am interested in getting into R and already have a background in ArcGIS, Sql Server and python. I see several articles I’ll be consulting soon.

    Cheers.

Leave a Reply

Your email address will not be published. Required fields are marked *