Categories
GIS

Sometimes, you just have to roll your own…

It was Doors Open Toronto last weekend, and the city published the locations as open data: Doors Open Toronto 2013. I thought I’d try to geocode it after Richard suggested we take a look. OpenStreetMap has the Nominatim geocoder, which you can use freely as long as you accept restrictions on bulk queries.

As a good and lazy programmer, I first tried to find pre-built modules. Mistake #1; they weren’t up to snuff:

  • Perl’s Geo::Coder::Many::OSM would only read from OSM‘s server. MapQuest run their own mirror as part of their great MapQuest Open Platform Web Services suite, and they have almost no limitation on query volume. OSM runs their operation on a shoestring, and too many queries gets you the disapproval face, or worse.
  • Python’s geopy gave spurious results amid copious whiny error messages.
    (Standard operational procedure for python, then… ☺)

So I rolled my own, using nowt but the Nominatim Search Service Developer’s Guide, and good old simple modules like URI::Escape, LWP::Simple, and JSON::XS. Much to my surprise, it worked!

Much as I love XML, it’s a bit hard to read as a human, so I smashed the Doors Open data down to simple pipe-separated text: dot.txt. Here’s my code, ever so slightly specialized for searching in Toronto:

#!/usr/bin/perl -w
# geonom.pl - geocode pipe-separated addresses with nominatim
# created by scruss on 02013/05/28

use strict;
use URI::Escape;
use LWP::Simple;
use JSON::XS;

# the URL for OpenMapQuest's Nominatim service
use constant BASEURI => 'http://open.mapquestapi.com/nominatim/v1/search.php';

# read pipe-separated values from stdin
# two fields: Site Name, Street Address
while (<>) {
    chomp;
    my ( $name, $address ) = split( '\|', $_, 2 );
    my %query_hash = (
        format  => 'json',
        street  => cleanaddress($address),    # decruft address a bit
                                              # You'll want to change these ...
        city    => 'Toronto',                 # fixme
        state   => 'ON',                      # fixme
        country => 'Canada',                  # fixme
        addressdetails => 0,                  # just basic results
        limit          => 1,                  # only want first result
             # it's considered polite to put your e-mail address in to the query
             # just so the server admins can get in touch with you
        email => 'me@mydomain.com',    # fixme

        # limit the results to a box (quite a bit) bigger than Toronto
        bounded => 1,
        viewbox => '-81.0,45.0,-77.0,41.0'    # left,top,right,bottom - fixme
    );

    # get the result from Nominatim, and decode it to a hashref
    my $json = get( join( '?', BASEURI, escape_hash(%query_hash) ) );
    my $result = decode_json($json);
    if ( scalar(@$result) > 0 ) {             # if there is a result
        print join(
            '|',    # print result as pipe separated values
            $name, $address,
            $result->[0]->{lat},
            $result->[0]->{lon},
            $result->[0]->{display_name}
          ),
          "\n";
    }
    else {          # no result; just echo input
        print join( '|', $name, $address ), "\n";
    }
}
exit;

sub escape_hash {

    # turn a hash into escaped string key1=val1&key2=val2...
    my %hash = @_;
    my @pairs;
    for my $key ( keys %hash ) {
        push @pairs, join( "=", map { uri_escape($_) } $key, $hash{$key} );
    }
    return join( "&", @pairs );
}

sub cleanaddress {

    # try to clean up street addresses a bit
    # doesn't understand proper 'Unit-Number' Canadian addresses tho.
    my $_ = shift;
    s/Unit.*//;     # shouldn't affect result
    s/Floor.*//;    # won't affect result
    s/\s+/ /g;      # remove extraneous whitespace
    s/ $//;
    s/^ //;
    return $_;
}

It quickly became apparent that the addresses had been entered by hand, and weren’t going to geocode neatly. Here are some examples of the bad ones:

  • 200 University Ave St W — it’s an avenue, not a street, and it runs north-south
  • 2087 Davenport Road (Rear House) Rd Unit: Rear — too many rears
  • 21 Colonel Sameul Smith Park Dr — we can’t fix typos
  • 0 Construction Trailer:Lower Simcoe at Lakeshore Blvd — 0? Zero??? What are you, some kinda python programmer?

Curiously, some (like the address for Black Creek Pioneer Village) were right, but just not found. Since the source was open data, I put the right address into OpenStreetMap, so for next year, typos aside, we should be able to find more events.

Now, how accurate were the results? Well, you decide:

Categories
GIS

a simple geocoder for toronto

I’m going to use SpatiaLite and the Toronto One Address Repository to try some simple geocoding. That is, given an address, spit out the real-world map coordinates. As it happens, the way the Toronto data is structured it doesn’t really need to use any GIS functions, just some SQL queries. There are faster and better ways to code this, but I’m just showing you how to load up data and run simple queries.

SpatiaLite is my definition of magic. It’s an extension to the lovely SQLite database that allows you to work with spatial data – instead of selecting data within tables, you can select within polygons, or intersections with lines, or within a distance of a point.

I’m going to try to avoid having too many maps here, as maps are a snapshot of a particular view of a GIS at a certain time. Maps I can make; GIS is what I’m trying to learn.

So, download the data and load up SpatiaLite GUI. Here I’ve created a new database file. addresses.sqlite. I’m all ready to load the shapefile.

Shapefiles are messy things, and are definitely glaikit. Firstly, they’re a misnomer; a shapefile is really a bunch of files which need to be kept together. They’re also a really old format; the main information store is actually a dBaseIII database. They also have rather dodgy ways of handling projection metadata. For all their shortcomings, no-one’s come up with anything better that people actually use.

Projection information is important, because the world is inconveniently unflat. If you think of a projected X-Y coordinate system as a graph paper Post-It note stuck to a globe, the grid squares depend on where you’ve decided to stick the note. Also, really only the tiny flat part that’s sticking to the globe closely approximates to real-world coordinates.

Thankfully, the EPSG had a handle on all this projection information (and, likely, Post-It notes). Rather than using proprietary metadata files, they have a catalogue of numbers that exactly identify map projections. SpatiaLite uses these Spatial Reference System Identifiers (SRIDs) to keep different projections lined up.

Toronto says its address data is in ‘MTM 3 Degree Zone 10, NAD27’. That’s not a SRID. You can list all the SRIDs that SpatiaLite knows with:

select * from spatial_ref_sys

which returns over 3500 results.

As we know there’s an MTM (Modified Transverse Mercator) and a 27 in the title, we can narrow things down:

select srid,ref_sys_name from spatial_ref_sys where ref_sys_name like '%MTM%' and ref_sys_name like '%27%'

The results are a bit more manageable:

srid ref_sys_name
2017 NAD27(76) / MTM zone 8
2018 NAD27(76) / MTM zone 9
2019 NAD27(76) / MTM zone 10
2020 NAD27(76) / MTM zone 11
2021 NAD27(76) / MTM zone 12
2022 NAD27(76) / MTM zone 13
2023 NAD27(76) / MTM zone 14
2024 NAD27(76) / MTM zone 15
2025 NAD27(76) / MTM zone 16
2026 NAD27(76) / MTM zone 17
32081 NAD27 / MTM zone 1
32082 NAD27 / MTM zone 2
32083 NAD27 / MTM zone 3
32084 NAD27 / MTM zone 4
32085 NAD27 / MTM zone 5
32086 NAD27 / MTM zone 6

So it looks like 2019 is our SRID. That last link goes to spatialreference.org, who maintain a handy guide to projections and SRIDs. (Incidentally, Open Toronto seems to use two different projections for its data – the other is ‘UTM 6 Degree Zone 17N NAD27’ with a SRID of 26717.)

So let’s load it:

This might take a while, as there are over 500,000 points in this data set.

If you want to use this data along with more complex geographic queries, add a Spatial Index by right-clicking on the Geometry table and ‘Build Spatial Index’. This will take a while again, and make the database file quite huge (128MB on my machine).

Update: there’s a much quicker way of doing this without messing with invproj in this comment.

Now we’re ready to geocode. I was at the Toronto Reference Library today, which is at 789 Yonge Street. Let’s find that location:

select easting, northing, address, lf_name, name, fcode_desc from TCL3_ADDRESS_POINT where lf_name like 'yonge%' and address=789

which gives:

EASTING NORTHING ADDRESS LF_NAME NAME FCODE_DESC
313923.031 4836665.602 789 YONGE ST Toronto Reference Library

(for most places, NAME and FCODE_DESC are blank.)

Ooooh … but those coordinates don’t look anything like the degrees we had yesterday. We have to convert back to unprojected decimal degrees with my old friend, proj. If we store the northing, easting and a label in a file, we can get the get the geographic coordinates with:

invproj -E -r -f "%.6f" +proj=tmerc +lat_0=0 +lon_0=-79.5 +k=0.9999 +x_0=304800 +y_0=0 +ellps=clrk66 +units=m +no_defs < file.txt

which gives us:

4836665.602000	313923.031000	-79.386869	43.671824	Library

Now that’s more like it:  43.671824°N, 79.386869°W. On a map, that’s:

Pretty close, eh?

Incidentally, I didn’t just magic up that weird invproj line. Most spatial databases use proj to convert between projections, and carry an extra column with the command line parameters. For our SRID of 2019, we can call it up with this:

select proj4text from spatial_ref_sys where srid=2019;

which returns:

+proj=tmerc +lat_0=0 +lon_0=-79.5 +k=0.9999 +x_0=304800 +y_0=0 +ellps=clrk66 +units=m +no_defs

Update: there’s a much better way of invoking invproj. It understands EPSG SRIDs, so we could have done:

invproj -E -r -f "%.6f" +init=EPSG:2019 < file.txt