Project is bult as Nginx/Openresty/Lua app that implements a postal address lookup API as part of bigger infrastructure (which provides databases for validation of enetered address "city", "state", "postal_code" & etc). Also API depends on availability of key/value datastore(s) with lat/long records for each valid street address.
Additional configuration and routing metadata has to be mantained to provide information about topology of DBs and key/value datastores.
The address lookup algorithm include two major steps:
- street address parsing and validation.
- street address lookups for each valid address option into key/value datastore(s) with lat/long.
For an address parsing and validation part of the algorith the API performs multiple steps:
- an original (input) address should be parsed via libpostal. All parts of the address will be converted to the lowercase.
- a simple normalization of original address parsing results should be performed.
- then, the original address to be expanded into multiple address options via libpostal.
- each address option should be validated via
"places"
and"streets"
classes (these classes should use catalogs of valid "places" and "streets" names which could be stored in multiple databases). These classes could re-write some of the address attributes with corresponding normalized values (for example"state": "new york"
to"state": "ny"
). - also, the
"places"
class should extract corresponding"routing_tag"
for each valid address option (which should be used to direct an address lookup to the proper "lat/long" datastore). This"routing_tag"
should be stored along with all other information inside"places"
database. For US this tag is usualy the same as a zip code. - the set of processed address options should be checked for uniqueness (since some repeating addresses could be generated by the validation process).
- the set of processed address options should be sorted accordingly to each option "weight" (the "weight" could be calculated accordingly to occurence of common "road_types" and common "road_directions" inside the
"road"
attribute of parsed address).
For example - by the end of this process the original address:
119 w 24th st, NYC, New York
will be expanded in to the address options set where the top of the set will include an address:
119 west 24 street new york ny 10001
And there will be a "routing_tag"
attribute (which will correspond to 10001 zip code)
For each address option the API will construct a "key" (using "house_number"
, "road"
and "city"
) to perform street address lookup. A number of simultaneous requests will be sent to multiple key/value datastores (accordingly with corresponding "routing_tags"
).
By the end of "lookup" processing the API will return an address with valid "lat/long" and with highest "weight" within address options set.
On old laptop (converted to Linux machine - Ubuntu 16.04) - Nginx was configured to run 1 worker process (uses 1 CPU core to process all requests) - running ab
on the same machine to simulate 50 concurent connections:
Server Software: openresty/1.11.2.2
Server Hostname: 127.0.0.1
Server Port: 8085
Document Path: /api/v0.1/address_lookup/119+west+24+street+New+York+NY
Document Length: 102 bytes
Concurrency Level: 50
Time taken for tests: 45.468 seconds
Complete requests: 100000
Failed requests: 0
Keep-Alive requests: 99950
Total transferred: 68499750 bytes
HTML transferred: 10200000 bytes
Requests per second: 2199.37 [#/sec] (mean)
Time per request: 22.734 [ms] (mean)
Time per request: 0.455 [ms] (mean, across all concurrent requests)
Transfer rate: 1471.25 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 5
Processing: 2 23 1.1 23 69
Waiting: 2 23 1.1 23 69
Total: 7 23 1.1 23 70
Percentage of the requests served within a certain time (ms)
50% 23
66% 23
75% 23
80% 23
90% 23
95% 23
98% 23
99% 24
100% 70 (longest request)
expects URL that end as "../424+South+Maple+Ave+Basking+Ridge+NJ+07920?language=en&country=us"
(could also use %20
for spaces or any other common URL encoding)
"language" and "country" query parameters are optional
expects JSON body with an array of JSON objects:
[
{ "address": "424 South Maple Ave Basking Ridge NJ 07920", "language": "en", "country": "us" },
{ "address": "119 w 24th st, New York, NY", "language": "en" },
]
"language" and "country" attributes are optional
expects JSON object with following attributes:
{
"address": "424 South Maple Ave Basking Ridge NJ 07920",
"language": "en",
"country": "us",
"lat": 40.6863623,
"lng": -74.53439190000002,
}
"language", "country", "lat", "lng" attributes are optional
The address is expected to be free of typos and it should include all essential parts and postal code
expects URL that end as "../424+South+Maple+Ave+Basking+Ridge+NJ+07920?language=en&country=us"
(could also use %20
for spaces or any other common URL encoding)
"language" and "country" query parameters are optional
The address is expected to be free of typos and it should include all essential parts and postal code
"make_XXXXX" methods are expected to be invoked from Nginx requests execution contex.
The methods which implement "lookup", "insert" and "delete" logic could invoked from tests and CLI (batch processing tools).
The "lookup" method could run multiple "green" IO threads (controlled by Nginx "events" loop) to execute a fast asyncronous lookup into key/value datastores to get lat/long information for each valid street address that is returned by "address" class (invoking "lookup_address_option()"
method of "address" class).
This class implements functionality of address parsing and expands a received address into multiple address options (via libpostal). Each address option gets validated by "places" and "streets" classes (which also will extract a "routing_tag"
to be used for address lookup). As parts of validation process for each address optin the "state"
attribute could get normalized to a common format (for example for US a "short" format - "NJ", "NY"). Some other data normalization and formating could be performed.
Only validated and unique address options are getting stored to be used for lat/long lookup. The expantion, validation and normalization process could produce a number of repeating address options - a reduction algorithm will be used to remove them.
Since all valid address options are assigned a "weight" (based on correctness of provided information) then the address with highest "weigth" and valid lat/long information will be returned.
Data structures:
{
address = {
country, (optional)
city, (required)
city_district, (optional)
state, (required)
postcode, (to be detected during address expansion)
road, (required)
house, (optional - "Empire State Building" for example)
house_number, (required if it is not a well known "house")
lat, (to be detected during address lookup)
lng (to be detected during address lookup)
},
address_options = { (sorted by weight)
{
country,
city,
city_district,
state,
postcode, (to be detected during address expansion if not provided)
road,
house,
house_number,
weight, (to be calculated during address expansion)
addresses_db_routing_tag, (required to perform address lookup - constructed from "state", "city" and/or "postal_code")
lat, (to be detected during address lookup)
lng (to be detected during address lookup)
},
...
{
...
}
}
}
Routing metatable for address lookups is a part of API's configuration. It could be stored remotely (in Git) and APIs instances could load it on demand (decoding JSON into lookup_routing_table data structures)
This table links lookup DB's routing tag with lookup "driver" and its configuration parameters.
Only Nginx Shared Dictionaries driver is implemented at this moment. Drivers for Redis and/or Elastic Search & etc. could be implemented in the future.
Format:
{
routing_tag1 = {
"driver" = "nginx_shared_lookup",
"config" = {
"shared_dictionary_name" = "newjersey07920"
}
},
...
routing_tagN = {
"driver" = "nginx_shared_lookup",
"config" = {
"shared_dictionary_name" = "manhattannewyorkcity10001"
}
}
}
This class implements functionality of mapping information of "city", "city_district", "state", "country", and "house" (a well known building name) to "postal_code" and "routing_tag" (this tag could be used to direct requests to particular key/value store for fast lookup of individual street addresses).
This class should communicate with database and could execute SQL queries on several DB tables:
- if the input is an USA address (default) then it should query individual "states" tables for information about particular "city" and/or "city_districts" names (the table should mantain a collection of all valid names);
- if previous search has failed (and/or input has no "city" or "state" information) - then it should query a "common_names" table, sending multiple queries for "house" (a well known building name), then for "city_district", then for "city" and "state" (in case of international addresses where input did not include "country")
- if the input is an international address then the input should include "country" - so the "city"/"state" search could be directed to a proper database (or micro-service via REST API requests).
Each "states" database table should include following columns:
| "city name" | "city alternative name" | "state name" | "postal_code" | "routing_tag" |
if "routing_tag" is empty it is assumed to be equal to "postal_code"
The "common_names" database table should include following columns
| "common_name" | "place_type" | "city name" | "state name" | "country name" | "postal_code" | "routing_tag" |
where "place_type" should be an element of enumeration of { "house", "city", "city_district", "state", "country" }
In case of international addresses (and "country" records) the "routing_tag" should include all information that is necessary to execute queries against corresponding database.
Routing metatable for database queries is a part of API's configuration. It could be stored remotely (in Git) and APIs instances could load it on demand (decoding JSON into lookup_routing_table data structures)
This table links lookup DB's routing tag with lookup "driver" and its configuration parameters.
Since class "places"
could extract multiple "postal_code"
results for a given "city", "city_district" or "state" the additional reduction for this results could be required. That could be achived by performing a database lookup for a street name (a "road"
attribute of the address) inside of "city"
or "state"
table. This lookup should return a list of valid "postal_codes"
for a given street name.
Routing metatable for database lookups is a part of API's configuration. It could be stored remotely (in Git) and APIs instances could load it on demand (decoding JSON into lookup_routing_table data structures)
This table links lookup DB's routing tag with lookup "driver" and its configuration parameters.
Now, the most challenging part of the project would be handling of incomplete, partial, and/or addresses with typos. After the short Google search I have found libpostal ( https://github.com/openvenues/libpostal ) which looks like a very good solution for this problem ( good read - https://mapzen.com/blog/inside-libpostal/ - I definitely would need to spend more time reading about it - but on the surface it is based on the most complete and up-to-date set of data).
There are multiple language binding for this library (including my favorite Lua - https://github.com/bungle/lua-resty-postal ). Or it can be used directly from written in C handler (Nginx- or Apache-module).
libpostal.parse_address( "119 west 24th street new york ny" )
returns table:
{
state = "ny",
house_number = "119",
city = "new york",
road = "west 24th street",
}
or libpostal.parse_address( "119 w 24th str, Manhattan New York, NY" )
returns table:
{
city = "new york",
state = "ny",
road = "west 24 street",
city_district = "manhattan",
house_number = "119",
}
libpostal.expand_address( "119 w 24th str, New York, NY" )
returns iterator which returns strings
119 w 24th street new york ny
119 w 24th street new york new york
119 west 24th street new york new york
119 west 24th street new york ny