Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate synchronous bulk geocoding #231

Open
mraross opened this issue Mar 4, 2021 · 15 comments · Fixed by #239
Open

Investigate synchronous bulk geocoding #231

mraross opened this issue Mar 4, 2021 · 15 comments · Fixed by #239

Comments

@mraross
Copy link

mraross commented Mar 4, 2021

The objective of this task is to investigate the feasibility of synchronous bulk geocoding compared to asynchronous batch geocoding. In synchronized bulk geocoding, the online geocoder is extended with a new resource named:

 addresses/bulk

This resource implements a POST request that takes a short list of addresses, geocodes them, and returns the list of geocoded results in the response. This is a stateless request; no jobs are created or managed. Handling of very large address lists is handled through an bulk address list submitter script that is written in python and reads in the source address list, issues a series of addresses/bulk requests, and outputs the geocoded results to a single file.

Performance tests should be conducted with various settings of address list size and time budget. For example, set global config parms maxAddressListSize = 100 and timeBudget=20 (seconds) and run address lists of sizes ranging from 1000 to 10 000 000. In a single request, the bulk geocoder would only do as many addresses as can be completed within the timeBudget and to a maximum of 100 addresses. The timebudget is critical as there are various server timeouts in the entire round-trip processing that must be honoured or the hard work of the bulk geocoder would be lost to oblivion.

The objective of this work is not to have delivery of production code, just determination of the feasibility of bulk geocoding. The bulk geocoder need only support CSV and Geographics projection (4326).

Deliverables should include address/bulk resource, bulk address list submitter script, determination of feasibility of bulk geocoding, comparative timings between bulk and batch geocoding, and suggestions for implementing production versions of both the bulk geocoder and the bulk address list submitter.

@mraross mraross changed the title Investigate synchronise bulk geocoding Investigate synchronised bulk geocoding Mar 4, 2021
@mraross
Copy link
Author

mraross commented Mar 5, 2021

The time it takes for a single CPF worker to process a single group of 1,000 addresses might be a useful metric.

@mraross
Copy link
Author

mraross commented Mar 5, 2021

If order doesn't matter, the bulk address list submitter could issue multiple, concurrent POST requests and just write the results out in the order they are returned from the addresses/bulk resource. This would allow multiple geocoder nodes to process requests from the same address list simultaneously, much like the batch geocoder. Responsibility for retry of a failed request now lies with the bulk address list submitter, not the geocoder itself.

@mraross mraross added this to the Geocoder 4.2 milestone Mar 5, 2021
@mraross
Copy link
Author

mraross commented Mar 5, 2021

As discussed with Chris H by phone this AM, add support for a starting sequence number parameter on addresses/ bulk resource. All match results will include a sequence number generated from this base number. This allows yourId to be optional and necessary only when it refers to a unique business client number.

@mraross
Copy link
Author

mraross commented Mar 5, 2021

@cmhodgson Please proceed with this task.

@mraross
Copy link
Author

mraross commented Mar 7, 2021

In batch geocoding, the batch geocoder has a scheduler process which ensures fairness and freedom from resource starvation. In bulk geocoding, scheduling policies are no longer under exclusive control of the geocoder.  The bulk geocoder must provide the following necessary conditions to enable fair and starvation-free scheduling in geocoder clients: 

  1. Bulk geocoder nodes must be kept in a separate node pool from online geocoder nodes. Otherwise, bulk geocoding requests can easily take over all nodes and starve online requests:

  2. Bulk geocoder nodes must all have a relatively short maximum processing time (e.g., just a few seconds) to allow for fair scheduling.

@cmhodgson
Copy link
Collaborator

This unfortunately wastes resources because the bulk requests can never take advantage of idle online CPU resources, but it does allow us to take one more step that is likely necessary to prevent overloading of the bulk processors - putting a lower limit on the number of tomcat threads on the bulk nodes, based on the expected utilization per thread (from testing). Then the submitter script must also throttle its request rate when it experiences failed connection attempts (due to the server being "full"). Kong should be able to help direct the requests to available servers but at some point we may hit limits. At which point the submitter scripts should slow the rate of requests, first by reducing parallelism, and then with an exponential back-off.

This does mean that the scheduling in overloaded situations would effectively be driven by the submitter code, meaning that if someone wrote a pushier submitter, it might be able to game the system and get its work done sooner, at the cost of overloading the system further with additional (rejected) requests. Maybe Kong can help manage this problem to a degree.

@mraross
Copy link
Author

mraross commented Mar 8, 2021

Allocating scheduling to an external client does need to be treated as a risk when we evaluate feasibility. As you say, the script we write should implement Ethernet-type retry. By keeping the max time period of the bulk geocoder short, at least we guard against run-to-completion scheduling!

I agree that Kong could play a role here. We could have an access control group of gov api keys that kong routes to a special pool of nodes that support longer max time periods.

@mraross
Copy link
Author

mraross commented Mar 8, 2021

In terms of compatibility with batch geocoder output, we could have a batchGeooderCompatible parameter that outputs the same CSV schema as the batch geocoder.

We could also add support for streetAddress output to cut down on download time. In streetAddress output, locality and provinceCode are returned separately and everything but locality and provinceCode is returned in streetAddress. This is to allow batch geocoder clients like MCFD to perform a locality proximity test to confirm match quality.

@mraross
Copy link
Author

mraross commented Mar 8, 2021

Graeme said:

Some differences with our CPF

  • single "location" column e.g. "SRID=4326;POINT(-123.1750334 49.2530889)" vs 3 in bulk: "X","Y","srsCode" e.g. -123.1655578,49.2694856,4326
  • sid column at the end of CPF records (I don't recall why that is there -- pid links?) that is not in bulk

Some tests and observations

  • I put in a 10k request of fairly standardized Vancouver addresses which finished in under 2min
  • I put in 4 simultaneous 10k requests and the cpu usage was only about 1/3 so it could easily be cranked up more to 7+ concurrent requests
  • I tried a trickier HealthIdeas dataset -- Brian's 10k PSRS sample. In previous testing using gold CPF, that dataset clocked in around 2min (and I recall with our test settings pegged the CPU pretty high). Unfortunately I didn't time the bulk run but it took way longer than the Vancouver addresses (maybe around 15 min). If that file were split up into 7 equal sized chunks and sent concurrently that would roughly match the CPF timings and its CPU usage.
  • results start streaming to a browser download file almost immediately after submitting, and continuing to update in seconds-long chunks -- timeout limitations may not even be an issue if this keeps the interaction live
  • in a production setup with more nodes, the ability to handle many more mini-jobs at once, and a managing framework, the performance may very end up be on par with CPF -- that is something to put in the test plan

@mraross mraross changed the title Investigate synchronised bulk geocoding Investigate synchronous bulk geocoding Mar 17, 2021
@mraross
Copy link
Author

mraross commented Mar 17, 2021

  1. In the addresses/bulk resource, change maxRequests and maxTime parms to not exceed global config values of same.

  2. Deliver geocoder with addresses/bulk resource to a feature branch that Dave will set up for this purpose. That way, we can determine the worthiness of bulk geocoding before committing it to the main branch (trunk?).

  3. In global config parms, set maxRequests default to 1 and maxTime default to some short value that a single address can be geocoded in (e.g., 10 seconds?).

  4. Deliver Bulk Geocoder Submission form and Bulk Geocoder Submitter script to ols-devkit

@mraross
Copy link
Author

mraross commented Mar 18, 2021

From: Chris Hodgson
Sent: March 17, 2021 3:19 PM
To: Ross, Michael RA CITZ:EX [email protected]; Kelsey, Brian CITZ:EX [email protected]; Kelsey, David M CITZ:EX [email protected]; Leeming, Graeme CITZ:IN [email protected]
Subject: Bulk Geocoding submitter testing

I just shared some preliminary results from my new concurrent bulk geocoding submitter, in a google sheet. The test data was the version of the ATP that I had handy with 501 geocodes in it, testing against our RRI dev server.

The code consists of a geocoder loop, in which each thread reads GEOCODES_PER_REQUEST rows from the input CSV, sends the request, waits for the result, and writes the results to the file. There are MAX_WORKERS of these loops running in parallel. Note that there are locks on both reading the input and writing the output, so there will be some time spent waiting to acquire these locks - which is time that no request will be out to the server for that thread, which is why the apparent efficiency doesn't always match the request efficiency. This was more apparent with larger request size and more concurrency, specifically the 4x100 case is actually slower than the 4x10 case, this is because the proportion of time spent waiting to acquire the locks is relatively high.

Based on these results I would say that 10-20 geocodes per request is the right number. It seems to be twice as efficient in terms of request time per geocode as doing a single request at a time, and going up to 100/request only improves that somewhat. This also keeps the requests small and short. We will have to run similar tests in your prod environment, but I don't expect that to change too much. The real test there will be to see how much server load is generated when turning up the concurrency.

I still have to make some changes to the server side code, but nothing that I expect to effect performance. Will have some commits and PRs later.

Cheers,
Chris

@dkelsey
Copy link
Collaborator

dkelsey commented Mar 19, 2021

I've crated a feature branch:
bulk-geocoding-231

It is based on the dev branch

@cmhodgson cmhodgson linked a pull request Mar 26, 2021 that will close this issue
@mraross mraross removed this from the Geocoder 4.2 milestone Jun 29, 2021
@3rdmike
Copy link
Contributor

3rdmike commented Jul 13, 2023

@cmhodgson Has the feature been merged and deployed? Thanks!

@cmhodgson
Copy link
Collaborator

There are two parts to this. The first is the /geocoder/batch resource (also referred to as "bulk", above). It is commented out in the main geocoder controller here ("Instant batch"): https://github.com/bcgov/ols-geocoder/blob/main/ols-geocoder-web/src/main/java/ca/bc/gov/ols/geocoder/rest/controllers/GeocoderController.java

Other supporting code is in: https://github.com/bcgov/ols-geocoder/tree/main/ols-geocoder-web/src/main/java/ca/bc/gov/ols/geocoder/rest/batch

I think it only supports csv input and output at the moment but that can be added to. Also doesn't yet support the various parameters. I haven't tested it in a while (certainly not with the java 17 and spring boot 3 changes), but I don't expect it is too far from ready to go if you were to uncomment the one controller method.

There is this form that was used to test this API: https://bcgov.github.io/ols-devkit/bulk/

The second part is the client side "concurrent bulk geocoding submitter" which is a python script designed to basically abuse the batch API by submitting many simultaneous small batches and hen putting the results together at the end. It is here:
https://github.com/bcgov/ols-devkit/blob/gh-pages/bulk/bulk_submitter.py

@3rdmike
Copy link
Contributor

3rdmike commented Jul 14, 2023

Thanks Chris! I'm planning a new CPF software the first part alone is perfect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants