-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate synchronous bulk geocoding #231
Comments
The time it takes for a single CPF worker to process a single group of 1,000 addresses might be a useful metric. |
If order doesn't matter, the bulk address list submitter could issue multiple, concurrent POST requests and just write the results out in the order they are returned from the addresses/bulk resource. This would allow multiple geocoder nodes to process requests from the same address list simultaneously, much like the batch geocoder. Responsibility for retry of a failed request now lies with the bulk address list submitter, not the geocoder itself. |
As discussed with Chris H by phone this AM, add support for a starting sequence number parameter on addresses/ bulk resource. All match results will include a sequence number generated from this base number. This allows yourId to be optional and necessary only when it refers to a unique business client number. |
@cmhodgson Please proceed with this task. |
In batch geocoding, the batch geocoder has a scheduler process which ensures fairness and freedom from resource starvation. In bulk geocoding, scheduling policies are no longer under exclusive control of the geocoder. The bulk geocoder must provide the following necessary conditions to enable fair and starvation-free scheduling in geocoder clients:
|
This unfortunately wastes resources because the bulk requests can never take advantage of idle online CPU resources, but it does allow us to take one more step that is likely necessary to prevent overloading of the bulk processors - putting a lower limit on the number of tomcat threads on the bulk nodes, based on the expected utilization per thread (from testing). Then the submitter script must also throttle its request rate when it experiences failed connection attempts (due to the server being "full"). Kong should be able to help direct the requests to available servers but at some point we may hit limits. At which point the submitter scripts should slow the rate of requests, first by reducing parallelism, and then with an exponential back-off. This does mean that the scheduling in overloaded situations would effectively be driven by the submitter code, meaning that if someone wrote a pushier submitter, it might be able to game the system and get its work done sooner, at the cost of overloading the system further with additional (rejected) requests. Maybe Kong can help manage this problem to a degree. |
Allocating scheduling to an external client does need to be treated as a risk when we evaluate feasibility. As you say, the script we write should implement Ethernet-type retry. By keeping the max time period of the bulk geocoder short, at least we guard against run-to-completion scheduling! I agree that Kong could play a role here. We could have an access control group of gov api keys that kong routes to a special pool of nodes that support longer max time periods. |
In terms of compatibility with batch geocoder output, we could have a batchGeooderCompatible parameter that outputs the same CSV schema as the batch geocoder. We could also add support for streetAddress output to cut down on download time. In streetAddress output, locality and provinceCode are returned separately and everything but locality and provinceCode is returned in streetAddress. This is to allow batch geocoder clients like MCFD to perform a locality proximity test to confirm match quality. |
Graeme said: Some differences with our CPF
Some tests and observations
|
|
From: Chris Hodgson I just shared some preliminary results from my new concurrent bulk geocoding submitter, in a google sheet. The test data was the version of the ATP that I had handy with 501 geocodes in it, testing against our RRI dev server. The code consists of a geocoder loop, in which each thread reads GEOCODES_PER_REQUEST rows from the input CSV, sends the request, waits for the result, and writes the results to the file. There are MAX_WORKERS of these loops running in parallel. Note that there are locks on both reading the input and writing the output, so there will be some time spent waiting to acquire these locks - which is time that no request will be out to the server for that thread, which is why the apparent efficiency doesn't always match the request efficiency. This was more apparent with larger request size and more concurrency, specifically the 4x100 case is actually slower than the 4x10 case, this is because the proportion of time spent waiting to acquire the locks is relatively high. Based on these results I would say that 10-20 geocodes per request is the right number. It seems to be twice as efficient in terms of request time per geocode as doing a single request at a time, and going up to 100/request only improves that somewhat. This also keeps the requests small and short. We will have to run similar tests in your prod environment, but I don't expect that to change too much. The real test there will be to see how much server load is generated when turning up the concurrency. I still have to make some changes to the server side code, but nothing that I expect to effect performance. Will have some commits and PRs later. Cheers, |
I've crated a feature branch: It is based on the |
@cmhodgson Has the feature been merged and deployed? Thanks! |
There are two parts to this. The first is the /geocoder/batch resource (also referred to as "bulk", above). It is commented out in the main geocoder controller here ("Instant batch"): https://github.com/bcgov/ols-geocoder/blob/main/ols-geocoder-web/src/main/java/ca/bc/gov/ols/geocoder/rest/controllers/GeocoderController.java Other supporting code is in: https://github.com/bcgov/ols-geocoder/tree/main/ols-geocoder-web/src/main/java/ca/bc/gov/ols/geocoder/rest/batch I think it only supports csv input and output at the moment but that can be added to. Also doesn't yet support the various parameters. I haven't tested it in a while (certainly not with the java 17 and spring boot 3 changes), but I don't expect it is too far from ready to go if you were to uncomment the one controller method. There is this form that was used to test this API: https://bcgov.github.io/ols-devkit/bulk/ The second part is the client side "concurrent bulk geocoding submitter" which is a python script designed to basically abuse the batch API by submitting many simultaneous small batches and hen putting the results together at the end. It is here: |
Thanks Chris! I'm planning a new CPF software the first part alone is perfect. |
The objective of this task is to investigate the feasibility of synchronous bulk geocoding compared to asynchronous batch geocoding. In synchronized bulk geocoding, the online geocoder is extended with a new resource named:
This resource implements a POST request that takes a short list of addresses, geocodes them, and returns the list of geocoded results in the response. This is a stateless request; no jobs are created or managed. Handling of very large address lists is handled through an bulk address list submitter script that is written in python and reads in the source address list, issues a series of addresses/bulk requests, and outputs the geocoded results to a single file.
Performance tests should be conducted with various settings of address list size and time budget. For example, set global config parms maxAddressListSize = 100 and timeBudget=20 (seconds) and run address lists of sizes ranging from 1000 to 10 000 000. In a single request, the bulk geocoder would only do as many addresses as can be completed within the timeBudget and to a maximum of 100 addresses. The timebudget is critical as there are various server timeouts in the entire round-trip processing that must be honoured or the hard work of the bulk geocoder would be lost to oblivion.
The objective of this work is not to have delivery of production code, just determination of the feasibility of bulk geocoding. The bulk geocoder need only support CSV and Geographics projection (4326).
Deliverables should include address/bulk resource, bulk address list submitter script, determination of feasibility of bulk geocoding, comparative timings between bulk and batch geocoding, and suggestions for implementing production versions of both the bulk geocoder and the bulk address list submitter.
The text was updated successfully, but these errors were encountered: