Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fetch command #77

Closed
jqnatividad opened this issue Oct 12, 2021 · 20 comments
Closed

fetch command #77

jqnatividad opened this issue Oct 12, 2021 · 20 comments
Assignees
Labels
enhancement New feature or request. Once marked with this label, its in the backlog.

Comments

@jqnatividad
Copy link
Collaborator

jqnatividad commented Oct 12, 2021

fetch will allow qsv to fetch HTML or data from web pages or services, to enrich a CSV (e.g. geocoding, wikidata api, etc.)

It will support authentication, concurrent requests, thresholds, etc.

Reminiscent of OpenRefine's fetch url... (https://docs.openrefine.org/manual/columnediting#add-column-by-fetching-urls) , but optimized for the command line.

@jqnatividad jqnatividad added the enhancement New feature or request. Once marked with this label, its in the backlog. label Oct 12, 2021
@jqnatividad jqnatividad changed the title apifetch command fetch command Oct 13, 2021
@mhuang74
Copy link
Contributor

I would like to take a crack at this. Please let me know if you have specifics in mind already.

@jqnatividad
Copy link
Collaborator Author

jqnatividad commented Nov 30, 2021

Hi @mhuang74 , the thing I like about Docopt, is that the helptext becomes an API contract definition of sorts.

Let's start by drafting the helptext first, and once we lock that down, you can then work on the code implementing it.

Can you take a first pass at the helptext?

Also, we need to think about the underlying crates to use - should we use reqwest, actix or awc?

As their underlying capabilities will enable the options we use with fetch.

@mhuang74
Copy link
Contributor

mhuang74 commented Dec 3, 2021

Thanks @jqnatividad. Here's my first pass.

Fetch values via an URL column, and optionally store them in a new column.

This command fetches HTML/data from web pages or web services for every row in the URL column, 
and optionally stores them in a new column.

URL column must contain full and valid URL path, which can be constructed via the 'lua' command.

Usage:
    qsv fetch [options] <column> [<input>]

fetch options:
    -c, --new-column <name>    Put the fetched values in a new column instead.

Common options:
    -h, --help                 Display this message
    -o, --output <file>        Write output to <file> instead of stdout.
    -n, --no-headers           When set, the first row will not be interpreted
                               as headers. Namely, it will be sorted with the rest
                               of the rows. Otherwise, the first row will always
                               appear as the header row in the output.
    -d, --delimiter <arg>      The field delimiter for reading CSV data.
                               Must be a single character. (default: ,)
    -q, --quiet                Don't show progress bars.

@jqnatividad
Copy link
Collaborator Author

jqnatividad commented Dec 3, 2021

Great @mhuang74 !

Now we start thinking about the implementation, which will ultimately result in additional options we need to expose on the CLI.

The first one is throttling, as qsv cannot just hit an endpoint without it, as most APIs have rate limits, and your IP can even be blacklisted if your query can easily mistaken for a DOS attack. There should be a default, conservative throttle delay (say 5000 ms like OpenRefine) that the user can override.

Also, what user agent should qsv present? It should definitely have a default one that says qsv, but the user should be presented with an option to use an alternate user agent string.

Another one is response caching. You may want to check out cached. I used it with apply geocode and it doubled performance just by prefixing the fn with a cached! macro 1-liner (before rustfmt, that is).

https://github.com/jqnatividad/qsv/blob/ead871b5dd65610e3cff25135434d1c240a54b63/src/cmd/apply.rs#L563-L569

Also, there's some work being done to allow cached to leverage a REDIS instance (jaemk/cached#68), and to store the cache to disk (jaemk/cached#20), that we can take advantage of in the future.

Finally, cached supports multi-threading with its sync_writes option, so in a future fetch version, you can really have very fast fetch jobs.

As for the underlying HTTP client library to use - https://blog.logrocket.com/the-state-of-rust-http-clients/

@mhuang74
Copy link
Contributor

mhuang74 commented Dec 4, 2021

Thanks @jqnatividad for detailed feedback. Here's my 2nd pass.

  • The need for >1000 ms (1 sec) throttle delay seems to be fundamentally at odds with high number of concurrent requests. But in case some servers do allow higher burst rate limits, we probably want to support multi-thread off the bat.
  • Storing Cookie seems important for auth
  • On error defaults to returning blank, but can store error
  • Support response caching
  • HTTP Header can be fully controlled via file

And more importantly, since we are really looking at fairly low request/sec throughput due to server rate limits, choice of library would be more about practicality rather than performance.

Probably would just go with reqwest, since:

  • supports cookie storage
  • supports system proxies
  • supports HTTPS/TLS
  • and sits on top of production tested hyper crate
Fetch values via an URL column, and optionally store them in a new column.

This command fetches HTML/data from web pages or web services for every row in the URL column, 
and optionally stores them in a new column.

URL column must contain full and valid URL path, which can be constructed via the 'lua' command.

To set proxy, please set env var HTTP_PROXY and HTTPS_PROXY (eg export HTTPS_PROXY=socks5://127.0.0.1:1086)

Usage:
    qsv fetch [options] [<column>] [<input>]

fetch options:
    --new-column=<name>        Put the fetched values in a new column instead.
    --threads=<value>          Number of threads for concurrent requests.                 
    --throttle-delay=<ms>      Set delay between requests in milliseconds. Recommend 1000 ms or greater (default: 5000 ms).
    --http-header-file=<file>  File containing additional HTTP Request Headers. Useful for setting Authorization or overriding User Agent.
    --cache-responses          Cache HTTP Responses to increase throughput.
    --on-error-store-error     On error, store HTTP error instead of blank value.
    --store-and-send-cookies   Automatically store and send cookies. Useful for authenticated sessions.

Common options:
    -h, --help                 Display this message
    -o, --output <file>        Write output to <file> instead of stdout.
    -n, --no-headers           When set, the first row will not be interpreted
                               as headers. Namely, it will be sorted with the rest
                               of the rows. Otherwise, the first row will always
                               appear as the header row in the output.
    -d, --delimiter <arg>      The field delimiter for reading CSV data.
                               Must be a single character. (default: ,)
    -q, --quiet                Don't show progress bars.

@jqnatividad
Copy link
Collaborator Author

@mhuang74 I like your 2nd pass. Some thoughts:

  • The need for >1000 ms (1 sec) throttle delay seems to be fundamentally at odds with high number of concurrent requests. But in case some servers do allow higher burst rate limits, we probably want to support multi-thread off the bat.

I agree. We should bake in multi-threading. Performance and multi-threading are signature advantages of Rust, and we should leverage it while remaining cognizant of rate limits.

  • Storing Cookie seems important for auth
  • On error defaults to returning blank, but can store error
  • Support response caching
  • HTTP Header can be fully controlled via file

Agree on all the above. With all the files we need to keep around, we should consider creating a ~/.qsv directory where we can store all the files - cookies, headers, persistent cache, etc.

Probably would just go with reqwest, since:

  • supports cookie storage
  • supports system proxies
  • supports HTTPS/TLS
  • and sits on top of production tested hyper crate

And yes, reqwest seems to be the best choice.

As for the docopt usage help text:

  • I'd change threads to jobs to be consistent with the other multi-threaded qsv commands (frequency, split and stats).
  • I'd change http-header-file to just header (like curl's CLI), and consider having a default header (named default.header?) stored in ~/.qsv .
  • Also consider using less verbose options (throttle, cache, store-error and cookies)

Finally, you may want to look into RateLimit http header fields to bake in support for auto-adjusting rate limiters to rate-limited services.

There are some discussions about it on reqwest that's worth investigating.

@mhuang74
Copy link
Contributor

mhuang74 commented Dec 11, 2021

@jqnatividad As most API return JSON, this seems important to support. But I am struggling with:

  • JSON requires double quotes and commas, which means I cannot save it to CSV then run foreach with jq to parse out desired value
  • Believe CSV Encoding can encode comma as well. But jq can't handle it I don't think.
  • Also, foreach is not supported on Windows platform

Example JSON Response

)$ QSV_LOG_LEVEL=DEBUG ./target/debug/qsv fetch  3 --new-column Geo test.csv
Country,ZipCode,URL,Geo
US,90210,http://api.zippopotam.us/us/90210,"{""post code"": ""90210"", ""country"": ""United States"", ""country abbreviation"": ""US"", ""places"": [{""place name"": ""Beverly Hills"", ""longitude"": ""-118.4065"", ""state"": ""California"", ""state abbreviation"": ""CA"", ""latitude"": ""34.0901""}]}"
US,94105,http://api.zippopotam.us/us/94105,"{""post code"": ""94105"", ""country"": ""United States"", ""country abbreviation"": ""US"", ""places"": [{""place name"": ""San Francisco"", ""longitude"": ""-122.3892"", ""state"": ""California"", ""state abbreviation"": ""CA"", ""latitude"": ""37.7864""}]}"
US,92802,http://api.zippopotam.us/us/92802,"{""post code"": ""92802"", ""country"": ""United States"", ""country abbreviation"": ""US"", ""places"": [{""place name"": ""Anaheim"", ""longitude"": ""-117.9228"", ""state"": ""California"", ""state abbreviation"": ""CA"", ""latitude"": ""33.8085""}]}"

Questions:

  • should there be a better way to parse out desired value from JSON response than relying on foreach ? e.g. some kind of internal JSON parsing support.

@jqnatividad
Copy link
Collaborator Author

@mhuang74, here are some ideas for consideration:

  • what about saving the JSON response as escaped ndjson/JSONlines in the CSV column? In that way, you don't need to worry about commas, saving the response with escaped quotes and newline without breaking CSV parsing. Also, folks can leverage the existing jsonl qsv command downstream for further processing.
  • instead of foreach, use the lua command which works on Windows. Also, there are existing lua libraries for JSON parsing (check out how I used the date library in lua to compute quarter for a given date (https://github.com/jqnatividad/qsv/wiki/Cookbook#date-enrichment) that can be leveraged.
  • and yes, I agree there needs to be some internal support for JSON parsing. In addition to saving the JSON response as ndjson/jsonl, perhaps you can include another option to execute a json query before saving the result to the --new-column. I just found jql - it looks promising and can be used as a library.

@mhuang74
Copy link
Contributor

mhuang74 commented Dec 12, 2021

@jqnatividad

Thanks for feedback and suggestion. Directly supporting JSON parsing seems the way to go.

I was able to integrate with jql as a library.

$ QSV_LOG_LEVEL=DEBUG ./target/debug/qsv fetch  3 --new-column City --jql '."places"[0]."place name"' test.csv
Country,ZipCode,URL,City
US,90210,http://api.zippopotam.us/us/90210,Beverly Hills
US,94105,http://api.zippopotam.us/us/94105,San Francisco
US,92802,http://api.zippopotam.us/us/92802,Anaheim

$ QSV_LOG_LEVEL=DEBUG ./target/debug/qsv fetch  3 --jql '."places"[0]."place name"' test.csv
Beverly Hills
San Francisco
Anaheim

https://github.com/mhuang74/qsv/tree/fetch_poc

@jqnatividad
Copy link
Collaborator Author

Hi @mhuang74 !
I hope you don't mind but I went ahead and merged your changes even if fetch is not fully finished as I wanted to include it in today's release and mark it as beta, and perhaps, even as an optional feature, so other folks can test it and give us feedback.

Even in its current form, it's already very useful, and once we implement the other options, it'll be one of qsv's tentpole commands.

@mhuang74
Copy link
Contributor

@jqnatividad Okay, glad to hear it. Thanks for doing the merge and refactoring.

@jqnatividad
Copy link
Collaborator Author

@mhuang74 tweaked it a bit more as I'm planning to already use it in a pipeline I'm working on :)

And some additional thoughts on the other options:

  • I think we can get rid of the cache option. We have the function cache by default, and if folks want more persistent, flexible caching, they can take advantage of the HTTP_PROXY env var you pointed out in the fetch helptext (and use NGINX, Squid, etc.) as beyond simple proc level cache, that's well beyond the scope of qsv.
  • instead of throttle, use rate-limit using the approaches described in Async Rate Limiting seanmonstar/reqwest#491

@jqnatividad
Copy link
Collaborator Author

Hi @mhuang74 ,
Did some more tweaks here - #132

@mhuang74
Copy link
Contributor

mhuang74 commented Dec 17, 2021

Thanks @jqnatividad. Glad to see the feature quickly filling out.

Please let me know if you still need help with fetch rate-limiting.

I can also move to something less of a critical path for your projects since I only get to work on this over weekends.

@jqnatividad
Copy link
Collaborator Author

Thanks @mhuang74! If you can help with rate-limiting, that'd be awesome! I think getting headers & cookies working will also be great for authenticated sessions while you're at it.

And feel free to jump into anything in the backlog that piques your interest. #46 and #60 in particular would really bulk up qsv's schema validation capabilities.

@mhuang74
Copy link
Contributor

Sure. Here's my plan for fetch rate-limiting:

  • use actix-governor to simulate a simple JSON API service in Unit Test, including rate limits for bursts and requests-per-second
  • play with governor to do rate-limiting for fetch as well

@jqnatividad jqnatividad pinned this issue Dec 23, 2021
jqnatividad added a commit that referenced this issue Dec 30, 2021
Fetch: support rate limiting as per #77
@jqnatividad
Copy link
Collaborator Author

Hi @mhuang74 , fetch is really coming together!

I tweaked test_fetch.rs and added some convenience macros to make it easier to change the test port and ran rustfmt.

Once the --header parameter is done, I'll take off the WIP label. 🚀 💯 🎉

@mhuang74
Copy link
Contributor

mhuang74 commented Jan 7, 2022

Thanks @jqnatividad.

Regarding --header, I am planning to use Java Properties format via java-properties crate. It should be easy to copy-paste Headers from Chrome.

For integration test, I am having difficulties finding an api that requires key yet does not require my credit card (eg RapidAPI).

The only one I've found is api.data.gov but it also recommend not to share the key.

Thoughts?

X-Api-Key: DEMO_KEY 

@jqnatividad
Copy link
Collaborator Author

Thinking about it, --header can be confused with CSV headers.

Perhaps, we should rename it -httpheader?

As for integration testing, I googled this up - https://mixedanalytics.com/blog/list-actually-free-open-no-auth-needed-apis/.

After a quick scan of the list , JSONPlaceholder and HTTPBin looks promising.
nationalize.io is interesting too, as it returns rate-limit response headers too when your free API key exceeds the free daily quota.

Alternatively, perhaps you can parameterize the integration test that requires an API key and populate it from an environment variable, and if the envar is not set, the test is skipped. We just mention in the fetch documentation that they need to get their own API key.

@jqnatividad
Copy link
Collaborator Author

@mhuang74 with the 0.29 release, fetch is no longer WIP! 💯 🚀 🥳 🎉

I'm closing this issue, but you're welcome to keep improving fetch or take on any other qsv command as you wish.

@jqnatividad jqnatividad unpinned this issue Jan 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. Once marked with this label, its in the backlog.
Projects
None yet
Development

No branches or pull requests

2 participants