Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discontinue old CA scraping #537

Open
stucka opened this issue Aug 10, 2023 · 1 comment
Open

Discontinue old CA scraping #537

stucka opened this issue Aug 10, 2023 · 1 comment

Comments

@stucka
Copy link
Contributor

stucka commented Aug 10, 2023

CA scraper is parsing PDFs from 2015, and not surprisingly is the slowest-running scraper of the bunch.

@chriszs
Copy link
Contributor

chriszs commented Sep 13, 2023

I wonder if:

  1. The scraper could be sped up/cleaned up.
  2. Whether there could be a way to archive the data from older years so it is retained, but we don't have to continually re-scrape it. There's precedent for hosting a spreadsheet file somewhere static, perhaps on BigLocalNews somewhere, which the scraper just pulls and integrates. That stuff doesn't change much, but archival data is still good to have. The oldest states go back to when the WARN Act first took effect in 1989 and I started thinking of completeness in terms of not just states but also years (I was shooting for seven years of coverage based on what seemed achievable to get for historical comparison) and people (because 50 states isn't possible, so you'll want to be able to say it covers 9X% of the U.S. population), as well as percentage of job loss overall.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants