Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

brick directions #4

Open
tomlue opened this issue Oct 9, 2024 · 3 comments
Open

brick directions #4

tomlue opened this issue Oct 9, 2024 · 3 comments
Assignees

Comments

@tomlue
Copy link
Contributor

tomlue commented Oct 9, 2024

I think this brick could be completed in 2 stages:

01_get_open_access_pdfs.py
deps: none
outs: brick/open_alex_open_access_pdfs.parquet
script that pulls all open access pdf urls from openalex. This script needs to work in a smart way so that it can be rerun to look for new updates without requiring all of the work to be done over again. You could query openalex with a publication date based on greatest publication date found in the already downloaded pdfs.

02_download_pdfs.py
deps: brick/open_alex_open_access_pdfs.parquet
outs:
brick/open_access_pdfs.pdf/*
brick/open_access_pdfs.parquet
script that gets the urls from stage 01 and downloads all the pdfs to the open_access_pdfs directory. it should also store metadata in the open_access_pdfs.parquet (like linking the download url to the path of the downloaded pdf). It should save the pdfs with the filename based on a content hash of the pdf. In the future, it may depend on other stages that use other methods of finding open access pdfs

@mahinth1
Copy link
Collaborator

Stage 1: 01_get_openaccess.py; check_downloaded_url.py (remove duplicates)

stage 2: 02_download.py

@mahinth1
Copy link
Collaborator

script to remove duplicate should be remove_duplicates.py

@mahinth1
Copy link
Collaborator

pdfs are being downloaded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants