You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think this brick could be completed in 2 stages:
01_get_open_access_pdfs.py
deps: none
outs: brick/open_alex_open_access_pdfs.parquet
script that pulls all open access pdf urls from openalex. This script needs to work in a smart way so that it can be rerun to look for new updates without requiring all of the work to be done over again. You could query openalex with a publication date based on greatest publication date found in the already downloaded pdfs.
02_download_pdfs.py
deps: brick/open_alex_open_access_pdfs.parquet
outs:
brick/open_access_pdfs.pdf/*
brick/open_access_pdfs.parquet
script that gets the urls from stage 01 and downloads all the pdfs to the open_access_pdfs directory. it should also store metadata in the open_access_pdfs.parquet (like linking the download url to the path of the downloaded pdf). It should save the pdfs with the filename based on a content hash of the pdf. In the future, it may depend on other stages that use other methods of finding open access pdfs
The text was updated successfully, but these errors were encountered:
I think this brick could be completed in 2 stages:
01_get_open_access_pdfs.py
deps: none
outs: brick/open_alex_open_access_pdfs.parquet
script that pulls all open access pdf urls from openalex. This script needs to work in a smart way so that it can be rerun to look for new updates without requiring all of the work to be done over again. You could query openalex with a publication date based on greatest publication date found in the already downloaded pdfs.
02_download_pdfs.py
deps: brick/open_alex_open_access_pdfs.parquet
outs:
brick/open_access_pdfs.pdf/*
brick/open_access_pdfs.parquet
script that gets the urls from stage 01 and downloads all the pdfs to the open_access_pdfs directory. it should also store metadata in the open_access_pdfs.parquet (like linking the download url to the path of the downloaded pdf). It should save the pdfs with the filename based on a content hash of the pdf. In the future, it may depend on other stages that use other methods of finding open access pdfs
The text was updated successfully, but these errors were encountered: