brick directions #4

tomlue · 2024-10-09T20:53:24Z

I think this brick could be completed in 2 stages:

01_get_open_access_pdfs.py
deps: none
outs: brick/open_alex_open_access_pdfs.parquet
script that pulls all open access pdf urls from openalex. This script needs to work in a smart way so that it can be rerun to look for new updates without requiring all of the work to be done over again. You could query openalex with a publication date based on greatest publication date found in the already downloaded pdfs.

02_download_pdfs.py
deps: brick/open_alex_open_access_pdfs.parquet
outs:
brick/open_access_pdfs.pdf/*
brick/open_access_pdfs.parquet
script that gets the urls from stage 01 and downloads all the pdfs to the open_access_pdfs directory. it should also store metadata in the open_access_pdfs.parquet (like linking the download url to the path of the downloaded pdf). It should save the pdfs with the filename based on a content hash of the pdf. In the future, it may depend on other stages that use other methods of finding open access pdfs

mahinth1 · 2024-10-15T13:32:17Z

Stage 1: 01_get_openaccess.py; check_downloaded_url.py (remove duplicates)

stage 2: 02_download.py

mahinth1 · 2024-10-15T13:36:59Z

script to remove duplicate should be remove_duplicates.py

mahinth1 · 2024-10-21T19:16:47Z

pdfs are being downloaded.

tomlue assigned mahinth1 Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

brick directions #4

brick directions #4

tomlue commented Oct 9, 2024

mahinth1 commented Oct 15, 2024

mahinth1 commented Oct 15, 2024

mahinth1 commented Oct 21, 2024

brick directions #4

brick directions #4

Comments

tomlue commented Oct 9, 2024

mahinth1 commented Oct 15, 2024

mahinth1 commented Oct 15, 2024

mahinth1 commented Oct 21, 2024