generated from biobricks-ai/brick-template
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
do away with template defaults; add in basic processing logic (needs …
…work, large files are not being read)
- Loading branch information
1 parent
95d8c83
commit 3994a25
Showing
6 changed files
with
51 additions
and
76 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,35 +1,5 @@ | ||
# How to build bricks | ||
# PubTator3 Brick | ||
|
||
1. Create a brick named `{newbrick}` from this template | ||
``` | ||
gh repo create biobricks-ai/{newbrick} -p biobricks-ai/brick-template --public | ||
gh repo clone biobricks-ai/{newbrick} | ||
cd newbrick | ||
``` | ||
|
||
2. Edit stages according to your needs: | ||
Recommended scripts: | ||
- ``01_download.sh`` | ||
- ``02_unzip.sh`` | ||
- ``03_build.sh`` calling a function to process individual files like ``csv2parquet.R`` or ``csv2parquet.py`` | ||
|
||
3. Replace stages in dvc.yaml with your new stages | ||
|
||
4. Build your brick | ||
``` | ||
dvc repro # runs new stages | ||
``` | ||
|
||
5. Push the data to biobricks.ai | ||
``` | ||
dvc push -r s3.biobricks.ai | ||
``` | ||
|
||
6. Commit the brick | ||
``` | ||
git add -A && git commit -m "some message" | ||
git push | ||
``` | ||
|
||
7. Monitor the bricktools github action | ||
This is a brick containing the assets from the FTP server provided by [Pubtator](https://www.ncbi.nlm.nih.gov/research/pubtator3/). | ||
|
||
The assets include full-text summaries and abstracts from PubMed articles, and tab-delimited data containing various kinds of relation data for NER (see the [FTP Readme](https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator3/README.txt) for more detailed descriptions). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
dvc | ||
lxml | ||
pandas | ||
pyarrow |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
import pandas as pd | ||
import pyarrow as pa | ||
from pyarrow.csv import open_csv, ParseOptions, ReadOptions | ||
from pyarrow.parquet import write_table | ||
from pathlib import Path | ||
|
||
raw_dir = Path("raw") | ||
brick_dir = Path("brick") | ||
|
||
biocxml_out = brick_dir / "BioCXML" | ||
|
||
|
||
def read_csv(filename): | ||
return open_csv( | ||
filename, | ||
parse_options=ParseOptions( | ||
delimiter="\t", | ||
invalid_row_handler=lambda _: "skip", | ||
newlines_in_values=True, | ||
), | ||
read_options=ReadOptions( | ||
block_size=1024 * 1024 * 1024 | ||
), # can't go higher than this or it overflows an int32... | ||
memory_pool=pa.system_memory_pool() | ||
) | ||
|
||
|
||
if __name__ == "__main__": | ||
|
||
if not biocxml_out.exists(): | ||
biocxml_out.mkdir(parents=True) | ||
|
||
for f in raw_dir.iterdir(): | ||
if f.is_file() and f.suffix == "": | ||
try: | ||
table: pa.Table = read_csv(f).read_all() | ||
write_table(table, brick_dir / f"{f.name}.parquet") | ||
except Exception as e: | ||
print(e) | ||
elif f.is_dir(): | ||
output = f / "BioCXML" | ||
for xml in output.iterdir(): | ||
df = pd.read_xml(xml) | ||
df.to_parquet(brick_dir / f"BioCXML/{f.name}.parquet") |
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.