Skip to content

Commit

Permalink
do away with template defaults; add in basic processing logic (needs …
Browse files Browse the repository at this point in the history
…work, large files are not being read)
  • Loading branch information
bhlieberman committed Jun 28, 2024
1 parent 95d8c83 commit 3994a25
Show file tree
Hide file tree
Showing 6 changed files with 51 additions and 76 deletions.
36 changes: 3 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,5 @@
# How to build bricks
# PubTator3 Brick

1. Create a brick named `{newbrick}` from this template
```
gh repo create biobricks-ai/{newbrick} -p biobricks-ai/brick-template --public
gh repo clone biobricks-ai/{newbrick}
cd newbrick
```

2. Edit stages according to your needs:
Recommended scripts:
- ``01_download.sh``
- ``02_unzip.sh``
- ``03_build.sh`` calling a function to process individual files like ``csv2parquet.R`` or ``csv2parquet.py``

3. Replace stages in dvc.yaml with your new stages

4. Build your brick
```
dvc repro # runs new stages
```

5. Push the data to biobricks.ai
```
dvc push -r s3.biobricks.ai
```

6. Commit the brick
```
git add -A && git commit -m "some message"
git push
```

7. Monitor the bricktools github action
This is a brick containing the assets from the FTP server provided by [Pubtator](https://www.ncbi.nlm.nih.gov/research/pubtator3/).

The assets include full-text summaries and abstracts from PubMed articles, and tab-delimited data containing various kinds of relation data for NER (see the [FTP Readme](https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator3/README.txt) for more detailed descriptions).
4 changes: 4 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
dvc
lxml
pandas
pyarrow
44 changes: 44 additions & 0 deletions stages/03_build.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
import pandas as pd
import pyarrow as pa
from pyarrow.csv import open_csv, ParseOptions, ReadOptions
from pyarrow.parquet import write_table
from pathlib import Path

raw_dir = Path("raw")
brick_dir = Path("brick")

biocxml_out = brick_dir / "BioCXML"


def read_csv(filename):
return open_csv(
filename,
parse_options=ParseOptions(
delimiter="\t",
invalid_row_handler=lambda _: "skip",
newlines_in_values=True,
),
read_options=ReadOptions(
block_size=1024 * 1024 * 1024
), # can't go higher than this or it overflows an int32...
memory_pool=pa.system_memory_pool()
)


if __name__ == "__main__":

if not biocxml_out.exists():
biocxml_out.mkdir(parents=True)

for f in raw_dir.iterdir():
if f.is_file() and f.suffix == "":
try:
table: pa.Table = read_csv(f).read_all()
write_table(table, brick_dir / f"{f.name}.parquet")
except Exception as e:
print(e)
elif f.is_dir():
output = f / "BioCXML"
for xml in output.iterdir():
df = pd.read_xml(xml)
df.to_parquet(brick_dir / f"BioCXML/{f.name}.parquet")
30 changes: 0 additions & 30 deletions stages/03_build.sh

This file was deleted.

2 changes: 0 additions & 2 deletions stages/csv2parquet.R

This file was deleted.

11 changes: 0 additions & 11 deletions stages/csv2parquet.py

This file was deleted.

0 comments on commit 3994a25

Please sign in to comment.