do away with template defaults; add in basic processing logic (needs …

…work, large files are not being read)
biobricks-ai · Jun 28, 2024 · 3994a25 · 3994a25
1 parent 95d8c83
commit 3994a25
Show file tree

Hide file tree

Showing 6 changed files with 51 additions and 76 deletions.
diff --git a/README.md b/README.md
@@ -1,35 +1,5 @@
-# How to build bricks
+# PubTator3 Brick
 
-1. Create a brick named `{newbrick}` from this template
-```
-gh repo create biobricks-ai/{newbrick} -p biobricks-ai/brick-template --public
-gh repo clone biobricks-ai/{newbrick}
-cd newbrick
-```
-
-2. Edit stages according to your needs:
-    Recommended scripts:
-    - ``01_download.sh``
-    - ``02_unzip.sh``
-    - ``03_build.sh`` calling a function to process individual files like ``csv2parquet.R`` or ``csv2parquet.py``
-
-3. Replace stages in dvc.yaml with your new stages
-
-4. Build your brick
-```
-dvc repro # runs new stages
-```
-
-5. Push the data to biobricks.ai
-```
-dvc push -r s3.biobricks.ai 
-```
-
-6. Commit the brick
-```
-git add -A && git commit -m "some message"
-git push
-```
-
-7. Monitor the bricktools github action
+This is a brick containing the assets from the FTP server provided by [Pubtator](https://www.ncbi.nlm.nih.gov/research/pubtator3/).
 
+The assets include full-text summaries and abstracts from PubMed articles, and tab-delimited data containing various kinds of relation data for NER (see the [FTP Readme](https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator3/README.txt) for more detailed descriptions).
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,4 @@
+dvc
+lxml
+pandas
+pyarrow
diff --git a/stages/03_build.py b/stages/03_build.py
@@ -0,0 +1,44 @@
+import pandas as pd
+import pyarrow as pa
+from pyarrow.csv import open_csv, ParseOptions, ReadOptions
+from pyarrow.parquet import write_table
+from pathlib import Path
+
+raw_dir = Path("raw")
+brick_dir = Path("brick")
+
+biocxml_out = brick_dir / "BioCXML"
+
+
+def read_csv(filename):
+    return open_csv(
+        filename,
+        parse_options=ParseOptions(
+            delimiter="\t",
+            invalid_row_handler=lambda _: "skip",
+            newlines_in_values=True,
+        ),
+        read_options=ReadOptions(
+            block_size=1024 * 1024 * 1024
+        ),  # can't go higher than this or it overflows an int32...
+        memory_pool=pa.system_memory_pool()
+    )
+
+
+if __name__ == "__main__":
+
+    if not biocxml_out.exists():
+        biocxml_out.mkdir(parents=True)
+
+    for f in raw_dir.iterdir():
+        if f.is_file() and f.suffix == "":
+            try:
+                table: pa.Table = read_csv(f).read_all()
+                write_table(table, brick_dir / f"{f.name}.parquet")
+            except Exception as e:
+                print(e)
+        elif f.is_dir():
+            output = f / "BioCXML"
+            for xml in output.iterdir():
+                df = pd.read_xml(xml)
+                df.to_parquet(brick_dir / f"BioCXML/{f.name}.parquet")
diff --git a/stages/03_build.sh b/stages/03_build.sh
diff --git a/stages/csv2parquet.R b/stages/csv2parquet.R
diff --git a/stages/csv2parquet.py b/stages/csv2parquet.py
-Original file line number
+Diff line change
@@ -0,0 +1,4 @@
+    dvc
+    lxml
+    pandas
+    pyarrow