ht-diva · gmauro · Dec 6, 2024 · Nov 19, 2024 · Nov 19, 2024 · Nov 21, 2024
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,10 @@
+# Instruction on how to modify the repository
+
+- Clone this repository
+- Create a new branch
+- Add your scripts
+- Commit and push in your branch
+- Create a pull request and add a reviewer
+- Wait the reviewer approval for merging
+- Wait eventual CI test if any to succeed
+- Celebrate the merging
diff --git a/README.md b/README.md
@@ -1,9 +1,23 @@
-# GWASStudio
+# Computational-Biobank
 
-## Getting started
+## This is a software for the ingestion and analysis of summary statistics genomics data
+
+### Description
+
+Welcome aboard fellow developer, this is a journey we are taking for implementing a TileDB for analyzing summary statistics genomic data.
+Please have a look [here](https://github.com/ht-diva/gwasstudio/blob/main/CONTRIBUTING.md) on how to contribute to this repo.
+
+## Getting started (Modify this part?)
 
 conda create --name gwasstudio --file conda-{linux, osx, osx-arm}-64.lock
 
 conda activate gwasstudio
 
 make install
+
+## Additional link to help you get to know TileDB, Dask and Polars more
+- [TileDB] (https://tiledb.com/)
+- [Dask] (https://www.dask.org/)
+- [Polars] (https://pola.rs/)
+
+#### To understand more how this software works have a look at this README guidelines [HERE](https://github.com/ht-diva/gwasstudio/blob/main/docs/README.md)
diff --git a/checksum_decode.txt b/checksum_decode.txt
diff --git a/docs/README.md b/docs/README.md
@@ -0,0 +1,188 @@
+# TileDB for summary statistics data
+
+## General commands
+This program works using python on the HT HPC cluster using the tiledb conda environment. This program consists of different subcommands that will be described in details below. This program makes also optionally use of Dask to accelerate the computation.
+
+
+```
+gwasstudio --help
+
+Usage: gwasstudio [OPTIONS] COMMAND [ARGS]...
+
+  GWASStudio
+
+Dask options:
+  --distribute                  Distribute the load to a Dask cluster
+  --minimum_workers INTEGER     Minimum amount of running workers
+  --maximum_workers INTEGER     Maximum amount of running workers
+  --memory_workers TEXT         Memory amount per worker
+  --cpu_workers INTEGER         CPU numbers per worker
+
+TileDB configuration:
+  --aws-access-key-id TEXT      aws access key id
+  --aws-secret-access-key TEXT  aws access key
+  --aws-endpoint-override TEXT  endpoint where to connect
+  --aws-use-virtual-addressing TEXT
+                                virtual address option
+  --aws-scheme TEXT             type of scheme used at the endpoint
+  --aws-region TEXT             region where the s3 bucket is located
+  --aws-verify-ssl TEXT         if ssl verification is needed
+
+Other options:
+  --version                     Show the version and exit.
+  -q, --quiet                   Set log verbosity
+  --help                        Show this message and exit.
+
+Commands:
+  export       Exports data from a TileDB dataset.
+  info         Show GWASStudio details
+  ingest       Ingest data data in a TileDB-unified dataset.
+  meta_ingest  Ingest metadata into a MongoDB collection.
+  meta_query   query metadata records from MongoDB
+  meta_view    Retrieve a dataset's metadata from MongoDB using its unique key
+```
+</br>
+
+## Ingestion
+To ingest new data into a TileDB use the ```ingestion``` option of the program. To check all the possible option that can be given below. Please if you decide to use this option contact first the developers Gianmauro Cuccuru and Bruno Ariano for building the metadata and checksum file prior this step
+
+``` gwasstudio ingest --help```
+
+<details>
+  <summary>Help message</summary>
+
+  ```
+  Usage: gwasstudio ingest [OPTIONS]
+
+  Ingest data data in a TileDB-unified dataset.
+
+Ingestion options:
+  --checksum TEXT  Path to the checksum file
+  --uri TEXT       S3 path where to store the tiledb dataset.
+
+Other options:
+  --help           Show this message and exit.
+  ```
+</details>
+</br>
+
+### Query metadata
+To query the metadata of a TileDB you need to use the command ```--meta_query```. Below we provide a coipl
+
+``` gwasstudio meta_query --help```
+
+<details>
+  <summary>Help message</summary>
+
+```
+Usage: gwasstudio meta_query [OPTIONS]
+
+  query metadata records from MongoDB
+
+Options:
+  --key TEXT              query key
+  --values TEXT           query values, separate multiple values with ;
+  --output [all|data_id]  The detail that you would like to retrieve from the
+                          metadata records
+  --search-file TEXT      Path to search template
+  --help                  Show this message and exit.
+```
+</details>
+</br>
+
+### Query TileDB and export the data
+
+There are different options to query the TileDB. Below we describe all of them with their usage
+```
+gwasstudio export --help
+```
+<details>
+  <summary>Help message</summary>
+```
+gwasstudio export --help
+
+Usage: gwasstudio export [OPTIONS]
+
+  Exports data from a TileDB dataset.
+
+TileDB mandatory options:
+  --uri TEXT            TileDB dataset URI.
+  --output_path TEXT    The path where you want the output.
+  --trait_id_file TEXT  A file containing a list of trait id to query.
+  --chr_list            A list of chromosomes to be used for the analysis.
+
+Options for Locusbreaker:
+  --locusbreaker BOOL   Activate locusbreaker
+  --pvalue-sig FLOAT    Minimum P-value threshold within the window.
+  --pvalue-limit FLOAT  P-value threshold for loci borders
+  --hole-size INTEGER   Minimum pair-base distance between SNPs in different
+                        loci (default: 250000)
+
+Options for filtering using a genomic regions or a list of SNPs ids:
+  --snp-list TEXT       A txt file with a column containing the SNP ids
+
+Other options:
+  --help                Show this message and exit.
+```
+</details>
+</br>
+
+### Step 2 investigate the schema of the TileDB
+
+To get details on how the TileDB you created is made run the following command:
+
+```
+python main.py export --uri dummy_tiledb_CD14 --schema
+```
+<details>
+  <summary>Schema of the file</summary>
+
+```
+ArraySchema(
+  domain=Domain(*[
+    Dim(name='cell_type', domain=('', ''), tile=None, dtype='|S0', var=True, filters=FilterList([LZ4Filter(level=5), ])),
+    Dim(name='gene', domain=('', ''), tile=None, dtype='|S0', var=True, filters=FilterList([LZ4Filter(level=5), ])),
+    Dim(name='position', domain=(1, 3000000000), tile=3000000000, dtype='uint32', filters=FilterList([LZ4Filter(level=5), ])),
+  ]),
+  attrs=[
+    Attr(name='SNP', dtype='ascii', var=True, nullable=False, enum_label=None, filters=FilterList([LZ4Filter(level=5), ])),
+    Attr(name='af', dtype='float32', var=False, nullable=False, enum_label=None, filters=FilterList([LZ4Filter(level=5), ])),
+    Attr(name='beta', dtype='float32', var=False, nullable=False, enum_label=None, filters=FilterList([LZ4Filter(level=5), ])),
+    Attr(name='se', dtype='float32', var=False, nullable=False, enum_label=None, filters=FilterList([LZ4Filter(level=5), ])),
+    Attr(name='allele0', dtype='ascii', var=True, nullable=False, enum_label=None, filters=FilterList([LZ4Filter(level=5), ])),
+    Attr(name='allele1', dtype='ascii', var=True, nullable=False, enum_label=None, filters=FilterList([LZ4Filter(level=5), ])),
+    Attr(name='p-value', dtype='float64', var=False, nullable=False, enum_label=None, filters=FilterList([LZ4Filter(level=5), ])),
+  ],
+  cell_order='row-major',
+  tile_order='row-major',
+  capacity=10000,
+  sparse=True,
+  allows_duplicates=True,
+)
+```
+</details>
+This will print the schema of the TileDB created.
+</br>
+
+### Step 3 query and export data by a list of SNPs
+
+Here we are showing how to query the TileDB created above for a series of SNPs within cell types
+
+```
+python main.py export --uri dummy_tiledb_CD14 --snp test_data/dummy_SNPs.txt --cell_types test_data/dummy_cell_list.txt --output_path test_out_snp_CD14
+```
+</br>
+
+### Step 4 query and export data by a list of genes and cell types for all the positions
+
+```
+python main.py export --uri dummy_tiledb_CD14 --genes test_data/dummy_gene_list.txt --cell_types test_data/dummy_cell_list.txt --output_path test_out_genes_CD14
+```
+</br>
+
+### Step 5 run locus-breaker
+
+```
+python main.py export --uri dummy_tiledb_CD14 --genes test_data/dummy_gene_list.txt --cell_types test_data/dummy_cell_list.txt --output_path test_out_genes_CD14
+```
+This will create 2 files in txt format. One containing the the limit of each clumped region and another file with all the positions included in each independent region defined by significant SNPs.
diff --git a/gwasstudio/__init__.py b/gwasstudio/__init__.py
@@ -1,10 +1,17 @@
 import importlib.metadata
-
 from appdirs import user_config_dir, user_log_dir
 from cloup import Context, HelpFormatter, HelpTheme, Style
 from loguru import logger as a_logger
 
-__all__ = ["__appname__", "__version__", "context_settings", "config_dir", "config_filename", "log_file", "logger"]
+__all__ = [
+    "__appname__",
+    "__version__",
+    "context_settings",
+    "config_dir",
+    "config_filename",
+    "log_file",
+    "logger",
+]
 
 __appname__ = __name__
 __version__ = importlib.metadata.version(__appname__)

diff --git a/gwasstudio/cli/export.py b/gwasstudio/cli/export.py
@@ -2,7 +2,6 @@
 import cloup
 import pandas as pd
 import tiledb
-
 from gwasstudio import logger
 from gwasstudio.methods.locus_breaker import locus_breaker
 from gwasstudio.utils import process_write_chunk
@@ -16,14 +15,24 @@
 @cloup.option_group(
     "TileDB mandatory options",
     cloup.option("--uri", default="None", help="TileDB dataset URI"),
-    cloup.option("--output_path", default="None", help="The chromosome used for the analysis"),
-    cloup.option("--trait_id_file", default="None", help="The chromosome used for the analysis"),
+    cloup.option("--output_path", default="None", help="The path of the output"),
+    cloup.option("--trait_id_file", default="None", help="The trait id used for the analysis"),
+    cloup.option(
+        "--chr_list",
+        default=None,
+        help="A list of chromosomes to be used for the analysis",
+    ),
 )
 @cloup.option_group(
     "Options for Locusbreaker",
     cloup.option("--locusbreaker", default=False, is_flag=True, help="Option to run locusbreaker"),
-    cloup.option("--pvalue-sig", default=False, help="P-value threshold to use for filtering the data"),
-    cloup.option("--pvalue-limit", default=5.0, help="P-value threshold for loci borders"),
+    cloup.option(
+        "--pvalue-sig",
+        default=False,
+        is_flag=True,
+        help="Maximum log p-value threshold within the window",
+    ),
+    cloup.option("--pvalue-limit", default=5.0, help="Log p-value threshold for loci borders"),
     cloup.option(
         "--hole-size",
         default=250000,
@@ -32,20 +41,31 @@
 )
 @cloup.option_group(
     "Options for filtering using a genomic regions or a list of SNPs ids",
-    cloup.option("--snp-list", default="None", help="A txt file with a column containing the SNP ids"),
+    cloup.option(
+        "--snp-list",
+        default="None",
+        help="A txt file with a column containing the SNP ids",
+    ),
 )
 @click.pass_context
-def export(ctx, uri, trait_id_file, output_path, pvalue_sig, pvalue_limit, hole_size, snp_list, locusbreaker):
+def export(ctx, uri, trait_id_file, output_path, pvalue_sig, pvalue_limit, hole_size, snp_list, locusbreaker, chr_list):
     # cfg = ctx.obj["cfg"]
     tiledb_unified = tiledb.open(uri, mode="r")
     logger.info("TileDB dataset loaded")
     trait_id_list = open(trait_id_file, "r").read().rstrip().split("\n")
+    if chr_list:
+        chr_list_str = chr_list.split(",")
+        chr_list_int = [int(x) for x in chr_list_str]
+    else:
+        chr_list_int = slice(None)
+
     # If locus_breaker is selected, run locus_breaker
     if locusbreaker:
         print("running locus breaker")
         for trait in trait_id_list:
             subset_SNPs_pd = tiledb_unified.query(
-                dims=["CHR", "POS", "TRAITID"], attrs=["SNPID", "ALLELE0", "ALLELE1", "BETA", "SE", "EAF", "MLOG10P"]
+                dims=["CHR", "POS", "TRAITID"],
+                attrs=["SNPID", "ALLELE0", "ALLELE1", "BETA", "SE", "EAF", "MLOG10P"],
             ).df[:, :, trait_id_list]
             results_lb = locus_breaker(subset_SNPs_pd)
             logger.info(f"Saving locus-breaker output in {output_path}")
@@ -54,26 +74,29 @@ def export(ctx, uri, trait_id_file, output_path, pvalue_sig, pvalue_limit, hole_
 
     # If snp_list is selected, run extract_snp
     if snp_list != "None":
-        SNP_list = pd.read_csv(snp_list, dtype={"CHR": str, "POS": int, "ALLELE0": str, "ALLELE1": str})
+        SNP_list = pd.read_csv(snp_list, dtype={"CHR": str, "POS": int, "EA": str, "NEA": str})
         chromosome_dict = SNP_list.groupby("CHR")["POS"].apply(list).to_dict()
         unique_positions = list(set(pos for positions in chromosome_dict.values() for pos in positions))
-        with tiledb.open("/scratch/bruno.ariano/tiledb_decode_unified", "r") as A:
-            tiledb_iterator = A.query(return_incomplete=True).df[
-                :, unique_positions, trait_id_list
+        with tiledb_unified as tiledb_iterator:
+            tiledb_iterator_query = tiledb_iterator.query(return_incomplete=True).df[
+                chr_list_int, unique_positions, trait_id_list
             ]  # Replace with appropriate filters if necessary
             with open(output_path, mode="a") as f:
-                for chunk in tiledb_iterator:
+                for chunk in tiledb_iterator_query:
                     # Convert the chunk to Polars format
                     process_write_chunk(chunk, SNP_list, f)
 
         logger.info(f"Saved filtered summary statistics by SNPs in {output_path}")
         exit()
 
     if pvalue_sig:
-        subset_SNPs = tiledb_unified.query(
-            cond=f"MLOGP10 > {pvalue_sig}",
-            dims=["CHR", "POS", "TRAITID"],
-            attrs=["SNPID", "ALLELE0", "ALLELE1", "BETA", "SE", "EAF", "MLOG10P"],
-        ).df[:, trait_id_list, :]
-        subset_SNPs.to_csv(output_path, index=False)
-        logger.info(f"Saving filtered GWAS by regions and samples in {output_path}")
+        with tiledb_unified as tiledb_iterator:
+            tiledb_iterator_query = tiledb_unified.query(
+                cond=f"MLOGP10 > {pvalue_sig}",
+                dims=["CHR", "POS", "TRAITID"],
+                attrs=["SNPID", "EA", "NEA", "BETA", "SE", "EAF", "MLOG10P"],
+            ).df[chr_list_int, :, trait_id_list]
+            with open(output_path, mode="a") as f:
+                for chunk in tiledb_iterator_query:
+                    chunk.to_csv(output_path, index=False)
+            logger.info(f"Saving filtered GWAS by regions and samples in {output_path}")