Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation #12

Merged
merged 12 commits into from
Dec 6, 2024
10 changes: 10 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Instruction on how to modify the repository

- Clone this repository
- Create a new branch
- Add your scripts
- Commit and push in your branch
- Create a pull request and add a reviewer
- Wait the reviewer approval for merging
- Wait eventual CI test if any to succeed
- Celebrate the merging
18 changes: 16 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,23 @@
# GWASStudio
# Computational-Biobank

## Getting started
## This is a software for the ingestion and analysis of summary statistics genomics data

### Description

Welcome aboard fellow developer, this is a journey we are taking for implementing a TileDB for analyzing summary statistics genomic data.
Please have a look [here](https://github.com/ht-diva/gwasstudio/blob/main/CONTRIBUTING.md) on how to contribute to this repo.

## Getting started (Modify this part?)

conda create --name gwasstudio --file conda-{linux, osx, osx-arm}-64.lock

conda activate gwasstudio

make install

## Additional link to help you get to know TileDB, Dask and Polars more
- [TileDB] (https://tiledb.com/)
- [Dask] (https://www.dask.org/)
- [Polars] (https://pola.rs/)

#### To understand more how this software works have a look at this README guidelines [HERE](https://github.com/ht-diva/gwasstudio/blob/main/docs/README.md)
4,858 changes: 4,858 additions & 0 deletions checksum_decode.txt

Large diffs are not rendered by default.

188 changes: 188 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
# TileDB for summary statistics data

## General commands
This program works using python on the HT HPC cluster using the tiledb conda environment. This program consists of different subcommands that will be described in details below. This program makes also optionally use of Dask to accelerate the computation.


```
gwasstudio --help

Usage: gwasstudio [OPTIONS] COMMAND [ARGS]...

GWASStudio

Dask options:
--distribute Distribute the load to a Dask cluster
--minimum_workers INTEGER Minimum amount of running workers
--maximum_workers INTEGER Maximum amount of running workers
--memory_workers TEXT Memory amount per worker
--cpu_workers INTEGER CPU numbers per worker

TileDB configuration:
--aws-access-key-id TEXT aws access key id
--aws-secret-access-key TEXT aws access key
--aws-endpoint-override TEXT endpoint where to connect
--aws-use-virtual-addressing TEXT
virtual address option
--aws-scheme TEXT type of scheme used at the endpoint
--aws-region TEXT region where the s3 bucket is located
--aws-verify-ssl TEXT if ssl verification is needed

Other options:
--version Show the version and exit.
-q, --quiet Set log verbosity
--help Show this message and exit.

Commands:
export Exports data from a TileDB dataset.
info Show GWASStudio details
ingest Ingest data data in a TileDB-unified dataset.
meta_ingest Ingest metadata into a MongoDB collection.
meta_query query metadata records from MongoDB
meta_view Retrieve a dataset's metadata from MongoDB using its unique key
```
</br>

## Ingestion
To ingest new data into a TileDB use the ```ingestion``` option of the program. To check all the possible option that can be given below. Please if you decide to use this option contact first the developers Gianmauro Cuccuru and Bruno Ariano for building the metadata and checksum file prior this step

``` gwasstudio ingest --help```

<details>
<summary>Help message</summary>

```
Usage: gwasstudio ingest [OPTIONS]

Ingest data data in a TileDB-unified dataset.

Ingestion options:
--checksum TEXT Path to the checksum file
--uri TEXT S3 path where to store the tiledb dataset.

Other options:
--help Show this message and exit.
```
</details>
</br>

### Query metadata
To query the metadata of a TileDB you need to use the command ```--meta_query```. Below we provide a coipl

``` gwasstudio meta_query --help```

<details>
<summary>Help message</summary>

```
Usage: gwasstudio meta_query [OPTIONS]

query metadata records from MongoDB

Options:
--key TEXT query key
--values TEXT query values, separate multiple values with ;
--output [all|data_id] The detail that you would like to retrieve from the
metadata records
--search-file TEXT Path to search template
--help Show this message and exit.
```
</details>
</br>

### Query TileDB and export the data

There are different options to query the TileDB. Below we describe all of them with their usage
```
gwasstudio export --help
```
<details>
<summary>Help message</summary>
```
gwasstudio export --help

Usage: gwasstudio export [OPTIONS]

Exports data from a TileDB dataset.

TileDB mandatory options:
--uri TEXT TileDB dataset URI.
--output_path TEXT The path where you want the output.
--trait_id_file TEXT A file containing a list of trait id to query.
--chr_list A list of chromosomes to be used for the analysis.

Options for Locusbreaker:
--locusbreaker BOOL Activate locusbreaker
--pvalue-sig FLOAT Minimum P-value threshold within the window.
--pvalue-limit FLOAT P-value threshold for loci borders
--hole-size INTEGER Minimum pair-base distance between SNPs in different
loci (default: 250000)

Options for filtering using a genomic regions or a list of SNPs ids:
--snp-list TEXT A txt file with a column containing the SNP ids

Other options:
--help Show this message and exit.
```
</details>
</br>

### Step 2 investigate the schema of the TileDB

To get details on how the TileDB you created is made run the following command:

```
python main.py export --uri dummy_tiledb_CD14 --schema
```
<details>
<summary>Schema of the file</summary>

```
ArraySchema(
domain=Domain(*[
Dim(name='cell_type', domain=('', ''), tile=None, dtype='|S0', var=True, filters=FilterList([LZ4Filter(level=5), ])),
Dim(name='gene', domain=('', ''), tile=None, dtype='|S0', var=True, filters=FilterList([LZ4Filter(level=5), ])),
Dim(name='position', domain=(1, 3000000000), tile=3000000000, dtype='uint32', filters=FilterList([LZ4Filter(level=5), ])),
]),
attrs=[
Attr(name='SNP', dtype='ascii', var=True, nullable=False, enum_label=None, filters=FilterList([LZ4Filter(level=5), ])),
Attr(name='af', dtype='float32', var=False, nullable=False, enum_label=None, filters=FilterList([LZ4Filter(level=5), ])),
Attr(name='beta', dtype='float32', var=False, nullable=False, enum_label=None, filters=FilterList([LZ4Filter(level=5), ])),
Attr(name='se', dtype='float32', var=False, nullable=False, enum_label=None, filters=FilterList([LZ4Filter(level=5), ])),
Attr(name='allele0', dtype='ascii', var=True, nullable=False, enum_label=None, filters=FilterList([LZ4Filter(level=5), ])),
Attr(name='allele1', dtype='ascii', var=True, nullable=False, enum_label=None, filters=FilterList([LZ4Filter(level=5), ])),
Attr(name='p-value', dtype='float64', var=False, nullable=False, enum_label=None, filters=FilterList([LZ4Filter(level=5), ])),
],
cell_order='row-major',
tile_order='row-major',
capacity=10000,
sparse=True,
allows_duplicates=True,
)
```
</details>
This will print the schema of the TileDB created.
</br>

### Step 3 query and export data by a list of SNPs

Here we are showing how to query the TileDB created above for a series of SNPs within cell types

```
python main.py export --uri dummy_tiledb_CD14 --snp test_data/dummy_SNPs.txt --cell_types test_data/dummy_cell_list.txt --output_path test_out_snp_CD14
```
</br>

### Step 4 query and export data by a list of genes and cell types for all the positions

```
python main.py export --uri dummy_tiledb_CD14 --genes test_data/dummy_gene_list.txt --cell_types test_data/dummy_cell_list.txt --output_path test_out_genes_CD14
```
</br>

### Step 5 run locus-breaker

```
python main.py export --uri dummy_tiledb_CD14 --genes test_data/dummy_gene_list.txt --cell_types test_data/dummy_cell_list.txt --output_path test_out_genes_CD14
```
This will create 2 files in txt format. One containing the the limit of each clumped region and another file with all the positions included in each independent region defined by significant SNPs.
11 changes: 9 additions & 2 deletions gwasstudio/__init__.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,17 @@
import importlib.metadata

from appdirs import user_config_dir, user_log_dir
from cloup import Context, HelpFormatter, HelpTheme, Style
from loguru import logger as a_logger

__all__ = ["__appname__", "__version__", "context_settings", "config_dir", "config_filename", "log_file", "logger"]
__all__ = [
"__appname__",
"__version__",
"context_settings",
"config_dir",
"config_filename",
"log_file",
"logger",
]

__appname__ = __name__
__version__ = importlib.metadata.version(__appname__)
Expand Down
63 changes: 43 additions & 20 deletions gwasstudio/cli/export.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
import cloup
import pandas as pd
import tiledb

from gwasstudio import logger
from gwasstudio.methods.locus_breaker import locus_breaker
from gwasstudio.utils import process_write_chunk
Expand All @@ -16,14 +15,24 @@
@cloup.option_group(
"TileDB mandatory options",
cloup.option("--uri", default="None", help="TileDB dataset URI"),
cloup.option("--output_path", default="None", help="The chromosome used for the analysis"),
cloup.option("--trait_id_file", default="None", help="The chromosome used for the analysis"),
cloup.option("--output_path", default="None", help="The path of the output"),
cloup.option("--trait_id_file", default="None", help="The trait id used for the analysis"),
cloup.option(
"--chr_list",
default=None,
help="A list of chromosomes to be used for the analysis",
),
)
@cloup.option_group(
"Options for Locusbreaker",
cloup.option("--locusbreaker", default=False, is_flag=True, help="Option to run locusbreaker"),
cloup.option("--pvalue-sig", default=False, help="P-value threshold to use for filtering the data"),
cloup.option("--pvalue-limit", default=5.0, help="P-value threshold for loci borders"),
cloup.option(
"--pvalue-sig",
default=False,
is_flag=True,
help="Maximum log p-value threshold within the window",
),
cloup.option("--pvalue-limit", default=5.0, help="Log p-value threshold for loci borders"),
cloup.option(
"--hole-size",
default=250000,
Expand All @@ -32,20 +41,31 @@
)
@cloup.option_group(
"Options for filtering using a genomic regions or a list of SNPs ids",
cloup.option("--snp-list", default="None", help="A txt file with a column containing the SNP ids"),
cloup.option(
"--snp-list",
default="None",
help="A txt file with a column containing the SNP ids",
),
)
@click.pass_context
def export(ctx, uri, trait_id_file, output_path, pvalue_sig, pvalue_limit, hole_size, snp_list, locusbreaker):
def export(ctx, uri, trait_id_file, output_path, pvalue_sig, pvalue_limit, hole_size, snp_list, locusbreaker, chr_list):
# cfg = ctx.obj["cfg"]
tiledb_unified = tiledb.open(uri, mode="r")
logger.info("TileDB dataset loaded")
trait_id_list = open(trait_id_file, "r").read().rstrip().split("\n")
if chr_list:
chr_list_str = chr_list.split(",")
chr_list_int = [int(x) for x in chr_list_str]
else:
chr_list_int = slice(None)

# If locus_breaker is selected, run locus_breaker
if locusbreaker:
print("running locus breaker")
for trait in trait_id_list:
subset_SNPs_pd = tiledb_unified.query(
dims=["CHR", "POS", "TRAITID"], attrs=["SNPID", "ALLELE0", "ALLELE1", "BETA", "SE", "EAF", "MLOG10P"]
dims=["CHR", "POS", "TRAITID"],
attrs=["SNPID", "ALLELE0", "ALLELE1", "BETA", "SE", "EAF", "MLOG10P"],
).df[:, :, trait_id_list]
results_lb = locus_breaker(subset_SNPs_pd)
logger.info(f"Saving locus-breaker output in {output_path}")
Expand All @@ -54,26 +74,29 @@ def export(ctx, uri, trait_id_file, output_path, pvalue_sig, pvalue_limit, hole_

# If snp_list is selected, run extract_snp
if snp_list != "None":
SNP_list = pd.read_csv(snp_list, dtype={"CHR": str, "POS": int, "ALLELE0": str, "ALLELE1": str})
SNP_list = pd.read_csv(snp_list, dtype={"CHR": str, "POS": int, "EA": str, "NEA": str})
chromosome_dict = SNP_list.groupby("CHR")["POS"].apply(list).to_dict()
unique_positions = list(set(pos for positions in chromosome_dict.values() for pos in positions))
with tiledb.open("/scratch/bruno.ariano/tiledb_decode_unified", "r") as A:
tiledb_iterator = A.query(return_incomplete=True).df[
:, unique_positions, trait_id_list
with tiledb_unified as tiledb_iterator:
tiledb_iterator_query = tiledb_iterator.query(return_incomplete=True).df[
chr_list_int, unique_positions, trait_id_list
] # Replace with appropriate filters if necessary
with open(output_path, mode="a") as f:
for chunk in tiledb_iterator:
for chunk in tiledb_iterator_query:
# Convert the chunk to Polars format
process_write_chunk(chunk, SNP_list, f)

logger.info(f"Saved filtered summary statistics by SNPs in {output_path}")
exit()

if pvalue_sig:
subset_SNPs = tiledb_unified.query(
cond=f"MLOGP10 > {pvalue_sig}",
dims=["CHR", "POS", "TRAITID"],
attrs=["SNPID", "ALLELE0", "ALLELE1", "BETA", "SE", "EAF", "MLOG10P"],
).df[:, trait_id_list, :]
subset_SNPs.to_csv(output_path, index=False)
logger.info(f"Saving filtered GWAS by regions and samples in {output_path}")
with tiledb_unified as tiledb_iterator:
tiledb_iterator_query = tiledb_unified.query(
cond=f"MLOGP10 > {pvalue_sig}",
dims=["CHR", "POS", "TRAITID"],
attrs=["SNPID", "EA", "NEA", "BETA", "SE", "EAF", "MLOG10P"],
).df[chr_list_int, :, trait_id_list]
with open(output_path, mode="a") as f:
for chunk in tiledb_iterator_query:
chunk.to_csv(output_path, index=False)
logger.info(f"Saving filtered GWAS by regions and samples in {output_path}")
Loading
Loading