Skip to content

Commit

Permalink
BioThings v0.12.5 (#349)
Browse files Browse the repository at this point in the history
* test add volume feature

* added in named volume option

* fixed parsing in named volume option

* needs to be tested

* remove named volume when having keep_container flag as true in docker dumper

* added option to create multiple named volumes

* changed to a list

* fix: moved jmespath after other transformations

* add sqlite except for duplicate record _id

* test add volume feature

* added in named volume option

* fixed parsing in named volume option

* needs to be tested

* remove named volume when having keep_container flag as true in docker dumper

* added option to create multiple named volumes

* changed to a list

* revert to pymongo 4.6.3 due to pymongo logs leaking into biothings hub logs

* rolling back pymongo version to see errors

* hide pymongo logs

Unnecessary pymongo logs were shown in hub level after new pymongo 4.7 update.

https://pymongo.readthedocs.io/en/latest/changelog.html
Added support for Python’s native logging library, enabling developers to customize the verbosity of log messages for their applications.

* Add metadata query field validation (#330)

* Change the constructor for the ESQueryBuilder ...

Pass the actual BiothingsESMetadata object rather than a property
of the that into the constructor for the ESQueryBuilder

<main change>
def __init__(
    ...
    metadata: BiothingsMetadata = None, <- used to be BiothingsMetadata.biothings_metadata
)

* Add additional logging to the QStringParser.parse method

* Make logical changes to ensure parity with 0.12.x branch

* Add initial attempt at building metadata fields

* Remove the metadata from the parser constructor

* Create metadata field set generation method

* Add metadata field checking at query runtime

* Add signifcant docstring and logging to the parse method

* Add docstring comments to the structure of the metadata

* Improve error handling for the metadata access

---------

Co-authored-by: jschaff <[email protected]>

* added comments for new dockercontainerdumper options volumes and named_volumes

* metadata query logical enhancements and bugfixes (#331)

* Correct minor logical errors and improve logging

* Add ResultFormatter for metadata field formatting

* Fix pipeline constructor arguments

* Remove breakpoint

---------

Co-authored-by: jschaff <[email protected]>

* Fix metadata fields. (#334)

* Improve metadata field index search (#333)

* Iterate over all pontential metadata index fields

* Add 0 length check to ensure we return None for empty set

* Fix the regex tests by updating the method calls

* Update the metadata tests

---------

Co-authored-by: jschaff <[email protected]>

* client.snapshot.delete keyword arguments

Elasticsearch Python Client 8.0 requires keyword arguments

Credit to @ctrl-schaff for discovering this on pending.api

* Add more idiomatic sqlite multiple document insert (#335)

* Add executemany improvement and extra exception handling

* Modify the signature for bulk_write

* Fix the tuple malformatting

* Remove breakpoints

* Correct the arguments syntax

---------

Co-authored-by: jschaff <[email protected]>

* Api customization (#336)

Initial implementation of allowing use of custom config_webs when creating apis using the hub.

---------

Co-authored-by: mygene_hub <[email protected]>
Co-authored-by: Everaldo <[email protected]>
Co-authored-by: Dylan Welzel <[email protected]>

* Add advanced plugin support for the command line tooling  (#329)

* Remove requirements around the manifest file

* Revert the import order

* Re-add the btinspect import that was accidently removed

---------

Co-authored-by: jschaff <[email protected]>

* Implement notifiers using asyncio (#337)

* Implement notifiers using asyncio library.

* Implement notifiers using asyncio library.

* Raise error if notifier is not implemented.

* Add certificate to the slack channel message.

* Add exponential backoff strategy.

* Add exponential backoff strategy.

* Clean code.

* Clean code.

* Add comment.

* Clean code.

* Move var to outside of the loop.

* Add comment.

* Keep same parameter name: event.

* Retry if get HTTP 5xx from GA4.

* Add tests for Notifiers.

* Add tests for Notifiers.

* Add tests for Notifiers.

* Add tests for Notifiers.

* Configure SSL for Slack Notifier.

* Fix metadata mapping and remove duplicated code. (#341)

* Fix Slack notifier test. (#343)

* Upgrade to tornado 6.4.1 (#342)

* Correct the pyproject.toml black configuration ... (#340)

See https://black.readthedocs.io/en/stable/usage_and_configuration/the_basics.html#configuration-format

Co-authored-by: jschaff <[email protected]>

* Replace deprecated imp module with importlib (#339)

The imp module has been deprecated since Python 3.4, which was released in March 2014. Python 3.12, released in October 2023, removed support for the imp module entirely.

* Fix pytest "Set" issue for older python versions (3.8 3.7)

* Build Config Err Fix

In the event an error occurs in one build config, every item in the list afterwards will contain the error. Resetting the error to None inside the loop fixes this.

* Add CLI mongodb support (#338)

* Remove requirements around the manifest file

* Revert the import order

* Generalize structure for mongodb support

---------

Co-authored-by: jschaff <[email protected]>

* Add id checking to IgnoreDuplicatedStorage ... (#344)

* Add id checking to IgnoreDuplicatedStorage ...

For the sequential case, this storage works as expected because
each document is checked for uniqueness. When processing this in batch,
there needs to be some more filtering involved especially if the number
of document is small with a higher ratio of duplicates in those
documents.

If you attempt to upload to the database and each batch has duplicates
in it, the entire batch will continually get thrown out until no
documents were actually uploaded to the database. Instead if we verify
the uniqueness constraint prior to uploading by ensuring we have a
one-to-one ratio of id to documents in our collection, then we can
ensure a safe upload.

This still discards identifical or near identical documents without
throwing out an entire batch of potentially valid documents in the batch
upload

* Remove breakpoint

---------

Co-authored-by: jschaff <[email protected]>

* Refactor ES exceptions (#346)

* Refactor ES exceptions.

* Add tests.

* Rollback changes in the connections tests file.

* Refactor ES exceptions.

* Review changes.

* Rise exceptions correctly. (#347)

* Rise exceptions correctly.

* Rise exceptions correctly.

---------

Co-authored-by: jal347 <[email protected]>
Co-authored-by: Everaldo <[email protected]>
Co-authored-by: Chunlei Wu <[email protected]>
Co-authored-by: Dylan Welzel <[email protected]>
Co-authored-by: jschaff <[email protected]>
Co-authored-by: Chunlei Wu <[email protected]>
Co-authored-by: Jason Lin <[email protected]>
Co-authored-by: mygene_hub <[email protected]>
  • Loading branch information
9 people authored Jul 22, 2024
1 parent 6573562 commit e154eaa
Show file tree
Hide file tree
Showing 30 changed files with 1,355 additions and 420 deletions.
62 changes: 44 additions & 18 deletions biothings/cli/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
import functools
import importlib
import importlib.util
import inspect
import logging
import os
import pathlib
Expand Down Expand Up @@ -69,35 +71,59 @@


def setup_config():
"""Setup a config module necessary to launch the CLI"""
"""
Setup a config module necessary to launch the CLI
Depending on the backend hub database, the order of configuration
matters. If we attempt to load a module that checks for the configuration
we'll have to ensure that the configuration is properly configured prior
to loading the module
"""
working_dir = pathlib.Path().resolve()
_config = DummyConfig("config")
_config.HUB_DB_BACKEND = {
"module": "biothings.utils.sqlite3",
"sqlite_db_folder": ".biothings_hub",
}
_config.DATA_SRC_DATABASE = ".data_src_database"
_config.DATA_ARCHIVE_ROOT = ".biothings_hub/archive"
# _config.LOG_FOLDER = ".biothings_hub/logs"
_config.LOG_FOLDER = None # disable file logging, only log to stdout
_config.DATA_PLUGIN_FOLDER = f"{working_dir}"
_config.hub_db = importlib.import_module(_config.HUB_DB_BACKEND["module"])
configuration_instance = DummyConfig("config")

try:
config_mod = importlib.import_module("config")
for attr in dir(config_mod):
value = getattr(config_mod, attr)
if isinstance(value, ConfigurationError):
raise ConfigurationError("%s: %s" % (attr, str(value)))
setattr(_config, attr, value)
raise ConfigurationError(f"{attr}: {value}")
setattr(configuration_instance, attr, value)
except ModuleNotFoundError:
logger.debug("The config.py does not exists in the working directory, use default biothings.config")
logger.warning(ModuleNotFoundError)
logger.warning("Unable to find `config` module. Using the default configuration")
finally:
sys.modules["config"] = configuration_instance
sys.modules["biothings.config"] = configuration_instance

configuration_instance.HUB_DB_BACKEND = {
"module": "biothings.utils.sqlite3",
"sqlite_db_folder": ".biothings_hub",
}
configuration_instance.DATA_SRC_SERVER = "localhost"
configuration_instance.DATA_SRC_DATABASE = "data_src_database"
configuration_instance.DATA_ARCHIVE_ROOT = ".biothings_hub/archive"
configuration_instance.LOG_FOLDER = ".biothings_hub/logs"
configuration_instance.DATA_PLUGIN_FOLDER = f"{working_dir}"

try:
configuration_instance.hub_db = importlib.import_module(configuration_instance.HUB_DB_BACKEND["module"])
except ImportError as import_err:
logger.exception(import_err)
raise import_err

sys.modules["config"] = _config
sys.modules["biothings.config"] = _config
configuration_repr = [
f"{configuration_key}: [{configuration_value}]"
for configuration_key, configuration_value in inspect.getmembers(configuration_instance)
if configuration_value is not None
]
logger.info("CLI Configuration:\n%s", "\n".join(configuration_repr))


def main():
"""The main entry point for running the BioThings CLI to test your local data plugins."""
"""
The entrypoint for running the BioThings CLI to test your local data plugin
"""
if not typer_avail:
logger.error(
'"typer" package is required for CLI feature. Use "pip install biothings[cli]" or "pip install typer[all]" to install.'
Expand Down
165 changes: 107 additions & 58 deletions biothings/cli/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,11 @@
import os
import pathlib
import shutil
import sys
import time
from pprint import pformat
from types import SimpleNamespace
from typing import Union

import tornado.template
import typer
Expand All @@ -16,17 +18,19 @@
from rich.panel import Panel
from rich.table import Table

import biothings.utils.inspect as btinspect
from biothings.utils import es
from biothings.utils.common import timesofar
from biothings.utils.dataload import dict_traverse
from biothings.utils.serializer import load_json, to_json
from biothings.utils.sqlite3 import get_src_db
from biothings.utils.workers import upload_worker
import biothings.utils.inspect as btinspect


def get_logger(name=None):
"""Get a logger with the given name. If name is None, return the root logger."""
"""
Get a logger with the given name.
If name is None, return the root logger.
"""
# basicConfig has been setup in cli.py, so we don't need to do it again here
# If everything works as expected, we can delete this block.
# logging.basicConfig(
Expand All @@ -35,10 +39,13 @@ def get_logger(name=None):
# datefmt="[%X]",
# handlers=[RichHandler(rich_tracebacks=True, show_path=False)],
# )
logger = logging.getLogger(name=None)
logger = logging.getLogger(name)
return logger


logger = get_logger(name=__name__)


def run_sync_or_async_job(func, *args, **kwargs):
"""When func is defined as either normal or async function/method, we will call this function properly and return the results.
For an async function/method, we will use CLIJobManager to run it.
Expand All @@ -54,15 +61,30 @@ def run_sync_or_async_job(func, *args, **kwargs):
return func(*args, **kwargs)


def load_plugin_managers(plugin_path, plugin_name=None, data_folder=None):
"""Load a data plugin from <plugin_path>, and return a tuple of (dumper_manager, upload_manager)"""
from biothings import config as btconfig
def load_plugin_managers(
plugin_path: Union[str, pathlib.Path], plugin_name: str = None, data_folder: Union[str, pathlib.Path] = None
):
"""
Load a data plugin from <plugin_path>, and return a tuple of (dumper_manager, upload_manager)
"""
from biothings import config
from biothings.hub.dataload.dumper import DumperManager
from biothings.hub.dataload.uploader import UploaderManager
from biothings.hub.dataplugin.assistant import LocalAssistant
from biothings.hub.dataplugin.manager import DataPluginManager
from biothings.utils.hub_db import get_data_plugin

_plugin_path = pathlib.Path(plugin_path).resolve()
config.DATA_PLUGIN_FOLDER = _plugin_path.parent.as_posix()
sys.path.append(str(_plugin_path.parent))

if plugin_name is None:
plugin_name = _plugin_path.name

if data_folder is None:
data_folder = pathlib.Path(f"./{plugin_name}")
data_folder = pathlib.Path(data_folder).resolve().absolute()

plugin_manager = DataPluginManager(job_manager=None)
dmanager = DumperManager(job_manager=None)
upload_manager = UploaderManager(job_manager=None)
Expand All @@ -71,12 +93,9 @@ def load_plugin_managers(plugin_path, plugin_name=None, data_folder=None):
LocalAssistant.dumper_manager = dmanager
LocalAssistant.uploader_manager = upload_manager

_plugin_path = pathlib.Path(plugin_path).resolve()
btconfig.DATA_PLUGIN_FOLDER = _plugin_path.parent.as_posix()
plugin_name = plugin_name or _plugin_path.name
data_folder = data_folder or f"./{plugin_name}"
assistant = LocalAssistant(f"local://{plugin_name}")
# print(assistant.plugin_name, plugin_name, _plugin_path.as_posix(), btconfig.DATA_PLUGIN_FOLDER)
logger.debug(assistant.plugin_name, plugin_name, _plugin_path.as_posix(), config.DATA_PLUGIN_FOLDER)

dp = get_data_plugin()
dp.remove({"_id": assistant.plugin_name})
dp.insert_one(
Expand All @@ -88,7 +107,7 @@ def load_plugin_managers(plugin_path, plugin_name=None, data_folder=None):
"active": True,
},
"download": {
"data_folder": data_folder, # tmp path to your data plugin
"data_folder": str(data_folder), # tmp path to your data plugin
},
}
)
Expand All @@ -115,40 +134,40 @@ def get_plugin_name(plugin_name=None, with_working_dir=True):
return plugin_name, working_dir if with_working_dir else plugin_name


def is_valid_data_plugin_dir(data_plugin_dir):
"""Return True/False if the given folder is a valid data plugin folder (contains either manifest.yaml or manifest.json)"""
return (
pathlib.Path(data_plugin_dir, "manifest.yaml").exists()
or pathlib.Path(data_plugin_dir, "manifest.json").exists()
)


def load_plugin(plugin_name=None, dumper=True, uploader=True, logger=None):
"""Return a plugin object for the given plugin_name.
def load_plugin(plugin_name: str = None, dumper: bool = True, uploader: bool = True, logger: logging.Logger = None):
"""
Return a plugin object for the given plugin_name.
If dumper is True, include a dumper instance in the plugin object.
If uploader is True, include uploader_classes in the plugin object.
If <plugin_name> is not valid, raise the proper error and exit.
"""
logger = logger or get_logger(__name__)

_plugin_name, working_dir = get_plugin_name(plugin_name, with_working_dir=True)
data_plugin_dir = pathlib.Path(working_dir) if plugin_name is None else pathlib.Path(working_dir, _plugin_name)
if not is_valid_data_plugin_dir(data_plugin_dir):
if plugin_name is None:
# current working_dir has the data plugin
data_plugin_dir = pathlib.Path(working_dir)
data_folder = pathlib.Path(".").resolve().absolute()
plugin_args = {"plugin_path": working_dir, "plugin_name": None, "data_folder": data_folder}
else:
data_plugin_dir = pathlib.Path(working_dir, _plugin_name)
plugin_args = {"plugin_path": _plugin_name, "plugin_name": None, "data_folder": None}
try:
dumper_manager, uploader_manager = load_plugin_managers(**plugin_args)
except Exception as gen_exc:
logger.exception(gen_exc)
if plugin_name is None:
err = (
plugin_loading_error_message = (
"This command must be run inside a data plugin folder. Please go to a data plugin folder and try again!"
)
else:
err = f'The data plugin folder "{data_plugin_dir}" is not a valid data plugin folder. Please try another.'
logger.error(err, extra={"markup": True})
plugin_loading_error_message = (
f'The data plugin folder "{data_plugin_dir}" is not a valid data plugin folder. Please try another.'
)
logger.error(plugin_loading_error_message, extra={"markup": True})
raise typer.Exit(1)

if plugin_name is None:
# current working_dir has the data plugin
dumper_manager, uploader_manager = load_plugin_managers(working_dir, data_folder=".")
else:
dumper_manager, uploader_manager = load_plugin_managers(_plugin_name)

current_plugin = SimpleNamespace(
plugin_name=_plugin_name,
data_plugin_dir=data_plugin_dir,
Expand Down Expand Up @@ -288,7 +307,12 @@ def do_dump_and_upload(plugin_name, logger=None):


def process_inspect(source_name, mode, limit, merge, logger, do_validate, output=None):
"""Perform inspect for the given source. It's used in do_inspect function below"""
"""
Perform inspect for the given source. It's used in do_inspect function below
"""
from biothings import config
from biothings.utils import hub_db

VALID_INSPECT_MODES = ["jsonschema", "type", "mapping", "stats"]
mode = mode.split(",")
if "jsonschema" in mode:
Expand All @@ -305,14 +329,14 @@ def process_inspect(source_name, mode, limit, merge, logger, do_validate, output
t0 = time.time()
data_provider = ("src", source_name)

src_db = get_src_db()
src_db = hub_db.get_src_db()
pre_mapping = "mapping" in mode
src_cols = src_db[source_name]
inspected = {}
converters, mode = btinspect.get_converters(mode)
for m in mode:
inspected.setdefault(m, {})
cur = src_cols.findv2()
cur = src_cols.find()
res = btinspect.inspect_docs(
cur,
mode=mode,
Expand Down Expand Up @@ -500,18 +524,24 @@ def do_inspect(
process_inspect(source_name, mode, limit, merge, logger=logger, do_validate=True, output=output)


def get_manifest_content(working_dir):
"""return the manifest content of the data plugin in the working directory"""
manifest_file_yml = os.path.join(working_dir, "manifest.yaml")
manifest_file_json = os.path.join(working_dir, "manifest.json")
if os.path.isfile(manifest_file_yml):
manifest = yaml.safe_load(open(manifest_file_yml))
return manifest
elif os.path.isfile(manifest_file_json):
manifest = load_json(open(manifest_file_json).read())
return manifest
def get_manifest_content(working_dir: Union[str, pathlib.Path]) -> dict:
"""
return the manifest content of the data plugin in the working directory
"""
working_dir = pathlib.Path(working_dir).resolve().absolute()
manifest_file_yml = working_dir.joinpath("manifest.yaml")
manifest_file_json = working_dir.joinpath("manifest.json")

manifest = {}
if manifest_file_yml.exists():
with open(manifest_file_yml, "r", encoding="utf-8") as yaml_handle:
manifest = yaml.safe_load(yaml_handle)
elif manifest_file_json.exists():
with open(manifest_file_json, "r", encoding="utf-8") as json_handle:
manifest = load_json(json_handle)
else:
raise FileNotFoundError("manifest file does not exits in current working directory")
logger.info("No manifest file discovered")
return manifest


########################
Expand All @@ -520,14 +550,21 @@ def get_manifest_content(working_dir):


def get_uploaders(working_dir: pathlib.Path):
"""A helper function to get the uploaders from the manifest file in the working directory, used in show_uploaded_sources function below"""
"""
A helper function to get the uploaders from the manifest file in the working directory
used in show_uploaded_sources function below
"""
data_plugin_name = working_dir.name

manifest = get_manifest_content(working_dir)
upload_section = manifest.get("uploader")
table_space = [data_plugin_name]
if not upload_section:
upload_sections = manifest.get("uploaders")
upload_section = manifest.get("uploader", None)
upload_sections = manifest.get("uploaders", None)

if upload_section is None and upload_sections is None:
table_space = [data_plugin_name]
elif upload_section is None and upload_sections is not None:
table_space = [item["name"] for item in upload_sections]

return table_space


Expand Down Expand Up @@ -573,10 +610,15 @@ def get_uploaded_collections(src_db, uploaders):


def show_uploaded_sources(working_dir, plugin_name):
"""A helper function to show the uploaded sources from given plugin."""
"""
A helper function to show the uploaded sources from given plugin.
"""
from biothings import config
from biothings.utils import hub_db

console = Console()
uploaders = get_uploaders(working_dir)
src_db = get_src_db()
src_db = hub_db.get_src_db()
uploaded_sources, archived_sources, temp_sources = get_uploaded_collections(src_db, uploaders)
if not uploaded_sources:
console.print(
Expand Down Expand Up @@ -670,10 +712,14 @@ def do_list(plugin_name=None, dump=False, upload=False, hubdb=False, logger=None


def serve(host, port, plugin_name, table_space):
"""Serve a simple API server to query the data plugin source."""
"""
Serve a simple API server to query the data plugin source.
"""
from .web_app import main
from biothings import config
from biothings.utils import hub_db

src_db = get_src_db()
src_db = hub_db.get_src_db()
rprint(f"[green]Serving data plugin source: {plugin_name}[/green]")
asyncio.run(main(host=host, port=port, db=src_db, table_space=table_space))

Expand Down Expand Up @@ -724,8 +770,11 @@ def do_clean_dumped_files(data_folder, plugin_name):

def do_clean_uploaded_sources(working_dir, plugin_name):
"""Remove all uploaded sources by a data plugin in the working directory."""
from biothings import config
from biothings.utils import hub_db

uploaders = get_uploaders(working_dir)
src_db = get_src_db()
src_db = hub_db.get_src_db()
uploaded_sources = []
for item in src_db.collection_names():
if item in uploaders:
Expand Down
Loading

0 comments on commit e154eaa

Please sign in to comment.