Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose custom 'walkers' to CLI. #551

Merged
merged 6 commits into from
Aug 25, 2023

Conversation

danielballan
Copy link
Member

@danielballan danielballan commented Aug 15, 2023

Prompted by @dylanmcreynolds' question in the Nikea Slack.

The directory-walker in Tiled is typically used to represent each file as a separate, single logical node. But what we want to group files together and represent them as a single node? A common example is a TIFF sequence (A001.tif, A002.tif, ... A100.tif).

Until recently your options were:

  • Write your own directory walker. This is the approach that @jmaruland referred to in the Slack discussion.
  • Plug into the existing walker with a recently-removed featured called subdirectory_handler, which was both confusing and limited

Now, you can extend the directory-walking machinery nicely. This PR:

  • Exposes this feature to the CLI. Previously it could only be accessed via the Python API
  • Sketches a working example here in the PR description

The example supposes that each node is backed by two files, named like x.json and x.csv for various x.

$ tree files
files
├── a.csv
├── a.json
├── b.csv
└── b.json

In general you could make more complex examples with more files and/or with grouping less direct than "They have the same name but different extensions." The Tiled machinery would be the same.

You need two pieces of code:

  1. A custom "walker" that will be called for each directory in your (potentially nested) tree of files. It handles the files it recognizes and returns the rest.
  2. A custom adapter

Working examples of both are defined here. Place this in your working directory as custom.py to follow along below.

# custom.py
import collections
import json

from tiled.adapters.csv import read_csv
from tiled.catalog.register import (
    Asset,
    DataSource,
    Management,
    create_node_safe,
    dict_or_none,
    ensure_uri,
    logger,
)


async def walk_csv_with_json(
    catalog,
    path,
    files,
    directories,
    settings,
):
    """
    Process groups of files as a single Node.

    As an example, suppose that we have data and metadata represented by a pair
    of files: a table as CSV and metadata as JSON, named like X.csv and
    X.json. Other examples could have different names or involve more than two
    files; this is just a simple example.

    But there may be other CSV or JSON files around, not involved in this scheme,
    and those should pass through, unhandled, and be handled as normal stand-alone
    CSV or JSON files.

    This function groups the files of interest, registers them, and passes on
    the files not of interest.
    """
    unhandled_directories = directories
    unhandled_files = []

    # The details of this section are _not_ related to Tiled specifically,
    # just general Python code for identify which files match the pattern
    # of interest. The strategy here will vary widely depending on the
    # specifics. See the function tiled.catalog.register.tiff_sequence
    # for a different example.

    # Refresher on jargon in the Python Path API:
    # The stem of "a/b/c/x.csv" is "x".
    # The suffix of "a/b/c/x.csv" is ".csv".
    files_by_stem_and_suffix = collections.defaultdict(dict)
    matched_stems = []
    for file in files:
        files_by_stem_and_suffix[file.stem][file.suffix] = file

    for stem, files_by_suffix in files_by_stem_and_suffix.items():
        # Can we find a matching x.csv and x_info.json?
        if (".csv" in files_by_suffix) and (".json" in files_by_suffix):
            matched_stems.append(stem)
            unhandled_files.extend(
                file
                for suffix, files in files_by_suffix.items()
                if suffix not in {".csv", ".json"}
            )
        else:
            unhandled_files.extend(files_by_suffix.values())

    # This is a way to write a mimetype that means "many files".
    mimetype = "multipart/related;type=example"

    # Here comes the Tiled part, where we construct an Adapter to extract
    # metadata and structure information, and register it with a catalog.
    for stem in matched_stems:
        logger.info("    Grouped CSV and JSON into a node '%s'", stem)
        adapter_class = settings.adapters_by_mimetype[mimetype]
        files_by_suffix = files_by_stem_and_suffix[stem]
        csv_file = files_by_suffix[".csv"]
        json_file = files_by_suffix[".json"]
        try:
            adapter = adapter_class(csv_file, json_file)
        except Exception:
            logger.exception("    SKIPPED: Error constructing adapter for '%s'", stem)
            return
        await create_node_safe(
            catalog,
            key=stem,
            structure_family=adapter.structure_family,
            metadata=dict(adapter.metadata()),
            specs=adapter.specs,
            data_sources=[
                DataSource(
                    mimetype=mimetype,
                    structure=dict_or_none(adapter.structure()),
                    parameters={},
                    management=Management.external,
                    assets=[
                        Asset(
                            data_uri=str(ensure_uri(str(csv_file.absolute()))),
                            is_directory=False,
                        ),
                        Asset(
                            data_uri=str(ensure_uri(str(json_file.absolute()))),
                            is_directory=False,
                        ),
                    ],
                )
            ],
        )
    return unhandled_files, unhandled_directories


def read_csv_with_json(csv_file, json_file, metadata=None, **kwargs):
    if metadata is None:
        metadata = json.loads(json_file.read_text())
    return read_csv(csv_file, metadata=metadata, **kwargs)

Start a server:

tiled serve directory \
  --walker custom:walk_csv_with_json
  --adapter 'multipart/related;type=example=custom:read_csv_with_json'
  --verbose
  --public
  files/

We see

Creating catalog database at /tmp/tmp177tgr7e/catalog.db
Indexing 'files/' ...
  Overwriting '/'
  Walking 'files'
    Grouped CSV and JSON into a node 'b'
    Grouped CSV and JSON into a node 'a'
Indexing complete. Starting server...

The metadata and data can be accessed:

$ http :8000/api/v1/metadata/a | jq .data.attributes.metadata
{
  "color": "blue"
}
$ http :8000/api/v1/node/full/a Accept:text/csv
HTTP/1.1 200 OK
Set-Cookie: tiled_csrf=_PpHNuNXtdLrZ-9APkSlULXM7hyII8dzbBYNT0hpwzo; HttpOnly; Path=/; SameSite=lax
content-length: 33
content-type: text/csv; charset=utf-8
date: Tue, 15 Aug 2023 06:02:16 GMT
etag: ef4697000229bc8798aadd68b07cc5d9
server: uvicorn
server-timing: read;dur=2546.7, tok;dur=0.1, pack;dur=1.3, app;dur=2559.9
set-cookie: tiled_csrf=_PpHNuNXtdLrZ-9APkSlULXM7hyII8dzbBYNT0hpwzo; HttpOnly; Path=/; SameSite=lax

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6

There are some details in the implementation that deserve a closer look:

  • In testing this I found an important bug. The Assets passed in to read_csv_with_json comes out of the database in a nondeterministic order. You get the wrong output (JSON metadata where you wanted a table) 50% of the time. The code involved here is quite new. This should be easy to fix.
  • In this particular example, the JSON file is fully de-normalized into the catalog database, so it may not make sense to track that as an Asset at all. A more complex example would involve both the metadata and data split across multiple files.

@danielballan
Copy link
Member Author

A built-in example may also be helpful for reference: the TIFF sequence walker

# Matches filename with (optional) non-digits \D followed by digits \d
# and then the file extension .tif or .tiff.
TIFF_SEQUENCE_STEM_PATTERN = re.compile(r"^(\D*)(\d+)\.(?:tif|tiff)$")
async def tiff_sequence(
catalog,
path,
files,
directories,
settings,
):
"""
Group files in the given directory into TIFF sequences.
We are looking for any files:
- with file extension .tif or .tiff
- with file name ending in a number
We group these into sorted groups and make one Node for each.
A group may have one or more items.
"""
unhandled_directories = directories
unhandled_files = []
sequences = collections.defaultdict(list)
for file in files:
if file.is_file():
match = TIFF_SEQUENCE_STEM_PATTERN.match(file.name)
if match:
sequence_name, _sequence_number = match.groups()
sequences[sequence_name].append(file)
continue
unhandled_files.append(file)
mimetype = "multipart/related;type=image/tiff"
for name, sequence in sorted(sequences.items()):
logger.info(" Grouped %d TIFFs into a sequence '%s'", len(sequence), name)
adapter_class = settings.adapters_by_mimetype[mimetype]
key = settings.key_from_filename(name)
try:
adapter = adapter_class(*sequence)
except Exception:
logger.exception(" SKIPPED: Error constructing adapter for '%s'", name)
return
await create_node_safe(
catalog,
key=key,
structure_family=adapter.structure_family,
metadata=dict(adapter.metadata()),
specs=adapter.specs,
data_sources=[
DataSource(
mimetype=mimetype,
structure=dict_or_none(adapter.structure()),
parameters={},
management=Management.external,
assets=[
Asset(
data_uri=str(ensure_uri(str(item.absolute()))),
is_directory=False,
)
for item in sorted(sequence)
],
)
],
)
return unhandled_files, unhandled_directories

@danielballan danielballan marked this pull request as ready for review August 24, 2023 20:22
@danielballan
Copy link
Member Author

Thanks @Wiebke and @dylanmcreynolds.

@danielballan danielballan merged commit 28d32ae into bluesky:main Aug 25, 2023
8 checks passed
@danielballan danielballan deleted the custom-walker branch August 25, 2023 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant