Expose custom 'walkers' to CLI. #551

danielballan · 2023-08-15T06:06:38Z

Prompted by @dylanmcreynolds' question in the Nikea Slack.

The directory-walker in Tiled is typically used to represent each file as a separate, single logical node. But what we want to group files together and represent them as a single node? A common example is a TIFF sequence (A001.tif, A002.tif, ... A100.tif).

Until recently your options were:

Write your own directory walker. This is the approach that @jmaruland referred to in the Slack discussion.
Plug into the existing walker with a recently-removed featured called subdirectory_handler, which was both confusing and limited

Now, you can extend the directory-walking machinery nicely. This PR:

Exposes this feature to the CLI. Previously it could only be accessed via the Python API
Sketches a working example here in the PR description

The example supposes that each node is backed by two files, named like x.json and x.csv for various x.

$ tree files
files
├── a.csv
├── a.json
├── b.csv
└── b.json

In general you could make more complex examples with more files and/or with grouping less direct than "They have the same name but different extensions." The Tiled machinery would be the same.

You need two pieces of code:

A custom "walker" that will be called for each directory in your (potentially nested) tree of files. It handles the files it recognizes and returns the rest.
A custom adapter

Working examples of both are defined here. Place this in your working directory as custom.py to follow along below.

# custom.py
import collections
import json

from tiled.adapters.csv import read_csv
from tiled.catalog.register import (
    Asset,
    DataSource,
    Management,
    create_node_safe,
    dict_or_none,
    ensure_uri,
    logger,
)


async def walk_csv_with_json(
    catalog,
    path,
    files,
    directories,
    settings,
):
    """
    Process groups of files as a single Node.

    As an example, suppose that we have data and metadata represented by a pair
    of files: a table as CSV and metadata as JSON, named like X.csv and
    X.json. Other examples could have different names or involve more than two
    files; this is just a simple example.

    But there may be other CSV or JSON files around, not involved in this scheme,
    and those should pass through, unhandled, and be handled as normal stand-alone
    CSV or JSON files.

    This function groups the files of interest, registers them, and passes on
    the files not of interest.
    """
    unhandled_directories = directories
    unhandled_files = []

    # The details of this section are _not_ related to Tiled specifically,
    # just general Python code for identify which files match the pattern
    # of interest. The strategy here will vary widely depending on the
    # specifics. See the function tiled.catalog.register.tiff_sequence
    # for a different example.

    # Refresher on jargon in the Python Path API:
    # The stem of "a/b/c/x.csv" is "x".
    # The suffix of "a/b/c/x.csv" is ".csv".
    files_by_stem_and_suffix = collections.defaultdict(dict)
    matched_stems = []
    for file in files:
        files_by_stem_and_suffix[file.stem][file.suffix] = file

    for stem, files_by_suffix in files_by_stem_and_suffix.items():
        # Can we find a matching x.csv and x_info.json?
        if (".csv" in files_by_suffix) and (".json" in files_by_suffix):
            matched_stems.append(stem)
            unhandled_files.extend(
                file
                for suffix, files in files_by_suffix.items()
                if suffix not in {".csv", ".json"}
            )
        else:
            unhandled_files.extend(files_by_suffix.values())

    # This is a way to write a mimetype that means "many files".
    mimetype = "multipart/related;type=example"

    # Here comes the Tiled part, where we construct an Adapter to extract
    # metadata and structure information, and register it with a catalog.
    for stem in matched_stems:
        logger.info("    Grouped CSV and JSON into a node '%s'", stem)
        adapter_class = settings.adapters_by_mimetype[mimetype]
        files_by_suffix = files_by_stem_and_suffix[stem]
        csv_file = files_by_suffix[".csv"]
        json_file = files_by_suffix[".json"]
        try:
            adapter = adapter_class(csv_file, json_file)
        except Exception:
            logger.exception("    SKIPPED: Error constructing adapter for '%s'", stem)
            return
        await create_node_safe(
            catalog,
            key=stem,
            structure_family=adapter.structure_family,
            metadata=dict(adapter.metadata()),
            specs=adapter.specs,
            data_sources=[
                DataSource(
                    mimetype=mimetype,
                    structure=dict_or_none(adapter.structure()),
                    parameters={},
                    management=Management.external,
                    assets=[
                        Asset(
                            data_uri=str(ensure_uri(str(csv_file.absolute()))),
                            is_directory=False,
                        ),
                        Asset(
                            data_uri=str(ensure_uri(str(json_file.absolute()))),
                            is_directory=False,
                        ),
                    ],
                )
            ],
        )
    return unhandled_files, unhandled_directories


def read_csv_with_json(csv_file, json_file, metadata=None, **kwargs):
    if metadata is None:
        metadata = json.loads(json_file.read_text())
    return read_csv(csv_file, metadata=metadata, **kwargs)

Start a server:

tiled serve directory \
  --walker custom:walk_csv_with_json
  --adapter 'multipart/related;type=example=custom:read_csv_with_json'
  --verbose
  --public
  files/

We see

Creating catalog database at /tmp/tmp177tgr7e/catalog.db
Indexing 'files/' ...
  Overwriting '/'
  Walking 'files'
    Grouped CSV and JSON into a node 'b'
    Grouped CSV and JSON into a node 'a'
Indexing complete. Starting server...

The metadata and data can be accessed:

$ http :8000/api/v1/metadata/a | jq .data.attributes.metadata
{
  "color": "blue"
}

$ http :8000/api/v1/node/full/a Accept:text/csv
HTTP/1.1 200 OK
Set-Cookie: tiled_csrf=_PpHNuNXtdLrZ-9APkSlULXM7hyII8dzbBYNT0hpwzo; HttpOnly; Path=/; SameSite=lax
content-length: 33
content-type: text/csv; charset=utf-8
date: Tue, 15 Aug 2023 06:02:16 GMT
etag: ef4697000229bc8798aadd68b07cc5d9
server: uvicorn
server-timing: read;dur=2546.7, tok;dur=0.1, pack;dur=1.3, app;dur=2559.9
set-cookie: tiled_csrf=_PpHNuNXtdLrZ-9APkSlULXM7hyII8dzbBYNT0hpwzo; HttpOnly; Path=/; SameSite=lax

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6

There are some details in the implementation that deserve a closer look:

In testing this I found an important bug. The Assets passed in to read_csv_with_json comes out of the database in a nondeterministic order. You get the wrong output (JSON metadata where you wanted a table) 50% of the time. The code involved here is quite new. This should be easy to fix.
In this particular example, the JSON file is fully de-normalized into the catalog database, so it may not make sense to track that as an Asset at all. A more complex example would involve both the metadata and data split across multiple files.

danielballan · 2023-08-15T06:09:41Z

A built-in example may also be helpful for reference: the TIFF sequence walker

tiled/tiled/catalog/register.py

Lines 320 to 385 in c88f41b

    
           # Matches filename with (optional) non-digits \D followed by digits \d 
        
           # and then the file extension .tif or .tiff. 
        
           TIFF_SEQUENCE_STEM_PATTERN = re.compile(r"^(\D*)(\d+)\.(?:tif|tiff)$") 
        
           async def tiff_sequence( 
        
               catalog, 
        
               path, 
        
               files, 
        
               directories, 
        
               settings, 
        
           ): 
        
               """ 
        
               Group files in the given directory into TIFF sequences. 
        
               We are looking for any files: 
        
               - with file extension .tif or .tiff 
        
               - with file name ending in a number 
        
               We group these into sorted groups and make one Node for each. 
        
               A group may have one or more items. 
        
               """ 
        
               unhandled_directories = directories 
        
               unhandled_files = [] 
        
               sequences = collections.defaultdict(list) 
        
               for file in files: 
        
                   if file.is_file(): 
        
                       match = TIFF_SEQUENCE_STEM_PATTERN.match(file.name) 
        
                       if match: 
        
                           sequence_name, _sequence_number = match.groups() 
        
                           sequences[sequence_name].append(file) 
        
                           continue 
        
                   unhandled_files.append(file) 
        
               mimetype = "multipart/related;type=image/tiff" 
        
               for name, sequence in sorted(sequences.items()): 
        
                   logger.info("    Grouped %d TIFFs into a sequence '%s'", len(sequence), name) 
        
                   adapter_class = settings.adapters_by_mimetype[mimetype] 
        
                   key = settings.key_from_filename(name) 
        
                   try: 
        
                       adapter = adapter_class(*sequence) 
        
                   except Exception: 
        
                       logger.exception("    SKIPPED: Error constructing adapter for '%s'", name) 
        
                       return 
        
                   await create_node_safe( 
        
                       catalog, 
        
                       key=key, 
        
                       structure_family=adapter.structure_family, 
        
                       metadata=dict(adapter.metadata()), 
        
                       specs=adapter.specs, 
        
                       data_sources=[ 
        
                           DataSource( 
        
                               mimetype=mimetype, 
        
                               structure=dict_or_none(adapter.structure()), 
        
                               parameters={}, 
        
                               management=Management.external, 
        
                               assets=[ 
        
                                   Asset( 
        
                                       data_uri=str(ensure_uri(str(item.absolute()))), 
        
                                       is_directory=False, 
        
                                   ) 
        
                                   for item in sorted(sequence) 
        
                               ], 
        
                           ) 
        
                       ], 
        
                   ) 
        
               return unhandled_files, unhandled_directories

danielballan · 2023-08-25T00:17:02Z

Thanks @Wiebke and @dylanmcreynolds.

Expose custom 'walkers' to CLI.

023800f

Wiebke mentioned this pull request Aug 24, 2023

Allow digits in tiff-sequence file names #555

Merged

danielballan added 3 commits August 24, 2023 08:34

Add skip_all and keep DEFAULT_WALKERS when overridded.

c17bebe

Fix skip_all, and test it.

9b605c6

whitespace

82b1808

danielballan marked this pull request as ready for review August 24, 2023 20:22

danielballan added 2 commits August 24, 2023 17:41

Improved test, by @Wiebke

56b1acc

Adjust skip_all to traverse subdirs.

8ab69d4

danielballan merged commit 28d32ae into bluesky:main Aug 25, 2023
8 checks passed

danielballan deleted the custom-walker branch August 25, 2023 00:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose custom 'walkers' to CLI. #551

Expose custom 'walkers' to CLI. #551

danielballan commented Aug 15, 2023 •

edited

Loading

danielballan commented Aug 15, 2023

danielballan commented Aug 25, 2023

Expose custom 'walkers' to CLI. #551

Expose custom 'walkers' to CLI. #551

Conversation

danielballan commented Aug 15, 2023 • edited Loading

danielballan commented Aug 15, 2023

danielballan commented Aug 25, 2023

danielballan commented Aug 15, 2023 •

edited

Loading