Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Artifact API #436

Draft
wants to merge 18 commits into
base: master
Choose a base branch
from
Draft

Artifact API #436

wants to merge 18 commits into from

Conversation

seanmacavaney
Copy link
Collaborator

@seanmacavaney seanmacavaney commented Apr 18, 2024

WIP

Example:

import pyterrier as pt ; pt.init()
index = pt.artifact.from_url('hf:macavaney/msmarco-passage.terrier')
# TerrierIndex('/Users/sean/.pyterrier/artifacts/7ead118630437940852142386f67ab62123a6ce372bb4b8cf12b06a76c8ccc25' <from 'https://huggingface.co/datasets/macavaney/msmarco-passage.terrier/resolve/main/artifact.tar.lz4'>)
index.bm25() # -> TerrierRetrieve

# maintain support for centralized from_dataset
index = pt.Artifact.from_dataset('msmarco_document', 'terrier_stemmed') # maps to hf:macavaney/pyterrier-from-dataset@msmarco_document.terrier_stemmed

@cmacdonald
Copy link
Contributor

Thanks Sean, this is interesting concept.

TerrierIndex is a bit of a complicated decision, as you know.

Could have a branch of pyterrier_pisa using this functionality so we can roadtest the API? A guide on how to add a new artifact?

@seanmacavaney
Copy link
Collaborator Author

The new pyterrier-quality repo has an example.

The key bits are:

Some things I'm still considering:

  • I think we can drop the _try_load stuff. Artifacts need the metadata to be loaded. This will be the primary case, which simplifies the implementation of a new artifact.
    • To handle existing artifacts that were not generated with a metadata file, simple "artifact metadata adapters" can be used to guess what the metadata file should be based on what files are in there. I think we can limit this to only a few rare cases.
  • A Artifact.from_hf(dataset_id) which just calls Artifact.from_hf(f'hf:{dataset_id}')
  • Some common stuff for building an artifact. Similar to the builder stuff in pyterrier-caching

@seanmacavaney
Copy link
Collaborator Author

artifact branch now on pyterrier-pisa: https://github.com/terrierteam/pyterrier_pisa/tree/artifact

@seanmacavaney
Copy link
Collaborator Author

@cmacdonald
Copy link
Contributor

cmacdonald commented May 7, 2024

I'm not sure what the entry_point stuff is for. Can you explain it simply? Is that just a code discovery mechanism for Python?

What use case does an Artefact address? Is it so I dont know the class that I am looking for to get an index or something, I can still load a factory object? But if I dont know its class, I dont know what (factory) methods it supports.

@seanmacavaney
Copy link
Collaborator Author

seanmacavaney commented May 7, 2024

Entry points act as a registry of all the artifacts installed. Since they're metadata about the package itself, they do not involve loading any modules at runtime to establish what's registered.

Here's a short document on the use case for artifacts: https://gist.github.com/seanmacavaney/ceac1b5eacaac4b072caa69986089ff4

Beyond the use cases outlined in the document, as you mention, it also simplifies the loading of artifacts. This is similar to how AutoModel simplifies loading models in huggingface. Oftentimes you're already specifying the name of what you're loading in the ID itself, so it's annoying and redundant to write it out again. For instance:

import pyterrier as pt
pt.init()
from pyterrier_pisa import PisaIndex
index = PisaIndex.from_hf('pyterrier/msmarco-passage.pisa')
# vs
import pyterrier as pt
pt.init()
index = pt.Artifact.from_hf('pyterrier/msmarco-passage.pisa')

The metadata file provided by the artifact specification also gives a hint about what package you need to install to load the artifact. So in the above example, if pyterrier-pisa isn't installed, it could give an error message saying that you need to install this package to load the index. (This isn't implemented yet, but the metadata is there.)

But if I dont know its class, I dont know what (factory) methods it supports.

This is also true with huggingface's AutoModel. You can always do help(index) to get documentation once you have an instance of an object, but when you only have an identifier, it might be challenging to find the right artifact class to load it.

@seanmacavaney
Copy link
Collaborator Author

seanmacavaney commented Jul 27, 2024

A prototype of the artifact API is in pyterrier-alpha. Integrated with extension packages:

Still to integrate:

@mam10eks
Copy link
Contributor

This is indeed a very cool concept, it would maybe also be cool if we could load artifact-results such as runs from TIRA? Could maybe be also prefixed similar to irds:... or the hf:... example from above?

@seanmacavaney
Copy link
Collaborator Author

Sounds reasonable! The idea would be that it would detect if it was loading a run file (or similar) and return it as a dataframe? We can experiment with this a bit on the implementation in pyterrier-alpha.

@mam10eks
Copy link
Contributor

yes, I think this style of magic (we likely should introduce a verbose flag :)) would be quite cool, automatically detecting that an ouptut is a run file should be no problem, as in tira they are always expected to produce a run.txt, which could be easily captured in combination with the scoped prefix.

@seanmacavaney
Copy link
Collaborator Author

For results, it might make more sense to have a special URI-style format for loading with pt.io.read_results? E.g., pt.io.read_results('tirex:<task>/<team>/<approach>/<dataset>')?

@mam10eks
Copy link
Contributor

mam10eks commented Aug 28, 2024

Yes, sounds very good.

I think a cool way could also be when one could directly pass a dataset id from ir_datasets, i.e., that the mapping from dataset-id to tira task is done internally. The dataset id might contain / characters, but I think this would be no problem as this would imply an hierarchical structure of the task, which is I think a valid viewpoint.

E.g., if we have the irds id clueweb12/touche-2020-task-2 an call would look like:

  • pt.io.read_results('tirex:clueweb12/touche-2020-task-2/<team>/<approach>)

We could maybe also think about listing of results. E.g., if I call something like:

  • pt.io.read_results('tirex:clueweb12/touche-2020-task-2/<team>)

it could print out all public approaches by the team and then fail, or if I call:

  • pt.io.read_results('tirex:clueweb12/touche-2020-task-2/)

It could print all public approaches and then fail.

@mam10eks
Copy link
Contributor

I would start to play a bit around in pyterrier-alpha.

@mam10eks
Copy link
Contributor

Cool, I have a first rough prototype (did not require no change in the pyterrier-alpha codebase, and only minor additions to the tira client) so that this test case works:

https://github.com/mam10eks/pyterrier-alpha/blob/main/tests/test_artifacts_from_tira.py

In principle (plus/minus potentially changed design decisions and documentation and more unit tests), this is it :)

@seanmacavaney
Copy link
Collaborator Author

seanmacavaney commented Sep 24, 2024

As a heads up -- I've replaced this branch with a version taken from alpha

The artifact-old branch records the state before the force push.

@cmacdonald
Copy link
Contributor

I think current commit omits the TerrierIndex artefact

@seanmacavaney
Copy link
Collaborator Author

Good catch! I had forgotten that it was done in the old one.



@contextmanager
def finalized_directory(path: str) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesnt return a string?



@contextmanager
def download_stream(url: str, *, expected_sha256: Optional[str] = None, verbose: bool = True) -> io.IOBase:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yield not the same as returning IOBase:

Return type of generator function must be compatible with "Generator[Any, Any, Any]"
"Generator[Any, Any, Any]" is not assignable to "IOBase"

*,
expected_sha256: Optional[str] = None,
verbose: bool = True
) -> io.IOBase:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

os.replace(path_tmp, path)


def download(url: str, path: str, *, expected_sha256: str = None, verbose: bool = True) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should/can this replace wget (to reduce dependencies?)

pyterrier/terrier/_index.py Show resolved Hide resolved
@@ -54,23 +64,26 @@ def find_files(dir):


@contextmanager
def _finalized_open_base(path, mode, open_fn):
def _finalized_open_base(path: str, mode: str, open_fn: Callable) -> io.IOBase:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again yield vs return (mypy told me these)

@seanmacavaney
Copy link
Collaborator Author

Thanks! From what I can tell, it looks like the correct return type annotations for the context managers (are described here) is Generator[X, None, None]. (It's a bummer because the annotations make it look like it's used as a generator instead of how it's actually used as a context manager :/.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants