Artifact API #436

seanmacavaney · 2024-04-18T21:06:04Z

WIP

Example:

import pyterrier as pt ; pt.init()
index = pt.artifact.from_url('hf:macavaney/msmarco-passage.terrier')
# TerrierIndex('/Users/sean/.pyterrier/artifacts/7ead118630437940852142386f67ab62123a6ce372bb4b8cf12b06a76c8ccc25' <from 'https://huggingface.co/datasets/macavaney/msmarco-passage.terrier/resolve/main/artifact.tar.lz4'>)
index.bm25() # -> TerrierRetrieve

# maintain support for centralized from_dataset
index = pt.Artifact.from_dataset('msmarco_document', 'terrier_stemmed') # maps to hf:macavaney/pyterrier-from-dataset@msmarco_document.terrier_stemmed

cmacdonald · 2024-04-22T20:06:01Z

Thanks Sean, this is interesting concept.

TerrierIndex is a bit of a complicated decision, as you know.

Could have a branch of pyterrier_pisa using this functionality so we can roadtest the API? A guide on how to add a new artifact?

seanmacavaney · 2024-05-03T14:55:44Z

The new pyterrier-quality repo has an example.

The key bits are:

Some things I'm still considering:

I think we can drop the _try_load stuff. Artifacts need the metadata to be loaded. This will be the primary case, which simplifies the implementation of a new artifact.
- To handle existing artifacts that were not generated with a metadata file, simple "artifact metadata adapters" can be used to guess what the metadata file should be based on what files are in there. I think we can limit this to only a few rare cases.
A Artifact.from_hf(dataset_id) which just calls Artifact.from_hf(f'hf:{dataset_id}')
Some common stuff for building an artifact. Similar to the builder stuff in pyterrier-caching

seanmacavaney · 2024-05-05T15:21:46Z

artifact branch now on pyterrier-pisa: https://github.com/terrierteam/pyterrier_pisa/tree/artifact

seanmacavaney · 2024-05-05T16:13:06Z

And on pyterrier-dr: https://github.com/terrierteam/pyterrier_dr/tree/artifact

cmacdonald · 2024-05-07T14:25:54Z

I'm not sure what the entry_point stuff is for. Can you explain it simply? Is that just a code discovery mechanism for Python?

What use case does an Artefact address? Is it so I dont know the class that I am looking for to get an index or something, I can still load a factory object? But if I dont know its class, I dont know what (factory) methods it supports.

seanmacavaney · 2024-05-07T16:02:43Z

Entry points act as a registry of all the artifacts installed. Since they're metadata about the package itself, they do not involve loading any modules at runtime to establish what's registered.

Here's a short document on the use case for artifacts: https://gist.github.com/seanmacavaney/ceac1b5eacaac4b072caa69986089ff4

Beyond the use cases outlined in the document, as you mention, it also simplifies the loading of artifacts. This is similar to how AutoModel simplifies loading models in huggingface. Oftentimes you're already specifying the name of what you're loading in the ID itself, so it's annoying and redundant to write it out again. For instance:

import pyterrier as pt
pt.init()
from pyterrier_pisa import PisaIndex
index = PisaIndex.from_hf('pyterrier/msmarco-passage.pisa')
# vs
import pyterrier as pt
pt.init()
index = pt.Artifact.from_hf('pyterrier/msmarco-passage.pisa')

The metadata file provided by the artifact specification also gives a hint about what package you need to install to load the artifact. So in the above example, if pyterrier-pisa isn't installed, it could give an error message saying that you need to install this package to load the index. (This isn't implemented yet, but the metadata is there.)

But if I dont know its class, I dont know what (factory) methods it supports.

This is also true with huggingface's AutoModel. You can always do help(index) to get documentation once you have an instance of an object, but when you only have an identifier, it might be challenging to find the right artifact class to load it.

seanmacavaney · 2024-07-27T11:19:18Z

A prototype of the artifact API is in pyterrier-alpha. Integrated with extension packages:

Still to integrate:

pyterrier-quality QualCache
pyterrier-adaptive CorpusGraph
(Anything else?)

mam10eks · 2024-08-13T14:02:59Z

This is indeed a very cool concept, it would maybe also be cool if we could load artifact-results such as runs from TIRA? Could maybe be also prefixed similar to irds:... or the hf:... example from above?

seanmacavaney · 2024-08-13T15:56:05Z

Sounds reasonable! The idea would be that it would detect if it was loading a run file (or similar) and return it as a dataframe? We can experiment with this a bit on the implementation in pyterrier-alpha.

mam10eks · 2024-08-14T07:34:21Z

yes, I think this style of magic (we likely should introduce a verbose flag :)) would be quite cool, automatically detecting that an ouptut is a run file should be no problem, as in tira they are always expected to produce a run.txt, which could be easily captured in combination with the scoped prefix.

seanmacavaney · 2024-08-27T16:51:18Z

For results, it might make more sense to have a special URI-style format for loading with pt.io.read_results? E.g., pt.io.read_results('tirex:<task>/<team>/<approach>/<dataset>')?

mam10eks · 2024-08-28T01:26:34Z

Yes, sounds very good.

I think a cool way could also be when one could directly pass a dataset id from ir_datasets, i.e., that the mapping from dataset-id to tira task is done internally. The dataset id might contain / characters, but I think this would be no problem as this would imply an hierarchical structure of the task, which is I think a valid viewpoint.

E.g., if we have the irds id clueweb12/touche-2020-task-2 an call would look like:

pt.io.read_results('tirex:clueweb12/touche-2020-task-2/<team>/<approach>)

We could maybe also think about listing of results. E.g., if I call something like:

pt.io.read_results('tirex:clueweb12/touche-2020-task-2/<team>)

it could print out all public approaches by the team and then fail, or if I call:

pt.io.read_results('tirex:clueweb12/touche-2020-task-2/)

It could print all public approaches and then fail.

mam10eks · 2024-08-28T07:54:24Z

I would start to play a bit around in pyterrier-alpha.

mam10eks · 2024-08-28T11:19:01Z

Cool, I have a first rough prototype (did not require no change in the pyterrier-alpha codebase, and only minor additions to the tira client) so that this test case works:

https://github.com/mam10eks/pyterrier-alpha/blob/main/tests/test_artifacts_from_tira.py

In principle (plus/minus potentially changed design decisions and documentation and more unit tests), this is it :)

seanmacavaney · 2024-09-24T16:58:23Z

As a heads up -- I've replaced this branch with a version taken from alpha

The artifact-old branch records the state before the force push.

cmacdonald · 2024-09-24T17:15:04Z

I think current commit omits the TerrierIndex artefact

seanmacavaney · 2024-09-24T17:18:05Z

Good catch! I had forgotten that it was done in the old one.

cmacdonald · 2024-09-25T08:58:59Z

pyterrier/io.py

+
+
+@contextmanager
+def finalized_directory(path: str) -> str:


this doesnt return a string?

cmacdonald · 2024-09-25T09:00:03Z

pyterrier/io.py

+
+
+@contextmanager
+def download_stream(url: str, *, expected_sha256: Optional[str] = None, verbose: bool = True) -> io.IOBase:


yield not the same as returning IOBase:

Return type of generator function must be compatible with "Generator[Any, Any, Any]"
"Generator[Any, Any, Any]" is not assignable to "IOBase"

cmacdonald · 2024-09-25T09:00:09Z

pyterrier/io.py

+    *,
+    expected_sha256: Optional[str] = None,
+    verbose: bool = True
+) -> io.IOBase:


cmacdonald · 2024-09-25T16:48:17Z

pyterrier/io.py

+    os.replace(path_tmp, path)
+
+
+def download(url: str, path: str, *, expected_sha256: str = None, verbose: bool = True) -> None:


should/can this replace wget (to reduce dependencies?)

pyterrier/terrier/_index.py

cmacdonald · 2024-09-25T17:32:53Z

pyterrier/io.py

@@ -54,23 +64,26 @@ def find_files(dir):


 @contextmanager
-def _finalized_open_base(path, mode, open_fn):
+def _finalized_open_base(path: str, mode: str, open_fn: Callable) -> io.IOBase:


again yield vs return (mypy told me these)

seanmacavaney · 2024-09-25T19:31:45Z

Thanks! From what I can tell, it looks like the correct return type annotations for the context managers (are described here) is Generator[X, None, None]. (It's a bummer because the annotations make it look like it's used as a generator instead of how it's actually used as a context manager :/.)

mam10eks mentioned this pull request Aug 29, 2024

Add EntryPoints compatible with the (currently in alpha development) PyTerrier Artifacts API tira-io/tira#659

Open

pull artifact stuff from pyterrier-alpha

98810ed

seanmacavaney force-pushed the artifact branch from 5c1c7f7 to 98810ed Compare September 24, 2024 16:57

url resolver entry point

9822833

seanmacavaney added 9 commits September 24, 2024 21:30

integration fixes

25b82b1

TerrierIndex artifact

3104cd0

narrowing pt.java.required

803f5ef

fix tests

cc883f7

integration bug

85f028c

integration

d814070

moved from-dataset hf repo under pyterrier org

98a8246

zenodo integration

8d791d4

note about zenodo

ce69f18

cmacdonald requested changes Sep 25, 2024

View reviewed changes

documentation wip

10ffa8f

seanmacavaney added 6 commits September 28, 2024 11:13

organization

b08001c

preliminary Artifact.to_p2p and .from_p2p using wormhole

df05acf

better p2p

f2cf081

documentation

3452998

a few tweaks

2e9d306

a few more tweaks

407e402

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Artifact API #436

Artifact API #436

seanmacavaney commented Apr 18, 2024 •

edited

Loading

cmacdonald commented Apr 22, 2024

seanmacavaney commented May 3, 2024

seanmacavaney commented May 5, 2024

seanmacavaney commented May 5, 2024

cmacdonald commented May 7, 2024 •

edited

Loading

seanmacavaney commented May 7, 2024 •

edited

Loading

seanmacavaney commented Jul 27, 2024 •

edited

Loading

mam10eks commented Aug 13, 2024

seanmacavaney commented Aug 13, 2024

mam10eks commented Aug 14, 2024

seanmacavaney commented Aug 27, 2024

mam10eks commented Aug 28, 2024 •

edited

Loading

mam10eks commented Aug 28, 2024

mam10eks commented Aug 28, 2024

seanmacavaney commented Sep 24, 2024 •

edited

Loading

cmacdonald commented Sep 24, 2024

seanmacavaney commented Sep 24, 2024

cmacdonald Sep 25, 2024

cmacdonald Sep 25, 2024

cmacdonald Sep 25, 2024

cmacdonald Sep 25, 2024

cmacdonald Sep 25, 2024

seanmacavaney commented Sep 25, 2024



		@contextmanager
		def download_stream(url: str, *, expected_sha256: Optional[str] = None, verbose: bool = True) -> io.IOBase:

		os.replace(path_tmp, path)


		def download(url: str, path: str, *, expected_sha256: str = None, verbose: bool = True) -> None:

Artifact API #436

Are you sure you want to change the base?

Artifact API #436

Conversation

seanmacavaney commented Apr 18, 2024 • edited Loading

cmacdonald commented Apr 22, 2024

seanmacavaney commented May 3, 2024

seanmacavaney commented May 5, 2024

seanmacavaney commented May 5, 2024

cmacdonald commented May 7, 2024 • edited Loading

seanmacavaney commented May 7, 2024 • edited Loading

seanmacavaney commented Jul 27, 2024 • edited Loading

mam10eks commented Aug 13, 2024

seanmacavaney commented Aug 13, 2024

mam10eks commented Aug 14, 2024

seanmacavaney commented Aug 27, 2024

mam10eks commented Aug 28, 2024 • edited Loading

mam10eks commented Aug 28, 2024

mam10eks commented Aug 28, 2024

seanmacavaney commented Sep 24, 2024 • edited Loading

cmacdonald commented Sep 24, 2024

seanmacavaney commented Sep 24, 2024

cmacdonald Sep 25, 2024

Choose a reason for hiding this comment

cmacdonald Sep 25, 2024

Choose a reason for hiding this comment

cmacdonald Sep 25, 2024

Choose a reason for hiding this comment

cmacdonald Sep 25, 2024

Choose a reason for hiding this comment

cmacdonald Sep 25, 2024

Choose a reason for hiding this comment

seanmacavaney commented Sep 25, 2024

seanmacavaney commented Apr 18, 2024 •

edited

Loading

cmacdonald commented May 7, 2024 •

edited

Loading

seanmacavaney commented May 7, 2024 •

edited

Loading

seanmacavaney commented Jul 27, 2024 •

edited

Loading

mam10eks commented Aug 28, 2024 •

edited

Loading

seanmacavaney commented Sep 24, 2024 •

edited

Loading