-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Artifact API #436
base: master
Are you sure you want to change the base?
Artifact API #436
Conversation
Thanks Sean, this is interesting concept. TerrierIndex is a bit of a complicated decision, as you know. Could have a branch of pyterrier_pisa using this functionality so we can roadtest the API? A guide on how to add a new artifact? |
The new pyterrier-quality repo has an example. The key bits are: Some things I'm still considering:
|
|
And on pyterrier-dr: https://github.com/terrierteam/pyterrier_dr/tree/artifact |
I'm not sure what the entry_point stuff is for. Can you explain it simply? Is that just a code discovery mechanism for Python? What use case does an Artefact address? Is it so I dont know the class that I am looking for to get an index or something, I can still load a factory object? But if I dont know its class, I dont know what (factory) methods it supports. |
Entry points act as a registry of all the artifacts installed. Since they're metadata about the package itself, they do not involve loading any modules at runtime to establish what's registered. Here's a short document on the use case for artifacts: https://gist.github.com/seanmacavaney/ceac1b5eacaac4b072caa69986089ff4 Beyond the use cases outlined in the document, as you mention, it also simplifies the loading of artifacts. This is similar to how import pyterrier as pt
pt.init()
from pyterrier_pisa import PisaIndex
index = PisaIndex.from_hf('pyterrier/msmarco-passage.pisa')
# vs
import pyterrier as pt
pt.init()
index = pt.Artifact.from_hf('pyterrier/msmarco-passage.pisa') The metadata file provided by the artifact specification also gives a hint about what package you need to install to load the artifact. So in the above example, if
This is also true with huggingface's AutoModel. You can always do |
A prototype of the artifact API is in pyterrier-alpha. Integrated with extension packages:
Still to integrate:
|
This is indeed a very cool concept, it would maybe also be cool if we could load artifact-results such as runs from TIRA? Could maybe be also prefixed similar to |
Sounds reasonable! The idea would be that it would detect if it was loading a run file (or similar) and return it as a dataframe? We can experiment with this a bit on the implementation in |
yes, I think this style of magic (we likely should introduce a verbose flag :)) would be quite cool, automatically detecting that an ouptut is a run file should be no problem, as in tira they are always expected to produce a run.txt, which could be easily captured in combination with the scoped prefix. |
For results, it might make more sense to have a special URI-style format for loading with |
Yes, sounds very good. I think a cool way could also be when one could directly pass a dataset id from ir_datasets, i.e., that the mapping from dataset-id to tira task is done internally. The dataset id might contain E.g., if we have the irds id
We could maybe also think about listing of results. E.g., if I call something like:
it could print out all public approaches by the team and then fail, or if I call:
It could print all public approaches and then fail. |
I would start to play a bit around in pyterrier-alpha. |
Cool, I have a first rough prototype (did not require no change in the pyterrier-alpha codebase, and only minor additions to the tira client) so that this test case works: https://github.com/mam10eks/pyterrier-alpha/blob/main/tests/test_artifacts_from_tira.py In principle (plus/minus potentially changed design decisions and documentation and more unit tests), this is it :) |
5c1c7f7
to
98810ed
Compare
As a heads up -- I've replaced this branch with a version taken from alpha The artifact-old branch records the state before the force push. |
I think current commit omits the TerrierIndex artefact |
Good catch! I had forgotten that it was done in the old one. |
|
||
|
||
@contextmanager | ||
def finalized_directory(path: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesnt return a string?
|
||
|
||
@contextmanager | ||
def download_stream(url: str, *, expected_sha256: Optional[str] = None, verbose: bool = True) -> io.IOBase: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yield not the same as returning IOBase:
Return type of generator function must be compatible with "Generator[Any, Any, Any]"
"Generator[Any, Any, Any]" is not assignable to "IOBase"
*, | ||
expected_sha256: Optional[str] = None, | ||
verbose: bool = True | ||
) -> io.IOBase: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
os.replace(path_tmp, path) | ||
|
||
|
||
def download(url: str, path: str, *, expected_sha256: str = None, verbose: bool = True) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should/can this replace wget (to reduce dependencies?)
@@ -54,23 +64,26 @@ def find_files(dir): | |||
|
|||
|
|||
@contextmanager | |||
def _finalized_open_base(path, mode, open_fn): | |||
def _finalized_open_base(path: str, mode: str, open_fn: Callable) -> io.IOBase: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again yield vs return (mypy told me these)
Thanks! From what I can tell, it looks like the correct return type annotations for the context managers (are described here) is |
WIP
Example: