Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tracking run inputs #98

Closed
Zethson opened this issue Nov 21, 2024 · 8 comments · Fixed by #124
Closed

Fix tracking run inputs #98

Zethson opened this issue Nov 21, 2024 · 8 comments · Fixed by #124
Assignees
Labels
bug Something isn't working priority-high
Milestone

Comments

@Zethson
Copy link
Member

Zethson commented Nov 21, 2024

We observed that when data from any non-default instance such as cellxgene is used as input during a Run, it does not show up in the linage graph.

@lazappi had suggested that this is probably because of our usage of the API to get the data and not the Python code which probably does more magic.

@Zethson Zethson added the bug Something isn't working label Nov 21, 2024
@falexwolf
Copy link
Member

The problem is that reticulate isn't used for .load() and .cache().

If reticulate was used, this would all be resolved.

Hence, the fix should be using reticulate for these two methods and it's going to work.

@lazappi
Copy link
Collaborator

lazappi commented Nov 22, 2024

I think we should have a discussion about whether it is worth using the API at all. We have about reached the limit of what the API can do currently and if we have to use {reticulate} for some things anyway, maybe it's better to use it for everything? It would mean a fairly big refactor but after that development might be quicker.

@falexwolf
Copy link
Member

Yes, we should have that discussion next week or so, I agree.

But what's indeed much better with the API is that you're not relying on 20 Python packages that Django needs to map all the schema modules.

So, it's not a clear decision pro reticulate for querying say bionty. That's likely more elegant through the REST API.

@falexwolf falexwolf changed the title Keep data lineage when getting data from other instances Add data lineage feature Nov 25, 2024
@falexwolf
Copy link
Member

falexwolf commented Nov 25, 2024

Under the hood of artifact$load() one calls:

def load():
     uid = from REST
     artifact = ln.Artifact.get(uid)
     return artifact.load()

@rcannood
Copy link
Collaborator

Solution 1

  • Add a rest endpoint for letting the lamindb instance know that we are caching a certain artifact in the context of a certain transform

Solution 2

  • Switch to reticulate for caching files

Solution 3

  • Use rest API as long as db$track() has not been called, switch to reticulate once tracking has started.

@rcannood rcannood added this to the 0.2.1 milestone Nov 25, 2024
@falexwolf
Copy link
Member

I'm strongly favoring Solution 2 for the up-coming weeks.

@falexwolf falexwolf changed the title Add data lineage feature Fix tracking run inputs Nov 25, 2024
@falexwolf
Copy link
Member

@lazappi (or @rcannood), do you have time to fix the this Monday morning so that this goes into the 0.3.0 release until about 2 pm?

I believe nobody really tried laminr much last week because of Thanksgiving. But they will this week.

Having tracking of inputs is the only missing key piece.

@lazappi
Copy link
Collaborator

lazappi commented Dec 2, 2024

I'll have a look this morning. There are other things I would like to improve but this one we haven't looked at yet so I'll do it first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority-high
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants