Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚚 Database migrations for lamindb 1.0a1 #2323

Merged
merged 131 commits into from
Jan 11, 2025
Merged

🚚 Database migrations for lamindb 1.0a1 #2323

merged 131 commits into from
Jan 11, 2025

Conversation

falexwolf
Copy link
Member

@falexwolf falexwolf commented Jan 5, 2025

Bumps __version__ from 0.78a1 to 1.0a1.

Internal instances migrated:

  • laminlabs/lamindata
  • laminlabs/lamin-site-assets
  • laminlabs/lamin-dev
  • laminlabs/arrayloader-benchmarks
  • laminlabs/bionty-assets

Other changes

  • upgraded Python to >=3.10 and Django to >=5 because we're using db_default -- discussion here on GitHub

Schema changes

  • remove _source_code_artifact from Transform, it's been deprecated since 0.75 222931a
    • data migration: for all transforms that have _source_code_artifact populated, populate source_code
  • rename Transform.name to Transform.description because it behaves like Artifact.description and it's confusing because people expect versioning to happen based on name 5d68c5e Claude chat
    • backward compat:
      • in the Transform constructor use name to populate key in all cases in which only name is passed
      • return the same transform based on key in case source_code is None via ._name_field = "key"
    • data migrations:
      • there already was a legacy description field that was never exposed on the constructor; to be safe, we concatenated potential data in it on the new description field Claude chat
      • for all transforms that have key=None and name!=None, use name to pre-populate key
  • rename Collection.name to Collection.key for consistency with Artifact & Transform and the high likelihood of users wanting to organize them hierarchically 2bef93a
  • a _branch_code integer on every record to model pull requests 483fb37
    • include visibility within that code
    • repurpose visibility=0 as _branch_code=0 as "archive"
    • put an index on it: see Claude chat
    • code a "draft" as _branch_code = 2, and "draft prs" as negative branch codes Claude thread -- this is better than adding another field as originally planned (internal Slack thread)
  • rename values "number" to "num" in dtype
  • an .aux json field on Record: Claude thread
  • a SmallInteger run._status_code that allows to write finished_at in clean up operations so that there is a run time also for aborted runs
  • rename Run.is_consecutive to Run._is_consecutive
  • a _template_id FK to store the information of the generating template (whether a record is a template is coded via _branch_code)
  • rename _accessor to otype to publicly declare the data format as suffix, accessor
  • rename Artifact.type to Artifact.kind (internal Slack thread) Claude chat
  • a FK to artifact run._logfile which holds logs
  • a hash field on ParamValue and FeatureValue to enforce uniqueness without running the danger of failure for large dictionaries 9e4b503 Claude chat
  • an Artifact._curator: dict field 068bcf7
  • add a boolean field ._expect_many to Feature/Param that defaults to True/False and indicates whether values for this feature/param are expected to occur a single or multiple times for every single artifact/run 1cc1ca8
    • for feature
      • if it's True (default), the values come from an observation-level aggregation and a dtype of datetime on the observation-level mean set[datetime] on the artifact-level
      • if it's False it's an artifact-level value and datetime means datetime; this is an edge case because an arbitrary artifact would always be a set of arbitrary measurements that would need to be aggregated ("one just happens to measure a single cell line in that artifact")
    • for param
      • if it's False (default), the values mean artifact/run-level values and datetime means datetime
      • if it's True, the values would be from an aggregation, this seems like an edge case but say when characterizing a model ensemble trained with different parameters it could be relevant
  • remove the .transform foreign key from artifact and collection for consistency with all other records; introduce a property and a simple filter statement instead (context: https://github.com/laminlabs/laminhub/issues/1498) a4e0fb2
  • store provenance metadata for TransformULabel, RunParamValue, ArtifactParamValue 576d858
  • enable linking projects & references to transforms & collections: laminlabs/ourprojects@370a869
  • rename Run.parent to Run.initiated_by_run 9fbdd6f
  • introduce a boolean flag on artifact that's called _overwrite_versions, which indicates whether versions are overwritten or stored separately; it defaults to False for file-like artifacts and to True for folder-like artifacts - see Slack thread
  • Rename n_objects to n_files for more clarity laminlabs/lamindb-setup@eba36ba Claude chat
  • Add a Space registry to lamindb with an FK on every BasicRecord internal Notion page a25df16
  • add a name column to run so that a specific run can be used as an analysis and named; for almost all runs it's going to be None
  • add a .type CharField to Feature and Param so that features & params can be hierarchically grouped
  • remove _previous_runs field on everything except Artifact & Collection; we can model this better with events going forward
  • a boolean on ULabel called is_concept to distinguish ontological concepts from the labels themselves; for instance, you would never want to label an artifact with a ulabel Project, you'll only want to label with actual project values Project 1, Project 2, etc. 6bd7f62

TODO in another PR:

  • add a Metric and MetricValue registry that parallels Param and ParamValue up to an FK to run
    • or rename Feature to Dimension and FeatureSet to Schema and let people group metrics through a feature type metric
  • move the Project registry into lamindb with meaningful releationships (e.g. also for ULabel) and make it easy to label things by default with a project
  • attach curator to FeatureSet or re-purpose FeatureSet as curator in some way or rename FeatureSet to Schema Slack thread
  • think through naming conventions of lamindb-related records: transfer, save_vitessce_config, run reports, etc. and migrate them
  • rename "glue" transform type

Materials

Needs:

Affects:

Triggered these issues:

Did not move forward with these changes

  • a Record._page_md textfield so that every entity can hold markdown content
  • a Record._public boolean that indicates whether a record is public Claude chat
  • check that there is an index on paramvalue.value and featurevalue.value, there is None but indexes on jsons are complicated and these tables won't be big so we rather hold off
  • consider whether we want to eliminate the link tables associated with "_previous_runs" runs; decided against it Claude thread
  • given that feature_sets is a public attribute and given that some people might want to query for feature_values or param_values outside the .features and .params managers; we should consider making _feature_values and _param_values public. This will likely also help people better understand the underlying mechanics if they're ever interested. Decided against because then there would be two ways to query by a feature or param. Everything is already supported through the lamindb-native API and by that considerably simpler than doing this with Django, in particular since ✨ Enable to easily join features onto artifacts via Artifact.df() #2238. Also see the run docs where nested queries are exemplified: image
  • rename User to Person? no, better to keep it separate; it's OK to have to tables for the internal concept and things that people want to do with Person otherwise

@falexwolf falexwolf changed the title 🚚 Remove _source_code_artifact M2M from Transform 🚚 Migrations for lamindb 1.0 Jan 5, 2025
@Koncopd
Copy link
Member

Koncopd commented Jan 5, 2025

automatically tracking artifact folder: https://claude.ai/chat/85b7a953-36d5-4afa-8efc-1bd6886eaefe

The conversation you were looking for could not be found.
what is it actually?

@Koncopd
Copy link
Member

Koncopd commented Jan 5, 2025

a Folder model with registry | key (see internal Claude discussion)

I don't really understand the point, as we have Storage already. Also we have different storage types, like http, hf, s3 etc, not clear what is a folder for http or hf. If a meaningful grouping is needed, we have Collection for that. This will b super confusing if we have Storage, Artifact, Collection and then also Folder.
Do we have a discussion for this?

@falexwolf falexwolf linked an issue Jan 5, 2025 that may be closed by this pull request
@falexwolf falexwolf changed the title 🚚 Migrations for lamindb 1.0 🚚 Migrations for LaminDB 1.0 Jan 5, 2025
@falexwolf
Copy link
Member Author

The conversation you were looking for could not be found. what is it actually?

This is something for the frontend; I shared the Claude discussion with you. After the whole exploration around the block model this will likely make more sense in laminhub_client (https://github.com/laminlabs/laminhub/issues/1784).

@Koncopd
Copy link
Member

Koncopd commented Jan 5, 2025

I still don't see it.

@falexwolf
Copy link
Member Author

falexwolf commented Jan 5, 2025

@Koncopd
Copy link
Member

Koncopd commented Jan 5, 2025

Ok, that is clearer to me now, but i am still not sure. If we have a path in s3, gs, and http with the same "folder" in the key, is it really the same folder and do we want to see it as a unified folder really?

@falexwolf
Copy link
Member Author

If we have a path in s3, gs, and http with the same "folder" in the key, is it really the same folder and do we want to see it as a unified folder really?

ArtifactFolder and TransformFolder and CollectionFolder are merely used in the UI and merely for virtual keys. Let's discuss this here when we get to it: https://github.com/laminlabs/laminhub/issues/1787

@falexwolf falexwolf changed the title 🚚 Migrations for lamindb 1.0 🗃️ Main database migrations for lamindb 1.0 Jan 11, 2025
@github-actions github-actions bot temporarily deployed to pull request January 11, 2025 14:08 Inactive
@falexwolf falexwolf changed the title 🗃️ Main database migrations for lamindb 1.0 🚚 Database migrations for lamindb 1.0a1 Jan 11, 2025
@github-actions github-actions bot temporarily deployed to pull request January 11, 2025 15:44 Inactive
@falexwolf falexwolf merged commit f078f27 into main Jan 11, 2025
14 checks passed
@falexwolf falexwolf deleted the migrate branch January 11, 2025 16:27
@github-actions github-actions bot temporarily deployed to pull request January 11, 2025 16:32 Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

🏗️ Migrations for LaminDB 1.0
4 participants