Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add Person and Reference models #5

Merged
merged 12 commits into from
Nov 21, 2024
Merged

Conversation

Zethson
Copy link
Member

@Zethson Zethson commented Nov 21, 2024

Add Person and Reference models into the schema.

As part of integrating laminlabs/findrefs into ourprojects, the Reference model is now incorporated directly within the ourprojects schema.

Signed-off-by: zethson <[email protected]>
Signed-off-by: zethson <[email protected]>
@Zethson Zethson changed the title Feature/findrefs person ✨ Add Person and Reference models Nov 21, 2024
Signed-off-by: zethson <[email protected]>
Signed-off-by: zethson <[email protected]>
@Zethson Zethson requested a review from sunnyosun November 21, 2024 14:19
ourprojects/models.py Outdated Show resolved Hide resolved
ourprojects/models.py Outdated Show resolved Hide resolved
ourprojects/models.py Outdated Show resolved Hide resolved
ourprojects/models.py Outdated Show resolved Hide resolved
ourprojects/models.py Outdated Show resolved Hide resolved
ourprojects/models.py Outdated Show resolved Hide resolved
ourprojects/models.py Outdated Show resolved Hide resolved
ourprojects/models.py Outdated Show resolved Hide resolved
ourprojects/models.py Outdated Show resolved Hide resolved
ourprojects/models.py Show resolved Hide resolved
@sunnyosun
Copy link
Member

No instance has this schema module yet, so you can remove all the migrations and generate a first migration script.

Signed-off-by: zethson <[email protected]>
@Zethson Zethson requested a review from sunnyosun November 21, 2024 15:01
ourprojects/__init__.py Show resolved Hide resolved
ourprojects/models.py Outdated Show resolved Hide resolved
ourprojects/models.py Outdated Show resolved Hide resolved
ourprojects/models.py Outdated Show resolved Hide resolved
Signed-off-by: zethson <[email protected]>
Signed-off-by: zethson <[email protected]>
@Zethson Zethson requested a review from sunnyosun November 21, 2024 16:05
ourprojects/models.py Outdated Show resolved Hide resolved
@sunnyosun
Copy link
Member

@falexwolf Can you have a final look, in case I missed anything?

Signed-off-by: zethson <[email protected]>
ourprojects/models.py Outdated Show resolved Hide resolved
ourprojects/models.py Outdated Show resolved Hide resolved
Signed-off-by: zethson <[email protected]>
@Zethson Zethson merged commit 37cad62 into main Nov 21, 2024
2 checks passed
@Zethson Zethson deleted the feature/findrefs_person branch November 21, 2024 19:22
@falexwolf
Copy link
Member

Ahhhhhhhh, didn't we discuss that we do not want to change the Reference model?

Already the "full_text" vs. "text" change was problematic. I forgot the specific discussion but somehow there was an assumption of separating abstract vs. full_text or something like this and in my opinion this added too much structure; leaving the user confused and data fragmented.

Now, it turns out that several fields were added here without them being discussed. Because it was a migration from another repo, we don't have a diff that we could review and I didn't notice the schema changes. The PR description is also silent about it and pretends the registry was simply moved over.

I'm only noticing now when trying to run data migrations in SQL across all instances (including customer instances). Of course these are throwing errors now.

Next time, please make diffs when making schema changes. Migrating package structure and database schema at the same time is a bad idea.

Version in findrefs

class Reference(Record, CanCurate, TracksRun, TracksUpdates, ValidateFields):
    """References such as a publication or document, with unique identifiers and metadata.

    Example:
        >>> reference = Reference(
        ...     name="A paper title",
        ...     doi="A doi",
        ... ).save()
    """

    class Meta(Record.Meta, TracksRun.Meta, TracksUpdates.Meta):
        abstract = False

    id: int = models.AutoField(primary_key=True)
    """Internal id, valid only in one DB instance."""
    uid: str = CharField(max_length=12, unique=True, default=ids.base62_12)
    """Universal id, valid across DB instances."""
    name: str = CharField(db_index=True)
    """Title or name of the reference document."""
    abbr: str | None = CharField(
        max_length=32,
        db_index=True,
        unique=True,
        null=True,
    )
    """A unique abbreviation for the reference."""
    url: str | None = models.URLField(null=True, blank=True)
    """URL linking to the reference."""
    pubmed_id: int | None = BigIntegerField(null=True)
    """A PudMmed ID."""
    doi: int | None = CharField(
        null=True,
        db_index=True,
        validators=[
            RegexValidator(
                regex=r"^(?:https?://(?:dx\.)?doi\.org/|doi:|DOI:)?10\.\d+/.*$",
                message="Must be a DOI (e.g., 10.1000/xyz123 or https://doi.org/10.1000/xyz123)",
            )
        ],
    )
    """Digital Object Identifier (DOI) for the reference."""
    text: str | None = TextField(null=True)
    """Text of the reference such as the abstract or the full-text to enable search."""
    artifacts: Artifact = models.ManyToManyField(
        Artifact, through="ArtifactReference", related_name="references"
    )
    """Artifacts labeled with this reference."""

Version here in ourprojects

class Reference(Record, CanCurate, TracksRun, TracksUpdates, ValidateFields):
    """References such as a publication or document, with unique identifiers and metadata.

    Example:
        >>> reference = Reference(
        ...     name="A Paper Title",
        ...     abbr="APT",
        ...     url="https://doi.org/10.1000/xyz123",
        ...     pubmed_id=12345678,
        ...     doi="10.1000/xyz123",
        ...     preprint=False,
        ...     journal="Nature Biotechnology",
        ...     description="A groundbreaking research paper.",
        ...     text="A really informative abstract.",
        ...     published_at=date(2023, 11, 21),
        ... ).save()
    """

    class Meta(Record.Meta, TracksRun.Meta, TracksUpdates.Meta):
        abstract = False

    id: int = models.AutoField(primary_key=True)
    """Internal id, valid only in one DB instance."""
    uid: str = CharField(
        unique=True, max_length=12, db_index=True, default=ids.base62_12
    )
    """Universal id, valid across DB instances."""
    name: str = CharField(db_index=True)
    """Title or name of the reference document."""
    abbr: str | None = CharField(
        max_length=32,
        db_index=True,
        unique=True,
        null=True,
    )
    """A unique abbreviation for the reference."""
    url: str | None = URLField(null=True)
    """URL linking to the reference."""
    pubmed_id: int | None = BigIntegerField(null=True, db_index=True)
    """A PudMmed ID."""
    doi: str | None = CharField(
        null=True,
        db_index=True,
        validators=[
            RegexValidator(
                regex=r"^(?:https?://(?:dx\.)?doi\.org/|doi:|DOI:)?10\.\d+/.*$",
                message="Must be a DOI (e.g., 10.1000/xyz123 or https://doi.org/10.1000/xyz123)",
            )
        ],
    )
    """Digital Object Identifier (DOI) for the reference."""
    preprint: bool = BooleanField(default=False, db_index=True)
    """Whether the reference is from a preprint."""
    public: bool = BooleanField(default=True, db_index=True)
    """Whether the reference is public."""
    journal: str | None = TextField(null=True)
    """Name of the journal."""
    description: str | None = TextField(null=True)
    """Description of the reference."""
    text: str | None = TextField(null=True)
    """Abstract or full text of the reference."""
    published_at: date | None = DateField(null=True, default=None)
    """Publication date."""
    authors: Person = models.ManyToManyField(Person, related_name="references")
    """All people associated with this reference."""
    artifacts: Artifact = models.ManyToManyField(
        Artifact, through="ArtifactReference", related_name="references"
    )
    """Artifacts labeled with this reference."""

@falexwolf
Copy link
Member

In the absence of a diff, I'm identifying three changes by eye.

Change 1

    preprint: bool = BooleanField(default=False, db_index=True)
    """Whether the reference is from a preprint."""

I'd remove this field: It makes the model complicated and might lead users to register the same paper as two different references: once for the preprint and once for the published version. In almost all cases, we'd not want this and professional managers thereby support "merging references". Of course we don't support this and hence this will likely cause more problems than solve.

Change 2

    journal: str | None = TextField(null=True)
    """Name of the journal."""

I'd also remove this field or think more deeply before adding it. Journal names are highly standardized and it's hella annoying if people just add random strings to it; one will have all kinds of abbreviations, typos etc. -- this was the reason for why there was no such field before.

In my experience it's hard to model this well (internally on notion we have the ReferenceSource table for this which is very dissatisfactory, too).

A solution could be a field where people could add free form text to "type" the reference similar to the W&B API. But if it's called "Journal" it should be an FK to a constrained vocab for a journal ontology; which IMO is over-engineering.

Change 3

    authors: Person = models.ManyToManyField(Person, related_name="references")
    """All people associated with this reference."""

I think this is OK! This works both for internal and external references and the name choice "authors" is universally accepted, I believe. There is an issue with ordering that we can resolve via an ordered_authors property like we do it for collections.ordered_artifacts. That's not nice but I still don't have a better idea.

@falexwolf
Copy link
Member

Change 4

    public: bool = BooleanField(default=True, db_index=True)
    """Whether the reference is public."""

Isn't this inconsistent with the external field? Is a "public reference" a non-internal reference is an "external reference"? If so, why use different words to describe the same thing? We have person.external.

@falexwolf
Copy link
Member

About this:

    preprint: bool = BooleanField(default=False, db_index=True)
    """Whether the reference is from a preprint."""
    public: bool = BooleanField(default=True, db_index=True)
    """Whether the reference is public."""
    journal: str | None = TextField(null=True)
    """Name of the journal."""
    description: str | None = TextField(null=True)
    """Description of the reference."""

Why are the indexes on the booleans but no indexes on the char fields? Not so critical right now, but there should be consistency on where we put indexes.

@Zethson
Copy link
Member Author

Zethson commented Jan 6, 2025

I'm back tomorrow but IIRC these changes were done for cellxgene and I certainly remember having discussed them with Sunny.

The PR description is also silent about it and pretends the registry was simply moved over.

True - point taken!

You were the final reviewer btw and we even had discussions on some of these schema changes: #5 (comment)

@falexwolf
Copy link
Member

Now I'm feeding this into Claude who finds a few more changes:

image

You were the final reviewer btw

Yes. But I approved assuming that there are zero changes to the Reference model given we need to migrate customer instances. This is what the PR description said and you can't expect me to memorize all fields of all registries we have. 😅

we even had discussions on some of these schema changes

I only reviewed the Person model assuming this is the only thing that is changing. Again, the discussion about the "full_text" vs. "text" vs. "abstract" changes were only a few days old, so I wasn't expecting that amount of fragmentation of to this model from a data modeling standpoint alone.

But given these two more reasons I had assumed that there is no way that the Reference model might actually have changed:

  1. a PR that moved code over from another project where at least my process is to always make one "moving code PR" and then more PRs in case I want to make changes to code that was moved; only then we have diffs
  2. there are migrations to do for several deployed instances and these are obviously harder if the schema changes before performing these migrations

Please, going forward, more cautiousness and more extensive PR descriptions in particular for migrations which persist stuff and are very hard to undo.

Also @sunnyosun!

@falexwolf
Copy link
Member

@sunnyosun
Copy link
Member

I think we assumed no instance had the findrefs schema before... Otherwise this should have been 2 PRs, one copies over the models, the other modifies them futher.

@falexwolf
Copy link
Member

I think we assumed no instance had the findrefs schema before...

We had a few discussions about findrefs in the context of our biggest customer. I think both of you should have been aware that they were using it. We also communicated the migration strategy to them. Happy to link GitHub threads in an internal Slack discussion.

I think you're probably already convinced that this PR should have been more thoughtful. But if you don't do SQL migrations yourself, it's hard to develop the kind of cautiousness one should have. Below is how debugging the migration script proceeds.

https://lamin.ai/laminlabs/lamin-dev/transform/ySlebT1AbuX80002

image

@falexwolf
Copy link
Member

falexwolf commented Jan 6, 2025

🎉 7 versions later! what a great AI agent I am 😆 💪

https://lamin.ai/laminlabs/lamin-dev/transform/ySlebT1AbuX80007

image

@falexwolf
Copy link
Member

And the data seems correctly populated with the artifact being re-linked:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants