Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support physical file-level deletion #42

Open
slabrams opened this issue Oct 6, 2022 · 10 comments
Open

Support physical file-level deletion #42

slabrams opened this issue Oct 6, 2022 · 10 comments
Labels
Component: Specification Confirmed: In-scope Use case will be included in the upcoming version of the spec or implementation notes.

Comments

@slabrams
Copy link

slabrams commented Oct 6, 2022

There are legitimate curatorial reasons for being able to physically remove individual files from an object. Right now, the only way to deal with this is through the Purge procedure outlined in the Implementation notes. This requires deleting the entire object and then re-creating it without the implicated files. It would be useful to work with the OCFL community to create an easier way to do this in a more automated manner that would rewrite inventories and perhaps leave a tombstone someone, either in the directory structure or just as metadata.

@rosy1280 rosy1280 added Proposed: In-Scope Use case is up for discussion and may change the spec, implementation notes, or become an extension. Component: Specification labels Sep 22, 2023
@zimeon
Copy link
Contributor

zimeon commented Sep 22, 2023

Thoughts from 2023-09-22 editors' meeting:

This would be a big change to OCFL, where up to v1.1 we consider versions to immutable once written. We seems possible uses with mutable filesystems where these is come compelling reason to delete a file, or with either mutable or immutable storage where a file is corrupted and irrecoverable. Unless the whole object is rewritten, in both of these cases the versions using the file will be broken and fail validation. This could be indicated in a new version (that does validate) that indicates what content from prior versions is no longer available or valid.

One way to do this would be to have something like a tombstone block that parallels the manifest block. So, imagine that the one file.txt in the spec's minimal object example is broken/deleted, then a v2 inventory might be:

{
  "digestAlgorithm": "sha512",
  "head": "v2",
  "id": "http://example.org/minimal",
  "manifest": { },
  "tombstone": {
    "7545b8...f67": [ "v1/content/file.txt" ]
  },
  "type": "https://ocfl.io/1.1/spec/#inventory",
  "versions": {
    "v1": {
      "created": "2018-10-02T12:00:00Z",
      "message": "One file",
      "state": {
        "7545b8...f67": [ "file.txt" ]
      },
      "user": {
        "address": "mailto:[email protected]",
        "name": "Alice"
      }
    },
   "v2": {
      "created": "2018-10-02T12:00:00Z",
      "message": "The one file is gone",
      "state": { },
      "user": {
        "address": "mailto:[email protected]",
        "name": "Alice"
      }
    }
  }
}

If something like this were supported then we may wish to have a root level flag to say that the root has mutable content

@neilsjefferies
Copy link
Member

@tomwrobel

@tomwrobel
Copy link
Collaborator

tomwrobel commented Sep 29, 2023

I gave some thought to this some time ago, while coming up with the description for how to purge files in ORA if it were ever required.

There were a few things I thought important. What is presented here isn't a proposed solution for the community, but its a list of considerations and what we thought important. I like the idea of a manifest section much more than I like our proposed internal solution (a json file)!

We would want to record why a file was purged

File purges can happen for an arbitrary reason, such as because a file became corrupt, but when they happen for a reason, it's often a legal or other compliance reason. We might, therefore, want to be able to audit the object at a later stage. If we were to find two copies of the object, one with a purged file and one without, it would be useful to know if we could restore the file (if the file was purged from OCFL because it was corrupt) or if we should never restore the file.

We decided to store the date/time of purge, the user responsible for the purge, and a message stating the reason for the purge

We would want to know which file was purged

We would want to maintain a record of which file was purged. This would allow us to demonstrate that the file was previously present, but was no longer there. Again, this allows for accurate comparison with a copy of an object which contained the binary file, as well as providing a demonstration that the file that was purged was no longer on the system (i.e. would be possible to demonstrate that no file with that checksum remained). We didn't want to preserve filenames, as a filename in itself might constitute purged information. We settled on storing a checksum of the purged binary file, alongside the digest algorithm used to generate that checksum.

We would want to know the state of the object at the time of purge

This was a way to be able to compare two copies of an object, one with the purged file and one without. The solution to this was to store the inventory digest for the current version of the record at the time of purge (updated thought: better would be the inventory digests for all versions of the object at time of the purge). This would allow comparison between two copies of the same object.

What we proposed internally (not necessarily a good idea)

Create a new version of the object with a filename of {sha512_of_binary_file}.purged.json. This would create a json file with the following information:

  • digest of the purged file
  • digest algorithm used to generate that digest
  • responsible user
  • reason for purge
  • datetime of purge
  • digest of the manifest at the time of the purge (a better way would have been a list of all the OCFL versions and the digests of their inventory.json files)

@rosy1280
Copy link
Contributor

Feedback on Use Cases

In advance of version 2 of the OCFL, we are soliciting feedback on use cases. Please feel free to add your thoughts on this use case via the comments.

Polling on Use Cases

In addition to reviewing comments, we are doing an informal poll for each use case that has been tagged as Proposed: In Scope for version 2. You can contribute to the poll for this use case by reacting to this comment. The following reactions are supported:

In favor of the use case Against the use case Neutral on the use case
👍🏼 👎🏼 👀

The poll will remain open through the end of February 2024.

@je4
Copy link

je4 commented Nov 4, 2023

As long as inventory.json is not changed, the deletion of files should be supported. To prevent the validation from failing, the files could be replaced with a file with a defined checksum, which then generates a warning instead of an error during validation. Tracking can then take place in a new version.

@bdwheele
Copy link

This seems useful, as we've had cases were we've had to delete files out of objects for legal reasons.
One question: does the content address exist for the removed in both the manifest and in the tombstone or just in tombstone?

@rosy1280 rosy1280 added Confirmed: In-scope Use case will be included in the upcoming version of the spec or implementation notes. and removed Proposed: In-Scope Use case is up for discussion and may change the spec, implementation notes, or become an extension. labels Feb 29, 2024
@rosy1280
Copy link
Contributor

At the time of this comment the vote tallied to +6. Confirming this as in scope for version 2 of the specification

@zimeon
Copy link
Contributor

zimeon commented Sep 20, 2024

Editors' meeting 2024-09-20: Handling file loss, file corruption or version collapse all change the assumption of version immutability. This is necessarily a version 2 concept so it can only apply in a version 2+ storage root. Even with a notionally immutable system, one can have corruption. Possible solutions without mutability would be to delete corrupted objects or just store a record of corrupted files outside of the system.

We agree on the tombstones idea described in the use case #42

Question from @brian Wheeler #42 (comment) : "If the file is gone then it would not appear in the manifest?". We agree that when a file is gone then the file would be shown in the tombstones block and not in the manifest block.

We will create a new version to record that a file has been deleted, vanished or corrupted. We will recommend that no other changes be made at the same time as the recording of deletion/corruption. The creation of a new version gives the chance to write a new version message with user/time/etc. and any other human readable information about the why the change is occurring.

Why might a file be tombstoned?

  • missing/vanished, removed/deleted -- spec will not distinguish these cases: file does not appear in manifest, appears with original digest in tombstones
  • corrupted -- file appears with "new" digest in manifest, with original digest in tombstones
  • name in file system but unreadable or not reliably readable -- file appears with empty digest string (not a valid digest output in any digest format we know, and an empty string is valid as key on JSON object whereas Null etc. are not) in manifest, with original digest in tombstones

Use cases for corrupted and or unreadable:

  • write once storage where we can't delete
  • corruption where we want to keep corrupted file for possible later analysis

We will add an extra parameter in ocfl_layout.json to flag use of mutability features such as tombstones with the implication that tooling MUST check latest inventory before trying to read any version.

Implementation notes must:

  • account for deduped files
  • talk about read errors and inconsistency
  • talk about corruption characteristics of different storage types
  • talk about need for documentation in new version
  • impact on other V2 features - packages and content-linking
  • validation strategies

Example of file deletion (unchanged from 2023-09-23 comment)

{
  "digestAlgorithm": "sha512",
  "head": "v2",
  "id": "http://example.org/minimal_deletion",
  "manifest": { },
  "tombstones": {
    "7545b8...f67": [ "v1/content/file.txt" ]
  },
  "type": "https://ocfl.io/1.1/spec/#inventory",
  "versions": {
    "v1": {
      "created": "2020-10-12T01:00:00Z",
      "message": "One file",
      "state": {
        "7545b8...f67": [ "file.txt" ]
      },
      "user": {
        "address": "mailto:[email protected]",
        "name": "Alice"
      }
    },
   "v2": {
      "created": "2021-01-00T02:00:00Z",
      "message": "The one file had to be deleted entirely for legal reasons",
      "state": { },
      "user": {
        "address": "mailto:[email protected]",
        "name": "Alice"
      }
    }
  }
}

Example of marking file corruption (cannot be read, and readable but bad digest)

{
  "digestAlgorithm": "sha512",
  "head": "v2",
  "id": "http://example.org/minimal_deletion",
  "manifest": {
    "": [ "v1/content/file1.txt" ],
    "aaa143...79a": [ "v1/content/file2.txt" ]
  },
  "tombstones": {
    "7545b8...f67": [ "v1/content/file1.txt" ],
    "fe4512...e47": [ "v1/content/file2.txt" ]
  },
  "type": "https://ocfl.io/1.1/spec/#inventory",
  "versions": {
    "v1": {
      "created": "2023-10-02T12:00:00Z",
      "message": "Two files",
      "state": {
        "7545b8...f67": [ "file1.txt" ],
        "fe4512...e47": [ "file2.txt" ]
      },
      "user": {
        "address": "mailto:[email protected]",
        "name": "Alice"
      }
    },
   "v2": {
      "created": "2024-09-20T10:09:00Z",
      "message": "File 1 vanished or cannot be read reliably, exclude. File 2 is corrupted with a different checksum, change name in state",
      "state": {
        "aaa143...79a": [ "file2_corrupted.txt" ]
      },
      "user": {
        "address": "mailto:[email protected]",
        "name": "Alice"
      }
    }
  }
}

@je4
Copy link

je4 commented Sep 21, 2024

I really like the idea of having a second list with entitities which have a special status.
How about just adding a status instead of a tombstone to the manifest.
Every manifest entry not listed in the state has status "ok".
This would prevent to deal with two lists referencing manifest entries and allows more information about what happened to the file. In most cases, the inventory will look the same, but there's a possibility to enhance with additional status informations.

Version 1 (minimal)

{
  "digestAlgorithm": "sha512",
  "head": "v2",
  "id": "http://example.org/minimal_deletion",
  "manifest": {
    "7545b8...f67": [ "v1/content/file1.txt" ],
    "5543b8...ae9": [ "ark:abc/123" ],
    "fe4512...e47": [ "v1/content/file2.txt" ]
  },
  "state": {
    "7545b8...f67": "deleted",
    "fe4512...e47": "corrupted",
    "5543b8...ae9": "remote"
  },
  "type": "https://ocfl.io/1.1/spec/#inventory",
  "versions": {
    "v1": {
      "created": "2023-10-02T12:00:00Z",
      "message": "Two files",
      "state": {
        "7545b8...f67": [ "file1.txt" ],
        "fe4512...e47": [ "file2.txt" ]
      },
      "user": {
        "address": "mailto:[email protected]",
        "name": "Alice"
      }
    },
   "v2": {
      "created": "2024-09-20T10:09:00Z",
      "message": "File 1 vanished or cannot be read reliably, exclude. File 2 is corrupted with a different checksum. Bigdata.txt remote file added",
      "state": {
        "aaa143...79a": [ "file2.txt" ],
        "5543b8...ae9": [ "bigdata.txt" ]
      },
      "user": {
        "address": "mailto:[email protected]",
        "name": "Alice"
      }
    }
  }
}

Version 2 (more information in state)

{
  "digestAlgorithm": "sha512",
  "head": "v2",
  "id": "http://example.org/minimal_deletion",
  "manifest": {
    "7545b8...f67": [ "v1/content/file1.txt" ],
    "fe4512...e47": [ "v1/content/file2.txt" ]
  },
  "state": {
    "7545b8...f67": { "status": "deleted", "message": "copyright issues" },
    "5543b8...ae9": { "status": "remote", "message": "ARK reference" },
    "fe4512...e47": { "status": "corrupted", "message": "corrupted with a different checksum" }
  },
  "type": "https://ocfl.io/1.1/spec/#inventory",
  "versions": {
    "v1": {
      "created": "2023-10-02T12:00:00Z",
      "message": "Two files",
      "state": {
        "7545b8...f67": [ "file1.txt" ],
        "fe4512...e47": [ "file2.txt" ]
      },
      "user": {
        "address": "mailto:[email protected]",
        "name": "Alice"
      }
    },
   "v2": {
      "created": "2024-09-20T10:09:00Z",
      "message": "File 1 has copyright issues, delete. File 2 is corrupted with a different checksum. Bigdata.txt remote file added",
      "state": {
        "aaa143...79a": [ "file2.txt" ],
        "5543b8...ae9": [ "bigdata.txt" ]
      },
      "user": {
        "address": "mailto:[email protected]",
        "name": "Alice"
      }
    }
  }
}

@je4
Copy link

je4 commented Oct 11, 2024

since there can be multiple files for one checksum, deletion/corruption/remote MUST refer to filenames and not to checksums.

   [...]
 "state": {
   "v1/content/file1.txt": "deleted",
   "v1/content/file2.txt": "corrupted",
   "ark:abc/123": "remote"
 },
   [...]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Specification Confirmed: In-scope Use case will be included in the upcoming version of the spec or implementation notes.
Projects
None yet
Development

No branches or pull requests

7 participants