-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support physical file-level deletion #42
Comments
Thoughts from 2023-09-22 editors' meeting: This would be a big change to OCFL, where up to v1.1 we consider versions to immutable once written. We seems possible uses with mutable filesystems where these is come compelling reason to delete a file, or with either mutable or immutable storage where a file is corrupted and irrecoverable. Unless the whole object is rewritten, in both of these cases the versions using the file will be broken and fail validation. This could be indicated in a new version (that does validate) that indicates what content from prior versions is no longer available or valid. One way to do this would be to have something like a
If something like this were supported then we may wish to have a root level flag to say that the root has |
I gave some thought to this some time ago, while coming up with the description for how to purge files in ORA if it were ever required. There were a few things I thought important. What is presented here isn't a proposed solution for the community, but its a list of considerations and what we thought important. I like the idea of a manifest section much more than I like our proposed internal solution (a json file)! We would want to record why a file was purgedFile purges can happen for an arbitrary reason, such as because a file became corrupt, but when they happen for a reason, it's often a legal or other compliance reason. We might, therefore, want to be able to audit the object at a later stage. If we were to find two copies of the object, one with a purged file and one without, it would be useful to know if we could restore the file (if the file was purged from OCFL because it was corrupt) or if we should never restore the file. We decided to store the date/time of purge, the user responsible for the purge, and a message stating the reason for the purge We would want to know which file was purgedWe would want to maintain a record of which file was purged. This would allow us to demonstrate that the file was previously present, but was no longer there. Again, this allows for accurate comparison with a copy of an object which contained the binary file, as well as providing a demonstration that the file that was purged was no longer on the system (i.e. would be possible to demonstrate that no file with that checksum remained). We didn't want to preserve filenames, as a filename in itself might constitute purged information. We settled on storing a checksum of the purged binary file, alongside the digest algorithm used to generate that checksum. We would want to know the state of the object at the time of purgeThis was a way to be able to compare two copies of an object, one with the purged file and one without. The solution to this was to store the inventory digest for the current version of the record at the time of purge (updated thought: better would be the inventory digests for all versions of the object at time of the purge). This would allow comparison between two copies of the same object. What we proposed internally (not necessarily a good idea)Create a new version of the object with a filename of
|
Feedback on Use CasesIn advance of version 2 of the OCFL, we are soliciting feedback on use cases. Please feel free to add your thoughts on this use case via the comments. Polling on Use CasesIn addition to reviewing comments, we are doing an informal poll for each use case that has been tagged as
The poll will remain open through the end of February 2024. |
As long as inventory.json is not changed, the deletion of files should be supported. To prevent the validation from failing, the files could be replaced with a file with a defined checksum, which then generates a warning instead of an error during validation. Tracking can then take place in a new version. |
This seems useful, as we've had cases were we've had to delete files out of objects for legal reasons. |
At the time of this comment the vote tallied to +6. Confirming this as in scope for version 2 of the specification |
Editors' meeting 2024-09-20: Handling file loss, file corruption or version collapse all change the assumption of version immutability. This is necessarily a version 2 concept so it can only apply in a version 2+ storage root. Even with a notionally immutable system, one can have corruption. Possible solutions without mutability would be to delete corrupted objects or just store a record of corrupted files outside of the system. We agree on the Question from @brian Wheeler #42 (comment) : "If the file is gone then it would not appear in the We will create a new version to record that a file has been deleted, vanished or corrupted. We will recommend that no other changes be made at the same time as the recording of deletion/corruption. The creation of a new version gives the chance to write a new version message with user/time/etc. and any other human readable information about the why the change is occurring. Why might a file be tombstoned?
Use cases for corrupted and or unreadable:
We will add an extra parameter in Implementation notes must:
Example of file deletion (unchanged from 2023-09-23 comment){
"digestAlgorithm": "sha512",
"head": "v2",
"id": "http://example.org/minimal_deletion",
"manifest": { },
"tombstones": {
"7545b8...f67": [ "v1/content/file.txt" ]
},
"type": "https://ocfl.io/1.1/spec/#inventory",
"versions": {
"v1": {
"created": "2020-10-12T01:00:00Z",
"message": "One file",
"state": {
"7545b8...f67": [ "file.txt" ]
},
"user": {
"address": "mailto:[email protected]",
"name": "Alice"
}
},
"v2": {
"created": "2021-01-00T02:00:00Z",
"message": "The one file had to be deleted entirely for legal reasons",
"state": { },
"user": {
"address": "mailto:[email protected]",
"name": "Alice"
}
}
}
} Example of marking file corruption (cannot be read, and readable but bad digest){
"digestAlgorithm": "sha512",
"head": "v2",
"id": "http://example.org/minimal_deletion",
"manifest": {
"": [ "v1/content/file1.txt" ],
"aaa143...79a": [ "v1/content/file2.txt" ]
},
"tombstones": {
"7545b8...f67": [ "v1/content/file1.txt" ],
"fe4512...e47": [ "v1/content/file2.txt" ]
},
"type": "https://ocfl.io/1.1/spec/#inventory",
"versions": {
"v1": {
"created": "2023-10-02T12:00:00Z",
"message": "Two files",
"state": {
"7545b8...f67": [ "file1.txt" ],
"fe4512...e47": [ "file2.txt" ]
},
"user": {
"address": "mailto:[email protected]",
"name": "Alice"
}
},
"v2": {
"created": "2024-09-20T10:09:00Z",
"message": "File 1 vanished or cannot be read reliably, exclude. File 2 is corrupted with a different checksum, change name in state",
"state": {
"aaa143...79a": [ "file2_corrupted.txt" ]
},
"user": {
"address": "mailto:[email protected]",
"name": "Alice"
}
}
}
} |
I really like the idea of having a second list with entitities which have a special status. Version 1 (minimal) {
"digestAlgorithm": "sha512",
"head": "v2",
"id": "http://example.org/minimal_deletion",
"manifest": {
"7545b8...f67": [ "v1/content/file1.txt" ],
"5543b8...ae9": [ "ark:abc/123" ],
"fe4512...e47": [ "v1/content/file2.txt" ]
},
"state": {
"7545b8...f67": "deleted",
"fe4512...e47": "corrupted",
"5543b8...ae9": "remote"
},
"type": "https://ocfl.io/1.1/spec/#inventory",
"versions": {
"v1": {
"created": "2023-10-02T12:00:00Z",
"message": "Two files",
"state": {
"7545b8...f67": [ "file1.txt" ],
"fe4512...e47": [ "file2.txt" ]
},
"user": {
"address": "mailto:[email protected]",
"name": "Alice"
}
},
"v2": {
"created": "2024-09-20T10:09:00Z",
"message": "File 1 vanished or cannot be read reliably, exclude. File 2 is corrupted with a different checksum. Bigdata.txt remote file added",
"state": {
"aaa143...79a": [ "file2.txt" ],
"5543b8...ae9": [ "bigdata.txt" ]
},
"user": {
"address": "mailto:[email protected]",
"name": "Alice"
}
}
}
} Version 2 (more information in state) {
"digestAlgorithm": "sha512",
"head": "v2",
"id": "http://example.org/minimal_deletion",
"manifest": {
"7545b8...f67": [ "v1/content/file1.txt" ],
"fe4512...e47": [ "v1/content/file2.txt" ]
},
"state": {
"7545b8...f67": { "status": "deleted", "message": "copyright issues" },
"5543b8...ae9": { "status": "remote", "message": "ARK reference" },
"fe4512...e47": { "status": "corrupted", "message": "corrupted with a different checksum" }
},
"type": "https://ocfl.io/1.1/spec/#inventory",
"versions": {
"v1": {
"created": "2023-10-02T12:00:00Z",
"message": "Two files",
"state": {
"7545b8...f67": [ "file1.txt" ],
"fe4512...e47": [ "file2.txt" ]
},
"user": {
"address": "mailto:[email protected]",
"name": "Alice"
}
},
"v2": {
"created": "2024-09-20T10:09:00Z",
"message": "File 1 has copyright issues, delete. File 2 is corrupted with a different checksum. Bigdata.txt remote file added",
"state": {
"aaa143...79a": [ "file2.txt" ],
"5543b8...ae9": [ "bigdata.txt" ]
},
"user": {
"address": "mailto:[email protected]",
"name": "Alice"
}
}
}
}
|
since there can be multiple files for one checksum, deletion/corruption/remote MUST refer to filenames and not to checksums. [...]
"state": {
"v1/content/file1.txt": "deleted",
"v1/content/file2.txt": "corrupted",
"ark:abc/123": "remote"
},
[...] |
There are legitimate curatorial reasons for being able to physically remove individual files from an object. Right now, the only way to deal with this is through the Purge procedure outlined in the Implementation notes. This requires deleting the entire object and then re-creating it without the implicated files. It would be useful to work with the OCFL community to create an easier way to do this in a more automated manner that would rewrite inventories and perhaps leave a tombstone someone, either in the directory structure or just as metadata.
The text was updated successfully, but these errors were encountered: