Possible solution for changed files, and also alternate database suggestions #383

bridgebrain · 2024-08-23T23:40:37Z

bridgebrain
Aug 23, 2024

Hello all, saw on CyanVoxels youtube that this project is alive and well! About halfway through, he discusses how the system handles files which have moved or changed names still has room for improvement. I had a thought about that.

Create a MD5/SHA256 checksum of every indexed file. Inject that into the metadata of the file (not certain if all file formats can hold metadata and have a common field like "comments"?). When the file is altered (renamed, moved), the system can seek and compare the database version of the against all un-indexed file checksum efficiently without having to re-perform the calculation on the actual file. This also means that duplicates of the same file, or file alterations (such as changing text on word documents and renaming the file) will still be tracked.

While I'm here, I'd also like to suggest an alternate to a SQL database. A lot of the ways tags interact with each other will be very hard to manage in SQL. My recommendation is arangoDB, as it's multi-model (lets you do key-value, but also document and graph). That said, neo4j is a lot simpler but would still make a good use for this (your data gets stored with relationships between things, so [mario >wearing> overalls] is a query you could make, but also search for mario, wearing, overalls, or any combination thereof. Actually making the tags for that might be a bit more complicated, since you now have to add all the relationships in order to search them, but there are a few good interfaces around to simply that. Lastly, SurrealDB is a document database, but its special query language is basically SQL with some extra abstraction. I haven't really used surrealdb much, so I can't weigh in too much on whether it's right for this, only that it would offer a lot more flexibility and doesn't have as much of a learning curve as the others might if people are already familiar with SQL.

Also, tossing this in from one of the youtube comments I saw, someone asked for sidecar files. While that's exactly what we're trying to avoid creating with this whole thing, I think eventually having the option to export sidecar files could be useful in the future, and shouldn't be too difficult to add (takes current tags for a selected file, punts them into the right format for a standard existing file like json, and creates the sidecar file next to the original.) I think this could help with some backing up problems, or when moving the files to a new system which doesn't have tagstudio on it such as sharing with a team.

flodejr · 2024-09-03T10:54:34Z

flodejr
Sep 3, 2024

I would suggest not a checksum but an unique id added to exif or meta data, adding checksum will already make the checksum invalid. Furthermore if you to calculate the checksum everytime, how much computational resource you would need.

As for the db, I would suggest using Flask-SQLAlchemy, it uses python classes to automatically create relations and tables in the native db, which will make relations easier to understand and accessing data is as easy as well. Furthermore, you have freedom to choose any native db.

2 replies

bridgebrain Sep 8, 2024
Author

Ah, maybe I wasn't clear. Only on the initial touch of a file, it generates a checksum to use as the unique ID, just as a nice universal non-repeating id generator system. After that, we're no longer looking for the actual checksum of the file, but the ID which we'd store and hunt for. If it doesn't have an ID, it generates a new one.

Any changes to the files would change the checksum, but not the ID. However multiple copies of the same file in different locations would checksum to have the same ID, which would help manage duplicates at the same time.

Ooo nice! I haven't explored that system at all, I'll have to add it to my tech que

Computerdores Sep 8, 2024
Collaborator

just as a nice universal non-repeating id generator system

If that's the goal it's better to just use a UUID, they can be generated in just one line (excluding import)

import uuid
print(uuid.uuid4())

and even if you generate 1000 of these each day you will be here for $2.71 \cdot 10^{15}$ days (7.4 trillion years) just to have a 50% chance of finding a duplicate. Adding onto that it takes the same time no matter the file size and is most likely faster.

Leseratte10 · 2024-09-09T14:48:21Z

Leseratte10
Sep 9, 2024

Inject that into the metadata of the file (not certain if all file formats can hold metadata and have a common field like "comments"?).

I would definitely want such a feature to be disabled by default. I think it's great that TagStudio doesn't move, touch or modify the original files and I would want to keep it that way. I wouldn't want checksums to change or digital signatures of a file to become invalid just because I'm managing them in TagStudio.

Also, very few file formats allow for custom metadata like this, you'd need to have specialized parsers for each file format, and it may break some file formats. It's mainly just media files (video, audio, images) that support this kind of arbitrary metadata, and that metadata wasn't intended for random software to add their own IDs to a file.

1 reply

CyanVoxel Sep 9, 2024
Maintainer

Inject that into the metadata of the file (not certain if all file formats can hold metadata and have a common field like "comments"?).

I would definitely want such a feature to be disabled by default. I think it's great that TagStudio doesn't move, touch or modify the original files and I would want to keep it that way. I wouldn't want checksums to change or digital signatures of a file to become invalid just because I'm managing them in TagStudio.

Also, very few file formats allow for custom metadata like this, you'd need to have specialized parsers for each file format, and it may break some file formats. It's mainly just media files (video, audio, images) that support this kind of arbitrary metadata, and that metadata wasn't intended for random software to add their own IDs to a file.

This echoes my exact thoughts on this, couldn't have put it better ☝️

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Possible solution for changed files, and also alternate database suggestions #383

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Possible solution for changed files, and also alternate database suggestions #383

Uh oh!

Uh oh!

bridgebrain Aug 23, 2024

Replies: 2 comments · 3 replies

Uh oh!

flodejr Sep 3, 2024

Uh oh!

bridgebrain Sep 8, 2024 Author

Uh oh!

Computerdores Sep 8, 2024 Collaborator

Uh oh!

Uh oh!

Leseratte10 Sep 9, 2024

Uh oh!

CyanVoxel Sep 9, 2024 Maintainer

bridgebrain
Aug 23, 2024

Replies: 2 comments 3 replies

flodejr
Sep 3, 2024

bridgebrain Sep 8, 2024
Author

Computerdores Sep 8, 2024
Collaborator

Leseratte10
Sep 9, 2024

CyanVoxel Sep 9, 2024
Maintainer