Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track inodes and use them for hard link comparisons on linux #535

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

KaibutsuX
Copy link

This implements inode comparisons for accurate hard link detection in linux.

I'm not sure what affects the ProtoMember changes have by adding a new field to the FileEntry struct.

@KaibutsuX
Copy link
Author

@0x90d This PR is preferable to my other hard link pr because it doesn't rely on any external utils like the find command.

@0x90d
Copy link
Owner

0x90d commented Sep 23, 2024

Same as in your other PR. This seem to only exclude files where the 'real' file and the 'link' file are both part of the scans but that is not how excluding hard link feature work in VDF. See the windows implementation. It will exclude files which are hard links even if the 'real' file is outside of the scan directories.

@KaibutsuX
Copy link
Author

I'm not sure I understand the purpose of the setting then. If I say scan this single folder, I assume I want to find duplicate videos in this folder.

It sounds like what you're describing is a separate setting of "Detect linked files outside of scan directories". When I look for hard links in the scanned directories, I am trying to exclude duplicates in the scanned folder.

Are hard/soft links the same in windows linux?

My understanding:
"Soft" links, same as windows shortcuts. There exists a reference file with data and multiple "soft" links can point to the reference file. Deleting the reference file breaks the links and the links are now invalid. Deleting the links leaves the reference file intact.

Hard links, I don't know what this means in window. In linux, a hard link is like a shortcut, except that instead of pointing to the reference file, it points to the underlying reference file data. 3 hard links of a given file do not rely on each other, they simply point to the filesystem's inode. Renaming any single file has no effect, all 3 hard links still point to the same data. Deleting 2 of the hard links has no effect on the 3rd file and the disk usage is unchanged, the data still remains.

@0x90d
Copy link
Owner

0x90d commented Oct 2, 2024

I'm not sure I understand the purpose of the setting then.

The purpose is that any file which is a hard link is excluded. It doesn't matter where it links to.

@KaibutsuX
Copy link
Author

Then that setting implicitly ignores the user's request to only inspect the given directories. It also is impossible to reliably operate within a given timeout since there is no way of knowing how big the rest of the filesystem is. If Archive.org used this tool and they wanted to find duplicate videos for a given folder (site) and they turned on detect hardlinks, they would be scanning essentially the entire internet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants