Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New feature: Blacklisting the grouping of confirmed non-matching file pairs #438

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

whacklezz
Copy link

@whacklezz whacklezz commented Jul 14, 2023

Background
When dealing with large groups, the likelihood of an item being added to the group grows with the group size. Similarity between false positives tends to create daisy-chains that eventually merges into super-groups.

If we explicitly declare items as "not matching"/"blacklisted", we can break these daisy chains and thus reduce them to smaller groups that are easier to manage. A simple daisy-chain of similar items can looks like this:

A-B-C-D-E-F-G-H

If we declare item D and E not matching, they are forbidden from grouping together and we end up with this group configuration instead:

A-B-C-D
E-F-G-H

Which is easier to visually compare, especially when considering that groups have no upper limit to the number of members.

There may be instances where the members of the group in reality are more heavily interconnected through their similarities. E.g. in the example above, each of the members A-H could all be similar to each other. In this case, severing a single link through blacklisting may not do much except randomly rearrange the items into two groups upon next scan (where the groups could contain anywhere from 2 to 6 members each, depending on the order of comparison). This can still be pruned down through iterations of blacklisting, but in the extreme case of all false positives, would mean n*(n-1) = 8*7 = 56 unique blacklisted permutations of item pairs. Meaning that large sets of internally similar false positives require n^2 blacklist entries.

Summary
Added scan-time blacklisting of the grouping of user-confirmed not-matching file pairs.
The files need to be scanned/rescanned in order to rearrange the already displayed groups after adding to the blacklist,
Tried to add inline comments to explain what each part of the code does.

Adding to the blacklist
Added a json file that keeps a dictionary of hashsets, one for each file that has a blacklisted pairing. It creates a two-way set, so you can look up A and see that it blacklists B, and you can look up B and see that it blacklists A.

Initially the file does not exist. The entries are created from the right-click menu of the scan result. When you right-click an item, you find new options in the respective submenus:

  1. Selected Group -> To blacklist the selected item from matching with each of the other items in the group.
  2. Selected Group -> To blacklist each item in the group from matching with each of the other items in the group.
  3. Selected Item -> To blacklist each checkmarked item from matching with each of the other checkmarked items.
  4. Selected Item -> To blacklist each checkmarked item from matching with each of the other items in the same group.
  5. Selected Item -> To blacklist the selected item from matching with each of the checkmarked items.

When the option is selected, it adds entries to the dictionary file into the corresponding hashsets.

Using the blacklist
When scanning/rescanning the files, the duplicate check will run as usual. The difference is that it also reads the blacklist dictionary, and uses some rules when it compares items for similarity.

If it encounters an item with a blacklist, it checks the elements on the list to evaluate whether or not to group the candidate items. I have tried to comment on the logic to make it understandable.

New rescan mode
After adding items to the blacklist dictionary, the displayed list of duplicates will shrink due to groups collapsing.
You can now choose "Regroup" as a scan option, which tries to match up only the items that passed the similarity check in the previous scan. As the blacklist will prohibit grouping of certain items, the "regroup" scan will potentially create new permutations of groups in accordance with the blacklist.

Blacklisting logic
When comparing A and B and they are found to be similar:

  • First check if A and B blacklists each other. If so, not possible to group no matter what.
  • If neither is a member of an existing group:
    As they are not blacklisting each other, create a new group for A and B.
  • If A is a member of an existing group and B is not:
    We can't add B to the group if there is a blacklist entry between B and one of A's group members (and similar the other way around).
    Note: We could potentially move A from its existing group into a new group with B, but that means there can be a disconnect inside A's group where an item is not similar to any remaining items. Not implemented.
  • If A and B are members of different existing groups:
    If A blacklists one of B's members AND B blacklists one of A's members, we cannot merge the groups
    If either A blacklists one of B's members OR B blacklists one of A's members, we still cannot merge the groups (again, could move item from one group to another, e.g. because of higher similarity, but not implemented due to complexity).
    If neither A nor B blacklists a group member of the other group, but a member of group A blacklists a member of group B, we cannot merge the groups. (Again, could move item from one group to another, but not implemented due to complexity)

Some final thoughts
When you have very big groups, it can be hard to pinpoint which items inside the group have a high similarity. Two items inside the group can be very unsimilar, but be in the same group because they are similar to other respective elements in the group. Think of them as being on the different ends of the daisy chain. Blacklisting the grouping of those two items does not help as much as blacklisting a false positive with a high similarity - but we don't know which is which.

A solution to this could be to add a link between the matching pair of items during grouping. The link could be used to highlight/navigate to/blacklist the group-internal match that caused the element to join the group in the first place. We could also count and display how many links each item is included in. Or even make a wizard for breaking up a group, that selects the most beneficial link, displays the two screenshots for comparison and asks "Are these duplicates (yes/no)". In any case this could allow us to surgically slice the set by targeting high-similarity false positives. For large scale false positives with a large degree of clustering overlap in the similarities (e.g. large collections of items with black bars), it would still require a lot of pruning to get anywhere.

Furthermore, a link between the items would enable us to analyze whether or not an item could be moved from one group to another as mentioned in the blacklisting logic. If no other item in the group has joined the group because of a comparison with the item in question, we could move the item out of the group and into a new one based on higher similarity, without worrying that we're leaving gaps in the existing group.

The possibility to target specific item links instead of the fuzzy group similarity would however require expanding the DuplicateItem model class in the approach that comes to mind, but I don't want to do this in this PR as the scope has already gotten big.

Another observation is that when you encounter a pair of duplicates within a large group, they may split up again upon regrouping, if they are sufficiently similar to other items in the large group.
A possible solution here (other than to deal with them immediately, or prune down the groups further), would be to add a "confirmed duplicates" repository which could attempt to pre-group confirmed duplicates when they are part of the included set of file entries. Would require some thought to not introduce unnecessary processing time.

Other than that, I can safely say that the changes introduced here have already allowed me to successfully find a lot more duplicates hiding in mega groups as discussed in #306.

Performance wise I have not noticed any increase in scanning time, however, I assume that sufficiently large blacklist dictionaries require a bit of extra RAM for the in-mem processing. Tested with a dictionary of 12 MB (120k entries) without issues.

Potential refactoring points are to split out the blacklist dictionary handler into its own class, and maybe consider rearranging the right-click menu entries into a dedicated submenu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant