GitHub

DESCRIPTION:

Cross-platform code to scan through an archive of documents (photos, codes, documents, etc.) and remove duplicates if any.

This code traverses through a directory and its subdirectories, identifies repeat files by matching their MD5 checksums and deletes the repeats after prompting the user.

Files can be filtered based on file name extensions. If filtering is used, only files belonging to an input list of extensions are checked for repeats.

As always, use with prudence as the deleted files cannot be recovered later in some OS!

One typical use is for removing duplicate photos from your archive.

TYPICAL USE-CASES:

Delete repeated/replicate photos

Remove multiple backup archives

Remove reduntant copies of log files

Streamline document archives

Manage code bases

INPUT:

rootDir : full directory to start the scan (use unix forward slash format)
           <string>
           
extList : Provide a list of string extensions for filtering files to process.
            use ["*.*"] to look for all files (default). See examples below.
            
caseIndependentMatch: <boolean> When set to True, file extension filtering
           is done independently. Default is False.

OUTPUT:

tuple of (result_df, summary_df, resultFlag)

result_df  : complete scan results of all files <pandas dataframe>

summary_df : scanned results agg by checksum <pandas dataframe>

resultFlag : 1 if all delete operations were completed, 0 if no deletes 
              were requested, -1 if deletes were done partially
              if -1 is returned, some of the deletes were not done <int>

USAGE:

import repeatFileChecker as rf

# search for repeats without filtering on file extensions
result_df, summary_df, resultFlag = rf.repeatFileChecker('./testimages', extList = ["*.*"], caseIndependentMatch=True)

# search for repeats only on .jpg files. case independent extension filtering
#  is performned. I.e. .jpg, .JPG, .JpG etc files are all searched. 
result_df, summary_df, resultFlag = rf.repeatFileChecker('./testimages', extList = [".jpg"], caseIndependentMatch=True)

# search for repeats only on .jpg files. case dependent search is performed
#  i.e. only .jpg files are searched. .JPG files are ignored 
result_df, summary_df, resultFlag = rf.repeatFileChecker('./testimages', extList = [".jpg"], caseIndependentMatch=False)

TESTING:

Use the photos in the testimages directory for testing this function.

CAVEATS:

If you select to delete the files at the prompt, make sure the files
are not read-only. Otherwise, the code throws an exception and exits.
The catch-try has been omitted by design.

Please notify author if you encounter an error on your OS (see caveats above before emailing).

@author: Balakumar B J, 2018

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
__pycache__		__pycache__
testimages		testimages
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
repeatFileChecker.py		repeatFileChecker.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DESCRIPTION:

TYPICAL USE-CASES:

INPUT:

OUTPUT:

USAGE:

TESTING:

CAVEATS:

About

Releases

Packages

Languages

License

bbalasub1/repeatFileChecker

Folders and files

Latest commit

History

Repository files navigation

DESCRIPTION:

TYPICAL USE-CASES:

INPUT:

OUTPUT:

USAGE:

TESTING:

CAVEATS:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages