DupeRelink - Post processing the output of CPIPMain.py to remove duplicate HTML files.

DupeRelink.py – Searches for HTML files that are the same, writes a single file into a common area and deletes all the others. Then re-links all the remaining HTML files that linked to the original files to link to the file in the common area. This is a space saving optimisation after CPIPMain.py has processed a directory of source files.

(CPIP36) $ python src/cpip/DupeRelink.py --help
usage: DupeRelink.py [-h] [-s SUBDIR] [-n] [-v] [-l LOGLEVEL] path

DupeRelink.py - Delete duplicate HTML files and relink them to save space. WARNING: This deletes in-place.
  Created by Paul Ross on 2017-09-26.
  Copyright 2017. All rights reserved.
  Licensed under GPL 2.0
USAGE

positional arguments:
  path                  Path to source directory. WARNING: This will be
                        rewritten in-place.

optional arguments:
  -h, --help            show this help message and exit
  -s SUBDIR, --subdir SUBDIR
                        Sub-directory for writing the common files. [default:
                        _common_html]
  -n, --nervous         Nervous mode, don't do anything but report what would
                        be done. Use -l20 to see detailed result. [default:
                        False]
  -v, --verbose         Verbose, lists duplicate files and sizes. [default:
                        False]
  -l LOGLEVEL, --loglevel LOGLEVEL
                        Log Level (debug=10, info=20, warning=30, error=40,
                        critical=50) [default: 30]

Copy a single file that is duplicated to the common area, rewrite the links in that copy to the original location then delete all duplicates.

cpip.DupeRelink._get_hash_result(dir_path, file_glob)

Returns a dict of {hash : [file_path, ...], ...} from a root directory.

cpip.DupeRelink._prepare_to_process(root_dir, file_glob)

Create a dict {hash : [file_paths, ...], ...} for duplicated files

cpip.DupeRelink._prune_hash_result(hash_result)

Prunes a dict of {hash : [file_path, ...], ...} to just those entries that have >1 file_path.

cpip.DupeRelink._replace_in_file(fpath, text_find, text_repl, nervous_mode, len_root_dir)

Reads the contents of the file at fpath, replaces text_from with text_repl and writes it back out to the same fpath.

In the directories where we have deleted files rewrite the links to the common directory.

cpip.DupeRelink.main()

Delete and relink common files.

cpip.DupeRelink.process(root_dir, sub_dir_for_common_files='_common_html', file_glob='*.html', nervous_mode=False, verbose=False)

Process a directory in-place by making a single copy of common files, deleting the rest and fixing the links.