DupeRelink - Post processing the output of `CPIPMain.py` to remove duplicate HTML files.¶

DupeRelink.py – Searches for HTML files that are the same, writes a single file into a common area and deletes all the others. Then re-links all the remaining HTML files that linked to the original files to link to the file in the common area. This is a space saving optimisation after CPIPMain.py has processed a directory of source files.

(CPIP36) $ python src/cpip/DupeRelink.py --help
usage: DupeRelink.py [-h] [-s SUBDIR] [-n] [-v] [-l LOGLEVEL] path

DupeRelink.py - Delete duplicate HTML files and relink them to save space. WARNING: This deletes in-place.
  Created by Paul Ross on 2017-09-26.
  Copyright 2017. All rights reserved.
  Licensed under GPL 2.0
USAGE

positional arguments:
  path                  Path to source directory. WARNING: This will be
                        rewritten in-place.

optional arguments:
  -h, --help            show this help message and exit
  -s SUBDIR, --subdir SUBDIR
                        Sub-directory for writing the common files. [default:
                        _common_html]
  -n, --nervous         Nervous mode, don't do anything but report what would
                        be done. Use -l20 to see detailed result. [default:
                        False]
  -v, --verbose         Verbose, lists duplicate files and sizes. [default:
                        False]
  -l LOGLEVEL, --loglevel LOGLEVEL
                        Log Level (debug=10, info=20, warning=30, error=40,
                        critical=50) [default: 30]

cpip.DupeRelink._copy_delete_duplicates_fix_links(hash_result, common_dir, nervous_mode, len_root_dir)¶: Copy a single file that is duplicated to the common area, rewrite the links in that copy to the original location then delete all duplicates.

cpip.DupeRelink._get_hash_result(dir_path, file_glob)¶: Returns a dict of {hash : [file_path, ...], ...} from a root directory.

cpip.DupeRelink._prepare_to_process(root_dir, file_glob)¶: Create a dict {hash : [file_paths, ...], ...} for duplicated files

cpip.DupeRelink._prune_hash_result(hash_result)¶: Prunes a dict of {hash : [file_path, ...], ...} to just those entries that have >1 file_path.

cpip.DupeRelink._replace_in_file(fpath, text_find, text_repl, nervous_mode, len_root_dir)¶: Reads the contents of the file at fpath, replaces text_from with text_repl and writes it back out to the same fpath.

cpip.DupeRelink._rewrite_links_where_files_deleted(root_dir, sub_dir_for_common_files, nervous_mode, hash_result, len_root_dir)¶: In the directories where we have deleted files rewrite the links to the common directory.

cpip.DupeRelink.main()¶: Delete and relink common files.

cpip.DupeRelink.process(root_dir, sub_dir_for_common_files='_common_html', file_glob='*.html', nervous_mode=False, verbose=False)¶: Process a directory in-place by making a single copy of common files, deleting the rest and fixing the links.

DupeRelink - Post processing the output of CPIPMain.py to remove duplicate HTML files.¶

DupeRelink - Post processing the output of `CPIPMain.py` to remove duplicate HTML files.¶