DupeRelink - Post processing the output of CPIPMain.py
to remove duplicate HTML files.¶
DupeRelink.py – Searches for HTML files that are the same, writes a single file into a common area and deletes all the others. Then re-links all the remaining HTML files that linked to the original files to link to the file in the common area. This is a space saving optimisation after CPIPMain.py has processed a directory of source files.
(CPIP36) $ python src/cpip/DupeRelink.py --help
usage: DupeRelink.py [-h] [-s SUBDIR] [-n] [-v] [-l LOGLEVEL] path
DupeRelink.py - Delete duplicate HTML files and relink them to save space. WARNING: This deletes in-place.
Created by Paul Ross on 2017-09-26.
Copyright 2017. All rights reserved.
Licensed under GPL 2.0
USAGE
positional arguments:
path Path to source directory. WARNING: This will be
rewritten in-place.
optional arguments:
-h, --help show this help message and exit
-s SUBDIR, --subdir SUBDIR
Sub-directory for writing the common files. [default:
_common_html]
-n, --nervous Nervous mode, don't do anything but report what would
be done. Use -l20 to see detailed result. [default:
False]
-v, --verbose Verbose, lists duplicate files and sizes. [default:
False]
-l LOGLEVEL, --loglevel LOGLEVEL
Log Level (debug=10, info=20, warning=30, error=40,
critical=50) [default: 30]
-
cpip.DupeRelink.
_copy_delete_duplicates_fix_links
(hash_result, common_dir, nervous_mode, len_root_dir)¶ Copy a single file that is duplicated to the common area, rewrite the links in that copy to the original location then delete all duplicates.
-
cpip.DupeRelink.
_get_hash_result
(dir_path, file_glob)¶ Returns a dict of {hash : [file_path, ...], ...} from a root directory.
-
cpip.DupeRelink.
_prepare_to_process
(root_dir, file_glob)¶ Create a dict {hash : [file_paths, ...], ...} for duplicated files
-
cpip.DupeRelink.
_prune_hash_result
(hash_result)¶ Prunes a dict of {hash : [file_path, ...], ...} to just those entries that have >1 file_path.
-
cpip.DupeRelink.
_replace_in_file
(fpath, text_find, text_repl, nervous_mode, len_root_dir)¶ Reads the contents of the file at fpath, replaces text_from with text_repl and writes it back out to the same fpath.
-
cpip.DupeRelink.
_rewrite_links_where_files_deleted
(root_dir, sub_dir_for_common_files, nervous_mode, hash_result, len_root_dir)¶ In the directories where we have deleted files rewrite the links to the common directory.
-
cpip.DupeRelink.
main
()¶ Delete and relink common files.
-
cpip.DupeRelink.
process
(root_dir, sub_dir_for_common_files='_common_html', file_glob='*.html', nervous_mode=False, verbose=False)¶ Process a directory in-place by making a single copy of common files, deleting the rest and fixing the links.