My initial requirement is to track/find changes in directories over time, but without duplicating it or without having direct access to it, and for the background, I need this to help me compute deltas for an incremental backup module (incremental-backups-tools).
(If you want to compare local directories, you should take a look at the filecmp module.)
My approach is to take a "snapshot" / compute an "index" of a directory which will contains:
- list of files
- list of subdirectories
- a mapping filename => last modified time or the file hash
You can easily dump it as JSON.
Later, when you want to compute changes, you just need to compare the previous index, with the current one, and track:
- deleted files
- new files
- updated files
- deleted subdirectories
I create a directory for testing:
!rm -rf /tmp/test_tracking
!mkdir /tmp/test_tracking
!mkdir /tmp/test_tracking/subdir1
!mkdir /tmp/test_tracking/subdir2
!echo content1 > /tmp/test_tracking/file1
!echo content2 > /tmp/test_tracking/file2
!echo content3 > /tmp/test_tracking/subdir1/file3
!echo content4 > /tmp/test_tracking/subdir2/file4
!ls -lR /tmp/test_tracking/
Files and subdir
First, we need to get a list of files and subdirectories relative to the root, using os.walk is the easiest way.
import os
path = '/tmp/test_tracking/'
files = []
subdirs = []
for root, dirs, filenames in os.walk(path):
for subdir in dirs:
subdirs.append(os.path.relpath(os.path.join(root, subdir), path))
for f in filenames:
files.append(os.path.relpath(os.path.join(root, f), path))
We check the result:
files
subdirs
Create an index
Once we have the list of files and subdirectories, we need to create the actual index, which will help us detect updated files.
There is two ways to check if a file has been modified that meet the requirement, the slow way: compute a hash (md5, sha256...) and the fast way: rely on the last modified time (even rsync relies on last modified time by default).
Last modified time
Using os.path:
file_mtime = os.path.getmtime(os.path.join(path, files[0]))
We can convert it easily to a datetime object:
from datetime import datetime
file_mtime, datetime.fromtimestamp(file_mtime)
Sha256 hash
Using hashlib:
import hashlib
def filehash(filepath, blocksize=4096):
""" Return the hash hexdigest for the file `filepath', processing the file
by chunk of `blocksize'.
:type filepath: str
:param filepath: Path to file
:type blocksize: int
:param blocksize: Size of the chunk when processing the file
"""
sha = hashlib.sha256()
with open(filepath, 'rb') as fp:
while 1:
data = fp.read(blocksize)
if data:
sha.update(data)
else:
break
return sha.hexdigest()
file_hash = filehash(os.path.join(path, files[0]))
file_hash
Since I don't want to spend hours computing changes, I often choose the mtime. We can build an index, basically a dictionnary, that maps filepath (relative to its own root) to its last modified time.
index = {}
for f in files:
index[f] = os.path.getmtime(os.path.join(path, files[0]))
index[files[0]]
Final index
So to be able to find differences between directory we need:
- The list of files
- A list of every subdirectory
- And a index (last mtime or hash)
import os
def compute_dir_index(path):
""" Return a tuple containing:
- list of files (relative to path)
- lisf of subdirs (relative to path)
- a dict: filepath => last
"""
files = []
subdirs = []
for root, dirs, filenames in os.walk(path):
for subdir in dirs:
subdirs.append(os.path.relpath(os.path.join(root, subdir), path))
for f in filenames:
files.append(os.path.relpath(os.path.join(root, f), path))
index = {}
for f in files:
index[f] = os.path.getmtime(os.path.join(path, files[0]))
return dict(files=files, subdirs=subdirs, index=index)
diff = compute_dir_index(path)
Computing changes
I make some changes and recompute a new diff:
!rm -rf /tmp/test_tracking/subdir2
!echo newcontent1 > /tmp/test_tracking/file1
!echo content5 > /tmp/test_tracking/file5
!ls -lR /tmp/test_tracking/
diff2 = compute_dir_index(path)
Computing the difference is easy, we need few things actually:
- Deleted files (just a set substraction)
- New files (just a set substraction)
- Updated files (thanks to the index)
- Deleted subdirectories
Here is the compute_diff
:
def compute_diff(dir_base, dir_cmp):
data = {}
data['deleted'] = list(set(dir_cmp['files']) - set(dir_base['files']))
data['created'] = list(set(dir_base['files']) - set(dir_cmp['files']))
data['updated'] = []
data['deleted_dirs'] = list(set(dir_cmp['subdirs']) - set(dir_base['subdirs']))
for f in set(dir_cmp['files']).intersection(set(dir_base['files'])):
if dir_base['index'][f] != dir_cmp['index'][f]:
data['updated'].append(f)
return data
We can check with our previously created diffs:
compute_diff(diff2, diff)
Dirtools
I recently added this feature in Dirtools, pull requests and feedback are welcome!
from dirtools import Dir, DirState
d = Dir(path)
dir_state = DirState(d)
state_file = dir_state.to_json()
# Later... after some changes
dir_state = DirState.from_json(state_file)
dir_state2 = DirState(d)
changes = dir_state2 - dir_state
Your feedback
Please don't hesitate if you have any questions, ideas, or improvements!
Tip with Bitcoin
Tip me with Bitcoin and vote for this post!
Leave a comment