Tracking Changes in Directories with Python

My initial requirement is to track/find changes in directories over time, but without duplicating it or without having direct access to it, and for the background, I need this to help me compute deltas for an incremental backup module (incremental-backups-tools).

(If you want to compare local directories, you should take a look at the filecmp module.)

My approach is to take a "snapshot" / compute an "index" of a directory which will contains:

  • list of files
  • list of subdirectories
  • a mapping filename => last modified time or the file hash

You can easily dump it as JSON.

Later, when you want to compute changes, you just need to compare the previous index, with the current one, and track:

  • deleted files
  • new files
  • updated files
  • deleted subdirectories

I create a directory for testing:

In [1]:
!rm -rf /tmp/test_tracking
!mkdir /tmp/test_tracking
!mkdir /tmp/test_tracking/subdir1
!mkdir /tmp/test_tracking/subdir2
!echo content1 > /tmp/test_tracking/file1
!echo content2 > /tmp/test_tracking/file2
!echo content3 > /tmp/test_tracking/subdir1/file3
!echo content4 > /tmp/test_tracking/subdir2/file4
!ls -lR /tmp/test_tracking/
/tmp/test_tracking/:
total 16
-rw-rw-r-- 1 thomas thomas    9 déc.  12 21:45 file1
-rw-rw-r-- 1 thomas thomas    9 déc.  12 21:45 file2
drwxrwxr-x 2 thomas thomas 4096 déc.  12 21:45 subdir1
drwxrwxr-x 2 thomas thomas 4096 déc.  12 21:45 subdir2

/tmp/test_tracking/subdir1:
total 4
-rw-rw-r-- 1 thomas thomas 9 déc.  12 21:45 file3

/tmp/test_tracking/subdir2:
total 4
-rw-rw-r-- 1 thomas thomas 9 déc.  12 21:45 file4

Files and subdir

First, we need to get a list of files and subdirectories relative to the root, using os.walk is the easiest way.

In [2]:
import os

path = '/tmp/test_tracking/'
files = []
subdirs = []

for root, dirs, filenames in os.walk(path):
    for subdir in dirs:
        subdirs.append(os.path.relpath(os.path.join(root, subdir), path))

    for f in filenames:
        files.append(os.path.relpath(os.path.join(root, f), path))

We check the result:

In [3]:
files
Out[3]:
['file2', 'file1', 'subdir2/file4', 'subdir1/file3']
In [4]:
subdirs
Out[4]:
['subdir2', 'subdir1']

Create an index

Once we have the list of files and subdirectories, we need to create the actual index, which will help us detect updated files.

There is two ways to check if a file has been modified that meet the requirement, the slow way: compute a hash (md5, sha256...) and the fast way: rely on the last modified time (even rsync relies on last modified time by default).

Last modified time

Using os.path:

In [5]:
file_mtime = os.path.getmtime(os.path.join(path, files[0]))

We can convert it easily to a datetime object:

In [6]:
from datetime import datetime
file_mtime, datetime.fromtimestamp(file_mtime)
Out[6]:
(1386881105.388609, datetime.datetime(2013, 12, 12, 21, 45, 5, 388609))

Sha256 hash

Using hashlib:

In [7]:
import hashlib

def filehash(filepath, blocksize=4096):
    """ Return the hash hexdigest for the file `filepath', processing the file
    by chunk of `blocksize'.

    :type filepath: str
    :param filepath: Path to file

    :type blocksize: int
    :param blocksize: Size of the chunk when processing the file

    """
    sha = hashlib.sha256()
    with open(filepath, 'rb') as fp:
        while 1:
            data = fp.read(blocksize)
            if data:
                sha.update(data)
            else:
                break
    return sha.hexdigest()
In [8]:
file_hash = filehash(os.path.join(path, files[0]))
In [9]:
file_hash
Out[9]:
'e0763097d2327a89fb7fc6a1fad40f87d2261dcdd6c09e65ee00b200a0128e1c'

Since I don't want to spend hours computing changes, I often choose the mtime. We can build an index, basically a dictionnary, that maps filepath (relative to its own root) to its last modified time.

In [10]:
index = {}
for f in files:
    index[f] = os.path.getmtime(os.path.join(path, files[0]))
In [11]:
index[files[0]]
Out[11]:
1386881105.388609

Final index

So to be able to find differences between directory we need:

  • The list of files
  • A list of every subdirectory
  • And a index (last mtime or hash)
In [12]:
import os

def compute_dir_index(path):
    """ Return a tuple containing:
    - list of files (relative to path)
    - lisf of subdirs (relative to path)
    - a dict: filepath => last 
    """
    files = []
    subdirs = []

    for root, dirs, filenames in os.walk(path):
        for subdir in dirs:
            subdirs.append(os.path.relpath(os.path.join(root, subdir), path))

        for f in filenames:
            files.append(os.path.relpath(os.path.join(root, f), path))
        
    index = {}
    for f in files:
        index[f] = os.path.getmtime(os.path.join(path, files[0]))

    return dict(files=files, subdirs=subdirs, index=index)

diff = compute_dir_index(path)

Computing changes

I make some changes and recompute a new diff:

In [13]:
!rm -rf /tmp/test_tracking/subdir2
!echo newcontent1 > /tmp/test_tracking/file1
!echo content5 > /tmp/test_tracking/file5
!ls -lR /tmp/test_tracking/
/tmp/test_tracking/:
total 16
-rw-rw-r-- 1 thomas thomas   12 déc.  12 21:45 file1
-rw-rw-r-- 1 thomas thomas    9 déc.  12 21:45 file2
-rw-rw-r-- 1 thomas thomas    9 déc.  12 21:45 file5
drwxrwxr-x 2 thomas thomas 4096 déc.  12 21:45 subdir1

/tmp/test_tracking/subdir1:
total 4
-rw-rw-r-- 1 thomas thomas 9 déc.  12 21:45 file3

In [14]:
diff2 = compute_dir_index(path)

Computing the difference is easy, we need few things actually:

  • Deleted files (just a set substraction)
  • New files (just a set substraction)
  • Updated files (thanks to the index)
  • Deleted subdirectories

Here is the compute_diff:

In [15]:
def compute_diff(dir_base, dir_cmp):
    data = {}
    data['deleted'] = list(set(dir_cmp['files']) - set(dir_base['files']))
    data['created'] = list(set(dir_base['files']) - set(dir_cmp['files']))
    data['updated'] = []
    data['deleted_dirs'] = list(set(dir_cmp['subdirs']) - set(dir_base['subdirs']))

    for f in set(dir_cmp['files']).intersection(set(dir_base['files'])):
        if dir_base['index'][f] != dir_cmp['index'][f]:
            data['updated'].append(f)

    return data

We can check with our previously created diffs:

In [16]:
compute_diff(diff2, diff)
Out[16]:
{'created': ['file5'],
 'deleted': ['subdir2/file4'],
 'deleted_dirs': ['subdir2'],
 'updated': ['file2', 'file1', 'subdir1/file3']}

Dirtools

I recently added this feature in Dirtools, pull requests and feedback are welcome!

In [19]:
from dirtools import Dir, DirState

d = Dir(path)
dir_state = DirState(d)

state_file = dir_state.to_json()

# Later... after some changes

dir_state = DirState.from_json(state_file)
dir_state2 = DirState(d)

changes = dir_state2 - dir_state

Your feedback

Please don't hesitate if you have any questions, ideas, or improvements!

You should follow me on Twitter

Share this article

Tip with Bitcoin

Tip me with Bitcoin and vote for this post!

1FKdaZ75Ck8Bfc3LgQ8cKA8W7B86fzZBe2

Leave a comment

© Thomas Sileo. Powered by Pelican and hosted by DigitalOcean.