NMC Tcl Extension

n-Depth Mean Compare is an algorithm and Tcl extension for gauging the similarity of two blocks of binary data of potentially differing lengths. Given a depth (some integer greater than 0) and two blocks of data, a floating point number is returned indicating the similarity of the two blocks. The closer to zero, the more similar the two blocks are deemed to be. If a value of 0 is returned, the blocks are deemed identical.


Building and Installing

Running “make” will build and test the extension. You can install with “make install”. The default prefix is /usr and the default lib path is ${PREFIX}/lib. You can override either of these like so:

make install PREFIX=$HOME

Walkthrough of Comparisons

Let's walk through some comparisons using nmc.

When using the Tcl extension, an n-depth of 0 means 'use the largest valid value for n'.

% # Load the package and import the nmc command
% package require nmc
% namespace import nmc::nmc
% # Simple equality test of identical strings
% nmc 0 asdf asdf
% # zero means identical.
% # If we compare slightly different strings of the same length:
% nmc 0 asdd asdf
% # We get a value close to zero. How about a string that is a substring of the other?
% nmc 0 asdff asdf
% # Also note that comparisons are commutative, like you'd expect:
% nmc 0 asdf asdff
% # 'b' is closer to 'a' than 'z' is:
% nmc 0 asda asdb
% nmc 0 asda asdz
% # 'asdf' and 'asdfasdf' are more dissimilar than 'asdfasdf' and 'asdfasdfasdf'
% nmc 0 asdf asdfasdf
% nmc 0 asdfasdf asdfasdfasdf
% # The tcl extension can take a percent as its n argument, which will then be percentage of the maximum valid n.
% # Get a couple of large, equal sized random strings
% set fh [open /dev/urandom r]
% set rand1 [read $fh [expr 0x1000]]; set rand2 [read $fh [expr 0x1000]]; close $fh
% # Depending on use case, we may wish to see purely random data as similar.
% # Percent-n is useful for this. Comparison using max-valid n:
% nmc 0 $rand1 $rand2
% nmc 1% $rand1 $rand2

Short Comings

The extension requires all the strings to be loaded into memory, thus really large files cannot be compared. A sloppy work around, would be breaking the files into chunks and comparing the chunks. Eventually, an nmc for file streams (C-based) and channels (tcl-based) will be implemented.

Floating point math is used and, as you probably know, you cannot expect 100% consistent results between architectures. It is for making comparisons on a single machine, not between different devices. An NMC value on one machine is not useful to compare against an NMC value of the same data on another machine.


All code included is released to the public domain, so long as the original author is credited.