I THINK ∴ I'M DANGEROUS

n-Depth Mean Compare

Abstract

n-Depth Mean Compare (NMC) is an algorithm to gauge the similarity of two blocks of binary data of potentially differing lengths. Given a depth (some integer greater than 0) and two blocks of data, a floating point number is returned indicating the similarity of the two blocks. The closer to zero, the more similar the two blocks are deemed to be. If a value of 0 is returned, the blocks are deemed identical.

Use Cases

The original purpose of NMC, what it was developed for, was to evaluate the output of genetically programmed algorithms. Given a particular input, how close to an ideal output could the algorithm get.

Additional potential use cases include identifying and grouping similar objects for object-based storage or on-disk file management. If you normalize your input data (e.g. convert audio data to 16-bit signed PCM or convert image data to 24-bit bmp) it can roughly compare image and sound data.

Fundamental Idea

Suppose we have two sets of integers, Set A and Set B.

If you take the average of Set A you can use this to loosely compare it to Set B. The difference between two identical sets of numbers loosely gauges the difference between the sets. A difference of zero of the averages means the averages are the same.

If we divide the two sets evenly in half, we can compare the average of the first half of Set A to the average of the first half of Set B and likewise for the second half of each set. This takes into account the positions of values the sets in the first and second halves. The difference between the first half of the sets can be averaged with the second half to get a single number representing similarity. Again, the closer two zero, the more similar the sets.

If the sets are divided into 3 (near-)equal portions, yet more positional information is accounted for. We can continue this trend up to the length of which ever set is shorter, producing ever more precise comparisons.

To account for differences in length, the length of each set is appended to their respective sets.

There is an additional operation–it's not just the average of the differences. The final result is the square root of the average of the square of the differences. This is similar to how standard deviation is calculated and I cannot offer any better rationale than I find using this calculation produces better results in my testing.

Proof of Concept Implementation

nmc.c can be compiled to a shared library with gcc -O2 -shared -fPIC nmc.c -o libnmc.so -lm

nmc.h
double nmc_mean (double *l, unsigned int len);
double nmc_mean_c (unsigned char *l, unsigned int len);
double nmc (unsigned int n, unsigned char *s1, unsigned int s1_len, unsigned char *s2, unsigned int s2_len);
nmc.c
#include <string.h>
#include <stdio.h>
#include <errno.h>
#include <stdlib.h>
#include <math.h>
 
#include "nmc.h"
 
double nmc_mean (double *l, unsigned int len)
{
    /* given an array of doubles (l) of length len, cacluate their mean */
 
    double avg = 0;
    unsigned int i=len;
    if (len == 1) return (double) *l;
    while (i--) avg += *(l++);
    return avg / len;
}
 
double nmc_mean_c (unsigned char *l, unsigned int len)
{
    double avg = 0;
    unsigned int i = len;
    if (len == 1) return (double) *l;
    while (i--) avg += (double) *(l++);
    return avg / len;
}
 
double nmc (unsigned int n, unsigned char *s1, unsigned int s1_len, unsigned char *s2, unsigned int s2_len)
{
    unsigned int s1_chunk, s1_chunk_last;
    unsigned int s2_chunk, s2_chunk_last;
    unsigned int i;
    double *diffs, mean;
 
    if ( n > s1_len || n > s2_len )
    {
        errno = EINVAL;
        return -1.0;
    }
 
    if ( 0 == n )
		n = (s1_len < s2_len) ? s1_len : s2_len;
 
    /* allocate memory for the difference array. We need n+1, n pieces
     * plus one final slot for the difference in length */
    diffs = calloc(n+1, sizeof(double));
 
    /* breaking s1 and s2 apart into n-pieces gives chunk-sized pieces */
    s1_chunk = s1_len / n;
    s2_chunk = s2_len / n;
 
    /* to account for s1 and s2 possibly not being evenly divisible by n */
    s1_chunk_last = ( s1_len % n ) ? s1_len % n : s1_chunk;
    s2_chunk_last = ( s2_len % n ) ? s2_len % n : s2_chunk;
 
    for (i = 0; i < n-1; i++)
    {
        diffs[i] =
			fabs (
				nmc_mean_c ( s1 + i*s1_chunk, s1_chunk )
				-
				nmc_mean_c ( s2 + i*s2_chunk, s2_chunk )
			);
        diffs[i] = diffs[i] < 128 ? diffs[i] : 256 - diffs[i];
        diffs[i] *= diffs[i];
    }
 
    /* the final chunk */
    diffs[n-1] =
		fabs (
			nmc_mean_c ( s1 + (n-1)*s1_chunk, s1_chunk_last )
			-
			nmc_mean_c ( s2 + (n-1)*s2_chunk, s2_chunk_last )
		);
    diffs[n-1] = diffs[n-1] < 128 ? diffs[n-1] : 256 - diffs[n-1];
    diffs[n-1] *= diffs[n-1];
 
    /* final diff: size difference */
    diffs[n] = fabs ( (double) s1_len - s2_len );
	diffs[n] *= diffs[n];
 
    mean = sqrt(nmc_mean(diffs,n+1));
 
    free(diffs);
 
    return mean;
}

Examples