Jun 1, 2016

MIBR: The Randomness Metric of Last Resort

Related Dyspoissometer functions: dyspoissometer_mibr_expected_get(), dyspoissometer_mibr_from_skew_get(), dyspoissometer_mibr_get(), dyspoissometer_skew_from_mibr_get().

In most practical cases, it's possible to assess the randomness quality of a mask list by computing its dyspoissonism. However, in some cases, the mask span (Z) vastly exceeds the mask count (Q). Here we introduce the "maximum index before repetition" (MIBR), the so-called "metric of last resort" because it can sometimes provide an informative measure of randomness in the latter case.

The MIBR is related to hash collisions, birthday attacks, and the birthday bound. It's the zero-based index immediately prior to the first repetition of a mask. For example:

{6, 5, 8, 9, 1, 3, 2, 4, 5, 0, 1}

has MIBR 7 because the 9th mask, which is a 5, is repeated. By definition, the MIBR of a mask list containing no repeats is the number of masks it contains, less one. For example:

{9, 0, 2, 4, 6}

has MIBR 4.

The MIBR is a useful randomness metric because it's possible to compute its expected (mean or average) and asymptotic median values, which can then be compared to the MIBR of a given mask list. Mask lists deviating substantially from these values are more likely to be nonrandom in origin.

To begin with, it's obvious that there is a 100% chance of having an MIBR on the interval [0, (Z-1)] because (1) it can't be negative and (2) after index (Z-1), there are no more mask choices, which forces us to repeat one of them. Here we introduce the "MIBR skew", K(N), which is the probability that after having selected the mask at index N, no repeats will yet have occurred.

We have established that:

K(0) = 1

But what about K(1), K(2), etc.? To begin with:

K(1) = ((Z - 1) / Z)

because (Z - 1) of Z possible masks are different from the mask at index zero. Then we have:

K(2) = ((Z - 1) / Z)((Z - 2) / Z)

because the 2nd mask must differ from the first one (hence ((Z - 1) / Z)) while the 3rd must differ from the first 2 (hence ((Z - 2) / Z)). By extension:

K(N) = Π((I = 1, N), ((Z - I) / Z))

where N is on [1, (Q-1)].

We can use K as a randomness metric. For example, if it turns out very close to zero, then it's likely that the source of the mask list is nonrandom because it tends to repeat itself too often. Inversely, if it turns out very close to one, then it's also likely nonrandom because it fails to repeat itself often enough. In the latter case, the generator might be a permutation, such as AES encryption. (This is one obvious way in which permutations make dubious cryptosystems, but I digress.) In any event, K can sometimes reveal nonrandom behavior in a mask list much smaller than Z -- but how much smaller?

This question is the subject of the birthday bound. In my own understanding, the birthday bound is the number of masks after which the probability of repetition having occurred is as close to 0.5 as possible, but no less than 0.5; the "limit median maximum index before repetition" (LMMIBR) is one less than that value. "Limit" refers to the fact that we're notionally dealing with infinite (and odd) Q but finite Z. "Median" refers to the fact that we want to pick the middle value in a list sorted from least to greatest MIBR. (This middle value exists because Q is odd, although if Q is even and finite, it can be approximated as the maximum value in the lesser half of MIBRs.) Dyspoissometer can approximate the LMMIBR via dyspoissometer_mibr_from_skew_get() with a skew of 0.5, which solves the problem using a binary search over all MIBRs up to (Z - 1). By the way, I investiged this all previously in this article regarding the LMICBR, but I think this current treatment is more concise.

Having thus computed it, approximately half of the measured MIBRs from a given generator should be less, and half greater. Any asymptotic deviation therefrom is most likely due to bias in the generator, although small deviations with small Z may be due to the quantization error induced by the asymmetry of the LMMIBR itself, in the sense that it may live more on one side of 0.5 than the other. Note that this test is essentially inconclusive in any particular case if (Q - 1) is less than or slightly greater than the LMMIBR, and the MIBR is (Q - 1), because the mask list is too small to demonstrate that the MIBR is atypically high, even though it is maximal given Q.

In practice, if enough Z-sized mask lists exist to test multiple MIBR values, then the dyspoissonism of their concatenation would probably be more informative. Still, the LMMIBR provides certain bounds on the usefulness of hashes and the deniability of the existence of ciphertext. It can even be used to estimate the number of animals in a given population.

The expected value of the MIBR (EMIBR) is more complicated to compute and less useful, mainly because it tends to be greater than the LMMIBR due to the long tail of possible MIBRs, thus requiring more samples, and thus slower convergence. Nevertheless, for your convenience, dyspoissometer_mibr_expected_get() will compute it, even if (Q < Z), which limits MIBR to at most (Q - 1). If (Q >= Z), then the result equals the infinite limit value, which in turn seems to have ratio one with respect to following approximation:

EMIBR(Z) ~ ((πZ / 2) ^ 0.5) - (4 / 3)

Note also that the EMIBR is closely related to expected kernel density.

Quiz:

1. What Dyspoissometer function will return the MIBR of mask list?

dyspoissometer_mibr_get().

2. What is the LMMIBR of (Z = (2 ^ 32))?

dyspoissometer_mibr_from_skew_get(U32_MAX, 0.5) returns 77,162, which is the LMMIBR. The birthday bound is thus 77,163. (Note that mask_idx_max is ((2 ^ 32) - 1).)

3. What is the EIBR of (Q = 71) and (Z = 93)?

dyspoissometer_mibr_expected_get(71-1, 93-1) returns 1.076...E+01, which is the EIBR.

4. What MIBR has skew closest to, but no less than, 0.37, given (Z = 12345)?

dyspoissometer_mibr_from_skew_get(12345-1, 0.37) returns 155.

5. What is the skew of the above MIBR?

dyspoissometer_skew_from_mibr_get(155, 12345-1) returns 3.74006...E-01. (Note that we don't subtract one from 155 because it's already an index as opposed to a count.)

6. How can we verify the answer in #3?

We already know from #5 that MIBR 155 has (K > 0.37). So check MIBR 156:

dyspoissometer_skew_from_mibr_get(156, 12345-1) returns 3.6...E-01, which is less than 0.37, so 155 is indeed the correct answer.