May 11, 2015

Dyspoissonism Defined

HomePrevious Lesson / Next Lesson

Related Dyspoissometer functions: dyspoissometer_dyspoissonism_get().

Dyspoissonism is a fraction expressing the extent to which a population list corresponding to a given mask list and mask span (Z) diverges from the most random possible case. The most random possible case, in turn, is asymptotically equivalent to a Q-scaled, integer constrained Poisson distribution with lambda = (Q / Z). Hence the word "dyspoissonism", which refers to the fractional extent of divergence from such a scaled Poisson distribution. (Note that "dys" refers to "divergence from", as opposed to "dis" which simply means "the opposite of".)

Mathematically, the dyspoissonism D of a mask list with logfreedom L, mask span Z, and implied mask count Q is given, in the trivial cases, by:

D = 0 if (Q = 1) or (Z = 1)

otherwise:

D = 1 - (L / (Q log(Z))

See dyspoissometer_dyspoissonism_get() in the source code for an implementation of D.

The trivial cases are defined as such because if we have only one mask or only one mask index -- and neither can be less than one -- then there is only possible of value of H, so the mask list is, by default, as random as it can possibly be, given the constraints. Therefore, intuitively, the dyspoissonism must be zero.

By the way, you might wonder why we don't simply have poissonism instead of dyspoissonism. The reason is simple: most real world data tends to be quite noisy and rich in entropy, even if it is dominated by a sparse core of simple signals. So our expectation to is encounter population lists which look very much like Poisson distributions (poissonism close to one). Therefore, for the sake of clear comparison, dyspoissonism is more practical.

Now, where does the formula for D come from? Well, the fractional term is asymptotically equal to one. Specifically, it approaches this limit when Q, Z, and L approach infinity. In turn, this corresponds to minimum D -- the "dyspoissonism floor", D0 -- of a particular Q and Z. Thus under these same limiting conditions, D0 approaches zero.

For its part, (Q log(Z)) is the logfreedom of the raw mask list (which by definition is literal and complete without any provision of H). The reason is that the way count of all mask lists -- that is, the number of unique mask lists complying with Q and Z -- is simply (Z ^ Q) because we have Q mask indexes to fill, each with one of Z possible masks. Appropriately, (Q log(Z)) is called the "obtuse logfreedom" of the mask list.

The fractional part above approaches one because the amount of entropy required to encode H under the above limit conditions has asymptotically negligible, and once H has been defined, we need only L eubits more information to determine the mask list. (In other words, it takes very little information to define a very random distribution, but a comparatively huge amount of information to provide the position of every mask which created it.)

Note that under finite conditions, D can never reach zero except in the trivial cases. For that matter, nor can it reach one unless (Q = 1). (In practice, either bound might be produced as an error condition, but this has nothing to do with dyspoissonism itself.)

Dyspoissonism is not merely a vague characterization of randomness quality. It actually conveys the fraction of obtuse logfreedom which, under the absence of mask list bias, should fairly be dedicated to the encoding of H itself. It does not tell us how to most efficiently encode H, however. (Encoding H efficiently is hard because it exhibits Poissonoid population biases. Fortunately, H is usually of negligible size even if we encode it obtusely.) Thus (1 - D) is the fraction of obtuse logfreedom which must be dedicated (under combinatorial data compression) to encode the particular "way" of the mask set; whereas D itself is the fraction of obtuse logfreedom which must be dedicated to encode most efficiently encode H (under some as yet undiscovered optimal compression scheme for population lists). How can we possibly be confident of this?

The answer is quite straightforward. If any legal mask list compliant with Q and Z (but no particular H) can occur with equal probability as any other, then any unbiased and efficient data compression scheme must actually preserve the exact size of the raw mask list. In other words, a maximally efficient data compression scheme is merely a size-preserving permutation of the masks themselves! So given the logfreedom of any particular mask, we then immediately know the number of eubits which should be dedicated to encoding H, which is just the obtuse logfreedom minus the actual logfreedom. Provided that the compression scheme is reversible and 100% efficient, every mask should emerge from the compression process with identically the same size.

But you might rightly ask why we would care about a noncompressing "compression" scheme, and how it could possibly measure entropy. As usual, entropy is a matter of perspective: if we have genuinely no expectations whatsoever, then the most efficient compression scheme should preserve size (bits in equals bits out). But if we expect a certain level of randomness, then we implicitly have a bias toward certain population lists at the expense of others. In this manner, we can achieve some level of data compression by changing the method of encoding H, so that we expect to achieve size savings overall (even though that may not be the case in any given trial). In any event, the maximum fraction by which we might compress a mask set is simply the dyspoissonism, because we cannot compress H to a negative size.

And that is precisely why dyspoissonism is a useful entropy metric: it measures the fraction of maximum lossless compressibility (under 100% efficient combinatorial compression), even assuming that H requires zero storage because it's implied (unequivocally predictable). For example, if (D = 0.31), then we cannot hope to compress the mask list by more than 31% (assuming, as always, that the masks all occur with equal probability and in unbiased random order). By "100% efficient combinatorial compression", I mean combinatorial compression which encodes nothing other than the "way number" corresponding to the particular mask set whose dyspoissonism is being measured. Technically, 100% efficiency cannot occur unless the logfreedom in bits (as opposed to eubits) is an integer, but for all practical purposes, rounding up to the nearest bit won't hurt.

In summary:

1. "Compressibility" in the context of dyspoissonism refers to the fractional extent to which a mask list and its population list -- jointly -- could be compressed if the latter required zero storage.

2. Dyspoissonism is a fraction on [0, 1]. Lesser values indicate lesser compressibility (more randomness).

3. Logfreedom is a nonnegative number with (Q log(Z)) as a loose upper bound. Greater values indicate lesser compressibility (more randomness).


No comments:

Post a Comment