May 23, 2015

Minimum Dyspoissonism vs. Qualitative Entropy

Home

The quantification of entropy is in the eye of the beholder. In other words, we can only measure entropy through a metric of our own choosing. What one metric regards as orderly, another might regard as entropic.

This is a stern warning to anyone who might otherwise consider dyspoissonism as some archetypical entropy metric: there is no such thing. It's merely another tool in the chest.

Take the following mask list with (Q = Z = 16), for example:

G = {0, 1, 2, 3, 4, 5, 6, 7, 7, 8, 8, 9, 9, 10, 10, 10}

G has a population list given by:

H = {7, 3, 1}

which turns out to have what I believe to be the minimum dyspoissonism (maximum logfreedom) for this Q and Z, namely about (4.2368106628E+01). But it hardly looks random to a human. If you're confused as to how this could possibly occur at maximum logfreedom, reread the first sentence in this post.

This is one particularly obtuse case in which, perhaps, someone has attempted to pass a dyspoissonism filter by contriving an easily constructed mask list which happens to have minimum dyspoissonism. I just thought I should point this out for those who might assume that such construction would be impossible.

What to do? Fortunately, there are plenty of other tools in the chest.

For one thing, we could take a simple derivative (next minus previous) transformation, yielding G':

G' = {0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0}
H' = {0, 0, 0, 0, 0, 1, 0, 0, 0, 1}

Now the logfreedom collapses and the dyspoissonism surges closer to one. This unremarkable data set has been revealed for what it is. For maximum paranoia, we could continue differentiating until we hit maximum dyspoissonism, searching for what I call "maximum differential dyspoissonism". This technique is particularly useful when analyzing PRNGs that tend to fall apart after several differentiations.

Alternatively, one could evaluate the kernel skew of G, purely as a backstop between dyspoissonism and the human sense of entropy.

May 13, 2015

Dyspoissometer Demo Output

Home

Even if you cannot build Dyspoissometer on your system, the demo output is a tutorial in its own right. Here is what it printed out on my 64-bit Linux system, after compiling for quad precision ("make demo_quad"):

---------------------------------------------------------------------------
Dyspoissometer Demo
Copyright 2016 Russell Leidich
http://dyspoissonism.blogspot.com
build_id_in_hex=00000020

Follow along with main() in demo.c, so you can learn how to use Dyspoissometer.
The math details are explained at the webpage linked above.

First of all, we have already called dyspoissometer_init() to ensure that (1)
Dyspoissometer is compatible with the way we intend to use it and (2) all
critical fixes which this demo knows about are present.

So let's get on with the demo. Given that:

mask_idx_max=000000000000001D
mask_max=0000000000000008

We then have the following list (mask_list0_base) of (mask_idx_max+1) masks,
all on the interval [0, mask_max]. It looks random -- but how random?

mask_list0_base[000000000000001E]={
  0000000000000004,0000000000000001,0000000000000006,0000000000000007,
  0000000000000006,0000000000000008,0000000000000000,0000000000000004,
  0000000000000007,0000000000000004,0000000000000006,0000000000000001,
  0000000000000000,0000000000000004,0000000000000006,0000000000000003,
  0000000000000000,0000000000000001,0000000000000002,0000000000000008,
  0000000000000008,0000000000000005,0000000000000003,0000000000000001,
  0000000000000007,0000000000000002,0000000000000001,0000000000000004,
  0000000000000007,0000000000000003
}

In order to answer this question, we first need to count how many of each mask
there are in a process called "frequency accrual". Begin by calling
dyspoissometer_uint_list_malloc_zero().

OK, that worked. Now we need to determine the frequency of each mask using the
freq_list0_base which we just allocated and has been set to zero.
freq_list0_base[N] will then contain the frequency of mask N. To do this, call
dyspoissometer_freq_list_update_autoscale() with (mask_count_implied==0) as
required on the first call. Watch out for the return value, which is ignored for
the purposes of this demo.

Done. The output is:

freq_list0_base[0000000000000009]={
  0000000000000003,0000000000000005,0000000000000002,0000000000000003,
  0000000000000005,0000000000000001,0000000000000004,0000000000000004,
  0000000000000003
}

So this means that mask zero has frequency 3 (occurs 3 times), mask one has
frequency 5, etc. You can confirm this by looking at mask_list0_base above.

Now we need to create a population list corresponding to the frequency list
which we just generated. This can be done via
dyspoissometer_pop_list_init().

So then here is the resulting population list:

pop_list0_base[0000000000000005]={
  0000000000000001,0000000000000001,0000000000000003,0000000000000002,
  0000000000000002
}

So this means that frequency one has population one (occurs once), frequency 2
occurs once, frequency 3 occurs 3 times, etc. You can confirm this by looking
at freq_list0_base above. (You can think of population as the frequency of the
frequency.)

Now that we have a population list, we can answer the "how random"
question. Start by getting its logfreedom from
dyspoissometer_logfreedom_dense_get():

logfreedom=+6.2282493264059594837404470739206729E+01

While this may not be very informative on its own, if we take (e==2.71828...)
raised to logfreedom, we'll get the number of possible sets of
(mask_idx_max+1) masks, each of which on [0, mask_max], which would result in
exactly the pop_list0_base given above:

way_count=+1.1192913403533332190720000000000000E+27

Trust me, way_count is an integer, which is most clearly visible on account of
trailing zeroes in DYSPOISSOMETER_NUMBER_QUAD mode (make demo_quad).

But most usefully, ceil(logfreedom/(log(2)) will provide us with the number of
bits required to combinatorially encode a mask list adhering to
pop_list0_base -- in other words, to guarantee that we can fully represent an
integer on the interval [0, way_count-1]:

bits_in_mask_set=0000005A

That is, if we ignore the cost of encoding pop_list0_base itself, we would
require 0x5A (90) bits of storage to encode mask_list0_base, given
mask_idx_max and mask_max with their values as given above.

Now we can compute the dyspoissonism of pop_list0_base using
dyspoissometer_dyspoissonism_get():

dyspoissonism=+5.5133858315519750531462878265266815E-02

Now that we know the actual logfreedom, how does it compare to the maximum
possible logfreedom attainable, given the constraints of mask_idx_max and
mask_max? We can answer this question with dyspoissometer_logfreedom_max_get(),
which uses a Monte Carlo method to provide an accurate approximation. (In this
case, we'll use 100 iterations, but in practice, you might need orders of
magnitude more, as convergence happens in somewhat unpredictable leaps.) Note
that we need to check for a negative return value, indicating memory
exhaustion. Also, for the sake of parallel searching, the function requires a
random number seed. Here's the result:

logfreedom_max=+6.2687958372167759219382483854671102E+01

But how can we have any faith in the accuracy of logfreedom_max (which by the
way is much more likely to be understated than overstated)? Well, we know that
the dyspoissonism floor (D0) is positive but approaches zero as the mask and
mask counts approach infinity. The denominator is (Q ln Z), where Q is the
mask count and Z is the mask span. So if our estimate of maximum logfreedom
is accurate, then it should be slightly less than this value. Let's see:

logfreedom_upper_bound=+6.5916737320086581483714714215351545E+01

So that worked fine. Now that we have what appears to be an accurate
approximation logfreedom_max, we can find D0, the dyspoissonism floor, from
dyspoissometer_dyspoissonism_get():

d0=+4.8982687541710707816454996837645826E-02

We could assess randomness by computing the ratio of dyspoissonism to D0, but
that approach would only be valid with constant mask count and mask span.
Fortunately, there is a more generic basis of comparison...

Returning to the original question, how random is mask_list0_base? One way to
answer this is to compute (logfreedom/logfreedom_max), which will give the
ratio of the information content in mask_list0_base to what would be expected
in the most random mask list compliant with mask_idx_max and mask_max:

information_density=+9.9353200967718573769967784308194695E-01

But because the information density is usually close to one, a more useful
metric is information _sparsity_, which is just one minus that quantity (but
should be computed using dyspoissometer_sparsity_get() for safety reasons:)

information_sparsity=+6.4679903228142623003221569180530485E-03

The great thing about information sparsity is that it can be used as a fair,
normalized, and unambiguous comparator across sets of different mask counts and
mask spans in order to assess randomness quality. Its only downside is that it
requires finding logfreedom_max, which is easily approximated, although the
error in which is somewhat unclear at present. This is precisely why
dyspoissonism is a better randomness comparator, provided that mask counts and
mask spans are held constant, as is often the case in signal analysis.

Note that, while dyspoissonism can only reach zero in the limit of an infinite
set (although it can readily reach one), information sparsity can reach either
extreme with any set.

Back to the original question: information_sparsity suggests that
mask_list0_base is almost -- but not quite -- as random as it could possibly
be. This is unsurprising for such a small mask list. But for large lists, an
elevated dyspoissonism or information sparsity can help to identify unseen
statistical biases -- and perhaps even predictable mathematical structure -- in
the data source.

Now let me show you another way to compute logfreedom. Suppose that we
generate a frequency list from a mask list:

freq_list1_base[0000000000000009]={
  0000000000000008,000000000000000B,000000000000000C,0000000000000009,
  000000000000000A,000000000000000A,0000000000000009,000000000000000B,
  000000000000000A
}

This means that mask zero has frequency 8, mask one has frequency 11, mask 2
has frequency 12, etc. But now, without taking the time to produce a population
list, we can compute the logfreedom directly using
dyspoissometer_logfreedom_sparse_get() (and again, checking for success on
return).

In order to do this, though, we must first allocate one other frequency list
equal in size to freq_list1_base above -- namely, 9 (DYSPOISSOMETER_UINT)s
because (mask_max==8). We use dyspoissometer_uint_list_malloc() for this
purpose. Here we go:

logfreedom=+1.9126308750039845767695139189084438E+02

OK, that worked. Now we can deallocate those 2 frequency lists that we just
allocated, using dyspoissometer_free()...

Done. Now, just to be paranoid, let's compute the population list of
freq_list1_base above:

pop_list1_base[000000000000000C]={
  0000000000000000,0000000000000000,0000000000000000,0000000000000000,
  0000000000000000,0000000000000000,0000000000000000,0000000000000001,
  0000000000000002,0000000000000003,0000000000000002,0000000000000001
}

You can verify that this is consistent with freq_list1_base: it contains one
frequency 8, 2 frequency 9s, etc. So we can now call
dyspoissometer_logfreedom_dense_get():

logfreedom=+1.9126308750039845767695139189084438E+02

As you can see, this matches the value above within the limits of numerical
error.

So as mentioned above, the minimum frequency with nonzero population is
frequency 8. Conveniently, we can avoid wasting resources dealing with all
those leading zeroes by calling dyspoissometer_logfreedom_dense_get() with
(freq_min==8) and the same freq_max_minus_1:

pop_list2_base[0000000000000005]={
  0000000000000001,0000000000000002,0000000000000003,0000000000000002,
  0000000000000001
}
logfreedom=+1.9126308750039845767695139189084438E+02

As expected, the logfreedom is unchanged.

From time to time, you may find it useful to analyze integer approximations of
Poisson distributions. dyspoissometer_poisson_term_get() can help you with
this. Unfortunately, there are many ways to map (analog) Poisson distributions
to integer approximations thereof. In theory, as the mask counts and mask spans
grow, the particular approximation method becomes less important, and the
information sparsity of the resulting discrete distribution approaches zero.
Here's a simple example using successively larger integer approximations of the
lambda-one Poisson distribution (LOPD). At each step, the mask count and mask
span are both multiplied by about 256. Let's see what happens. (This might
take a few minutes):

poisson_mask_idx_max=0000000000000100
poisson_mask_max=00000000000000FF
poisson_information_sparsity=+1.1294726924381859360730610406606087E-04
poisson_mask_idx_max=000000000000FF04
poisson_mask_max=000000000000FF00
poisson_information_sparsity=+7.7909880882871078266079775867239554E-07
poisson_mask_idx_max=0000000000FEFFFE
poisson_mask_max=0000000000FEFFFE
poisson_information_sparsity=+6.0988639656266055913442929059338777E-10
poisson_mask_idx_max=00000000FEFFFFF1
poisson_mask_max=00000000FEFFFFFE
poisson_information_sparsity=+6.9691891013055711558023866053140997E-15
poisson_mask_idx_max=000000FF0000000A
poisson_mask_max=000000FF00000000
poisson_information_sparsity=+2.7435640484631022740993965157706537E-15
poisson_mask_idx_max=0000FEFFFFFFFFFA
poisson_mask_max=0000FEFFFFFFFFFF
poisson_information_sparsity=+1.6849273151579334872820991222149401E-18
poisson_mask_idx_max=00FEFFFFFFFFFFF4
poisson_mask_max=00FEFFFFFFFFFFFE
poisson_information_sparsity=+5.2962575165164824927570952982632527E-25
poisson_mask_idx_max=FEFFFFFFFFFFFFC9
poisson_mask_max=FEFFFFFFFFFFFFFC
poisson_information_sparsity=+1.6668441166413227665972389848858803E-21

The trend above is not convincingly monotonic (if you "make demo_quad"). Is
this merely reflective of a poor approximation of logfreedom_max at high mask
counts? Or are Poisson distributions asymptotically but nonmonotonically of zero
information sparsity? This is a hard open question. (I'm even slightly doubtful
that the asymptotic hypothesis is true, because I'm only mostly convinced that
maximum information density is expected behavior in large mask lists.) In any
event, we frequently speak of Poisson distributions as though they are discrete
because their domain is the whole numbers; their range is analog, however,
which leaves a lot to be desired.

Speaking of distributions, Dyspoissometer has a way to generate pseudorandom
mask lists which adhere to a given mask count and mask span, namely
dyspoissometer_mask_list_pseudorandom_get(). This function is particularly
useful for doing statistical surveys of such lists, which for example could
be used to find the variance of dyspoissonism. For instance, here's a random
mask list for (mask_idx_max==37) and (mask_max==14):

mask_list_base[0000000000000026]={
  000000000000000A,0000000000000002,000000000000000E,0000000000000004,
  0000000000000008,000000000000000A,0000000000000009,0000000000000006,
  000000000000000A,0000000000000000,000000000000000C,0000000000000001,
  000000000000000C,000000000000000B,000000000000000A,0000000000000006,
  0000000000000006,000000000000000C,0000000000000004,0000000000000001,
  000000000000000A,0000000000000003,0000000000000009,000000000000000C,
  0000000000000003,0000000000000002,0000000000000007,000000000000000B,
  0000000000000001,000000000000000C,0000000000000004,0000000000000003,
  0000000000000006,000000000000000E,0000000000000009,0000000000000001,
  0000000000000007,0000000000000000
}

As you can see, mask_list_base contains 38 masks on [0, 14]. If you run this
demo using a different numerical precision, you will get a different random
mask list due to variances in the evolutionary pathway of
dyspoissometer_logfreedom_max_get(), as its decision process and therefore the
number of times it iterates the random seed vary with numerical error, despite
having asymtotically constant output.

Let's get on with a more exciting application of this generator, which is the
asymptotic discovery of median logfreedom, courtesy of
dyspoissometer_logfreedom_median_get(). It works like this: generate lots of
random mask lists, all of which adhering to the same mask counts and mask
spans. Then find the logfreedom of each such mask list, and accumulate them in
a logfreedom list. Finally, find the median value thereof. Median logfreedom is
more important than expected (mean) logfreedom because only the former
essentially guarantees that any random modification of the mask list has equal
probability to decrease or increase the logfreedom; in other words, median (not
mean!) logfreedom is an entropy equilibrium point. So a good true random number
generator should make mask lists whose median logfreedom is the aforementioned
value. Just watch the return value for error status!

logfreedom_median=+9.7136567195470554483466520955731780E+01

So the value above is an approximation of the median logfreedom, obtained from
1000 random mask lists. If you found the logfreedom of _every_ possible mask
list, sorted those logfreedoms, then took the middle value thereof, you should
find it to be close to if not exactly the value above (which is, in fact, an
actual logfreedom as opposed to an interpolated value).

Sanity check: the value above is slightly less than ((37+1) ln (14+1)), which
is about 103. Pass!

Now I have something amazing to show you. Take a look at pop_list1_base
above. Notice how it looks quite like a Poisson distribution with lambda
(expected frequency) equal to 10. By contrast, look at pop_list3_base below,
which has the same mask count and mask span.

pop_list3_base[000000000000000D]={
  0000000000000000,0000000000000000,0000000000000000,0000000000000000,
  0000000000000000,0000000000000000,0000000000000001,0000000000000001,
  0000000000000002,0000000000000001,0000000000000002,0000000000000001,
  0000000000000001
}

In particular, note that pop_list3_base is bimodal; it has 2 peaks -- very
unlike a Poisson distribution. So logically, it should have lower logfreedom:

logfreedom=+1.9218634690158870608247890704275966E+02

Whoa! What's going on here? How could such a distribution have _greater_
logfreedom, corresponding to lesser dyspoissonism and thus greater entropy?
And that's only the tip of the iceburg...

Let's look at every single neighbor that pop_list1_base has. A "neighbor" of a
population list is simply another population list which corresponds to a single
mask change. In other words, the population of nonzero frequency A decreases by
one, while the population of frequency (A-1) increases by one; simultaneously,
the population of frequency B decreases by one, while the population of
frequency (B+1) increases by one, where (A!=(B+1)). (In the code, frequency A
is called the "down right" frequency, and B is called the "up left" frequency.)
This is simply what happens when one mask is changed into another.

In classical physics, the Second Law of Thermodynamics implies that the entropy
of a closed system always increases over time (because if we make ice in the
freezer, then enough heat is dissipated into the kitchen to create a net
positive change the entropy of the entire house. So logically, at least one
neighbor of pop_list1_base should have greater logfreedom, because we
demonstrated above that its logfreedom is less than that of a "camel humps"
distribution. You'll have to look at the code in demo.c for this exercise, but
here are the logfreedoms of each of its neighbors:

logfreedom_self=+1.9126308750039845767695139189084438E+02
logfreedom_neighbor=+1.9034679676852430261176786467907636E+02
logfreedom_neighbor=+1.9094463376927992306114114467725377E+02
logfreedom_neighbor=+1.9085762239229029329497337877538004E+02
logfreedom_neighbor=+1.9077757968461675686914960081365358E+02
logfreedom_neighbor=+1.8977143262362074075688942666708872E+02
logfreedom_neighbor=+1.9106241680493630651567993878672431E+02
logfreedom_neighbor=+1.9097540542794667674951217288485058E+02
logfreedom_neighbor=+1.9089536272027314032368839492312412E+02
logfreedom_neighbor=+1.8998215365493639335934442862876736E+02
logfreedom_neighbor=+1.9076231221248596843492942665209929E+02
logfreedom_neighbor=+1.9108076594360450305073967386568985E+02
logfreedom_neighbor=+1.9100072323593096662491589590396342E+02
logfreedom_neighbor=+1.8967199872663255383741036763658378E+02
logfreedom_neighbor=+1.8974895976776868216239458468089895E+02
logfreedom_neighbor=+1.8978978176228893729194916174605424E+02
logfreedom_neighbor=+1.9040288623517534617554261590578599E+02
logfreedom_neighbor=+1.8935354499551401922160012042299319E+02
logfreedom_neighbor=+1.8965364958796435730235063255761818E+02
logfreedom_neighbor=+1.8996380451626819682428469354980174E+02
logfreedom_self=+1.9126308750039845767695139189084438E+02

Notice how _every_ neighbor has _lesser_ logfreedom. In other words,
pop_list1_base exists at a _local_ entropy maximum even though we just proved
above that it does _not_ exist at a _global_ maximum; if we change any single
mask, the result will be a population list with lesser (never equal)
logfreedom. This is profound. It implies that some discrete systems are so
heavily constrained that _any_ minimum quantum of change will send them into a
lower entropy state! If the universe is ultimately digital, then perhaps there
are natural examples of this phenomenon. Note that this goes beyond simple
quantum systems which occasionally lose entropy by sheer chance; in this case,
_any_ single step we take will result in lesser entropy!

I should add that it's precisely this lack of entropy gradient which makes the
deterministic discovery of D0 so difficult. Indeed, I only discovered this
problem when I wrote a function intended to do so by riding down the gradient.
dyspoissometer_logfreedom_max_get() is therefore somewhat nondeterministic,
although it rides gradients eagerly when possible.

I hope you learned something from this demo and find Dyspoissometer useful.
I have just one more suggestion before concluding: if you want a direct and
easy way to find logfreedom, try one of the
dyspoissometer_u8/u16/u24/u32_list_logfreedom_get() functions. Yes, we could
have used them above, but then you would not have learned as much about
dyspoissonism.

Scroll up or redirect to a file in order to see what you missed!
---------------------------------------------------------------------------

May 12, 2015

How to Tune a Quantum Computer with Dyspoissometer

Home

In the 20th century, human laborers would use a primitive device called a tuning fork to tune the musical notes on a piano in order to ensure that they complied with musical frequency standards. (Ah, the 20th century -- how mechanical it was!) Similarly, we can now use Dyspoissometer to tune the entropy quality of a quantum computer. You can imagine a robot showing up at your lab in a pair of overalls, Dyspoissometer in hand, ready to politely inform you that your qubits are slightly biased!

Following is the general procedure:

1. Instruct the quantum computer to produce a random mask list where the mask span, Z, is a power of 2. For maximum statistical significance, the mask count, Q, should equal Z. (2^16) 16-bit samples should suffice in practice. Each mask should come from the collapse of an N-qubit superposition (ideally, entangled) to an N-bit (classical) mask. Make sure to round-robin the qubits so we uniformally sample each of them.

2. Compute the logfreedom of the mask list with Dyspoissometer (see related functions on the logfreedom page).

3. Loop back to step 1, accumulating a list of logfreedoms.

4. Sort the list.

5. Compare the median value (in the middle of the list) with the output of dyspoissomater_logfreedom_median_get() after running it for 10 or more times as many iterations.

6. If the median is much less than expected, then the qubits are probably biased. But if the median is much more than expected, then perhaps the random qubit output is being pseudorandomly postprocessed to hide a systematic error caused by manufacturing problems. Either way, we have a problem necessitating qubit recalibration. Double check after running dyspoissomater_logfreedom_median_get() for as long as possible, perhaps in multiple parallel threads. In practice, (10^6) iterations should be considered minimal.

Dyspoissometer Source Code

Home

Latest Stable Release

Sample Images #1

These are the same images used in this YouTube video about Dyspoissometer.

This is the source code for Dyspoissometer for Linux (32-bit and 64-bit), Mac (64-bit), and Windows (32-bit), which is mostly C but the basic functions (for example, computing logfreedom and dyspoissonism) can also be done in Mathematica ("MathKernel" in a Linux terminal).

README.txt has everything you need to get started. It will tell you how to build and run the demo, which contains code for several commonplace purposes. And be sure to check out dyspoissofile (Dyspoissometer for Files) which will allow you to find the logfreedom and dyspoissonism of arbitrary binary files, even if you don't know C. For your convenience, you can see the actual text of the demo output here because I think it has educational value (even though it might be somewhat out-of-sync with the most recent source code).

Rapid Approximation of Maximum Logfreedom

Home / Previous Lesson / Next Lesson

Related Dyspoissometer functions: dyspoissometer_logfreedom_max_get(), dyspoissometer_mask_list_pseudoramdom_get(), dyspoissometer_poisson_term_get().

We want to identify the maximum logfreedom, Lmax, attainable given a population list restricted to Q masks and Z mask indexes. This is value is required in order to compute the information sparsity, S, and the dyspoissonism floor, D0.

Unfortunately, my analysis thus far indicates that this task is intractably hard even for practical sizes of Q and Z because it involves a combinatorial optimization problem at its core. But fortunately, approximating Lmax to several digits appears to be readily doable on a cheap computer -- even for Q and Z on the order of (2 ^ 64). The Dyspoissometer demo includes a call to dyspoissometer_logfreedom_max_get() for sake of illustration.

The process is rather complicated and nothing remotely like evaluating a simple power series, so unless you prefer to understand it directly from the C code, I suggest watching the following video. Unfortunately, it was recorded when I still referred to the mask count, Q, as the "block count"; and the mask span, Z, as the "mask count". Hopefully the graphics will make the process sufficiently clear in spite of this.




Information Sparsity vs. Dyspoissonism

HomePrevious Lesson / Next Lesson

Related Dyspoissometer functions: dyspoissometer_sparsity_get().

Dyspoissonism provides an answer to the question "How compressible is this mask list?" Information sparsity answers a slightly different question: "How does this mask list compare to the most random possible mask list?" Like dyspoissonism, information sparsity is expressed as a fraction on [0, 1], but unlike dyspoissonism, it can easily reach either extreme in finite test cases.

The information sparsity, S, is given by:

S = 1 - (L / Lmax)

where L is the logfreedom of the mask list in question and Lmax is the maximum possible logfreedom, given Q and Z. (The ratio on the right is called "information density", but because it is usually close to one for real world data, sparsity is a more practical metric.) So more formally (and awkwardly), we have:

S = 1 - (L(H, Z) / Lmax(Q, Z))

See dyspoissometer_sparsity_get() in the source code for an implementation of S.

So S is a fraction which is zero when the logfreedom is maximal (Lmax), and zero when it's minimal (zero). The wonderful thing about S is that it allows us to compare various mask sets with different mask counts and mask spans in a normalized and reasonably fair manner, as opposed to dyspoissonism which, but for some trivial corner cases, has a nonzero lower bound which depends upon Q and Z. So we can conclude, for example, that mask set A is closer to its state of maximum randomness than mask set B, even though the latter is larger and involves more masks.

In theory, armed with S, we would have little or no use for D. Unfortunately, finding Lmax appears to involve a combinatorial problem of complexity which is bounded on the low side by Q -- yikes!

But all is not lost. We can, in practice, estimate Lmax to several digits in a matter of seconds on a cheap computer. Indeed, this is is Dyspoissometer for Files (command: "dyspoissofile") gives us the option to compute S, or not, and based on how many approximating iterations for Lmax. How to do that is the topic of the next lesson.


May 11, 2015

Random Number Generators and Logfreedom

HomePrevious Lesson / Next Lesson

Related Dyspoissometer functions: dyspoissometer_logfreedom_median_get(), dyspoissometer_mask_list_pseudoramdom_get().

There are 2 basic types of random number generators: (1) pseudorandom number generators (PRNGs), which are deterministic integer systems which produce output that appears (to humans) to be random (apart from the fact that they can be made to recur, given the same seed (input value); and (2) true random number generators (TRNGs) which produce random bits from a physical noise source. Traditionally, the stated goal of PRNG design is to mimic TRNGs so faithfully as to be indistinct from them, apart from the aforementioned seed issue.

Logfreedom shows such a design goal to be fundamentally misguided. Here's why:

Imagine that we have an "ideal" binary TRNG -- one without any bitwise bias whatsoever. Assume that we use it to create a mask list. For maximal statistical significance, Q and Z should be equal, say, (2^16) for binary convenience. Should we expect this ideal TRNG to generate the most random possible mask list (with maximum logfreedom)?

No, because zero bias does not guarantee maximum randomness -- quite the opposite. What it does imply is that the output is just as likely to be any particular mask list as any other -- rarely, one with very low logfreedom (high dyspoissonism). So how can we tell the difference between an unbiased TRNG and a biased one, without sampling an astronomical number of mask lists?

One good approach is to look at the median logfreedom. The median is simply the middle value in a list (not to be confused with the mean, which is the average value). We use the median because computing the mean tends to involve a significant degree of error due to the tiny differences among most logfreedom samples, competing for precision with the enormous differences among a few others. Nevertheless, in theory, analyzing the mean should be useful.

Determining the median logfreedom is conceptually easy:

1. Create an unbiased pseudorandom mask list. (Obviously, we don't want to bias our notion of median logfreedom due to biases in the PRNG used to generate the list. So obviously we need to confident that the PRNG has an intractably long limit cycle, for starters.)

2. Measure its logfreedom and append it to a list.

3. Loop to step 1 until the desired number of samples has been reached.

4. Sort the list of logfreedoms.

5. Read the logfreedom halfway down the list.

Dyspoissometer uses various statistical tricks to emulate the above steps without a sort process, but nevertheless produces the correct median once the list has been populated. See dyspoissometer_logfreedom_median_get() in the source code for details.

An ideal TRNG should have very nearly the same median logfreedom as the purportedly unbiased PRNG employed above. If the medians do not closely agree, then at least one of the generators is weak.

But what if the TRNG consistently produces logfreedoms well above the PRNG's median? Assuming that the PRNG was indistinguishable from random, that would still be a sign of statistical bias, as though the TRNG were artificially contrived to do so. Think of the extreme case: one could create a PRNG masquerading as a TRNG which always outputs the same mask list, which happens to have been architected to have maximum logfreedom.

By the way, precisely computing the median, mean, or maximum logfreedom is computationally hard, and perhaps combinatorially so. So we must rest content with approximation pending further insights into this problem.

Thus an ideal TRNG actually creates mask lists with logfreedoms above and below the median with equal probability. TRNGs which do not do this are probably weak.

But by definition, a PRNG produces only one mask list. It could be intractably large, for example, the set of all (2 ^ 128) masks produced by AES128 encryption using equally many different keys as inputs; in which case, we would need a minimodel thereof in order to extrapolate its randomness quality (for example, a similar AES32 implementation). In any event, if we have only one mask list, with each output (mask) correspdoning to a unique input (mask index), then commonsense security considerations say that we should make it as random as possible within the constraints of time and storage. In other words, a good PRNG should have a minimodel implementation with as close to maximum logfreedom as possible.

In summary, an ideal TRNG exhibits no bias with respect to median logfreedom and certainly makes no attempt to restrict itself to maximum logfreedom; whereas the mask list produced by an ideal PRNG does indeed approximate maximum logfreedom to the degree attainable within its design constraints.

Fortunately, Dyspoissometer does contain a fast algo for computing maximum logfreedom. This is important because it can provide us with an accurate picture of the "information sparsity" of a mask list. This is the topic of the next lesson.


Dyspoissonism Defined

HomePrevious Lesson / Next Lesson

Related Dyspoissometer functions: dyspoissometer_dyspoissonism_get().

Dyspoissonism is a fraction expressing the extent to which a population list corresponding to a given mask list and mask span (Z) diverges from the most random possible case. The most random possible case, in turn, is asymptotically equivalent to a Q-scaled, integer constrained Poisson distribution with lambda = (Q / Z). Hence the word "dyspoissonism", which refers to the fractional extent of divergence from such a scaled Poisson distribution. (Note that "dys" refers to "divergence from", as opposed to "dis" which simply means "the opposite of".)

Mathematically, the dyspoissonism D of a mask list with logfreedom L, mask span Z, and implied mask count Q is given, in the trivial cases, by:

D = 0 if (Q = 1) or (Z = 1)

otherwise:

D = 1 - (L / (Q log(Z))

See dyspoissometer_dyspoissonism_get() in the source code for an implementation of D.

The trivial cases are defined as such because if we have only one mask or only one mask index -- and neither can be less than one -- then there is only possible of value of H, so the mask list is, by default, as random as it can possibly be, given the constraints. Therefore, intuitively, the dyspoissonism must be zero.

By the way, you might wonder why we don't simply have poissonism instead of dyspoissonism. The reason is simple: most real world data tends to be quite noisy and rich in entropy, even if it is dominated by a sparse core of simple signals. So our expectation to is encounter population lists which look very much like Poisson distributions (poissonism close to one). Therefore, for the sake of clear comparison, dyspoissonism is more practical.

Now, where does the formula for D come from? Well, the fractional term is asymptotically equal to one. Specifically, it approaches this limit when Q, Z, and L approach infinity. In turn, this corresponds to minimum D -- the "dyspoissonism floor", D0 -- of a particular Q and Z. Thus under these same limiting conditions, D0 approaches zero.

For its part, (Q log(Z)) is the logfreedom of the raw mask list (which by definition is literal and complete without any provision of H). The reason is that the way count of all mask lists -- that is, the number of unique mask lists complying with Q and Z -- is simply (Z ^ Q) because we have Q mask indexes to fill, each with one of Z possible masks. Appropriately, (Q log(Z)) is called the "obtuse logfreedom" of the mask list.

The fractional part above approaches one because the amount of entropy required to encode H under the above limit conditions has asymptotically negligible, and once H has been defined, we need only L eubits more information to determine the mask list. (In other words, it takes very little information to define a very random distribution, but a comparatively huge amount of information to provide the position of every mask which created it.)

Note that under finite conditions, D can never reach zero except in the trivial cases. For that matter, nor can it reach one unless (Q = 1). (In practice, either bound might be produced as an error condition, but this has nothing to do with dyspoissonism itself.)

Dyspoissonism is not merely a vague characterization of randomness quality. It actually conveys the fraction of obtuse logfreedom which, under the absence of mask list bias, should fairly be dedicated to the encoding of H itself. It does not tell us how to most efficiently encode H, however. (Encoding H efficiently is hard because it exhibits Poissonoid population biases. Fortunately, H is usually of negligible size even if we encode it obtusely.) Thus (1 - D) is the fraction of obtuse logfreedom which must be dedicated (under combinatorial data compression) to encode the particular "way" of the mask set; whereas D itself is the fraction of obtuse logfreedom which must be dedicated to encode most efficiently encode H (under some as yet undiscovered optimal compression scheme for population lists). How can we possibly be confident of this?

The answer is quite straightforward. If any legal mask list compliant with Q and Z (but no particular H) can occur with equal probability as any other, then any unbiased and efficient data compression scheme must actually preserve the exact size of the raw mask list. In other words, a maximally efficient data compression scheme is merely a size-preserving permutation of the masks themselves! So given the logfreedom of any particular mask, we then immediately know the number of eubits which should be dedicated to encoding H, which is just the obtuse logfreedom minus the actual logfreedom. Provided that the compression scheme is reversible and 100% efficient, every mask should emerge from the compression process with identically the same size.

But you might rightly ask why we would care about a noncompressing "compression" scheme, and how it could possibly measure entropy. As usual, entropy is a matter of perspective: if we have genuinely no expectations whatsoever, then the most efficient compression scheme should preserve size (bits in equals bits out). But if we expect a certain level of randomness, then we implicitly have a bias toward certain population lists at the expense of others. In this manner, we can achieve some level of data compression by changing the method of encoding H, so that we expect to achieve size savings overall (even though that may not be the case in any given trial). In any event, the maximum fraction by which we might compress a mask set is simply the dyspoissonism, because we cannot compress H to a negative size.

And that is precisely why dyspoissonism is a useful entropy metric: it measures the fraction of maximum lossless compressibility (under 100% efficient combinatorial compression), even assuming that H requires zero storage because it's implied (unequivocally predictable). For example, if (D = 0.31), then we cannot hope to compress the mask list by more than 31% (assuming, as always, that the masks all occur with equal probability and in unbiased random order). By "100% efficient combinatorial compression", I mean combinatorial compression which encodes nothing other than the "way number" corresponding to the particular mask set whose dyspoissonism is being measured. Technically, 100% efficiency cannot occur unless the logfreedom in bits (as opposed to eubits) is an integer, but for all practical purposes, rounding up to the nearest bit won't hurt.

In summary:

1. "Compressibility" in the context of dyspoissonism refers to the fractional extent to which a mask list and its population list -- jointly -- could be compressed if the latter required zero storage.

2. Dyspoissonism is a fraction on [0, 1]. Lesser values indicate lesser compressibility (more randomness).

3. Logfreedom is a nonnegative number with (Q log(Z)) as a loose upper bound. Greater values indicate lesser compressibility (more randomness).


Bits, Eubits, Logfreedom, and Entropy

HomePrevious Lesson / Next Lesson

Typically we express entropy content in units of bits (zeroes and ones). This makes sense in light of the ease of constructing binary computers. But we could just as easily express entropy in units of hexadecimal digits, or even (gag!) decimal digits. Those are all examples of integer-base number systems. We could, in fact, have an irrational base for our number system, say, Euler's number. Just as one hexadecimal digit is worth 4 bits because (log2(16) = 4), one "eubit" (YOO - bit) is worth (log2(e) = 1.442695...) bits. The motivation for this is that it saves us the trouble of converting back and forth between logfreedom and bits, which is wasteful because they measure the same thing.

Thus what logfreedom is expressing is the amount of entropy required to exactly represent one of (some huge way count) possible mask lists compliant with a given population list and mask span. (If you absolutely must think in bits, then simply multiply by log2(e).) Logfreedom is not concerned with whether or not we literally wish to represent the mask list that way, which is the business of combinatorial data compression, but rather, quantifying the smallest amount of information into which it could be reversibly compressed, absent any statistical bias between one possible mask list vs. another -- in other words, the entropy content of the mask list. (In practice, of course, we would need to round up the storage requirement to an integer number of bits, but this is a trivial consideration.)

Put another way, we can choose to quantify entropy as the logfreedom of a population list corresponding to a particular mask list and mask span.

Note that this fundamentally differs from the definition of entropy from the perspective of compressed sensing. Compressed sensing begins with the assertion that most complicated real world signals are dominated by a small number of simple signals of high amplitude -- such that the signal can be, for all practical purposes, precisely reconstructed from a small set of symbols. This is equivalent to a large mask list dominated by a "zero mask" (corresponding to zero amplitude) with a small number of other ones. But viewed from the perspective of logfreedom, the simplest systems are those with the least way count -- often, but not necessarily, those which are dominated by a single such zero mask.

But which definition of entropy is more reasonable? The trouble with the compressed sensing definition is that it emphasizes the quantity of "large" components (essentially, the L0 norm of a list subjected to a zeroing threshold), not their diversity of amplitude or lack thereof. It ignores another empirical fact about real world signals, which is that they tend to exhibit components of equal amplitude (once digitized). For example, a realistic signal might have 10 components each of amplitude 3. From the compressed sensing perpective, this would be more complicated than, say, 4 components of amplitudes 3, 17, 19, and 24. Depending on the number of signals in the digitized raw data, the former might actually have lesser logfreedom (and therefore consume less storage) than the latter, even if we assume that all other components would be ignored.

We can therefore imagine a revitalization in the field of compressed sensing, focussed not on L1, L-fraction, or L0 minimization, but rather, the minimization of logfreedom subject to some metric of signal reconstruction quality. I think this would lead to smaller compressed files and perhaps much higher performance than the former approach, not the least of which because logfreedom can be accurately estimated using a subsampling of the raw data, and computed with maximum precision in one pass of the data without the expensive matrix operations due to matrix norm minimization algos such as Lasso and Homotopy.

To bring this all full circle, look at the following binary mask lists. Which one contains more entropy:

G0 = {0, 0, 0, 0, 0, 0, 0, 0, 1, 0}
G1 = {1, 0, 1, 0, 0, 1, 1, 1, 0, 1}

If you guessed G1, you're wrong. The answer is "neither" because the definition of "entropy" depends on one's expectation. With no expectation of any bias one way or the other, G0 and G1 contain the same amount of entropy, namely, 10 bits' worth. Granted, had I told you to expect mundane real world signals, then G1 would be the correct answer. On the other hand, had I told you that mask lists with 4 zeroes were disproportionately popular, then at a certain point, G0 would have less entropy. Entropy content is, truly in the eye of the beholder.

Logfreedom reflects the philosophy that in the real world, entropy varies monotonically with the number of states which a system can assume, once its population list and mask span have been chosen. Dyspoissonism, which is the topic of the next lesson, is the fractional difference between the entropy content as measured through the lens of logfreedom, and the assumption that all mask lists contain equal amounts of entropy.

Computing the Logarithm of a Factorial

HomePrevious Lesson / Next Lesson

Related Dyspoissometer macros: LOG(), LOG_N_PLUS_1(), LOG_SUM(), LOG_SUM_N_PLUS_1().

UPDATE: Interval math considerations, and a suggested accelerated approximation method, are discussed in this whitepaper which was also published here at vixra.org. sha256sum=39895297e3b50093bcc6b0595f2e1d1cf2639a8ffa7bb942487c0256dbff5b8f

NOTE: If you arrived here because you searched for 0.918938533204672741780..., then you are dealing with ((1/2) log(2π)), which suggests that your math problem may have something to do with logarithmic summation.

This lesson is about how to compute the natural logarithm ("log") of a factorial, namely;

log(N!)

where N is an integer greater than 1. (In the trivial cases, where N is zero or one, then (N!) is one by definition, so L is zero.) We refer to log(N!) as the "logsum" of N (which is the sum of the logs, not the log of the sum).

Clearly, we can evaluate it by brute force:

log(N!) = Σ(M = (1, N), log(M))

This works quickly and accurately for small N, but at a certain point, neither adjective applies. Fortunately, we can use what I call "Bayes' Approximation" to provide a huge extension in the domain of N which satisfies both adjectives.

A bit of background: James Stirling popularized an approximation for the logsum which was first published by Abraham de Moivre sometime prior to his death in 1754. This became known as "Stirling's Approximation", even though it was neither his discovery nor, as it turns out, convergent!

To make a long story short, Stirling's Approximation comes from approximating log(N!) as an integral using calculus, instead of what it literally is, which is the sum of (a finite number of (an infinite number of rational fractions)). (The log itself is computable using various infinite series, the most rapidly converging of which being the "area hyperbolic tangent" series.) Stirling (well, de Moivre) used the fact that:

∫ log(x) dx = x log(x) - x

if we take the constant of indefinite integration as zero. Integrating this function on the domain [1, N] produces an approximation of log(N!). Then, by adding evermore corrective terms, the approximation can approach the exact value.

At least, that was the theory. But in 1763, 2 years after Thomas Bayes' death, the Royal Society published a letter from him to John Canton, in which he demonstrated that the approximation which had been misappropriated to Stirling was in any event nonconvergent! In other words, after a certain number of terms, it became increasingly less accurate. Oops.

I don't know who rectified the situation. Perhaps it was Stirling himself, but in my view, Bayes' deserves the credit for finding the problem, and certainly we want to draw a distinction between "Stirling's Broken Approximation" and "Someone's Fixed Approximation". It is the latter which I call "Bayes' Approximation". Yet for some vexing reason, Stirling's Broken Approximation appears to be in popular use today, perhaps because the first few terms are similar to those of Bayes' (be careful!). I wonder how many products have been misengineered as a result...

Now it turns out that Bayes' Approximation, which is to my knowledge absolutely convergent for (N > 0), has buried within it a deep combinatorics problem not unlike the determination of way count itself! I suppose this logical, since logfreedom, which contains logsum terms, is informationally equivalent to way count. Here is how the approximation works:

log(N!) = (N + (1/2)) log(N + 1) - (N + 1) + (1/2) log(2π) + C(N)

All of the above is derived from calculus with an optimized constant of indefinite integration. Note that the approximation is officially for computing log(Γ(N)), which I have adapted to logsum using the following identity:

log(N!) ≡ log(Γ(N+1))

For its part, the C version of Dyspoissometer uses lgamma(N + 1) to compute the log(N!) in double precision and lgammaq() to compute it in quad precision; the Mathematica version uses LogGamma[] (as opposed to Log[Gamma[]], which is slow).

But what about C(N)? It's given by the following infinite series of finite products:

C(N) = Σ(M = (1, ∞), c(M) / Π(P = (1, M),  (N + P + 1)))

where c(M) is the finite series given by:

c(M) = (1 / (2M)) Σ(P = (1, M), P |s(M, P)| / ((P + 1)(P + 2)))

where |s(M, P)| (note the bars) is the absolute value of the Stirling numbers of the first kind (which, strictly speaking, are signed). Those numbers are easily generated, if you trudge through Wikipedia's awkward presentation of the algo. Without attempting to duplicate the article here, we can see the echos of way count in their mathematical properties: "In particular, the Stirling numbers of the first kind count permutations according to their number of cycles (counting fixed points as cycles of length one)." So perhaps there is a connection from way count to graph theory which I have not yet appreciated.

Putting it all together:

log(N!) = (N + (1/2)) log(N + 1) - (N + 1) + (1/2) log(2π) + (1 / (12 (N + 2))) + (1 / (12 (N + 2) (N + 3))) + (59 / (360 (N + 2) (N + 3) (N + 4))) + (29 / (60 (N + 2) (N + 3) (N + 4) (N + 5))) + ...

Quiz:

Find log(7!) using brute force and Bayes' Approximation.

Answer:

1. log(7) + log(6) + log(5) + log(4) + log(3) + log(2) = 8.525...

2. Using Bayes' Approximation, we get approximately the same answer at WolframAlpha here.

Fortunately -- at least as far as I can tell -- the lgamma() function in GCC uses Bayes' Approximation. (I discovered this after painstakingly implementing the above formula for a week, only to discover that lgamma() was both slightly slower and more accurate, yet radically simpler by virtue of being a builtin function. All this was quite a coincidence, triggered by opening to the wrong instruction manual page, whereupon I discovered lgamma.)

By the way, there other approaches to approximating the logsum, none of which I prefer to Bayes' Approximation. But see for yourself here at StackExchange.

Now that you know how to compute the logsum efficiently, we can explore the applications of logsum in the next lesson.

Keywords: 0.918938533204672741780, 0.91893853320467274178, 0.9189385332046727417, 0.918938533204672741, 0.91893853320467274, 0.9189385332046727, 0.918938533204672, 0.91893853320467, 0.9189385332046, 0.918938533204, 0.91893853320, 0.9189385332, 0.918938533, 0.91893853, 0.9189385, 0.918938, 0.91893

The Logfreedom Formula

HomePrevious Lesson / Next Lesson

Related Dyspoissometer functions: dyspoissometer_logfreedom_dense_get(), dyspoissometer_logfreedom_sparse_get(), dyspoissometer_u16_list_logfreedom_get(), dyspoissometer_u24_list_logfreedom_get(), dyspoissometer_u32_list_logfreedom_get(), dyspoissometer_u8_list_logfreedom_get().

From the previous lesson, we have following formula for way count, W:

Formula 1.

X = ((Z!) / ((H0!) Π(F = (1, K), H[F]!))
Y = (Q!) / Π(F = (1, K), (F!) ^ H[F])
W = XY

Obviously, W gets too large to handle very quickly, even exceeding physical memory size for the sake of storing a single integer in some practical cases. What to do?

Fortunately, W happens to be composed of nothing more than the products and ratios of a bunch of whole numbers. So we can easily create an expression for its natural logarithm. We abbreviate the natural logarithm as "log" or "ln". The logarithm to base 2 is denoted "log2"; the logarithm to base 10 is denoted "log10"; etc. Fortunately, the log of the way count is indeed susceptible to accurate representation using everyday floating-point arithmetic. Recall that:

Identity 1.

log(AB) ≡ log(A) + log(B)

and

Identity 2.

log(A / B) ≡ log(A) - log(B), B nonzero

and

Identity 3.

log(A^B) ≡ B log(A)

Note that in the identities above we have used the "threequals" symbol ("≡") because they are identities as opposed to solvable equations.

Quiz: What is the log of ((7!)/(3!))?

Answer: (log(7) + log(6) + log(5) + log(4)), or about 6.73.

Given H, a population list the sum of whose K items, corresponding to frequencies one through K, is Z (the mask span). Let Q be the "implied mask count":

Q = Σ(F = (1, K, H[F])

Then the way count, W, is given by Formula 1 above. We define log(W) as the "logfreedom", L, of H and Z. Applying identities 1, 2, and 3 4 to Formula 1 gives:

Formula 2.

L = log(X) + log(Y)
L = log(Z!) - log(H0!) - Σ(F = (1, K), log(H[F]!)) + log(Q!) - Σ(F = (1, K), H[F] log(F!))
L = log(Q!) + log(Z!) - log(H0!) - Σ(F = (1, K), log(H[F]!)) - Σ(F = (1, K), H[F] log(F!))

Formula 2 is what we call the "logfreedom formula". It appears in many instances throughout the Dyspoissometer source code. It's quite easy to compute, given H and Z. The only troublesome terms are of the form log(N!). This uncomplicated expression actually gives rise to a fascinating mathematical odyssey, which is the subject of the next lesson.

By the way, the explicit reference to log(H0!) is redundant because 0! is simply one, so we could run the last 2 terms from (F = 0) instead of (F = 1) without introducing a singularity. But we isolate log(H0!) for safety reasons: some builtin functions for computing the log-of-factorial cannot take zero as an input.

May 10, 2015

Way Count: Counting the Number of Configurations

HomePrevious Lesson / Next Lesson

Central to the discussion of information content in discrete sets is "way count", or the number of mask lists which comply with a given population list, H, and mask span, Z. For example, suppose we have:

H = {5, 1, 2, 1}

Thus we have five masks with frequency 1, one mask with frequency 2, two masks with frequency 3, and one mask with frequency 4. (K, the maximum frequency with nonzero population, is 4.) Remember that the population of frequency zero (H0) is omitted by convention, which means that we must provide the mask span as a given:

Z = 15

(In the real world, Z is almost always a power 2, and usually a power of 256 because masks usually consist of one or a few bytes each. So compliance with Z is a foregone conclusion in practice, and not something we need to test for. But this is a math lesson.)

Quiz:

1. Find H0, the population of masks with zero frequency.

2. Find Q, the mask count.

Answers:

1.

H0 = Z - Σ(F = (1, K), H[F])
H0 = 6

("Σ(F = (1, K), H[F])" means "the sum of H[F] with F running from 1 through K, inclusive".)

2.

Q = Σ(F = (1, K), F H[F])
Q = (1 * 5) + (2 * 1) + (3 * 2) + (4 * 1)
Q = 17

Look at the frequency list, J, below. Does it comply with H?

J = {3, 4, 1, 1, 1, 3, 1, 2, 1}

We have: 5 (F = 1)s, 1 (F = 2), 2 (F = 3)s, and 1 (F = 4). So its population list is indeed H.

Does J also comply with Z? No, because J contains only 9 items corresponding to 9 masks. Let's try another frequency list, J':

J' = {0, 0, 1, 3, 1, 4, 0, 0, 1, 1, 2, 3, 0, 1, 0}

As you can see if you count the frequencies in J', it also complies with H. And this time, we have 6 masks with frequency zero (H0 = 6), so J' also complies with Z. Thus any mask list G having frequency list J' is a "way" of H and Z.

The "way count" W of H and Z is the number of unique mask lists G complying with H and Z. We could calculate W by brute force:

W = 1
For every J compliant with H and Z
    For every G compliant with J
        W = W + 1
    Loop
Loop

Unfortunately, for all but the most trivial cases, this method rapidly becomes intractable. (Nonetheless, brute force testing proved quite useful for verifying Dyspoissometer's notion of W in simple test cases.) Fortunately, high school combinatorics can help us out here.

First of all, how many (J)s are compliant with H and Z? Well, we know that J contains Z items -- one for each mask. How many ways, X1, can we choose the masks with frequency 1? (There are H[1] such masks.)

X1 = Z nCr H[1]

where the right side means "Z combination H[1]".

And how many ways, Y1, can we chose the mask indexes corresponding to those masks? The least mask could occur at any of Q different mask indexes; the 2nd least mask could occur at any of (Q - 1); the 3rd least could occur at any of (Q - 2), etc. All possible combinations are allowed, so we need to take the product of these terms. So we have:

Y1 = Q nPr H[1]

where the right side means "Q permutation H[1]".

But what about masks with frequency 2? How many ways, X2, can we chose them? Well, after choosing the frequency-one masks, we have only (Z - H[1]) masks left from which to select a subset of size H[2]:

X2 = (Z - H[1]) nCr H[2]

Similarly, we can quantify X3 and X4, the number of ways in which we could choose masks of frequency 3 and 4, respectively:

X3 = (Z - H[1] - H[2]) nCr H[3]
X4 = (Z - H[1] - H[2] - H[3]) nCr H[4]

But how many ways, Y2, can H[2] different masks each have frequency 2, thereby occurring a total of 2H[2] times? Recall that we have only (Q - H[1]) mask indexes remaining from which to choose. The first pair of mask indexes containing masks with frequency 2 can thus be chosen in ((Q - H[1]) nCr 2) ways. The next set can then be chosen in ((Q - H[1] - 2) nCr 2). The 3rd set can then be chosen in ((Q - H[1] - 4) nCr 2) ways, the 4th in ((Q - H[1] - 6) nCr 2) ways, etc. So we have:

Y2 = Π(N = (1, H[2]), ((Q - H[1] - 2(N - 1)) nCr 2))

where "Π" means "product of" with the index, N, running from 1 to H[2]. Now, look at the "nCr 2" part. This term, first of all, implies that we have H[2] factors of (2!) (where "!" means "factorial"). (By the way, (0!) and (1!) are both defined to be one.) So we could simplify this to:

Y2 = Π(N = (1, H[2]), (Q - H[1] - 2(N - 1))(Q - H[1] - 2(N - 1) - 1)) / ((2!) ^ H[2])

Where "A ^ B" means (A raised to the power of B) -- not to be confused with the xor operator in C. And notice that we have expanded the pair of terms in the numerator of the combination into an explicit product of neighboring whole numbers. But if you look at this expression, we can simply replace the product series with a generic permutation:

Y2 = ((Q - H[1]) nPr (2H[2])) / ((2!) ^ H[2])

For Y3, we have (Q - H[1] - 2H[2]) mask indexes available from which to chose 3H[3] of them to contain all the frequency-3 masks. Using a very similar process, we can easily derive:

Y3 = ((Q - H[1] - 2H[2]) nPr (3H[3])) / ((3!) ^ H[3])

and then, analogously:

Y4 = ((Q - H[1] - 2H[2] - 3H[3]) nPr (4H[4])) / ((4!) ^ H[4])

By the way, look again at Y1:

Y1 = Q nPr H[1]

Note that we can rewrite it in the pattern as higher-order (Y)s:

Y1 = (Q nPr 1H[1]) / ((1!) ^ H[1])

Finding W is then a simple matter of plugging in the appropriate items from:

H = {5, 1, 2, 1}

then taking products:

W = X1 * X2 * X3 * X4 * Y1 * Y2 * Y3 * Y4
W = 3003 * 10 * 36 * 7 * 742,560 * 66 * 4200 * 1
W = 1,557,688,630,417,920,000

So there you have it: we can map every unique mask set compliant with H and Z to an integer on [0, (1,557,688,630,417,920,000 - 1)]. The mechanics of performing such a mapping are nontrivial and computationally expensive. Nevertheless, this concept is the basis of combinatorial data compression: given Z (usually implied) and H (explicitly specified in a usually-negligible amount of information), we can encode a single integer of ceiling(log2(W)) bits which exactly implies -- but is (maybe vastly) smaller than -- the mask list. Assuming that all compliant mask lists occur with equal probability, combinatorial compression is maximally efficient (minimum size).

But for the purposes of statistical analysis as opposed to data compression, we have an obvious problem: W is huge, approaching the capacity of a 64-bit integer -- all from an H with only 4 small items! In practice, standard integer types are useless for this reason, but even quad precision floating-point is rapidly exhausted by computing way count. We have to do something about this.

First of all, we can simplify the expression for W. Generically, we have:

W = X * Y

where:

X = Π(F = (1, K), (Z - Σ(N = (1, (F - 1)), H[N])) nCr H[F])

and

Y = Π(F = (1, K), (Q - Σ(N = (1, (F - 1)), N H[N])) nPr F H[F]) / ((F!) ^ H[F])

But by expanding the combinations and permutations, we can simplify X and Y as follows:

X = ((Z!) / ((H0!) * Π(F = (1, K), H[F]!))
Y = (Q!) / Π(F = (1, K), (F!) ^ H[F])

This is somewhat intuitive: (Q!) reflects that all mask indexes are used, but ((Z!) / (H0)!) reflects that only masks with nonzero frequency are used. But is this really correct? Try the example again:

X = (15!) / ((6!) * (5!) * (1!) * (2!) * (1!))
Y = (17!) / (((1!) ^ 5) * ((2!) ^ 1) * ((3!) ^ 2) * ((4!) ^ 1))

X = 7,567,560
Y = 205,837,632,000
W = XY
W = 1,557,688,630,417,920,000

Looks good! Now proceed to the next lesson to see how we deal with the problem of huge W.


The Terminology of Mask Lists

HomeNext Lesson

Related Dyspoissometer functions: dyspoissometer_uint_list_malloc_zero(), dyspoissometer_freq_max_minus_1_get(), dyspoissometer_pop_list_get(), dyspoissometer_freq_list_get(), dyspoissometer_freq_list_update_autoscale().

In order to understand dyspoissonism, it is necessary to start with a few basic definitions.

First of all, we have the concept of a "mask list". (Mathematicians would call this a "mask set", but computer programmers think of "set" as a verb with an entirely different meaning, so "list" is preferable. We use G in the math and "mask_list" in the source code.) A mask list is a nonnull list of whole numbers, each of which on some (inclusive) interval, [0, Z-1]. That is, no integer in the list can be less than zero or more than (Z-1), where Z is called the "mask span". For example, suppose that (Z=6). We could then have the following value of G:

G = {1, 5, 2, 2, 0, 3, 5, 1, 2}

Each whole number in G is called a "mask". But there is another set of whole numbers in G, relaying the position information of each mask: the mask indexes. Index zero contains (mask = 1); index one contains (mask = 5); and so on. (The terminology originated for historical reasons prompting the discovery of generalized dyspoissonism, although we now refer to "mask indexes" instead of "block numbers", "mask span" instead of "mask count", and "mask count" instead of "block count".)

In this case, we have 9 mask indexes, numbered zero through 8. So mask indexes are all on [0, 8]. The maximum such index, denoted mask_idx_max in the Dyspoissometer source code, is thus 8. Thus the "mask count", denoted Q in the math, is 9.

By the way, notice that masks all fall on a dense interval (with no skipped integers), so we are implicitly assuming that you have already performed a mapping from "real world symbols" to whole numbers. For example, the masks might correspond to English words, people, fruits, cars, or whatever.

Look up at mask_list. Notice that there is a single zero, two 1s, three 2s, one 3, no 4s, and two 5s. These are what we call the "frequencies" of these masks. We can make a list, denoted J in the math and "freq_list" in the code, which contains 6 items, one for each mask on [0, mask_max], showing its frequency:

J = {1, 2, 3, 1, 0, 2}

As a sanity check, note that the sum of the frequencies equals the mask count:

1 + 2 + 3 + 1 + 0 + 2 = 9

In turn, we can create a "population list", denoted as H in the math and "pop_list" in the code, containing the frequencies of each frequency. ("Population" and "frequency" mean basically the same thing in this case, but we choose to use the former as an unambiguous short form for "frequency of frequency".) Again, we can count the frequencies: we have one 0, two 1s, two 2s, and one 3:

H = {2, 2, 1}

Oops! What happened to that zero? Shouldn't it be:

H = {1, 2, 2, 1}

Well, not by convention in this case. We omit the population of frequency zero -- in other words, the number of possible (allowed) masks which do not occur in mask_list. The reason is that it's usually implicit because mask_max is given and we have all the other nonzero populations in H. The population of frequency zero is referred to as H0 ("H zero"). Similarly, the population of frequency F is denoted H[F] (note the brackets). Also by convention, population lists always terminate with a nonzero value and do not omit any nonzero populations (other than H0).

Once again, as a sanity check, note that the sum of populations equals the mask span:

(1 + 2 + 2 + 1) = 6

For further verification, note that the sum of (F H[F]) equals the mask count:

(1 * 2) + (2 * 2) + (3 * 1) = 9


Intro

Dyspoissometer source code in C and Mathematica is here for the impatient. Documentation follows.

This work evolved into a protracted research project funded by Tigerspike that almost got me fired. Without the pioneering mentality and superhuman patience of Luke Janssen, Oliver Palmer, Stuart Christmas, and Steven Zhang, you would not have this tool. They even granted me the copyright when I left on amicable terms so I could continue to manage it. If you want to build a custom app around it, give them a call.

Dyspoissonism (dis - PWAH - suhn - izm) is a quantity in the range of zero to one which measures the maximum possible compression fraction achievable under conditions of lossless combinatorial data compression applied to an unbiased random list of symbols. Put another way, dyspoissonism is a robust metric of distributed (global) randomness, with lesser values corresponding to greater randomness. Part of the mission of this blog is to mathematically justify this assertion. The other goal is to teach you how to apply it in practice, on real world digital signals, using Dyspoissometer, an open source discrete statistical analysis tool available in C and Mathematica languages.

Over the centuries, various tools have been developed to allow us to quantify the error in various measurements, such as the Gaussian, Maxwell-Boltzmann, and Poisson distributions familiar to most scientists and engineers. Loosely speaking, wide fuzzy distributions correspond to more information and narrow spikey distributions correspond to less information.

Bear in mind, what humans perceive as "interesting" is, from a statistical perspective, generally quite sparse in information content. This is the fundamental assertion behind the field of compressed sensing. Thus the narrower distributions are regarded as more "informative", which is reflective of their capacity to offer us insights into the operation of the universe -- not to provide us with a huge quantity of incomprehensible information.

It is entirely too common that analog statistical methods are (sometimes grossly) misapplied to discrete systems. Unfortunately, there is a dearth of tools appropriate for such ends. But they say that necessity is the mother of invention. So thanks again to Tigerspike, I'm pleased to offer you the following tutorial on dyspoissonism in addition to the source code linked above. You will find other articles posted to this blog from time to time, but here are the basics:

1. The Terminology of Mask Lists
2. Way Count: Counting the Number of Configurations
3. The Logfreedom Formula
4. Computing the Logarithm of a Factorial
5. Bits, Eubits, Logfreedom, and Entropy
6. Dyspoissonism Defined
7. Random Number Generators and Logfreedom
8. Information Sparsity vs. Dyspoissonism
9. Rapid Approximation of Maximum Logfreedom
10. Dyspoissometer Source Code
11. Dyspoissometer Demo Output