Participants may optionally register for the competition by sending e-mail to barak+list-abbadingo-request@pearlmutter.net. We will put registered participants on a mailing list for competition-related announcements.
To be specific, anyone who deliberately consults the Oracle more than 25 times in one day will be disqualified, as will anyone who mounts any kind of security or denial of services attack against the Abbadingo server.
Figuring out how to generate the best possible hypothesis from the available training data is part of the competition. We provide one useful hint below. You also might want to think about cross-validation.
Short theoretical discussion: In the worst case, learning DFA's from given data is NP-hard [A-78, PW-93, KV-89]. However, the average case isn't bad [TB-73, L-92, FKRRSS-93]. The problems in this competition are of exactly the same type as described in [L-92], but they are harder than the ones that were successfully solved in that paper because there is less training data. However there is more data than the information theoretic lower bound, so the problems are solvable in principle. Because the amount of training data for these problems lies between current upper and lower sample complexity bounds, the problems are in prime research territory.
However, other approaches to the competition are also valid. In fact, participants are not required to produce a DFA at all, only an accurate set of labels for the test set.
There is also a R4RS Scheme implementation of the Trakhtenbrot-Barzdin algorithm available. This Scheme code has not been optimized, and is orders of magnitude slower than the C version. For expository purposes only.
In order to avoid biasing participants we are not publishing our own library of routines for DFA manipulation, minimization, etc. However if you have any source code, useful calculations, or anything else you'd like to share with other participants, we will gladly make them available in the collection of generously donated materials.
You might also want to search the self-proclaimed Homepage for the International Grammatical Inference Community.
It is a straightforward exercise to find a closed form expression that approximates the distribution of the number of nodes/links not terminated at/traversed by the training set as a function of the training set size.
It may be that some of the sparser training sets do not fully constrain the solution. Even if they don't, it might be that the error rate induced by missing information would be below the 1% tolerance.
Doing the best you can with limited data is part of the competition.
depth = max{s|states} min{x|strings leading to s} length(x)This terminology extends the notion of the depth of a tree in the natural way, and was used by Trakhtenbrot and Barzdin' [TB-73].