Abbadingo Data Sets

Problems and their Current Status
problem name problem status training set testing set target DFA states target DFA depth strings in training set
A practice train.a.gz test.a.gz 61 10 4456
B practice train.b.gz test.b.gz 119 12 13894
C practice train.c.gz test.c.gz 247 14 36992
D practice train.d.gz test.d.gz 498 16 115000
1 solved1 3/10 train.1.gz test.1.gz 63 10 3478
2 solved1 3/19 train.2.gz test.2.gz 138 12 10723
3 solved2 5/6 train.3.gz test.3.gz 260 14 28413
R solved2 5/15 train.r.gz test.r.gz 499 16 87500
4 solved1 3/14 train.4.gz test.4.gz 68 10 2499
5 solved1 3/31 train.5.gz test.5.gz 130 12 7553
6 solved1 6/8 train.6.gz test.6.gz 262 14 19834
S solved2 5/16 train.s.gz test.s.gz 506 16 60000
7 solved1 7/14 train.7.gz test.7.gz 65 10 1521
8 unsolved train.8.gz test.8.gz 125 12 4382
9 unsolved train.9.gz test.9.gz 267 14 11255
T unsolved train.t.gz test.t.gz 519 16 32500
1By Hugues Juille, loyal grad student of the incomparable Jordan Pollack.
2By Rod Price, England.

(The above ordering does not necessarily correspond to difficulty. The official ranking system reflected in the criteria for winning an award is two dimensional.)

The files train.?.gz are sample strings labeled by the sixteen languages in the competition. You should use them to infer the languages. You can test your answers using test.?.gz, which are strings you can classify and then test using the Abbadingo Oracle.

Data sets A, B, C, and D are for practice only.

Data sets R, S, and T are new official (i.e. non-practice) problems of nominal size 512.

The above individual files are compressed with gzip. Alternatively, you can get the files in one shot (2,675K gzipped tar file; 2,676K zip file; 1,884K bzipped tar file), or the twelve smallest problems in one shot, everything except for D, R, S, and T (885K gzipped tar file; 887K zip file; 629K bzipped tar file.) Note that you can force Netscape to download to a file instead of displaying by using shift-leftClick on the link.

In each file, the first line is a header giving the number of strings in the file and the number of symbols (for this competition always two: 0 and 1). Each succeeding line specifies one string. These lines have the format ``label len sym1 sym2 ... symlen'' where len is the length of the string, and sym1 sym2 ... symlen are its symbols, separated by white space. The label 1 means accepted, the label 0 means rejected, and the label -1 means unknown (used in testing set files). So if the last line of a file was 0 7 1 0 0 0 1 1 1 it would indicate that the string 1000111 is rejected. For a toy example problem complete with files in this format, see the deterministic finite automata page.

Good luck, and may the best algorithms win!

Abbadingo home
The Abbadingo Webmaster welcomes your comments and complaints.