problem name | problem status | training set | testing set | target DFA states | target DFA depth | strings in training set |
---|---|---|---|---|---|---|
A | practice | train.a.gz | test.a.gz | 61 | 10 | 4456 |
B | practice | train.b.gz | test.b.gz | 119 | 12 | 13894 |
C | practice | train.c.gz | test.c.gz | 247 | 14 | 36992 |
D | practice | train.d.gz | test.d.gz | 498 | 16 | 115000 |
1 | solved1 3/10 | train.1.gz | test.1.gz | 63 | 10 | 3478 |
2 | solved1 3/19 | train.2.gz | test.2.gz | 138 | 12 | 10723 |
3 | solved2 5/6 | train.3.gz | test.3.gz | 260 | 14 | 28413 |
R | solved2 5/15 | train.r.gz | test.r.gz | 499 | 16 | 87500 |
4 | solved1 3/14 | train.4.gz | test.4.gz | 68 | 10 | 2499 |
5 | solved1 3/31 | train.5.gz | test.5.gz | 130 | 12 | 7553 |
6 | solved1 6/8 | train.6.gz | test.6.gz | 262 | 14 | 19834 |
S | solved2 5/16 | train.s.gz | test.s.gz | 506 | 16 | 60000 |
7 | solved1 7/14 | train.7.gz | test.7.gz | 65 | 10 | 1521 |
8 | unsolved | train.8.gz | test.8.gz | 125 | 12 | 4382 |
9 | unsolved | train.9.gz | test.9.gz | 267 | 14 | 11255 |
T | unsolved | train.t.gz | test.t.gz | 519 | 16 | 32500 |
(The above ordering does not necessarily correspond to difficulty. The official ranking system reflected in the criteria for winning an award is two dimensional.)
The files train.?.gz are sample strings labeled by the sixteen languages in the competition. You should use them to infer the languages. You can test your answers using test.?.gz, which are strings you can classify and then test using the Abbadingo Oracle.
Data sets A, B, C, and D are for practice only.
Data sets R, S, and T are new official (i.e. non-practice) problems of nominal size 512.
The above individual files are compressed with gzip. Alternatively, you can get the files in one shot (2,675K gzipped tar file; 2,676K zip file; 1,884K bzipped tar file), or the twelve smallest problems in one shot, everything except for D, R, S, and T (885K gzipped tar file; 887K zip file; 629K bzipped tar file.) Note that you can force Netscape to download to a file instead of displaying by using shift-leftClick on the link.
In each file, the first line is a header giving the number of strings in the file and the number of symbols (for this competition always two: 0 and 1). Each succeeding line specifies one string. These lines have the format ``label len sym1 sym2 ... symlen'' where len is the length of the string, and sym1 sym2 ... symlen are its symbols, separated by white space. The label 1 means accepted, the label 0 means rejected, and the label -1 means unknown (used in testing set files). So if the last line of a file was 0 7 1 0 0 0 1 1 1 it would indicate that the string 1000111 is rejected. For a toy example problem complete with files in this format, see the deterministic finite automata page.
Good luck, and may the best algorithms win!