eg-GRIDS+: What is it?
eg-GRIDS+ is an open source grammatical inference algorithm, which is able to learn context-free grammars from positive examples only.
Generalisation is controlled through heuristics (minimum description length).
eg-GRIDS+ sources and binaries can be found here:
You will need Tcl/Tk to run the binaries, as in the latest form I distribute them as tcl packages. You can run eg-GRIDS+ through the eg-GRIDS+.tcl file.
How to install it
The current version of eg-GRIDS+ is distributed as an extension to the Tcl/Tk programming language. (There is a standalone binary that can be build from the sources, but it has less functionality exposed through its command line arguments.) In order to use it (i.e. under Linux), I suggest the following steps:
- Download and install ActiveTcl 8.5/8.6 from ActiveState (it is free):
I usually install it in /opt/ActiveTcl-8.5
- Copy the folder contained inside egGRIDS1.0-linux64.tar.gz in the /opt/ActiveTcl-8.5/lib folder.
- Start tclsh8.5 from ActiveTcl. (/opt/ActiveTcl-8.5/bin/tclsh8.5)
- Type inside tclsh:
package require egGRIDS
and press return. You should see 1.0 printed. Type exit to quit tclsh.
How to use it
Download a Tcl script that can act as a front-end from here:
You can run eg-GRIDS+ with the following commands:
- /opt/ActiveTcl-8.5/bin/tclsh8.5 eg-GRIDS+.tcl <grammar_file.grm>
- /opt/ActiveTcl-8.5/bin/tclsh8.5 eg-GRIDS+.tcl <examples_file.txt>
Use (1) if you have your sentences in a CFG format eg-GRIDS+ can parse, or use (2) if you have your sentences
as a series of (space separated) tokens, with each sentence in a single line. In case of (2), the input file name must has a .txt extension. Using (2), eg-GRIDS+ will convert your training sentences into a grammar format (a .grm file will be generated, in the same folder as the txt file, using the same name - without the .txt extension), and then it will proceed as in (1) case.
During learning, the algorithm will produce a directory named "learned" in the same directory as the directory you run eg-GRIDS+. All output will be written in this "learned" directory. All intermediate grammars and the final one (FINAL.grm) will be saved in this directory. Also, some .dot files will be written, you can ignore these...
What if there is a problem?
But lets hope you will not need to compile them, as the build system is not good :-)
Finally, I am very interested in hearing if the algorithm has worked for your data or not :-) Right now it is running with a fast heuristic. If does not work for your data alternatives should be tried, like the beam search or the genetic search (described in the papers) which are more time-consuming as they explore a larger search space.