eg-GRIDS+: What is it?

eg-GRIDS+ is an open source grammatical inference algorithm, which is able to learn context-free grammars from positive examples only.

Generalisation is controlled through heuristics (minimum description length).

Download eg-GRIDS+

eg-GRIDS+ sources and binaries can be found here:

http://www.ellogon.org/~petasis/eg-GRIDS+/

You will need Tcl/Tk to run the binaries, as in the latest form I distribute them as tcl packages. You can run eg-GRIDS+ through the eg-GRIDS+.tcl file.

How to install it

The current version of eg-GRIDS+ is distributed as an extension to the Tcl/Tk programming language. (There is a standalone binary that can be build from the sources, but it has less functionality exposed through its command line arguments.) In order to use it (i.e. under Linux), I suggest the following steps:

  1. Download and install ActiveTcl 8.5/8.6 from ActiveState (it is free):
    https://www.activestate.com/activetcl/downloads/
    I usually install it in /opt/ActiveTcl-8.5
  2. Copy the folder contained inside egGRIDS1.0-linux64.tar.gz in the /opt/ActiveTcl-8.5/lib folder.
  3. Start tclsh8.5 from ActiveTcl. (/opt/ActiveTcl-8.5/bin/tclsh8.5)
  4. Type inside tclsh:
    package require egGRIDS
    and press return. You should see 1.0 printed. Type exit to quit tclsh.

How to use it

Download a Tcl script that can act as a front-end from here:

http://www.ellogon.org/~petasis/eg-GRIDS+/eg-GRIDS+.tcl

You can run eg-GRIDS+ with the following commands:

  1. /opt/ActiveTcl-8.5/bin/tclsh8.5 eg-GRIDS+.tcl <grammar_file.grm>
  2. /opt/ActiveTcl-8.5/bin/tclsh8.5 eg-GRIDS+.tcl <examples_file.txt>

Use (1) if you have your sentences in a CFG format eg-GRIDS+ can parse, or use (2) if you have your sentences
as a series of (space separated) tokens, with each sentence in a single line. In case of (2), the input file name must has a .txt extension. Using (2), eg-GRIDS+ will convert your training sentences into a grammar format (a .grm file will be generated, in the same folder as the txt file, using the same name - without the .txt extension), and then it will proceed as in (1) case.

During learning, the algorithm will produce a directory named "learned" in the same directory as the directory you run eg-GRIDS+. All output will be written in this "learned" directory. All intermediate grammars and the final one (FINAL.grm) will be saved in this directory. Also, some .dot files will be written, you can ignore these...

What if there is a problem?

If you have any problem, just send me an e-mail :-) [This email address is being protected from spambots. You need JavaScript enabled to view it.]. The sources can be found here:

http://www.ellogon.org/~petasis/eg-GRIDS+/eg-GRIDS+.zip

But lets hope you will not need to compile them, as the build system is not good :-)

Finally, I am very interested in hearing if the algorithm has worked for your data or not :-) Right now it is running with a fast heuristic. If does not work for your data alternatives should be tried, like the beam search or the genetic search (described in the papers) which are more time-consuming as they explore a larger search space.

Relevant publications

Year: 2011

  1. Georgios Petasis.
    Machine Learning in Natural Language Processing.
    Ph.D. Thesis, Department of Informatics and Telecommunications, University of Athens, 2011.
    URL BibTeX

    @phdthesis{PhD-2011-Petasis,
    	author = "Petasis, Georgios",
    	abstract = "This thesis examines the use of machine learning techniques in various tasks of natural language processing, mainly for the task of information extraction from texts. The objectives are the improvement of adaptability of information extraction systems to new thematic domains (or even languages), and the improvement of their performance using as fewer resources (either linguistic or human) as possible. This thesis has examined two main axes: a) the research and assessment of existing algorithms of machine learning mainly in the stages of linguistic pre-processing (such as part of speech tagging) and named-entity recognition, and b) the creation of a new machine learning algorithm and its assessment on synthetic data, as well as in real world data from the task of relation extraction between named entities. This new algorithm belongs to the category of inductive grammar learning, and can infer context free grammars from positive examples only.",
    	keywords = "information extraction, machine learning, grammatical inference",
    	month = "July 1",
    	school = "Department of Informatics and Telecommunications, University of Athens",
    	title = "{M}achine {L}earning in {N}atural {L}anguage {P}rocessing",
    	type = "Ph.D. Thesis",
    	url = "http://www.ellogon.org/petasis/bibliography/Petasis/Ph.D.Thesis-GeorgiosPetasis.pdf",
    	year = 2011
    }
    

Year: 2008

  1. Georgios Petasis, Vangelis Karkaletsis, Georgios Paliouras and Constantine D Spyropoulos.
    Learning context-free grammars to extract relations from text.
    In Malik Ghallab, Constantine D Spyropoulos, Nikos Fakotakis and Nikolaos M Avouris (eds.). Proceeding of the 2008 conference on ECAI 2008: 18th European Conference on Artificial Intelligence 178. 2008, 303–307.
    URL BibTeX

    @inproceedings{Petasis:2008:LCG:1567281.1567350,
    	author = "Petasis, Georgios and Karkaletsis, Vangelis and Georgios Paliouras and Constantine D. Spyropoulos",
    	abstract = "In this paper we propose a novel relation extraction method, based on grammatical inference. Following a semi-supervised learning approach, the text that connects named entities in an annotated corpus is used to infer a context free grammar. The grammar learning algorithm is able to infer grammars from positive examples only, controlling overgeneralisation through minimum description length. Evaluation results show that the proposed approach performs comparable to the state of the art, while exhibiting a bias towards precision, which is a sign of conservative generalisation.",
    	address = "Amsterdam, The Netherlands, The Netherlands",
    	booktitle = "Proceeding of the 2008 conference on ECAI 2008: 18th European Conference on Artificial Intelligence",
    	editor = "Malik Ghallab and Constantine D. Spyropoulos and Nikos Fakotakis and Nikolaos M. Avouris",
    	isbn = "978-1-58603-891-5",
    	pages = "303--307",
    	publisher = "IOS Press",
    	series = "Frontiers in Artificial Intelligence and Applications",
    	title = "{L}earning context-free grammars to extract relations from text",
    	url = "http://www.ellogon.org/petasis/bibliography/ECAI2008/ECAI2008_0371.pdf",
    	volume = 178,
    	year = 2008
    }
    

Year: 2004

  1. Georgios Petasis, Georgios Paliouras, Constantine D Spyropoulos and Constantine Halatsis.
    Eg-GRIDS: Context-Free Grammatical Inference from Positive Examples Using Genetic Search.
    In Georgios Paliouras and Yasubumi Sakakibara (eds.). Grammatical Inference: Algorithms and Applications, Proceedings of the 7th International Colloquium on Grammatical Inference (ICGI 2004) 3264. 2004, 223–234.
    URL BibTeX

    @inproceedings{DBLP:conf/icgi/PetasisPSH04,
    	author = "Petasis, Georgios and Georgios Paliouras and Constantine D. Spyropoulos and Constantine Halatsis",
    	abstract = "In this paper we present eg-GRIDS, an algorithm for inducing context-free grammars that is able to learn from positive sample sentences. The presented algorithm, similar to its GRIDS predecessors, uses simplicity as a criterion for directing inference, and a set of operators for exploring the search space. In addition to the basic beam search strategy of GRIDS, eg-GRIDS incorporates an evolutionary grammar selection process, aiming to explore a larger part of the search space. Evaluation results are presented on artificially generated data, comparing the performance of beam search and genetic search. These results show that genetic search performs better than beam search while being significantly more efficient computationally.",
    	address = "Athens, Greece",
    	booktitle = "Grammatical Inference: Algorithms and Applications, Proceedings of the 7th International Colloquium on Grammatical Inference (ICGI 2004)",
    	editor = "Georgios Paliouras and Yasubumi Sakakibara",
    	isbn = "3-540-23410-1",
    	month = "October 11--13",
    	pages = "223--234",
    	publisher = "Springer Berlin / Heidelberg",
    	series = "Lecture Notes in Computer Science",
    	title = "{E}g-{GRIDS}: {C}ontext-{F}ree {G}rammatical {I}nference from {P}ositive {E}xamples {U}sing {G}enetic {S}earch",
    	url = "http://www.ellogon.org/petasis/bibliography/ICGI2004/e-GRIDS-ICGI-2004-Submission.pdf",
    	volume = 3264,
    	year = 2004
    }
    
  2. Georgios Petasis, Georgios Paliouras, Vangelis Karkaletsis, Constantine Halatsis and Constantine D Spyropoulos.
    E-GRIDS: Computationally Efficient Grammatical Inference from Positive Examples.
    GRAMMARS 7:69–110, 2004.
    Technical Report referenced in the paper: http://www.ellogon.org/petasis/bibliography/GRAMMARS/GRAMMARS2004-SpecialIssue-Petasis-TechnicalReport.pdf.
    URL BibTeX

    @article{GRAMMARS-vol.7-Petasis,
    	author = "Petasis, Georgios and Georgios Paliouras and Vangelis Karkaletsis and Constantine Halatsis and Constantine D. Spyropoulos",
    	abstract = "In this paper we present a new computationally efficient algorithm for inducing context-free grammars that is able to learn from positive sample sentences. This new algorithm uses simplicity as a criterion for directing inference, and the search process of the new algorithm has been optimised by utilising the results of a theoretical analysis regarding the behaviour and complexity of the search operators. Evaluation results are presented on artificially generated data, while the scalability of the algorithm is tested on a large textual corpus. These results show that the new algorithm performs well and can infer grammars from large data sets in a reasonable amount of time.",
    	journal = "GRAMMARS",
    	keywords = "grammatical inference, context-free grammars, minimum description length, positive examples",
    	note = "Technical Report referenced in the paper: http://www.ellogon.org/petasis/bibliography/GRAMMARS/GRAMMARS2004-SpecialIssue-Petasis-TechnicalReport.pdf",
    	pages = "69--110",
    	title = "{E}-{GRIDS}: {C}omputationally {E}fficient {G}rammatical {I}nference from {P}ositive {E}xamples",
    	url = "http://www.ellogon.org/petasis/bibliography/GRAMMARS/GRAMMARS2004.pdf",
    	volume = 7,
    	year = 2004
    }