Description of Ellogon

Ellogon is a multi-lingual, cross-platform, general-purpose language engineering environment, developed in order to aid both researchers who are doing research in computational linguistics, as well as companies who produce and deliver language engineering systems. Ellogon as a language engineering platform offers an extensive set of facilities, including tools for processing and visualising textual/HTML/XML data and associated linguistic information, support for lexical resources (like creating and embedding lexicons), tools for creating annotated corpora, accessing databases, comparing annotated data, or transforming linguistic information into vectors for use with various machine learning algorithms.

During the last decade, a large number of software infrastructures aiming at facilitating R&D in the field of natural language processing have been presented. Some of these infrastructures, such as LT-NSL/LT-XML tools or GATE, have become extremely popular as they have been applied to a wide range of tasks by many institutions around the world.

Ellogon belongs to the category of referential or annotation based platforms, where the linguistic information is stored separately from the textual data, having references back to the original text. Based on the TIPSTER data model, Ellogon provides infrastructure for:

Ellogon Architecture

Ellogon can be used either as an NLP integrated development environment (IDE) or as a library that can be embedded to foreign applications. To achieve this, Ellogon proposes and implements a modular architecture with four independent subsystems:

[Ellogon Architecture Image]

Ellogon Data Model

Ellogon shares the same data model as the TIPSTER architecture. Due to this, it shares some basic features with other TIPSTER-based infrastructures, such as GATE. However, it also offers a large number of features that differentiate it from such infrastructures.

The central element for storing data in Ellogon is the Collection. A collection is a finite set of Documents. An Ellogon document consists of textual data as well as linguistic information about the textual data. This linguistic information is stored in the form of attributes and annotations.

An attribute associates a specific type of information with a typed value. An annotation associates arbitrary information (in the form of attributes) with portions of textual data. Each such portion, named span, consists of two character offsets denoting the start and the end characters of the portion, as measured from the first character of some textual data. Annotations typically consist of four elements:

[Ellogon Data Model Image]

Why to use Ellogon?

The motivation behind the development of Ellogon (which started in 1998) was the inadequacy of existing platforms to support, at that time, some essential properties, such as the ability to

Ellogon in its present form satisfies all of these requirements. As Ellogon is based on the TIPSTER architecture, it shares many basic properties with other TIPSTER-based infrastructures like GATE. However, Ellogon offers several important features that differentiate it from similar infrastructures:

Working with Ellogon

The users of Ellogon can be roughly classified in three major categories: computational linguists, language engineers and end users.

Ellogon Components

For most users of Ellogon, the central point of interest is the linguistic processing that can be carried out within it. Ellogon provides a generic framework where external components can be easily embedded. As Ellogon follows a modular paradigm, it utilises components of various types, with each type specialising in a specific processing task. A taxonomy of the currently defined component types are shown in the following figure:

[Ellogon Component Hierarchy Image]

The most important component type from the user’s point of view is of course the linguistic processing component, as natural language processors usually belong to this component type. These components (along with components of the machine-learning processing type) can be organised into Systems for performing some specific task. The tasks can range from basic linguistic tasks, such as part-of-speech tagging or parsing, to application level tasks, such as information extraction or machine translation.

A linguistic processing component consists mainly of two parts. The first part is responsible for performing the desired linguistic processing while the main responsibility of the second component part is to interface the linguistic processing sub-component with Ellogon, through the provided API. Components can appear either as wrappers or as native components. Wrappers usually provide the needed code in order to interface an existing independent implementation of a linguistic processing tool to the Ellogon platform. Native components on the other hand are processing tools specifically designed for use within the Ellogon platform. Usually, in such components the two component parts cannot be easily identified or separated.

Each component is associated with metadata, which include a set of pre-conditions and a set of post-conditions among other information. Pre-conditions declare the linguistic information that must be present in a document before this specific component can be applied to it. Post-conditions describe the linguistic information that will be added in the document as a side effect of processing the document with this specific component. Ellogon uses these two sets in order to establish relations among the various components or to "undo" the results of a component application on a corpus.

Each component can also specify a set of parameters, as well as a set of viewers (components of type "visualisation component"). Parameters represent various runtime dependent options (such as the location of a file containing the grammar of a syntactic parser). They can be edited by the end user through the graphical interface and are given to the component every time it is executed. A component can also specify a set of predefined viewers, in order to present in a graphical way the linguistic information produced during the component execution. Examples of available viewers can been seen here and here.

Creating components can be easily done through the Ellogon GUI. Currently, Ellogon components can be developed in five languages, C/C++, Tcl, Java, Perl and Python. The Ellogon GUI offers a specialised dialog where the user can specify various parameters of the component he/she intends to create, including its pre/post-conditions. Then Ellogon creates the skeleton of the new component that will handle all the interaction with the Ellogon platform. If the language of the component is C/C++ or Java, proper Makefiles for compiling the component under Unix and Windows will also be created. Besides creating a skeleton, Ellogon tries to facilitate the development of the component by allowing the developer to edit the source code and reload the specific component into Ellogon from its GUI.