Function tip_ExtractTextBasedOnAnnotations


Prototype: tip_ExtractTextBasedOnAnnotations(Utilities);
Arguments:

: ()
Set: An Annotation set. (CDM_AnnotationSet)
mode: One of the values CDM_EXTRACT_TEXT_LEAVE_UNSPANNED_TEXT or CDM_EXTRACT_TEXT_REMOVE_UNSPANNED_TEXT (int)

C++ API Sections
Tcl API Sections
Description
Return Value
Notes
See Also
Index

Description:

This function will accept two parameters, a text object (of type CDM_ByteSequence or CDM_RawData) and a set of Annotations (of type CDM_AnnotationSet). It will return a new text object (of type CDM_ByteSequence or CDM_RawData) that will contain the text portions from the provided text object that are spanned by Annotations in the specified Annotation set. The portions of the original text that are not spanned by Annotations will either be replaced with white space characters or eliminated from the resulting text, depending on the value of the "mode" parameter. Valid values for the "mode" parameter are "CDM_EXTRACT_TEXT_LEAVE_UNSPANNED_TEXT" and "CDM_EXTRACT_TEXT_REMOVE_UNSPANNED_TEXT". The first value will convert text portions not spanned into white space characters (either spaces (' ') or new-line characters ('\n')), while the second value will cause the elimination of unspanned text portions from the resulting text object.

A common use of this function is when we want to perform a linguistic task only on part of the text contained on the Document, but we want the resulting Annotations to span the proper text portions, when added to the original Document. For example, we want to apply a sentence splitter on an HTML document. As the sentence splitter has no knowledge of HTML tags, we want to create a version of the Document text where the HTML tags have been replaced with spaces. Provided that we have Annotations that span all the HTML tags, we can create this text by simply executing the following command:

set Text [tip_ExtractTextBasedOnAnnotations $DocText $HTML_Annotations]

The value of the "mode" parameter can be ommited. In this case, it defaults to CDM_EXTRACT_TEXT_LEAVE_UNSPANNED_TEXT.

This function will always create and return a new text object, even when the Annotation set is empty or the specified text object through the "Text" parameter has no text.

Notes:

This function is equivalent to CDM_ExtractTextBasedOnAnnotations (C++ API).


Generated by: petasis@aias on Wed Aug 16 10:31:45 PM EEST 2006.