Function tip_ExtractTextBasedOnAnnotations
|
|
Prototype: | tip_ExtractTextBasedOnAnnotations(Utilities);
|
---|
Arguments: | : () Set: An Annotation set. (CDM_AnnotationSet) mode: One of the values CDM_EXTRACT_TEXT_LEAVE_UNSPANNED_TEXT or CDM_EXTRACT_TEXT_REMOVE_UNSPANNED_TEXT (int)
|
---|
|
|
Description:
This function will accept two parameters, a text object (of type
CDM_ByteSequence or CDM_RawData) and a set of Annotations (of type
CDM_AnnotationSet). It will return a new text object (of type
CDM_ByteSequence or CDM_RawData) that will contain the text portions
from the provided text object that are spanned by Annotations in the
specified Annotation set. The portions of the original text that are not
spanned by Annotations will either be replaced with white space
characters or eliminated from the resulting text, depending on the value
of the "mode" parameter. Valid values for the "mode" parameter are
"CDM_EXTRACT_TEXT_LEAVE_UNSPANNED_TEXT" and
"CDM_EXTRACT_TEXT_REMOVE_UNSPANNED_TEXT". The first value will convert text
portions not spanned into white space characters (either spaces (' ') or
new-line characters ('\n')), while the second value will cause the
elimination of unspanned text portions from the resulting text object. A common use of this function is when we want to perform a linguistic
task only on part of the text contained on the Document, but we want the
resulting Annotations to span the proper text portions, when added to
the original Document. For example, we want to apply a sentence splitter
on an HTML document. As the sentence splitter has no knowledge of HTML
tags, we want to create a version of the Document text where the HTML
tags have been replaced with spaces. Provided that we have Annotations
that span all the HTML tags, we can create this text by simply executing
the following command:
set Text [tip_ExtractTextBasedOnAnnotations $DocText $HTML_Annotations]
The value of the "mode" parameter can be ommited. In this case, it
defaults to CDM_EXTRACT_TEXT_LEAVE_UNSPANNED_TEXT.
This function will always create and return a new text object,
even when the Annotation set is empty or the specified text object
through the "Text" parameter has no text.
Notes:
This function is equivalent to CDM_ExtractTextBasedOnAnnotations
(C++ API).
Generated by: petasis@aias on Wed Aug 16 10:31:45 PM EEST 2006.