Function CDM_ExtractTextBasedOnAnnotations
|
|
|
|
Description:
This function will accept two parameters, a text object (of type
CDM_ByteSequence or CDM_RawData) and a set of Annotations (of type
CDM_AnnotationSet). It will return a new text object (of type
CDM_ByteSequence or CDM_RawData) that will contain the text portions
from the provided text object that are spanned by Annotations in the
specified Annotation set. The portions of the original text that are not
spanned by Annotations will either be replaced with white space
characters or eliminated from the resulting text, depending on the value
of the "mode" parameter. Valid values for the "mode" parameter are
"CDM_EXTRACT_TEXT_LEAVE_UNSPANNED_TEXT" and
"CDM_EXTRACT_TEXT_REMOVE_UNSPANNED_TEXT". The first value will convert text
portions not spanned into white space characters (either spaces (' ') or
new-line characters ('\n')), while the second value will cause the
elimination of unspanned text portions from the resulting text object. A common use of this function is when we want to perform a linguistic
task only on part of the text contained on the Document, but we want the
resulting Annotations to span the proper text portions, when added to
the original Document. For example, we want to apply a sentence splitter
on an HTML document. As the sentence splitter has no knowledge of HTML
tags, we want to create a version of the Document text where the HTML
tags have been replaced with spaces. Provided that we have Annotations
that span all the HTML tags, we can create this text by simply executing
the following command:
Text = CDM_ExtractTextBasedOnAnnotations(DocText, HTML_Annotations, CDM_EXTRACT_TEXT_LEAVE_UNSPANNED_TEXT);
This function will always create and return a new text object,
even when the Annotation set is empty or the specified text object
through the "Text" parameter has no text.
CDM does not own the returned object: the caller is responsible to use
Tcl_DecrRefCount or CDM_Free in order to free the object and the memory
associated with it.
Return Value:
This function will always create and return a new text object,
even when the Annotation set is empty or the specified text object
through the "Text" parameter has no text.
CDM does not own the returned object: the caller is responsible to use
Tcl_DecrRefCount or CDM_Free in order to free the object and the memory
associated with it.
In case of an error, NULL will be returned and an error message will be
placed in the current active Tcl interpreter (CDM_Interp).
Notes:
This function is equivalent to tip_ExtractTextBasedOnAnnotations
(Tcl API).
See Also:
CDM_Free, Tcl_DecrRefCount,
tip_ExtractTextBasedOnAnnotations
Generated by: petasis@aias on Wed Aug 16 10:31:57 PM EEST 2006.