Function CDM_ExtractTextBasedOnAnnotations


Definition:#include <CDM.h>
Prototype:CDM_ByteSequence CDM_ExtractTextBasedOnAnnotations(CDM_ByteSequence Text, CDM_AnnotationSet Set, int mode);
Arguments:

Text: A text object. (CDM_ByteSequence or CDM_RawData)
Set: An Annotation set. (CDM_AnnotationSet)
mode: One of the values CDM_EXTRACT_TEXT_LEAVE_UNSPANNED_TEXT or CDM_EXTRACT_TEXT_REMOVE_UNSPANNED_TEXT (int)

C/C++ API Sections
Tcl API Sections
Description
Return Value
Notes
See Also
Index

Description:

This function will accept two parameters, a text object (of type CDM_ByteSequence or CDM_RawData) and a set of Annotations (of type CDM_AnnotationSet). It will return a new text object (of type CDM_ByteSequence or CDM_RawData) that will contain the text portions from the provided text object that are spanned by Annotations in the specified Annotation set. The portions of the original text that are not spanned by Annotations will either be replaced with white space characters or eliminated from the resulting text, depending on the value of the "mode" parameter. Valid values for the "mode" parameter are "CDM_EXTRACT_TEXT_LEAVE_UNSPANNED_TEXT" and "CDM_EXTRACT_TEXT_REMOVE_UNSPANNED_TEXT". The first value will convert text portions not spanned into white space characters (either spaces (' ') or new-line characters ('\n')), while the second value will cause the elimination of unspanned text portions from the resulting text object.

A common use of this function is when we want to perform a linguistic task only on part of the text contained on the Document, but we want the resulting Annotations to span the proper text portions, when added to the original Document. For example, we want to apply a sentence splitter on an HTML document. As the sentence splitter has no knowledge of HTML tags, we want to create a version of the Document text where the HTML tags have been replaced with spaces. Provided that we have Annotations that span all the HTML tags, we can create this text by simply executing the following command:

Text = CDM_ExtractTextBasedOnAnnotations(DocText, HTML_Annotations, CDM_EXTRACT_TEXT_LEAVE_UNSPANNED_TEXT);

This function will always create and return a new text object, even when the Annotation set is empty or the specified text object through the "Text" parameter has no text. CDM does not own the returned object: the caller is responsible to use Tcl_DecrRefCount or CDM_Free in order to free the object and the memory associated with it.

Return Value:

This function will always create and return a new text object, even when the Annotation set is empty or the specified text object through the "Text" parameter has no text. CDM does not own the returned object: the caller is responsible to use Tcl_DecrRefCount or CDM_Free in order to free the object and the memory associated with it. In case of an error, NULL will be returned and an error message will be placed in the current active Tcl interpreter (CDM_Interp).

Notes:

This function is equivalent to tip_ExtractTextBasedOnAnnotations (Tcl API).

See Also:

CDM_Free, Tcl_DecrRefCount, tip_ExtractTextBasedOnAnnotations


Generated by: petasis@aias on Wed Aug 16 10:31:57 PM EEST 2006.