Skip to content

text

Classes and functions for text markup manipulation.

MarkupText

Text with optional markup.

Warning

This class should not be instantiated directly. Use its class method constructors instead.

Example
title = MarkupText.from_("A Structured Review of the Validity of BLEU")
title = MarkupText.from_("TTCS$^{\mathcal{E}}$: a Vectorial Resource for Computing Conceptual Similarity")
Note

This class implements a limited subset of string methods to make common operations more convenient, for example:

title = MarkupText.from_string("A Structured Review of the Validity of BLEU")
title == "A Structured Review of the Validity of BLEU"  # True
"BLEU" in title  # True
title.startswith("A ")  # True

These operate on the stringified XML representation of the class.

contains_markup property

contains_markup

True if this text contains markup; False if it is a plain string.

as_html

as_html(allow_url=True)

Returns:

Type Description
str

Text with markup transformed into HTML.

Parameters:

Name Type Description Default
allow_url bool

Defaults to True. If False, URLs are not wrapped in <a href="..."> tags, but in simply <span> tags.

True

as_latex

as_latex()

Returns:

Type Description
str

Text with markup transformed into LaTeX commands.

as_text

as_text()

Returns:

Type Description
str

The plain text with any markup stripped. The only transformation that will be performed is replacing TeX-math expressions with their corresponding Unicode representation, if possible.

as_xml

as_xml()

Returns:

Type Description
str

Text with markup in the original XML format.

endswith

endswith(suffix, start=None, end=None)

Return True if the string ends with the specified suffix, False otherwise.

Equivalent to self.as_xml().endswith(...).

from_ classmethod

from_(content)

Instantiate MarkupText from an XML element or a string, heuristically parsing any supported markup.

  • If called with an XML element, assumes it uses the Anthology's markup format and calls from_xml().
  • If called with a string, assumes the string might contain markup and will try to intelligently parse it. At the moment, only LaTeX markup is supported, which means that the effect of this is identical to calling from_latex_maybe(), but this may change if we support different types of markup in the future.
Note

If you want more fine-grained control over markup detection, call one of the more specific builder functions instead.

Parameters:

Name Type Description Default
content _Element | str

A string potentially containing markup, or an XML element containing valid MarkupText according to the schema.

required

Returns:

Type Description
MarkupText

Instantiated MarkupText object corresponding to the content.

from_latex classmethod

from_latex(text, clean=True)

Parameters:

Name Type Description Default
text str

A text string potentially containing LaTeX markup.

required
clean bool

If True, applies the Anthology's Unicode normalization.

True

Returns:

Type Description
MarkupText

Instantiated MarkupText object corresponding to the string.

from_latex_maybe classmethod

from_latex_maybe(text, clean=True)

Like from_latex(), but can be used if it is unclear if the string is plain text or LaTeX. Will prevent percentage signs being interpreted as LaTeX comments, and apply a heuristic to decide if a tilde is literal or a non-breaking space.

Parameters:

Name Type Description Default
text str

A text string potentially in plain text or LaTeX format.

required
clean bool

If True, applies the Anthology's Unicode normalization.

True

Returns:

Type Description
MarkupText

Instantiated MarkupText object corresponding to the string.

from_string classmethod

from_string(text, clean=True)

Parameters:

Name Type Description Default
text str

A simple text string without any markup.

required
clean bool

If True, applies the Anthology's Unicode normalization.

True

Returns:

Type Description
MarkupText

Instantiated MarkupText object corresponding to the string.

from_xml classmethod

from_xml(element)

Parameters:

Name Type Description Default
element _Element

An XML element containing valid MarkupText according to the schema.

required

Returns:

Type Description
MarkupText

Instantiated MarkupText object corresponding to the element.

startswith

startswith(prefix, start=None, end=None)

Return True if the string starts with the specified prefix, False otherwise.

Equivalent to self.as_xml().startswith(...).

to_xml

to_xml(tag='span')

Parameters:

Name Type Description Default
tag str

Name of outer tag in which the text should be wrapped.

'span'

Returns:

Type Description
_Element

A serialization of this MarkupText in Anthology XML format.