Skip to content

utils.text

Functions for simpler string manipulation.

clean_unicode

clean_unicode(s)

Performs some opinionated string normalization.

Originally implemented in https://github.com/acl-org/acl-anthology/blob/master/bin/normalize_anth.py by David Wei Chiang, this is intended to standardize how we represent certain Unicode characters in our data, e.g. by decomposing ligatures, removing "invisible" soft hyphens, etc.

Parameters:

Name Type Description Default
s str

Any text string.

required

Returns:

Type Description
str

The cleaned up string.

interpret_pages

interpret_pages(text)

Splits up a 'pages' field into first and last page.

Parameters:

Name Type Description Default
text str

A text string representing a page range.

required

Returns:

Type Description
tuple[str, str]

A tuple (first_page, last_page); if a known separator was found, this is the result of splitting the input on the separator; otherwise, we assume that the field contains a single page.

month_str2num

month_str2num(text)

Convert a month string to a number, e.g. February -> 2

Parameters:

Name Type Description Default
text str

A text string representing a month value.

required

Returns:

Type Description
Optional[int]

None if the string doesn't correspond to a month; the numeric month value otherwise.

Note

We're not using Python's datetime here since its behaviour depends on the system locale.

remove_extra_whitespace

remove_extra_whitespace(text)

Parameters:

Name Type Description Default
text str

An arbitrary string.

required

Returns:

Type Description
str

The input string without newlines and consecutive whitespace replaced by a single whitespace character.