utils.text¶
Functions for simpler string manipulation.
clean_unicode ¶
clean_unicode(s)
Performs some opinionated string normalization.
Originally implemented in https://github.com/acl-org/acl-anthology/blob/master/bin/normalize_anth.py by David Wei Chiang, this is intended to standardize how we represent certain Unicode characters in our data, e.g. by decomposing ligatures, removing "invisible" soft hyphens, etc.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
s
|
str
|
Any text string. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The cleaned up string. |
interpret_pages ¶
interpret_pages(text)
Splits up a 'pages' field into first and last page.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
A text string representing a page range. |
required |
Returns:
| Type | Description |
|---|---|
tuple[str, str]
|
A tuple |
month_str2num ¶
month_str2num(text)
Convert a month string to a number, e.g. February -> 2
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
A text string representing a month value. |
required |
Returns:
| Type | Description |
|---|---|
Optional[int]
|
None if the string doesn't correspond to a month; the numeric month value otherwise. |
Note
We're not using Python's datetime here since its behaviour depends on the system locale.