Accessing Authors/Editors¶
People are complicated.1 Metadata for publications often only includes the "name" of each author given as a string; but names can be ambiguous (the same name can refer to different people), and conversely, the same person can have published under different names.
Therefore, when it comes to names and personal identities, this library distinguishes between the following three concepts:
Nameobjects represent a name. They are essentially strings with a little bit of metadata, but contain no information about the actual identity of a person behind the name.NameSpecificationobjects represent authors/editors as specified on a publication. They are essentially names with optional extra information for disambiguation, such as the person's affiliation or their internal Anthology ID.Personobjects represent natural persons. They may have one or more names, but will always have one name that we consider to be the "canonical" one.
Tip
It is useful to remember that only a Person can have publications. If you have only a Name or a NameSpecification, you first need to resolve that to a Person before you can look up papers authored/edited by that person.
Names¶
A person's name is always split up into first and last name components. While this, of course, doesn't fully reflect the complexities of how names work across different cultures, it is the minimum structure that we assume in order to, e.g., generate accurate bibliographic information.
The following ways to instantiate a Name are equivalent:
from acl_anthology.people import Name
Name("Yang", "Liu")
Name(last="Liu", first="Yang")
If a person only has a single name, the convention is to record this as the
last name. In this case, the first name part must be explicitly given as
None:
Name(None, "Mausam")
Looking up names¶
To look up names, use
anthology.find_people, which
will return a list of persons that can be referred to by that name:
>>> anthology.find_people(Name("Yang", "Liu"))
[
Person(id='yang-liu-edinburgh', names=[Name(first='Yang', last='Liu')], item_ids=<set of 15 AnthologyIDTuple objects>, comment='Edinburgh'),
Person(id='yang-liu-blcu', names=[Name(first='Yang', last='Liu')], item_ids=<set of 1 AnthologyIDTuple objects>, comment='Beijing Language and Culture University'),
Person(id='yang-liu-hk', names=[Name(first='Yang', last='Liu')], item_ids=<set of 3 AnthologyIDTuple objects>, comment='The Chinese University of Hong Kong (Shenzhen)'),
... 12 more ...
]
For convenience, you can also call .find_people() with tuples or strings; the
following are all equivalent:
anthology.find_people("Yang Liu")
anthology.find_people("Liu, Yang")
anthology.find_people(("Yang", "Liu"))
However, supplying a {first} {last} string only works as long as the split is
unambiguous. If either the first or last name contains spaces, you must use the
{last}, {first} format; if either of them contains a comma, you must use a
tuple instead:
anthology.find_people("Daniel A. McFarland") # raises ValueError
anthology.find_people("McFarland, Daniel A.") # works
anthology.find_people("Stabler, Jr., Edward P.") # raises ValueError
anthology.find_people(("Edward P.", "Stabler, Jr.")) # works
Name specifications¶
Author or editor fields, e.g. on papers, will always return a
NameSpecification. This is
mostly a regular name with an optional ID (i.e., the name was already manually
disambiguated by us) and affiliation. In the example below, you can see that
author "Yang Liu" was assigned an explicit ID in the metadata:
>>> paper = anthology.get("2021.emnlp-main.151")
>>> paper.authors
[
NameSpecification(name=Name(first='Jialu', last='Wang'), id=None, affiliation=None, variants=[]),
NameSpecification(name=Name(first='Yang', last='Liu'), id='yang-liu-umich', affiliation=None, variants=[]),
NameSpecification(name=Name(first='Xin', last='Wang'), id=None, affiliation=None, variants=[])
]
The "variant" field is not systematically used at the moment, but is intended for name variants written in a different script, such as:
>>> anthology.get("2021.ccl-1.1").authors
[
NameSpecification(name=Name(first='Hao', last='Wang'), id=None, affiliation=None,
variants=[Name(first='浩', last='汪')]),
NameSpecification(name=Name(first='Junhui', last='Li'), id=None, affiliation=None,
variants=[Name(first='军辉', last='李')]),
NameSpecification(name=Name(first='Zhengxian', last='Gong'), id=None, affiliation=None,
variants=[Name(first='正仙', last='贡')])
]
Looking up name specifications¶
In contrast to names, name specifications will always resolve to a single
person. To find the person that is being referred to, use
.resolve():
>>> paper = anthology.get("2021.emnlp-main.151")
>>> name_spec = paper.authors[1]
>>> name_spec
NameSpecification(name=Name(first='Yang', last='Liu'), id='yang-liu-umich', affiliation=None, variants=[])
>>> name_spec.resolve()
Person(
id='yang-liu-umich',
names=[Name(first='Yang', last='Liu')],
item_ids=<set of 4 AnthologyIDTuple objects>,
comment='Univ. of Michigan, UC Santa Cruz'
)
This will work as long as the NameSpecification that you want to resolve is
attached to an Anthology item (paper, volume, or talk).
Persons¶
A Person object represents a natural
person. The documentation above showed how persons can be looked up via names
or name specifications; they can also be retrieved directly from their ID:
anthology.get_person("yang-liu-umich")
A person will always have exactly one canonical name, which is the one that is used as the leading name on author pages:
>>> person = anthology.get_person("dan-mcfarland")
>>> person.canonical_name
Name(first='Dan', last='McFarland')
They may also have additional names:
>>> person.names
[
Name(first='Dan', last='McFarland'),
Name(first='Daniel', last='McFarland'),
Name(first='Daniel A.', last='McFarland')
]
Looking up publications¶
You can get a set of all items associated with a person:
>>> person.item_ids
{('Q18', '1', '28'), ('2020.findings', 'emnlp', '158'), ('W11', '15', '16'), ...}
You can iterate over
Person.anthology_items()
to get the actual items the person is associated with. If you know that you only
want to iterate over papers or volumes, you can also use
Person.volumes() or
Person.papers() instead.
An Entity-Relationship diagram¶
erDiagram
Name {
str first
str last
}
NameSpecification {
Optional[str] id
Optional[str] affiliation
}
Person {
}
"Paper, Volume, etc." {
}
Person ||--|{ NameSpecification : identified-by
Person }|--|{ Name : has
NameSpecification }o--|| Name : contains
"Paper, Volume, etc." }o--|{ NameSpecification : refers-to
-
Both in real life and in bibliographic metadata. ↩