Accessing Publications¶
Data hierarchy¶
Publications in the ACL Anthology are organized into collections of volumes containing papers. A volume is a set of related papers that would traditionally have been bound and published as a physical book. A collection is a group of volumes that were published at the same time under the same venue.
For example, the paper with the Anthology ID “2022.acl-long.220” belongs to the collection 2022.acl (indicating that it was published at ACL 2022), the volume long (which subsumes all long papers at the conference), and has the paper ID 220. The following graph illustrates this hierarchy:
flowchart LR
anthid("<b>Full ID:</b> <i>2022.acl-long.220</i>")
anthid -...-> paper
anth["<b>Anthology</b>
"]
anth --> coll
coll["<b>Collection</b>
id: <i>2022.acl</i>"]
vol1["<b>Volume</b>
id: <i>long</i>"]
coll --> vol1
paper["<b>Paper</b>
id: <i>220</i>"]
vol1 --> paper
The library organizes data the same way. You can use
anthology.get() to access
collections, volumes, or papers by their corresponding IDs.
anthology.get("2022.acl") # returns the Collection '2022.acl'
anthology.get("2022.acl-long") # returns the Volume '2022.acl-long'
anthology.get("2022.acl-long.220") # returns the Paper '2022.acl-long.220'
Warning
All .get and .get_* functions will return None if the given ID doesn't correspond to any item in the Anthology data, so you probably want to check their return value.
Manipulating Anthology IDs¶
As you can see above, Anthology IDs like 2022.acl-long.220 can be split into parts
representing the collection, volume, and paper. You can use the utility
functions in utils.ids to convert IDs between strings
and tuples:
from acl_anthology.utils.ids import parse_id, build_id
parse_id("2022.acl-long.220") # returns ('2022.acl', 'long', '220')
build_id("2022.acl", "long", "220") # returns '2022.acl-long.220'
Functions that take Anthology IDs will usually accept both the string and the tuple form, so the following also works:
anthology.get(("2022.acl", "long", "220")) # returns the Paper '2022.acl-long.220'
To distinguish between their local ID and their full(y qualified) ID, child objects provide separate attributes for these:
paper = anthology.get("2022.acl-long.220")
paper.id # returns '220'
paper.full_id # returns '2022.acl-long.220'
paper.full_id_tuple # returns ('2022.acl', 'long', '220')
Looking up publications¶
If you know that you are looking for a specific type of item, you can also use:
anthology.get_collection()to only look up collectionsanthology.get_volume()to only look up volumesanthology.get_paper()to only look up papers
Working with containers¶
Following the data hierarchy described above, child objects
of the Anthology class are also organized
in a hierarchical fashion:
- The
CollectionIndexis a container mapping collection IDs toCollectionobjects. (The index is accessible asanthology.collections.) Collectionobjects are containers mapping volume IDs toVolumeobjects.Volumeobjects are containers mapping paper IDs toPaperobjects.
This means that the following are all equivalent:
anthology.get("2022.acl-long.220") # Paper '2022.acl-long.220'
anthology.get("2022.acl").get("long").get("220") # same
anthology.collections["2022.acl"]["long"]["220"] # same
Tip
As a general rule, all containers provide complete dictionary-like functionality.
The following all work as they would with a regular dictionary object:
volume = anthology.get_volume("2022.acl-long")
volume.get("220") # returns the Paper '2022.acl-long.220'
volume["220"] # returns the Paper '2022.acl-long.220'
# ^... but raises KeyError on invalid IDs
"220" in volume # returns True if paper ID '220' exists in this volume
len(volume) # returns the number of papers in this volume
list(volume) # returns a list of paper IDs in this volume
Iterating over containers¶
Since containers behave like dictionaries, you can iterate over their contents in the same way:
volume = anthology.get_volume("2022.acl-long")
for paper_id, paper in volume.items():
print(paper.full_id)
Note that this means iterating over the container directly will iterate over its keys:
for paper in volume:
print(paper) # will print paper IDs, not Paper objects!
To return a generator over the values, i.e. the actual child objects, you can use these semantically named functions:
collection = anthology.get("2022.acl")
for volume in collection.volumes():
for paper in volume.papers():
print(paper.full_id)
Accessing parents¶
All child objects also keep pointers to their direct parent (.parent) as well
as the Anthology instance they belong to
(.root), so you can also easily navigate upwards in the hierarchy.
paper = anthology.get("2022.acl-long.220")
paper.parent # returns the Volume '2022.acl-long'
paper.parent.parent # returns the Collection '2022.acl'
paper.parent.parent.parent # returns the CollectionIndex
paper.parent.parent.parent.parent # returns the Anthology
paper.root # returns the Anthology, but less confusingly
Looking up events¶
Events can be accessed through
anthology.get_event() or via
anthology.events, which is the
EventIndex. Event IDs are
currently required to be of the form {venue}-{year}; e.g., "acl-2022" is the
event ID for ACL 2022:
event = anthology.get_event("acl-2022")
Papers and volumes can infer their associated events via
.get_events():
[event.id for event in paper.get_events()] # returns ['acl-2022']
Explicitly loading data¶
The Anthology metadata is distributed across many individual XML and YAML files, and loading all of this data can take a bit of time and memory. Fortunately, you normally don't need to worry about that, as the Anthology library implements lazy-loading and only loads files on-demand as they are required.
For example, if you access the paper with ID 2022.acl-long.220, the library
will load the file xml/2022.acl.xml from the data directory, which (hopefully)
contains the metadata for the requested paper.
However, if you know that you will eventually process the entire Anthology data anyway, it can be faster to load the entire Anthology data at once:
anthology.load_all()
Tip
It should never be required to call .load_all(), and calling it may
or may not provide a speed-up, depending on what kind of data you are
accessing and in which manner.
The following actions currently will trigger the entire Anthology data to be
loaded, so you may want to use .load_all() if you are planning to do any of
the following:
- Accessing persons or the PersonIndex.
- Accessing venue objects.
- Accessing the BibkeyIndex.
When developing or modifying the Anthology data, .load_all() can also be
useful to check that the library can read and process the entirety of the data
in the data directory without errors.