Modifying Data¶
Rules of thumb¶
The aim of this library is to also make it easy to modify or create data in the Anthology. The implementation of this is complicated by the various indices and objects such as persons that cross-reference other objects. Here are some rules of thumb when making modifications to the data:
- To create new objects, use
create_functions provided by the library whenever possible, rather than instantiating them directly. - You can modify objects by simply modifying their attributes, as long as
the object in question has an explicit representation in the Anthology data.
- This includes collections, volumes, papers, events.
- It also includes persons where
Person.is_explicit == True, as those have an explicit representation inpeople.yaml.
- Saving data is always non-destructive. In XML files, it will also avoid introducing unnecessary changes (e.g. no needless reordering of tags).
- Affected indices and their child objects should automatically update on relevant changes.
- This includes the
item_idsattribute of affectedPersonorVenueinstances, or thecolocated_idsofEventinstances. - It also includes the
BibkeyIndexand the reverse-mapping of volume IDs to events inEventIndex.
- This includes the
Caution: Dynamic updating of linked items is experimental
There may be edge cases where it currently doesn't work yet as expected. If you encounter such issues, please report them as a bug.
Modifying publications¶
To make modifications to existing publications, you can normally just modify the
attributes of the respective object. For example, to add a DOI to paper, we can
just fetch the paper and set its doi attribute:
>>> paper = anthology.get("2022.acl-long.99")
>>> paper.doi = '10.18653/v1/2022.acl-long.99'
Rule of thumb
As a general rule, all classes perform automatic input conversion and validation. This means that setting attributes should either "do the right thing" or raise a TypeError.
Input validation and conversion¶
Attributes generally perform input validation. For example, since a paper's
PDF attribute needs to be an instance of
PDFReference, trying to set it to a path
won't work and will raise a TypeError:
>>> paper.pdf = Path("2025.test-1.pdf") # will raise TypeError
However, many attributes also provide input converters that perform simpler
conversions automatically. For example, paper titles and abstracts are stored
as MarkupText objects, but setting
such an attribute to a string will automatically convert it:
>>> paper.title = "Improving the ACL Anthology"
>>> paper.title
MarkupText('Improving the ACL Anthology')
Here are a few more examples of input converters, showing what you can set and how it will be converted & stored internally:
>>> paper.awards = ["Best paper award"]
>>> paper.awards # stored as a tuple
('Best paper award',)
>>> person = anthology.get_person("marcel-bollmann")
>>> person.orcid = "https://orcid.org/0000-0003-2598-8150"
>>> person.orcid # stored without URL prefix
'0000-0003-2598-8150'
>>> volume = anthology.get("2022.acl-long")
>>> volume.year = 2022
>>> volume.year # stored as string
'2022'
>>> from datetime import date
>>> volume.ingest_date = date.today()
>>> volume.ingest_date # stored as string
'2025-01-08'
This design philosophy means that there’s normally no need to check values in your code before you set them, as you can just check for errors upon setting.
Immutable attributes¶
Since input conversion and validation can only happen upon setting an attribute, but not generally when modifying a (mutable) attribute, collection objects generally use immutable attributes.
A common example is the author list on papers, which is exposed as a tuple
(rather than a list). To modify the author list, you need to create a new tuple
and set it on the paper. For example, to add an author to a paper, you can
create a new NameSpecification,
wrap it in a tuple, and "append" it to the author list via +=:
>>> spec = NameSpecification("Bollmann, Marcel")
>>> paper.authors += (spec,)
Anthology objects that can be set on collection items are generally immutable,
too. For example, to correct a checksum on a
PDFReference, it won't work to update the
reference itself – you need to create a new one and re-set the attribute:
>>> paper.pdf.checksum = "f9f4f558" # will raise
>>> paper.pdf = PDFReference.from_file(...) # works
The main exception to this rule is modifying name specifications. To update an existing author's name, you need to remember that names are immutable, but name specifications can be modified:
>>> paper.authors[0].name.first = "Marc Marcel" # will raise
>>> paper.authors[0].name = Name("Bollmann, Marc Marcel") # works
Things to keep in mind¶
Citation keys¶
If a paper's title or author list has changed, you might want to recreate its
citation key (or 'bibkey'). This can be done by calling
Paper.refresh_bibkey().
If the auto-generated bibkey is identical to the current one, the bibkey will
not change.
Dependent indices¶
Changing an item's attribute might affect various indices. As a rule of thumb, all indices – as well as objects connected to these indices – will update automatically. For example:
- Changing an item's
bibkeychanges will update the BibkeyIndex. - Changing an item's author or editor list will update the PersonIndex and the
item_idsof any Person objects affected by the change. - Changing an item's
venue_idswill update the VenueIndex and theitem_idsof any Venue objects affected by the change. It will also update any Event objects that are implicitly inferred from the venue assignment, as well as the reverse-indexing in the EventIndex for such events.
Modifying people¶
A person can be explicit (has an entry in people.yaml) or inferred (was instantiated from a name specification without an ID). To make modifications to persons, it is important to remember that:
-
Only an explicit person's attributes can be meaningfully modified.
-
Changing which person a paper/volume is assigned to should be done by modifying the name specification on the paper/volume, not by changing anything on the Person object. In other words, do not modify
Person.item_ids.
A note on terminology
Within the library, the term explicit refers to a person that has an entry in people.yaml, whereas inferred refers to a person that was instantiated automatically while loading the XML data files (and has no entry in people.yaml).
Currently, all inferred persons have IDs ending with /unverified, while IDs of explicit persons must not end with /unverified. (More specifically, they may not even contain a slash.)
In practice, this means that "inferred" persons are currently equivalent to "unverified" persons, but the library intentionally uses terminology that is agnostic to the semantics of the ID. If the semantics of whom we consider "(un)verified" change, the terminology in the library needn't change, as it only refers to the technical aspect of where the ID came from (people.yaml vs. implicit instantiation).
Creating a new person¶
Manually creating a new person (that will get saved to people.yaml and can
have an ORCID and other metadata) can be done in two ways:
-
By calling
PersonIndex.create(). The returned Person is not linked to any papers/volumes, but you can set their ID afterwards on name specifications. -
By calling
make_explicit()on a currently inferred person. This will not only add this person to the database, but also set their ID on all papers/volumes currently associated with them.
Example: Merging two persons¶
Situation: An author has published under multiple names, and therefore two separate persons get instantiated for them (let's call them p1 and p2). We want to merge them into a single person.
-
If neither person is explicit yet: Call
p1.make_explicit(). This will create an entry inpeople.yamlwith all current names ofp1add the new ID to all papers and volumes currently inferred to belong to eitherp1. -
p1can now assumed to be explicit. Ifp2is not explicit, callp2.merge_into(p1). This will add all ofp2's names top1and setp1's ID on all papers and volumes currently inferred to belong top2. -
Save the changes, e.g. via
Anthology.save_all().
Example: Disambiguating a person¶
Situation: A person p1 is currently associated with papers/volumes that actually belong to different people, who just happened to publish under the same name. We want to create a new person instance for the other author with the same name.
-
Call
anthology.people.create()for all persons who do not have an explicit ID yet, giving all the names that can refer to this person. Also supply the ORCID when calling this function, if it is known. -
For each person, go through the papers that actually belong to them and update the name specification where
namespec.id == p1by setting the explicit ID of the correct newly-created person. TODO
Ingesting new proceedings¶
Proceedings can be ingested almost entirely via functionality from this library; in particular, no data files (XML or YAML) need to be saved manually. (The only functionality that is currently not part of this library is the fixed-caser for paper titles, which is described below.)
New collections, volumes, and papers¶
Creating new objects from acl_anthology.collections should be done with create_ functions from their respective parent objects.
All attributes that can be set on these objects (Volumes, Papers, etc.) can also be supplied as keyword parameters to the create_ functions. Some required attributes don't need to be supplied here:
- A Paper's
idwill be set to the next-highest numeric ID that doesn't already exist in the volume, starting at"1". - A Paper's
bibkeywill be automatically generated if not explicitly set. - A Volume's
yearattribute will be derived from the collection ID (e.g.,"2049"in a collection with ID"2049.acl"). - A Volume's
typewill default to PROCEEDINGS.
However, it is strongly recommended to supply the author/editor list when calling a create_ function, as this will resolve person IDs and create correct bibkeys automatically.
Example
Here is an example for how to create a new paper in an entirely new collection:
collection = anthology.create_collection("2049.acl")
volume = collection.create_volume(
id="long",
title=MarkupText.from_latex_maybe("Proceedings of the ..."),
venue_ids=["acl"],
)
paper = volume.create_paper(
title=MarkupText.from_latex_maybe("GPT$^{\\infty}$ is all you need"),
authors=[NameSpecification(first="John", last="Doe")],
)
When all volumes and papers have been added, the XML file is written by calling:
collection.save()
If you don't supply an author list here...
If you don't supply authors or editors when calling a create_ function, or you need to modify those afterwards for some reason, you will need to perform these steps manually (which are otherwise handled by the create_ function):
- Call
anthology.people.ingest_namespec()on each NameSpecification. - Call
refresh_bibkey()on the Paper.
Specifying titles and abstracts¶
Paper titles and abstracts are stored internally as MarkupText, but it is possible to simply set them to a string value, in which case heuristic markup conversion will be performed. Generally, for setting attributes that expect MarkupText, the following applies:
-
Supplying a string value
sis equivalent to usingMarkupText.from_(s), which will parse and convert simple markup. Currently, only LaTeX markup is supported. -
If the heuristic conversion is not desired (or you want to make more explicit that you're converting from LaTeX), other builder methods of MarkupText can be used, such as
MarkupText.from_latex(), orMarkupText.from_string()for plain strings
Example
Setting a paper's title to a string automatically parses LaTeX markup contained in the string:
>>> paper.title = "Towards $\\infty$"
>>> paper.title
MarkupText('Towards <tex-math>\\infty</tex-math>')
If this is not desired, MarkupText can be explicitly instantiated with one of its builder methods, for example:
>>> paper.title = MarkupText.from_string("Towards $\\infty$")
>>> paper.title # No markup parsing here
MarkupText('Towards $\\infty$')
Paper titles should also have our fixed-casing algorithm applied to protect certain characters e.g. by wrapping them in braces in BibTeX entries. The fixed-caser is currently not part of this Python library. There are two options for running the fixed-casing on a new ingestion:
-
Outside the ingestion script: Run
bin/fixedcase/protect.pyon the new XML files produced by the ingestion script. -
Within the ingestion script: Convert titles to XML, run
fixedcase.protect(), then set the title again from the modified XML element:import fixedcase xml_title = paper.title.to_xml("title") fixedcase.protect(xml_title) paper.title = MarkupText.from_xml(xml_title)
Specifying authors¶
Authors need to be specified by creating name specifications, for example:
NameSpecification(Name("Marcel", "Bollmann"), orcid="0000-0003-2598-8150")
If an ORCID is supplied, the NameSpecification also needs to have an explicit ID
referring to an entry in people.yaml. The library can add an ID
automatically as long as you supply the author/editor list to the create_
function, so there is typically no need to call create() during
ingestion!
Example
If you create a paper in the following way...
paper = volume.create_paper(
title=MarkupText.from_string("The past and future of the ACL Anthology"),
authors=[NameSpecification(Name("Marcel", "Bollmann"), orcid="0000-0003-2598-8150")],
)
...the name specification will automatically be updated with an ID referring to this person in one of two ways:
- If a person with this ORCID already exists in
people.yaml, their ID will be filled in. - If a person with this ORCID does not exist in
people.yaml, a new entry with this ORCID will be added topeople.yamlwith an auto-generated person ID. The ID is a slug of the person's name; if necessary to avoid an ID clash, the last four digits of their ORCID will be appended.
New events¶
Creating an explicit event works the same way as with other collection items:
event = collection.create_event(id="acl-2049")
An Event's id, if not given, will be automatically generated from the
collection ID (e.g., "2049.acl" will generate "acl-2049" for the event).
Since the mixture of implicit and explicit creation of events and linking them
to volumes can sometimes become a bit unintuitive (see the documentation of
create_event()
or the source code of
EventIndex.load() for
the gory details), it's best to ensure that:
- The EventIndex has been loaded before creating a new event (e.g. by running
anthology.events.load()oranthology.load_all()). - Any volumes in the same collection are explicitly added to the event via
event.add_colocated(volume).
Connecting to venues and SIGs¶
Volumes can be connected to venues by modifying the volume's venue_ids list.
New venues can be added by calling
VenueIndex.create(), which will also
create a corresponding YAML file upon saving. Afterwards, the ID used when
instantiating the venue can be used in a volume's venue_ids.
TODO: connecting to SIGs; we may want to refactor how SIGs are represented before introducing this functionality.
Saving changes¶
Rule of thumb
Call anthology.save_all() to save all metadata changes.
Calling save_all() will write
XML and YAML files to the Anthology's data directory, with the following
caveats:
-
Collections will track if they have been modified to prevent writing XML files unnecessarily.
-
Saving a collection manually can be done by calling
Collection.save(). -
Saving a collection uses a minimal-diff algorithm by default to avoid introducing "noise" in the diffs, i.e. changes to the XML that do not make a semantic difference, such as reordering certain tags, attributes, or introducing/deleting certain empty tags. It is also guaranteed to be non-destructive through integration tests on the entire Anthology data.
-
YAML files will always be written. Serializing all YAML files is much faster than serializing all XML files, so they are written unconditionally, without tracking changes.