Skip to content

Modifying Data

Rules of thumb

The aim of this library is to also make it easy to modify or create data in the Anthology. The implementation of this is complicated by the various indices and objects such as persons that cross-reference other objects. Here are some rules of thumb when making modifications to the data:

  1. To create new objects, use create_ functions provided by the library whenever possible, rather than instantiating them directly.
  2. You can modify objects by simply modifying their attributes, as long as the object in question has an explicit representation in the Anthology data.
    • This includes collections, volumes, papers, events.
    • It also includes persons where Person.is_explicit == True, as those have an explicit representation in people.yaml.
  3. Saving data is always non-destructive. In XML files, it will also avoid introducing unnecessary changes (e.g. no needless reordering of tags).
  4. Affected indices and their child objects should automatically update on relevant changes.
    • This includes the item_ids attribute of affected Person or Venue instances, or the colocated_ids of Event instances.
    • It also includes the BibkeyIndex and the reverse-mapping of volume IDs to events in EventIndex.

Caution: Dynamic updating of linked items is experimental

There may be edge cases where it currently doesn't work yet as expected. If you encounter such issues, please report them as a bug.

Modifying publications

To make modifications to existing publications, you can normally just modify the attributes of the respective object. For example, to add a DOI to paper, we can just fetch the paper and set its doi attribute:

>>> paper = anthology.get("2022.acl-long.99")
>>> paper.doi = '10.18653/v1/2022.acl-long.99'

Rule of thumb

As a general rule, all classes perform automatic input conversion and validation. This means that setting attributes should either "do the right thing" or raise a TypeError.

Input validation and conversion

Attributes generally perform input validation. For example, since a paper's PDF attribute needs to be an instance of PDFReference, trying to set it to a path won't work and will raise a TypeError:

>>> paper.pdf = Path("2025.test-1.pdf")    # will raise TypeError

However, many attributes also provide input converters that perform simpler conversions automatically. For example, paper titles and abstracts are stored as MarkupText objects, but setting such an attribute to a string will automatically convert it:

>>> paper.title = "Improving the ACL Anthology"
>>> paper.title
MarkupText('Improving the ACL Anthology')

Here are a few more examples of input converters, showing what you can set and how it will be converted & stored internally:

>>> paper.awards = ["Best paper award"]
>>> paper.awards                           # stored as a tuple
('Best paper award',)
>>> person = anthology.get_person("marcel-bollmann")
>>> person.orcid = "https://orcid.org/0000-0003-2598-8150"
>>> person.orcid                           # stored without URL prefix
'0000-0003-2598-8150'
>>> volume = anthology.get("2022.acl-long")
>>> volume.year = 2022
>>> volume.year                            # stored as string
'2022'
>>> from datetime import date
>>> volume.ingest_date = date.today()
>>> volume.ingest_date                     # stored as string
'2025-01-08'

This design philosophy means that there’s normally no need to check values in your code before you set them, as you can just check for errors upon setting.

Immutable attributes

Since input conversion and validation can only happen upon setting an attribute, but not generally when modifying a (mutable) attribute, collection objects generally use immutable attributes.

A common example is the author list on papers, which is exposed as a tuple (rather than a list). To modify the author list, you need to create a new tuple and set it on the paper. For example, to add an author to a paper, you can create a new NameSpecification, wrap it in a tuple, and "append" it to the author list via +=:

>>> spec = NameSpecification("Bollmann, Marcel")
>>> paper.authors += (spec,)

Anthology objects that can be set on collection items are generally immutable, too. For example, to correct a checksum on a PDFReference, it won't work to update the reference itself – you need to create a new one and re-set the attribute:

>>> paper.pdf.checksum = "f9f4f558"                         # will raise
>>> paper.pdf = PDFReference.from_file(...)                 # works

The main exception to this rule is modifying name specifications. To update an existing author's name, you need to remember that names are immutable, but name specifications can be modified:

>>> paper.authors[0].name.first = "Marc Marcel"             # will raise
>>> paper.authors[0].name = Name("Bollmann, Marc Marcel")   # works

Things to keep in mind

Citation keys

If a paper's title or author list has changed, you might want to recreate its citation key (or 'bibkey'). This can be done by calling Paper.refresh_bibkey(). If the auto-generated bibkey is identical to the current one, the bibkey will not change.

Dependent indices

Changing an item's attribute might affect various indices. As a rule of thumb, all indices – as well as objects connected to these indices – will update automatically. For example:

  • Changing an item's bibkey changes will update the BibkeyIndex.
  • Changing an item's author or editor list will update the PersonIndex and the item_ids of any Person objects affected by the change.
  • Changing an item's venue_ids will update the VenueIndex and the item_ids of any Venue objects affected by the change. It will also update any Event objects that are implicitly inferred from the venue assignment, as well as the reverse-indexing in the EventIndex for such events.

Modifying people

A person can be explicit (has an entry in people.yaml) or inferred (was instantiated from a name specification without an ID). To make modifications to persons, it is important to remember that:

  1. Only an explicit person's attributes can be meaningfully modified.

  2. Changing which person a paper/volume is assigned to should be done by modifying the name specification on the paper/volume, not by changing anything on the Person object. In other words, do not modify Person.item_ids.

A note on terminology

Within the library, the term explicit refers to a person that has an entry in people.yaml, whereas inferred refers to a person that was instantiated automatically while loading the XML data files (and has no entry in people.yaml).

Currently, all inferred persons have IDs ending with /unverified, while IDs of explicit persons must not end with /unverified. (More specifically, they may not even contain a slash.)

In practice, this means that "inferred" persons are currently equivalent to "unverified" persons, but the library intentionally uses terminology that is agnostic to the semantics of the ID. If the semantics of whom we consider "(un)verified" change, the terminology in the library needn't change, as it only refers to the technical aspect of where the ID came from (people.yaml vs. implicit instantiation).

Creating a new person

Manually creating a new person (that will get saved to people.yaml and can have an ORCID and other metadata) can be done in two ways:

  1. By calling PersonIndex.create(). The returned Person is not linked to any papers/volumes, but you can set their ID afterwards on name specifications.

  2. By calling make_explicit() on a currently inferred person. This will not only add this person to the database, but also set their ID on all papers/volumes currently associated with them.

Example: Merging two persons

Situation: An author has published under multiple names, and therefore two separate persons get instantiated for them (let's call them p1 and p2). We want to merge them into a single person.

  1. If neither person is explicit yet: Call p1.make_explicit(). This will create an entry in people.yaml with all current names of p1 add the new ID to all papers and volumes currently inferred to belong to either p1.

  2. p1 can now assumed to be explicit. If p2 is not explicit, call p2.merge_into(p1). This will add all of p2's names to p1 and set p1's ID on all papers and volumes currently inferred to belong to p2.

  3. Save the changes, e.g. via Anthology.save_all().

Example: Disambiguating a person

Situation: A person p1 is currently associated with papers/volumes that actually belong to different people, who just happened to publish under the same name. We want to create a new person instance for the other author with the same name.

  1. Call anthology.people.create() for all persons who do not have an explicit ID yet, giving all the names that can refer to this person. Also supply the ORCID when calling this function, if it is known.

  2. For each person, go through the papers that actually belong to them and update the name specification where namespec.id == p1 by setting the explicit ID of the correct newly-created person. TODO

Ingesting new proceedings

Proceedings can be ingested almost entirely via functionality from this library; in particular, no data files (XML or YAML) need to be saved manually. (The only functionality that is currently not part of this library is the fixed-caser for paper titles, which is described below.)

New collections, volumes, and papers

Creating new objects from acl_anthology.collections should be done with create_ functions from their respective parent objects.

All attributes that can be set on these objects (Volumes, Papers, etc.) can also be supplied as keyword parameters to the create_ functions. Some required attributes don't need to be supplied here:

  • A Paper's id will be set to the next-highest numeric ID that doesn't already exist in the volume, starting at "1".
  • A Paper's bibkey will be automatically generated if not explicitly set.
  • A Volume's year attribute will be derived from the collection ID (e.g., "2049" in a collection with ID "2049.acl").
  • A Volume's type will default to PROCEEDINGS.

However, it is strongly recommended to supply the author/editor list when calling a create_ function, as this will resolve person IDs and create correct bibkeys automatically.

Example

Here is an example for how to create a new paper in an entirely new collection:

collection = anthology.create_collection("2049.acl")
volume = collection.create_volume(
    id="long",
    title=MarkupText.from_latex_maybe("Proceedings of the ..."),
    venue_ids=["acl"],
)
paper = volume.create_paper(
    title=MarkupText.from_latex_maybe("GPT$^{\\infty}$ is all you need"),
    authors=[NameSpecification(first="John", last="Doe")],
)

When all volumes and papers have been added, the XML file is written by calling:

collection.save()
If you don't supply an author list here...

If you don't supply authors or editors when calling a create_ function, or you need to modify those afterwards for some reason, you will need to perform these steps manually (which are otherwise handled by the create_ function):

  • Call anthology.people.ingest_namespec() on each NameSpecification.
  • Call refresh_bibkey() on the Paper.

Specifying titles and abstracts

Paper titles and abstracts are stored internally as MarkupText, but it is possible to simply set them to a string value, in which case heuristic markup conversion will be performed. Generally, for setting attributes that expect MarkupText, the following applies:

  • Supplying a string value s is equivalent to using MarkupText.from_(s), which will parse and convert simple markup. Currently, only LaTeX markup is supported.

  • If the heuristic conversion is not desired (or you want to make more explicit that you're converting from LaTeX), other builder methods of MarkupText can be used, such as MarkupText.from_latex(), or MarkupText.from_string() for plain strings

Example

Setting a paper's title to a string automatically parses LaTeX markup contained in the string:

>>> paper.title = "Towards $\\infty$"
>>> paper.title
MarkupText('Towards <tex-math>\\infty</tex-math>')

If this is not desired, MarkupText can be explicitly instantiated with one of its builder methods, for example:

>>> paper.title = MarkupText.from_string("Towards $\\infty$")
>>> paper.title  # No markup parsing here
MarkupText('Towards $\\infty$')

Paper titles should also have our fixed-casing algorithm applied to protect certain characters e.g. by wrapping them in braces in BibTeX entries. The fixed-caser is currently not part of this Python library. There are two options for running the fixed-casing on a new ingestion:

  1. Outside the ingestion script: Run bin/fixedcase/protect.py on the new XML files produced by the ingestion script.

  2. Within the ingestion script: Convert titles to XML, run fixedcase.protect(), then set the title again from the modified XML element:

    import fixedcase
    
    xml_title = paper.title.to_xml("title")
    fixedcase.protect(xml_title)
    paper.title = MarkupText.from_xml(xml_title)
    

Specifying authors

Authors need to be specified by creating name specifications, for example:

NameSpecification(Name("Marcel", "Bollmann"), orcid="0000-0003-2598-8150")

If an ORCID is supplied, the NameSpecification also needs to have an explicit ID referring to an entry in people.yaml. The library can add an ID automatically as long as you supply the author/editor list to the create_ function, so there is typically no need to call create() during ingestion!

Example

If you create a paper in the following way...

paper = volume.create_paper(
    title=MarkupText.from_string("The past and future of the ACL Anthology"),
    authors=[NameSpecification(Name("Marcel", "Bollmann"), orcid="0000-0003-2598-8150")],
)

...the name specification will automatically be updated with an ID referring to this person in one of two ways:

  • If a person with this ORCID already exists in people.yaml, their ID will be filled in.
  • If a person with this ORCID does not exist in people.yaml, a new entry with this ORCID will be added to people.yaml with an auto-generated person ID. The ID is a slug of the person's name; if necessary to avoid an ID clash, the last four digits of their ORCID will be appended.

New events

Creating an explicit event works the same way as with other collection items:

event = collection.create_event(id="acl-2049")

An Event's id, if not given, will be automatically generated from the collection ID (e.g., "2049.acl" will generate "acl-2049" for the event).

Since the mixture of implicit and explicit creation of events and linking them to volumes can sometimes become a bit unintuitive (see the documentation of create_event() or the source code of EventIndex.load() for the gory details), it's best to ensure that:

  1. The EventIndex has been loaded before creating a new event (e.g. by running anthology.events.load() or anthology.load_all()).
  2. Any volumes in the same collection are explicitly added to the event via event.add_colocated(volume).

Connecting to venues and SIGs

Volumes can be connected to venues by modifying the volume's venue_ids list. New venues can be added by calling VenueIndex.create(), which will also create a corresponding YAML file upon saving. Afterwards, the ID used when instantiating the venue can be used in a volume's venue_ids.

TODO: connecting to SIGs; we may want to refactor how SIGs are represented before introducing this functionality.

Saving changes

Rule of thumb

Call anthology.save_all() to save all metadata changes.

Calling save_all() will write XML and YAML files to the Anthology's data directory, with the following caveats:

  • Collections will track if they have been modified to prevent writing XML files unnecessarily.

  • Saving a collection manually can be done by calling Collection.save().

  • Saving a collection uses a minimal-diff algorithm by default to avoid introducing "noise" in the diffs, i.e. changes to the XML that do not make a semantic difference, such as reordering certain tags, attributes, or introducing/deleting certain empty tags. It is also guaranteed to be non-destructive through integration tests on the entire Anthology data.

  • YAML files will always be written. Serializing all YAML files is much faster than serializing all XML files, so they are written unconditionally, without tracking changes.