Openlibrary: Merge works with same title and spelling differences in author name

Created on 25 Feb 2019  ·  5Comments  ·  Source: internetarchive/openlibrary

Description

Some editions are not merged to the work they belong to (and new unnecessary work pages are created) due to minor differences in the author name spelling.

Evidence

Lacapra vs. LaCapra kept these two separate:
https://openlibrary.org/works/OL8382164W
https://openlibrary.org/works/OL2731955W

Expectation

I think an automatic merge is in order for such minor mistakes/differences in spelling of either title or author name.

Proposal & Constraints

A case insensitive comparison would fix the specific case, I believe; computing a Levenshtein distance may be trickier, or should be very restrictive (max 1 character difference?) given middle names, cf. https://github.com/internetarchive/openlibrary/issues/77#issuecomment-372389677

Doing merges manually is very tedious, if at all feasible; cf. https://github.com/internetarchive/openlibrary/issues/684 https://github.com/internetarchive/openlibrary/issues/805

Data Triage 3 Bug merging

All 5 comments

The issue isn't just capitalization. It is also a matter of accents, whitespaces, translations, transliterations, and codespace normalizations. We simply must move away from using spelling as the identifier for an authority. There's a sound reason for using VIAF, ISNI, or Wikidata identifiers: simple spelling cannot reliably distinguish author identities.

LeadSongDog, 28/02/19 20:06:

The issue isn't just capitalization. It is also a matter of accents,
whitespaces, translations, transliterations, and codespace
normalizations.

Sure, but I wanted to avoid an overbroad issue as this one is easier to
fix than the general case.

We simply must move away from using spelling as the
identifier for an authority. There's a sound reason for using VIAF,
ISNI, or Wikidata identifiers: simple spelling cannot reliably
distinguish author identities.

But then VIAF clusters use spelling comparisons just like OpenLibrary,
and it's not trivial to connect every record to a Wikidata ID.

Even identical spelling of author and title does not reliably indicate that the works are the same. We have many problem titles that are very common, such as "Journal" or "Works". We also have some very common (often incomple) author names such as "Smith" or "Brown". Unless a human user makes the comparison between two author records, we won't be able to trust they refer to the same identity.
I agree that ISNI or Wikidata would be more reliable than VIAF, but any of them would be better than simple text comparison we have now. This is not a new issue, see #853 for instance, or even earlier.

I'll lean on @hornc assessment to decide whether to subsume this under #853 (this also relates to work @cdrini is doing on solr), or whether there is bandwidth to do a stopgap solution for this specific case.

We have ~10 issues all surrounding merging (works, editions, authors). I think this is somewhat blocked on our merging infrastructure (e.g. #2553). Let's track this as related to #2114 and close this issue.

There is no clear beginning and end to this issue -- it is a proposal that we merge works w/ similar title and author name. We can also use isbn, ocaid, lccn, year, and several other fields to do this at scale.

Closing for now.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

BrittanyBunk picture BrittanyBunk  ·  5Comments

jdlrobson picture jdlrobson  ·  5Comments

cdrini picture cdrini  ·  4Comments

LeadSongDog picture LeadSongDog  ·  5Comments

dcapillae picture dcapillae  ·  4Comments