Openlibrary: Method to convert LCCs to LCC class names

Created on 23 Apr 2020 · 5Comments · Source: internetarchive/openlibrary

LCCs can be displayed as a ~path down the classification tree. These provide useful information which we want to display to the user. In order to do that, we need to be able to decode the LCC into classes. (This issue split off from #3290)

Describe the problem that you'd like solved

Want to be able to programmatically get the data on the right:

| Sample LCC from a real book | Expected Result |
| -- | -- |
| F1047 .C95 | [
("History of the Americas", (F)),
("British American (including Canada)", (F1001, F1145.2)),
("British America", (F1001, F1145.2)),
("Canada", (F1001, F1145.2)),
("Maritime Provinces", (F1035.8)),
("Prince Edward Island", (F1046, F1049.7)),
] |
| NC760 .B2813 2004 | [
("Visual Arts", (N)),
("Drawing. Design. Illustration", (NC)),
("Special subjects", (NC760, NC825)),
] |
| QH81 .C3525 1996 | [
("Science", (Q)),
("Natural History - Biology", (QH)),
("Natural History (General)", (QH1, QH278.5)),
] |
| RF290 .E73 2009 | [
("Medicine", (R)),
("Otorhinolaryngology", (RF)),
("Otology. Diseases of the ear", (RF110, RF320)),
] |
| NB699.N4 B4 1969b | [
("Visual Arts", (N)),
("Sculpture", (NB)),
("History", (NB60, NB1115)),
] |

See https://github.com/internetarchive/openlibrary/issues/3290 for more examples; not the table there is missing the first LCC class.

Proposal & Constraints

[ ] Needs function that given a string, human-entered LCC from Open Library, returns a list of LCC classes
[ ] Each class should also include either a range of LCCs or a LCC prefix (see examples above)

Notes:

For this stage, although LCCs provide information beyond the first digit (e.g. NB699.A14), this feature will be considered complete once classes are given for the LCC up to, but not including, the first cutter number (i.e. not including "A14" in "NB699.A14"). This are expansions we can do in future issues.
Optional expansion (not required to close this issue; can be done in a future issue): should pass the LCC class names through i18n.
The examples above are generated using https://www.loc.gov/catdir/cpso/lcco/ . The result doesn't have to be _identical_ to the above, but it should be very similar.

Additional context

LCC breakdown: https://www.loc.gov/catdir/cpso/lcco/
Crash course LCCs: https://github.com/internetarchive/openlibrary/blob/master/openlibrary/utils/lcc.py
In depth breakdown of LCCs: https://www.terkko.helsinki.fi/files/9666/classify_trnee_manual.pdf

Stakeholders

@cclauss @BrittanyBunk

Librarians @cclauss 2 Identifiers Feature Request

Source

cdrini

Most helpful comment

Next step is once @cclauss has a method he thinks is ready, he or I can add it to the UI, and put it on dev.openlibrary.org for testing :) Does that seem correct @cclauss ?

cdrini on 23 Apr 2020

👍2 🚀1

All 5 comments

@cdrini There are two outlines, the LCCO and the schedule outlines. @cclauss was using the schedule outlines: https://www.loc.gov/aba/cataloging/classification/. Should we use the LCCO if @cclauss's work is based on the schedules?

BrittanyBunk on 23 Apr 2020

Although incomplete, the LCCO is much easier to work with, because the schedules will have subclasses where the indentation is both forward and backward and idk how to visualize or program that in a way that makes to viewers and coders. The LCCO only indents forward, so the classes always come after each other (not both before and after each other).

An example would be when it looks like this in the schedules:
------subclass 1
subclass 2
------subclass 3

Like how can that be represented easily? It can't. However, the LCCO can, because it looks like this:
subclass 1
----subclass 2
-------subclass 3

That's easy to represent. The only issue with the LCCO is that it's not the complete list of classes and subclasses, it's incomplete. The schedules is the complete one.

That's my current dilemma, where something needs to be sacrificed: 1) completeness, 2) accuracy in representation.

It's up to you and @cclauss which you choose. I think due to completeness and being official, the schedules is the best choice - as we could always find a way to represent the info, but we can't get easily what we're missing.

BrittanyBunk on 23 Apr 2020

I believe @cclauss is using the dumps from https://github.com/thisismattmiller/lcc-pdf-to-json . I think using those seems best, because we can get something working and experiment with it to see how it "feels" :+1: Whatever we choose is not set in stone. We can always adjust it to handle more complexity if we find we need to :)

A complex system that works is invariably found to have evolved from a simple system that worked. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work. You have to start over, beginning with a working simple system. - John Gall

cdrini on 23 Apr 2020

👍1

@cdrini agreed. Let's go with what's already being used before taking on more :) That said, what's next?

BrittanyBunk on 23 Apr 2020

Next step is once @cclauss has a method he thinks is ready, he or I can add it to the UI, and put it on dev.openlibrary.org for testing :) Does that seem correct @cclauss ?

cdrini on 23 Apr 2020

👍2 🚀1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Delete unused/merged branches

cdrini · 5Comments

ISBN star queries no longer work

cdrini · 4Comments

Amstrad CPC is missing from the Archive Library

nonom · 3Comments

Better integration of project gutenberg's material

BrittanyBunk · 4Comments

GSOC leaderboard for keeping track of participants

Pratyush1197 · 3Comments