Registry: Import iDigBio collections into GrSciColl

Created on 5 Feb 2020  ·  12Comments  ·  Source: gbif/registry

Goal(s)

What needs to happen before the actual import

We could do these in a different order of course.

1. Link iDigBio and GrSciColl entries

Since iDigBio describes collections, we should probably:

  1. Match the iDigBio entries to the GrSciColl collections (based on title, code, etc.)
  2. If no match can be found in collections, we should try to find out if the corresponding iDigBio institution is available in GrSciColl.
  3. If we cannot find any match in the GrSciColl collections and institutions, I think we should create both an institution and a collection attached to it (similar to what we talked about in the case of Index Herbariorum: https://github.com/gbif/registry/issues/167). Does it make sense?

Once we have a list of matches, we could add identifiers to the GrSciColl entries to work on the import (similar to what we do in the case of IH).

Who should do the matching: iDigBio or GBIF?

Everyone probably has an idea on how to proceed but for the sake of tracking what is happening, I am writing here the steps of the matching process:

  • [x] Getting the data from iDigBio (from here: http://idigbio.github.io/idb-us-collections/collections.json)
  • [x] Getting the data from GrSciColl (most likely with the collection API)
  • [x] Clean up the data (using OpenRefine for example)
  • [x] Use your favorite algorithm to match the data with the relevant fields.
  • [x] Check manually the fuzzy/suspicious matches.

Now who will do what?

2. Agree on the mapping of iDigBio and GrSciColl fields

The models between iDigBio and GrSciColl seem pretty similar. Here is how we propose to map the fields. Could you go over this and let us know if you have any comment?

iDiBio | GrSciColl
-- | --
Institution | Mapped to "Institution" in Collection entity and "Name" if used create an institution
Collection | Name in Coll
Recordsets | Set as a MachineTag (since it is for internal use) in coll
RecordsetQuery | MachineTag in coll
Institution Code | Mapped to "Code" in Institution
Collection Code | Mapped to "Code" in Collection
Collection Uuid | Added as an identifier
Collection Lsid | Added as an identifier
Collection Url | Homepage in Coll
Collection Catalog Url | Catalogue URL in Coll
Description | Description in Coll
DescriptionForSpecialists | Concatenated to Description in Coll (or new field?)
CataloguedSpecimens | Number of Specimen in Coll
KnownToContainTypes | Discard? (the field is used less than 100 times) Is it necessary for internal use? In that case, we can add it as a machineTag.
TaxonCoverage | Taxonomic coverage in Coll
Geographic Range | Geographic coverage in Coll
CollectionExtent | Discard? (it seems like in most cases it contains a string with the same value as cataloguedSpecimens)
Contact | Mapped to Staff Name
Contact Role | Mapped to Staff Position
Contact Email | Mapped to Staff Email
Mailing Address | Mailing Address in Coll
Mailing City | Mailing City in Coll
Mailing State | Mailing State in Coll
Mailing Zip | Mailing Postal Code in Coll
Physical Address | Physical Address in Coll
Physical City | Physical City in Coll
Physical State | Physical State in Coll
Physical Zip | Physical Postal Code in Coll
UniqueNameUUID | Added as identifier in inst
AttributionLogoURL | New field?
ProviderManagedID | Added as identifier
DerivedFrom | Added as MachineTag if it is for internal use?
SameAs | Added as identifier
Flags | Added as MachineTag
PortalDisplay | Added as MachineTag
Lat | Latitude in Institution
Lon | Longitude in Institution

3. Decide what to do when there is an overlap between IH and iDigBio

As mentioned earlier, we are working on synchronising Index Herbariorum and GrSciColl (https://github.com/gbif/registry/issues/167). There is a partial overlap between iDigBio and IH.

What should we do in these cases?
I suggest to overwrite the information for the fields provided by IH (IH value overwrite iDigBio or GrSciColl value) and keep the fields that are from iDigBio only.
If the iDigBio record is the most up to date, we would create a GitHub issue and then send the latest update to IH.
Would that be ok?

GRSciColl

Most helpful comment

@asturcon We picked this field up from Audubon Core, but we agreed that you can discard the field since we are not doing anything with it.

All 12 comments

regarding part 1:

As far as who performs the work, I respectfully think it would be best and most expedient if GBIF is able to devote the time to this. iDigBio/ACIS IT is still short by 1 team member and, despite our feelings that the resulting product will work much better for everyone, I don't think we could guarantee that we'd be able to commit to it anytime soon.

Here are some other notes for section 1 of this issue:

  • 1-3 on your list make sense, including the proposed solution in 3 for if no matches can be found
  • for matching, it might be possible to match from GBIF's institution code to collections.json institution code

  • based on existing documentation of collections.json (in the repo readme), the institution_lsid is mapped to a "GRBio LSID or coolURI for the institution LSID" if found, otherwise is blank

  • other matches will likely need to be string-based match algorithms. A potentially helpful note for matching/verification purposes is that the recordset uuid in collections.json will match the recordset uuid served from our API.

Part 2:
The individual records in iDigBio’s collections.json are Institution-Collection records. GBIF properly breaks Institution and Collection out into separate entities. See attached diagram for intended hierarchy.

unnamed

Note: there are field definitions in the readme of: https://github.com/iDigBio/idb-us-collections

Comments on individual mappings:

“UniqueNameUUID Added as identifier” - this appears to be intended as an "institution" UUID in a hierarchy of iDigBio records but does not seem to have been implemented. Keep as identifier in GBIF system.

recordsetQuery: This generates a link to the iDigBio recordset, (i.e., https://www.idigbio.org/portal/recordsets/ea12da76-1b2e-4944-8709-1de3af1c65e2). This field can be discarded if you are generating links to the recordset another way.

Recordsets - Reminder: this is our parent object for individual records in our system

KnownToContainTypes: this seems okay to discard.

Collectionextent: can be copied into CatalogedSpecimens where the CatalogedSpecimens is blank, but not required to keep as a separate field (discard).

“attributionLogoURL, providerManagedID, derivedFrom” - note that these are Audubon Core terms

Regarding part 3:

We are okay with the proposed method of integrating IH and iDigBio data. To help determine who the most recent record, IH or iDigBio, you can use the commit date for an individual file in the iDigBio repo as an added/modified date.

The way that repository works is that a human creates/updates a chunk of json named ./collections/{collection_uuid}.json and commits. The software workflow then runs tests and aggregates that json chunk into the full collections.json. An example individual json file would be:

https://github.com/iDigBio/idb-us-collections/blob/master/collections/001c5234-048b-11e5-b0ee-002315492bbc

Important Note: The collections.json file that actually gets loaded and used is served from the json-index or gh-pages branch (it gets pushed to both) and not the master branch. For instance:

https://raw.githubusercontent.com/iDigBio/idb-us-collections/json-index/collections.json

or

http://idigbio.github.io/idb-us-collections/collections.json

I hope that all of this helps. Please feel free to @ us for additional questions or clarification.

@roncanepa @nrejack I was checking the mappings and looks like AttributionLogoURL is the only iDigBio field we're missing in our registry. But I checked the collections.json file and noticed that this field is always empty. Should we still add it to our registry? or we can discard it too?

@asturcon We picked this field up from Audubon Core, but we agreed that you can discard the field since we are not doing anything with it.

Many thanks for your replies @roncanepa and @nrejack !
In that case, we will get started on [1. Link iDigBio and GrSciColl entries]. We will do as much as possible automatically and send you and Cat some things that might need manual checking, is that ok with you?

Fine with me, send away! Thanks so much, everyone!!

Hey @CatChapman, Morten has been working on matching iDigBio and GrSciColl entries: https://github.com/gbif/registry/issues/187
It turns out that it makes more sense to match first everything to GrSCiColl institutions because these are the entries for which we have a lot more details and identifiers. Then once we got the matches for institution, we could take a look at the collections and match these as well.

Morten described his whole matching process and results on the issue linked above but here are the highlights:

  1. Match the iDigBio entries based on the IRN
  2. Match left iDigBio entries based on other identifiers
  3. Match left iDigBio entries based on title and code (note that the titles were processed to facilitate the matching)
  4. Match left iDigBio entries based on city and code
  5. Match left iDigBio entries based title alone when there are no iDigBio institution code
  6. Match left iDigBio entries based title (despite conflicting codes)
  7. Match left iDigBio entries manually

This leaves 235 iDigBio entries unmatched for which we would create new entries in GrSciColl.
Now we need your help to check the matching! Could you go over https://github.com/gbif/registry/issues/187 and take a look at the matching result? (We can also provide you with a spreadsheet if it is more convenient).

Note that we might have some duplicate collections at the beginning as some collection titles can be a bit vague in GrSciColl and we don't always have reliable codes. No worries, we expect to iron them out a bit later.

Morten also documented how we expect to do the merging itself here: https://github.com/gbif/registry/issues/188

@ManonGros WOW! This is great. You guys rock, so much.

A spreadsheet would be fantastic - I just emailed you, so feel free to send it there, or link to it (if it's a Google Sheet, etc) in here.

Will take a peek at #188 now.

Great! I am adding the tab-separated CSV file for the matching:
iDigBio_GrSciColl_matches_march2020.tsv.zip

If would be great to get back your check in a machine readable format. We suggest to add a column to this file with true/false for each match along with a potential "correction" column with the corresponding match you believe to be true.

Morten's JSON file updated with input from CAT:
iDigBio_Morten_matches_AND_Cat_addition.json.zip

Was this page helpful?
0 / 5 - 0 ratings

Related issues

timrobertson100 picture timrobertson100  ·  17Comments

marcos-lg picture marcos-lg  ·  11Comments

MortenHofft picture MortenHofft  ·  5Comments

timrobertson100 picture timrobertson100  ·  20Comments

timrobertson100 picture timrobertson100  ·  9Comments