We could do these in a different order of course.
Since iDigBio describes collections, we should probably:
Once we have a list of matches, we could add identifiers to the GrSciColl entries to work on the import (similar to what we do in the case of IH).
Everyone probably has an idea on how to proceed but for the sake of tracking what is happening, I am writing here the steps of the matching process:
Now who will do what?
The models between iDigBio and GrSciColl seem pretty similar. Here is how we propose to map the fields. Could you go over this and let us know if you have any comment?
iDiBio | GrSciColl
-- | --
Institution | Mapped to "Institution" in Collection entity and "Name" if used create an institution
Collection | Name in Coll
Recordsets | Set as a MachineTag (since it is for internal use) in coll
RecordsetQuery | MachineTag in coll
Institution Code | Mapped to "Code" in Institution
Collection Code | Mapped to "Code" in Collection
Collection Uuid | Added as an identifier
Collection Lsid | Added as an identifier
Collection Url | Homepage in Coll
Collection Catalog Url | Catalogue URL in Coll
Description | Description in Coll
DescriptionForSpecialists | Concatenated to Description in Coll (or new field?)
CataloguedSpecimens | Number of Specimen in Coll
KnownToContainTypes | Discard? (the field is used less than 100 times) Is it necessary for internal use? In that case, we can add it as a machineTag.
TaxonCoverage | Taxonomic coverage in Coll
Geographic Range | Geographic coverage in Coll
CollectionExtent | Discard? (it seems like in most cases it contains a string with the same value as cataloguedSpecimens)
Contact | Mapped to Staff Name
Contact Role | Mapped to Staff Position
Contact Email | Mapped to Staff Email
Mailing Address | Mailing Address in Coll
Mailing City | Mailing City in Coll
Mailing State | Mailing State in Coll
Mailing Zip | Mailing Postal Code in Coll
Physical Address | Physical Address in Coll
Physical City | Physical City in Coll
Physical State | Physical State in Coll
Physical Zip | Physical Postal Code in Coll
UniqueNameUUID | Added as identifier in inst
AttributionLogoURL | New field?
ProviderManagedID | Added as identifier
DerivedFrom | Added as MachineTag if it is for internal use?
SameAs | Added as identifier
Flags | Added as MachineTag
PortalDisplay | Added as MachineTag
Lat | Latitude in Institution
Lon | Longitude in Institution
As mentioned earlier, we are working on synchronising Index Herbariorum and GrSciColl (https://github.com/gbif/registry/issues/167). There is a partial overlap between iDigBio and IH.
What should we do in these cases?
I suggest to overwrite the information for the fields provided by IH (IH value overwrite iDigBio or GrSciColl value) and keep the fields that are from iDigBio only.
If the iDigBio record is the most up to date, we would create a GitHub issue and then send the latest update to IH.
Would that be ok?
regarding part 1:
As far as who performs the work, I respectfully think it would be best and most expedient if GBIF is able to devote the time to this. iDigBio/ACIS IT is still short by 1 team member and, despite our feelings that the resulting product will work much better for everyone, I don't think we could guarantee that we'd be able to commit to it anytime soon.
Here are some other notes for section 1 of this issue:
for matching, it might be possible to match from GBIF's institution code to collections.json institution code
based on existing documentation of collections.json (in the repo readme), the institution_lsid
is mapped to a "GRBio LSID or coolURI for the institution LSID" if found, otherwise is blank
other matches will likely need to be string-based match algorithms. A potentially helpful note for matching/verification purposes is that the recordset uuid in collections.json will match the recordset uuid served from our API.
Part 2:
The individual records in iDigBio’s collections.json are Institution-Collection records. GBIF properly breaks Institution and Collection out into separate entities. See attached diagram for intended hierarchy.
Note: there are field definitions in the readme of: https://github.com/iDigBio/idb-us-collections
Comments on individual mappings:
“UniqueNameUUID Added as identifier” - this appears to be intended as an "institution" UUID in a hierarchy of iDigBio records but does not seem to have been implemented. Keep as identifier in GBIF system.
recordsetQuery: This generates a link to the iDigBio recordset, (i.e., https://www.idigbio.org/portal/recordsets/ea12da76-1b2e-4944-8709-1de3af1c65e2). This field can be discarded if you are generating links to the recordset another way.
Recordsets - Reminder: this is our parent object for individual records in our system
KnownToContainTypes: this seems okay to discard.
Collectionextent: can be copied into CatalogedSpecimens where the CatalogedSpecimens is blank, but not required to keep as a separate field (discard).
“attributionLogoURL, providerManagedID, derivedFrom” - note that these are Audubon Core terms
Regarding part 3:
We are okay with the proposed method of integrating IH and iDigBio data. To help determine who the most recent record, IH or iDigBio, you can use the commit date for an individual file in the iDigBio repo as an added/modified date.
The way that repository works is that a human creates/updates a chunk of json named ./collections/{collection_uuid}.json and commits. The software workflow then runs tests and aggregates that json chunk into the full collections.json. An example individual json file would be:
Important Note: The collections.json
file that actually gets loaded and used is served from the json-index
or gh-pages
branch (it gets pushed to both) and not the master branch. For instance:
https://raw.githubusercontent.com/iDigBio/idb-us-collections/json-index/collections.json
or
http://idigbio.github.io/idb-us-collections/collections.json
I hope that all of this helps. Please feel free to @ us for additional questions or clarification.
@roncanepa @nrejack I was checking the mappings and looks like AttributionLogoURL
is the only iDigBio field we're missing in our registry. But I checked the collections.json
file and noticed that this field is always empty. Should we still add it to our registry? or we can discard it too?
@asturcon We picked this field up from Audubon Core, but we agreed that you can discard the field since we are not doing anything with it.
Many thanks for your replies @roncanepa and @nrejack !
In that case, we will get started on [1. Link iDigBio and GrSciColl entries]. We will do as much as possible automatically and send you and Cat some things that might need manual checking, is that ok with you?
Fine with me, send away! Thanks so much, everyone!!
Hey @CatChapman, Morten has been working on matching iDigBio and GrSciColl entries: https://github.com/gbif/registry/issues/187
It turns out that it makes more sense to match first everything to GrSCiColl institutions because these are the entries for which we have a lot more details and identifiers. Then once we got the matches for institution, we could take a look at the collections and match these as well.
Morten described his whole matching process and results on the issue linked above but here are the highlights:
This leaves 235 iDigBio entries unmatched for which we would create new entries in GrSciColl.
Now we need your help to check the matching! Could you go over https://github.com/gbif/registry/issues/187 and take a look at the matching result? (We can also provide you with a spreadsheet if it is more convenient).
Note that we might have some duplicate collections at the beginning as some collection titles can be a bit vague in GrSciColl and we don't always have reliable codes. No worries, we expect to iron them out a bit later.
Morten also documented how we expect to do the merging itself here: https://github.com/gbif/registry/issues/188
@ManonGros WOW! This is great. You guys rock, so much.
A spreadsheet would be fantastic - I just emailed you, so feel free to send it there, or link to it (if it's a Google Sheet, etc) in here.
Will take a peek at #188 now.
Great! I am adding the tab-separated CSV file for the matching:
iDigBio_GrSciColl_matches_march2020.tsv.zip
If would be great to get back your check in a machine readable format. We suggest to add a column to this file with true/false for each match along with a potential "correction" column with the corresponding match you believe to be true.
Morten's JSON file updated with input from CAT:
iDigBio_Morten_matches_AND_Cat_addition.json.zip
Most helpful comment
@asturcon We picked this field up from Audubon Core, but we agreed that you can discard the field since we are not doing anything with it.