Registry: Synchronize with Index Herbariorum

Created on 11 Dec 2019  ·  9Comments  ·  Source: gbif/registry

Index Herbariorum is an authoritative catalog which should be the master source for Herbaria entities. Herbaria records in the registry should be kept in sync with the ongoing editing efforts of IH.

This first iteration of work is deliberately scoped to accommodate the minimal functionality needed to achieve this. Once complete, additional feature requests can be opened as new issues.

It is envisaged the general synchronization will operate as follows:

  • Retrieve all Herbaria from IndexHerbariorum
  • For each entity locate the equivalent Institution or Collection in GRSciColl using the IH IRN

    • If the entity exists and they differ, update GrSciColl

    • If the entity does not exist, insert it as an institution and with an identifier holding the IH IRN

    • If there is a conflict (e.g. multiple options) notify editors for resolution

  • Create, update or delete the associate staff members for the entities

A future version may allow the editing of IH entities in GRSciColl. Under that scenario when entities differ more complex logic is required, likely requiring notification to GRSciColl and IH staff to resolve the differences.

GRSciColl

Most helpful comment

I suggest we move ORCID related ideas to a new issue to not conflate things. This ticket is specifically to get GrSciColl and IH in sync (adding links to ORCID accounts is desirable but not necessary)

On 8 Jan 2020, at 14:31, Kyle Copas notifications@github.com wrote:

Plus as Morten suggested, we could use the ORCiDs when available but I doubt that most people have created one.

As of Dec 2017, there were 454,000 users in the biological sciences who have created ORCID IDs—one of the three highest adoption rates of any discipline. Tbh, we should commit to this, use the existing infrastructure (including becoming an ORCID member, imo) and encourage members of the community to sign up—the promise being that we can provide value-for-service if they do.

Note that Bloodhound is already using ORCIDs to pull both past and present institutional affiliation, e.g. https://bloodhound-tracker.net/organization/Q1122595. You all will know better how that works, but we could also consider this as (part of?) our approach…


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.

All 9 comments

The institutions from IH have a contact field which has only a phone, an email and a webUrl (http://sweetgum.nybg.org/science/api/v1/institutions/UARK). In a GrSciColl institution/collection we have a contacts field but they are actually person entities (http://api.gbif.org/v1/grscicoll/institution/f7068d69-cf88-42d8-a984-0c4de6ce8579 whose contact is http://api.gbif.org/v1/grscicoll/person/118b48f0-9af9-45ac-8ea9-d8221d7fa2af ).

What should we do with the IH contact? ignore it? add it as a GrSciColl person and link it to the institution/collection? for the latter the first name is required, so in that case we need to make up one.

I don't know who can answer this best @timrobertson100 @MortenHofft @ManonGros

Those contact fields are not for a person. They are for the herbarium as an entity. So it is important as people come and go. I am quite sure that this would be considered essential from an IH standpoint and my feeling is that it is important as well. So I would suggest we extended our model instead. But better check with others as well.

As for people/staff. IH has an endpoint for those as well. They are - as far as I know - only linked by institution codes. In time we should sync those as well. But we might want to discuss more on our goal for handling contacts of this sort (ORCiD etc). @ManonGros do you have a preferred approach for this?

I like the idea of extending our model.

For the herbarium contacts, I agree with you @MortenHofft , we should extend our model to have something like what we have for the GBIF publishing organisations (see for example "email":["[email protected]"],"phone":["+47 99642071"] in http://api.gbif.org/v1/organization/b670ea7c-48e7-45e4-ba66-5bf01ee4d398).

For people/staff, I also agree, we should synchronise/import the people as well. Perhaps even before we synchronise the institutions? (I am just asking because it would seem logical to update the contacts when synchronising the IH institutions but this would require to have to staff/people up to date).

As far as I understand, for us staff/people can have a primary institution but be affiliated with several collections and institutions. While for IH, one person is associated with one institution code. Plus the information is a bit different (http://api.gbif.org/v1/grscicoll/person/118b48f0-9af9-45ac-8ea9-d8221d7fa2af and http://sweetgum.nybg.org/science/ih/person-details/?irn=131429).

For synchronising people/staff, should we proceed as we do for the institution? Meaning, checking matching semi-automatically first. If yes, how could we link them? There is no identifier or machine tags for people. Plus as Morten suggested, we could use the ORCiDs when available but I doubt that most people have created one. And even for those who have one, we need to find them first.

I don't know if it is possible at all, but ideally I imagine something like that:

  1. Find potential ORCiD for all the GrSciColl staff/people (if we have confirmation that the ORCiD is correct for a given person, synchronise with this in priority)
  2. Match and link the IH person list with the GrSciColl staff/people
  3. Update the GrSciColl staff entries if older than IH
  4. Synchronise the GrSciColl institutions with IH (based on the identifiers we use to link them after our matching/checking, e.g what we did in UAT)

I know it is not that simple, let me know what you think.

About the staff is already in the description of this task so I was planning to sync them in this process. I don't think we need to do something manually.

EDIT: when I said I don't think we need to do something manually, I meant I will try to match them using the name, email or any other representative field (I did something similar in the last DB migration, even though the matching is not perfect because there are a lot of staff duplicated but just with different address or phone) and if I can't match to any existing one I will create a new one. Still this matching won't be perfect as I mentioned before, if we want it to be more accurate then we need some manual editing.

Plus as Morten suggested, we could use the ORCiDs when available but I doubt that most people have created one.

As of Dec 2017, there were 454,000 users in the biological sciences who have created ORCID IDs—one of the three highest adoption rates of any discipline (see Study of ORCID Adoption Across Disciplines and Locations). Tbh, we should commit to this, use the existing infrastructure (including becoming an ORCID member, imo) and encourage members of the community to sign up—the promise being that we can provide value-for-service if they do.

Note that Bloodhound is already using ORCIDs to pull both past and present institutional affiliation, e.g. https://bloodhound-tracker.net/organization/Q1122595. You all will know better how that works, but we could also consider this as (part of?) our approach…

I suggest we move ORCID related ideas to a new issue to not conflate things. This ticket is specifically to get GrSciColl and IH in sync (adding links to ORCID accounts is desirable but not necessary)

On 8 Jan 2020, at 14:31, Kyle Copas notifications@github.com wrote:

Plus as Morten suggested, we could use the ORCiDs when available but I doubt that most people have created one.

As of Dec 2017, there were 454,000 users in the biological sciences who have created ORCID IDs—one of the three highest adoption rates of any discipline. Tbh, we should commit to this, use the existing infrastructure (including becoming an ORCID member, imo) and encourage members of the community to sign up—the promise being that we can provide value-for-service if they do.

Note that Bloodhound is already using ORCIDs to pull both past and present institutional affiliation, e.g. https://bloodhound-tracker.net/organization/Q1122595. You all will know better how that works, but we could also consider this as (part of?) our approach…


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.

Something else to take into account for the synchronisation:
On the long term, we want IH records to be directly edited in IH and then synchronised with GrSciColl.
But right now, we have a handful of editors who have been editing their GrSciColl records already. Which means that GrSciColl contains the most updated information about a collection/institution not IH.
See this example:

These are only a few cases but it would be nice to not overwrite these entries. For now we should check the modified dates before synchronising and notify IH if the GrSciColl version is more up to date.

In production and scheduled to run weekly.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ManonGros picture ManonGros  ·  12Comments

ahahn-gbif picture ahahn-gbif  ·  4Comments

MortenHofft picture MortenHofft  ·  5Comments

marcos-lg picture marcos-lg  ·  11Comments

rukayaj picture rukayaj  ·  14Comments