Registry: Implement a lookup service for GrSciColl collections

Created on 17 Jun 2020 · 11Comments · Source: gbif/registry

This lookup service is intended to link occurrence data with collections. It will use collections data for the lookup but this behaviour could be overwritten with dataset machine tags.

This service could receive the following parameters:

institution code
institution ID
collection code
collection ID
dataset key
owner institution code ??

If there are machine tags in the dataset we use them and stop the lookup.

The service should return how good the match is (exact, fuzzy, etc.). Exact matches will only happen if codes match and IDs match or are not contradictory (e.g.: present in only one side).

Anything else? are there any other parameters that can be useful to take into account?

question GRSciColl

Source

marcos-lg

❤1

Most helpful comment

I just tried to match based on a csv extract I had from some time ago.
The csv is distinct institutionId, institutionCode, datasetKey, publisherKey with an occurrence count for each.

I took anything with more than 2500 occurrences and tried to match them against the service.

105,871,241 occurrences had a single match
 24,553,278 with an exact singular match
 27,113,941 occurrences had multiple matches
 34,998,066 occurrences had no match

2,374 combinations was tested against the service (multiple can be the same since it included dataset and publisher)

That isn't a bad start. I haven't evaluated the quality of the matches though.

MortenHofft on 23 Jul 2020

👍2

All 11 comments

I put in DEV a first version of the collections lookup service (it still doesn't use the machine tags) to see if this is what we were expecting.

It returns a list of institution matches and another for collection matches. It's a list because codes are not unique so there may be cases where we can have multiple options and we can't discriminate by any other field.

For each match it shows:

type: it can be exact or fuzzy. exact is only if both the code and identifier match. Otherwise it's fuzzy.
remarks: they are observations to understand how the match was done. The possible values are:
- CODE_MATCH: it doesn't ignore the case
- IDENTIFIER_MATCH: it doesn't ignore the case
- ALTERNATIVE_CODE_MATCH: it doesn't ignore the case
- NAME_MATCH: it ignores the case and removes accents and whitespaces but doesn't do prefix or suffix matches
- PROBABLY_ON_LOAN: it happens when the owner institution and the institution are not the same
- INST_COLL_MISMATCH: it happens when the institution of a collection is not present in the institutions matched.

When there are institution matches, a collection only matches fuzzily if it belongs to any of the institution matched. Exact collection matches will always be returned.

You can check these examples to see how the service works:

http://api.gbif-dev.org/v1/grscicoll/lookup?collectionCode=herbarium returns 567 collections
http://api.gbif-dev.org/v1/grscicoll/lookup?collectionCode=herbarium&institutionCode=UNCA returns 1 institution and 1 collection
http://api.gbif-dev.org/v1/grscicoll/lookup?collectionCode=UNCA&institutionCode=UNCA&institutionId=gbif:ih:irn:240135 returns 1 institution and 2 collections
http://api.gbif-dev.org/v1/grscicoll/lookup?collectionCode=UIMNH&collectionId=http://grbio.org/cool/zrdp-fspx&institutionCode=K is an example of a INST_COLL_MISMATCH
http://api.gbif-dev.org/v1/grscicoll/lookup?ownerInstitutionCode=MBMC&institutionCode=K is an example of a PROBABLY_ON_LOAN

Am I missing something @MortenHofft @timrobertson100 ?

marcos-lg on 22 Jul 2020

I've added the machine tag check following the format that was used already in pipelines for now:

Namespace: processing.gbif.org
Names:
- institutionCode: maps an institution code to a GrSciColl institution
- collectionCode: maps a collection code to a GrSciColl collection
- collectionToInstitutionCode: maps a collection code to a GrSciColl institution. I'd probably rename it to collectionCodeToInstitution (TBD).
- institutionToCollectionCode: maps an institution code to a GrSciColl collection. I'd probably rename it to institutionCodeToCollection (TBD).

The value of the tags should follow the pattern {key}:{code}.

marcos-lg on 23 Jul 2020

It looks good - I'm very curious to see it applied to actual data.

verbose or not
Should we have a "verbose" option like in the species match?

If we imagine anyone but staff using this, it might be useful to just have a match or none option. Say Plazi using it to create links from collection codes in articles.

// plain lookup - not verbose - will at most return one institution and one collection.
{
  institutionMatch: {
    matchType: 'NONE'
  },
  {
    collectionMatch: {
      matchType: 'FUZZY',
      reasons: ['SAME_NAME', 'SAME_CODE', 'SAME_COUNTRY'],
      entity: {
        ...
      }
    }
  }
}

// verbose option
// could be like the one running in dev
{
  "institutionMatches": [],
  "collectionMatches": [
    ...
  ]
}

on naming
the species match API use matchType instead of type and the clusters use reasons instead of remarks.

Real data
Before running it on real data I wonder if it would be worth just checking the top 500 distinct combinations of institution code/id colelctionCode/ID and manually assess some of them? We might learn something (say we need to try to flip id and code)

What constitute a match?
When do you imagine this will trigger a match?

Exact only?
Fuzzy but only one result
Exact institution, but only one fuzzy collection?

Country as a disambiguator
Would it make sense to add country as a search param? When indexing occurrences we could add the country of the publisher to disambiguate when there are for example 2 collection matches? Or is that better done by the consumer iterating the results?

MortenHofft on 23 Jul 2020

I just tried to match based on a csv extract I had from some time ago.
The csv is distinct institutionId, institutionCode, datasetKey, publisherKey with an occurrence count for each.

I took anything with more than 2500 occurrences and tried to match them against the service.

105,871,241 occurrences had a single match
 24,553,278 with an exact singular match
 27,113,941 occurrences had multiple matches
 34,998,066 occurrences had no match

2,374 combinations was tested against the service (multiple can be the same since it included dataset and publisher)

That isn't a bad start. I haven't evaluated the quality of the matches though.

MortenHofft on 23 Jul 2020

👍2

_What constitute a match?_
When do you imagine this will trigger a match?

Exact only?

Fuzzy but only one result

Exact institution, but only one fuzzy collection?

For this you mean for the non-verbose version where we show only 1 match, right?

It could be something like:

For institutions
- Only one exact match
- Only one fuzzy match
For collections
- Only one exact match
- If there was an institution match, only one fuzzy match whose institution is the same as the institution matched
- If there wasn't an institution match, only one fuzzy match

Do you think we should also provide an overall match status?

marcos-lg on 23 Jul 2020

_What constitute a match?_
When do you imagine this will trigger a match?

Exact only?

Fuzzy but only one result

Exact institution, but only one fuzzy collection?

For this you mean for the non-verbose version where we show only 1 match, right?

I meant when using it in pipelines for assigning GrSciColl IDs to occurrences. I had imagined that this service included the decision and all logic. How does it work for other lookup services?

That reminds me, you mentioned the other day that you considered adding all the matched IDs to the occurrence index.
I guess there are 2 possible versions:

Adding all possible candidates. That effectively push the burden to the UI or user. And the same specimens would appear under multiple collections.
Only add a link when we have one confident match. The service take the responsibility of the statement. We will be able to match fewer.

I'm more in favour of version 2. Only adding a GrSciColl ID to an occurrence when we have 1 confident match. Not an array of candidate matches. And if we want more matches, then we address publishers to add better identifiers to either GrSciColl or the occurrences. Or we add machine tags to the datasets in case.

If it is useful to have all candidates indexed, could we then consider a separate field for it?

MortenHofft on 24 Jul 2020

Should The service return flags. FUZZY COLLECTION CODE MATCH. NO COLLECTION CODE MATCH. Similar to species match service.

MortenHofft on 24 Jul 2020

I can think of these flags:

AMBIGUOUS_INSTITUTION: more than 1 institution was found and we couldn't break the tie
AMBIGUOUS_COLLECTION: same as above but for collections
FUZZY_INSTITUTION_MATCH: 1 institution matched but fuzzily
FUZZY_COLLECTION_MATCH: same as above but for collections
INSTITUTION_NAME_USED: the institutionCode field contains the institution name instead of the code
OWNER_INSTITUTION_NAME_USED: same as above but for the owner institution
COLLECTION_NAME_USED: same as above but for collections
NO_COLLECTION_CODE_MATCH: the code provided didn't match
NO_INSTITUTION_CODE_MATCH: same as above
NO_COLLECTION_ID_MATCH: the ID provided didn't match
NO_INSTITUTION_ID_MATCH: same as above
INSTITUTION_COLLECTION_MISMATCH: the collection found doesn't belong to the institution matched

EDIT: the INSTITUTION_NAME_USED ones maybe can be removed and just used the FUZZY_INSTITUTION_MATCH for these cases. I don't know what would be more useful for publishers

marcos-lg on 24 Jul 2020

👍1

I like it - it is my impression that many publishers appreciate those flags and act on them. This will give them the insights to modify data and improve the matching.

MortenHofft on 24 Jul 2020

The service now returns a response like:

{
  "institutionMatch": {
    ...
  },
  "collectionMatch": {
    ...
  },
  alternativeMatches { 
    institutionMatches: []
    collectionMatches: []
  }
}

The alternative matches are only shown if the verbose parameter is set to true. The fuzzy matches are limited to 20 results for performance reasons.

It was also added a Country parameter used to break ties: http://api.gbif-dev.org/v1/grscicoll/lookup?institutionCode=BR&country=BE&verbose=true

A match happens if any of these conditions are met:

There's only 1 machine tag match
There's only 1 exact match
There are multiple exact matches but only one matches the country parameter received
There's only 1 fuzzy match
There are multiple fuzzy matches but only one matches at least the code or the id and one more field (name or alternative code)
There are multiple fuzzy matches but only one matches the country parameter received

Additionally, institutions whose owner institution is different than the institution are not considered a match. Also, collections whose institution doesn't match the institution accepted match are also not considered a match.

I haven't added the flags but a status field instead:

ACCEPTED: accepted match
AMBIGUOUS: more than 1 result was found and we couldn't break the tie
AMBIGUOUS_MACHINE_TAGS: same as above but with machine tag matches
AMBIGUOUS_OWNER: there are results but don't match with the institution owner so we skip them not to link on loan collections
AMBIGUOUS_INSTITUTION_MISMATCH: there are fuzzy matches but don't belong to the institution matched
DOUBTFUL: the match found is fuzzy

The rest of the flags can be inferred from the reasons field of the match. Issues can be set from this field in pipelines.

marcos-lg on 3 Aug 2020

I've extracted from our data in PROD combinations of these fields that are present in more than 1000 records:

v_institutionid
v_institutioncode
v_ownerinstitutioncode
v_collectioncode
v_collectionid
datasetkey

Additionally, I took the country from the publishing organization of the dataset.

Then I passed them to the lookup service in UAT. The results are in this spreadhseet

marcos-lg on 12 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

unable to add ROR to GRSciColl

MortenHofft · 5Comments

Create DiSSCo Network Entity

timrobertson100 · 17Comments

GBIF citation string/object on all datasets

MortenHofft · 24Comments

Import iDigBio collections into GrSciColl

ManonGros · 12Comments

Nested institutions in GRSciColl

rukayaj · 14Comments