Registry: Implement a lookup service for GrSciColl collections

Created on 17 Jun 2020  ·  11Comments  ·  Source: gbif/registry

This lookup service is intended to link occurrence data with collections. It will use collections data for the lookup but this behaviour could be overwritten with dataset machine tags.

This service could receive the following parameters:

  • institution code
  • institution ID
  • collection code
  • collection ID
  • dataset key
  • owner institution code ??

If there are machine tags in the dataset we use them and stop the lookup.

The service should return how good the match is (exact, fuzzy, etc.). Exact matches will only happen if codes match and IDs match or are not contradictory (e.g.: present in only one side).

Anything else? are there any other parameters that can be useful to take into account?

question GRSciColl

Most helpful comment

I just tried to match based on a csv extract I had from some time ago.
The csv is distinct institutionId, institutionCode, datasetKey, publisherKey with an occurrence count for each.

I took anything with more than 2500 occurrences and tried to match them against the service.

105,871,241 occurrences had a single match
 24,553,278 with an exact singular match
 27,113,941 occurrences had multiple matches
 34,998,066 occurrences had no match

2,374 combinations was tested against the service (multiple can be the same since it included dataset and publisher)

That isn't a bad start. I haven't evaluated the quality of the matches though.

All 11 comments

I put in DEV a first version of the collections lookup service (it still doesn't use the machine tags) to see if this is what we were expecting.

It returns a list of institution matches and another for collection matches. It's a list because codes are not unique so there may be cases where we can have multiple options and we can't discriminate by any other field.

For each match it shows:

  • type: it can be exact or fuzzy. exact is only if both the code and identifier match. Otherwise it's fuzzy.
  • remarks: they are observations to understand how the match was done. The possible values are:

    • CODE_MATCH: it doesn't ignore the case

    • IDENTIFIER_MATCH: it doesn't ignore the case

    • ALTERNATIVE_CODE_MATCH: it doesn't ignore the case

    • NAME_MATCH: it ignores the case and removes accents and whitespaces but doesn't do prefix or suffix matches

    • PROBABLY_ON_LOAN: it happens when the owner institution and the institution are not the same

    • INST_COLL_MISMATCH: it happens when the institution of a collection is not present in the institutions matched.

When there are institution matches, a collection only matches fuzzily if it belongs to any of the institution matched. Exact collection matches will always be returned.

You can check these examples to see how the service works:

Am I missing something @MortenHofft @timrobertson100 ?

I've added the machine tag check following the format that was used already in pipelines for now:

  • Namespace: processing.gbif.org
  • Names:

    • institutionCode: maps an institution code to a GrSciColl institution

    • collectionCode: maps a collection code to a GrSciColl collection

    • collectionToInstitutionCode: maps a collection code to a GrSciColl institution. I'd probably rename it to collectionCodeToInstitution (TBD).

    • institutionToCollectionCode: maps an institution code to a GrSciColl collection. I'd probably rename it to institutionCodeToCollection (TBD).

The value of the tags should follow the pattern {key}:{code}.

It looks good - I'm very curious to see it applied to actual data.

verbose or not
Should we have a "verbose" option like in the species match?

If we imagine anyone but staff using this, it might be useful to just have a match or none option. Say Plazi using it to create links from collection codes in articles.

// plain lookup - not verbose - will at most return one institution and one collection.
{
  institutionMatch: {
    matchType: 'NONE'
  },
  {
    collectionMatch: {
      matchType: 'FUZZY',
      reasons: ['SAME_NAME', 'SAME_CODE', 'SAME_COUNTRY'],
      entity: {
        ...
      }
    }
  }
}

// verbose option
// could be like the one running in dev
{
  "institutionMatches": [],
  "collectionMatches": [
    ...
  ]
}

on naming
the species match API use matchType instead of type and the clusters use reasons instead of remarks.

Real data
Before running it on real data I wonder if it would be worth just checking the top 500 distinct combinations of institution code/id colelctionCode/ID and manually assess some of them? We might learn something (say we need to try to flip id and code)

What constitute a match?
When do you imagine this will trigger a match?

  • Exact only?
  • Fuzzy but only one result
  • Exact institution, but only one fuzzy collection?

Country as a disambiguator
Would it make sense to add country as a search param? When indexing occurrences we could add the country of the publisher to disambiguate when there are for example 2 collection matches? Or is that better done by the consumer iterating the results?

I just tried to match based on a csv extract I had from some time ago.
The csv is distinct institutionId, institutionCode, datasetKey, publisherKey with an occurrence count for each.

I took anything with more than 2500 occurrences and tried to match them against the service.

105,871,241 occurrences had a single match
 24,553,278 with an exact singular match
 27,113,941 occurrences had multiple matches
 34,998,066 occurrences had no match

2,374 combinations was tested against the service (multiple can be the same since it included dataset and publisher)

That isn't a bad start. I haven't evaluated the quality of the matches though.

_What constitute a match?_
When do you imagine this will trigger a match?

  • Exact only?
  • Fuzzy but only one result
  • Exact institution, but only one fuzzy collection?

For this you mean for the non-verbose version where we show only 1 match, right?

It could be something like:

  • For institutions

    • Only one exact match

    • Only one fuzzy match

  • For collections

    • Only one exact match

    • If there was an institution match, only one fuzzy match whose institution is the same as the institution matched

    • If there wasn't an institution match, only one fuzzy match

Do you think we should also provide an overall match status?

_What constitute a match?_
When do you imagine this will trigger a match?

  • Exact only?
  • Fuzzy but only one result
  • Exact institution, but only one fuzzy collection?

For this you mean for the non-verbose version where we show only 1 match, right?

I meant when using it in pipelines for assigning GrSciColl IDs to occurrences. I had imagined that this service included the decision and all logic. How does it work for other lookup services?


That reminds me, you mentioned the other day that you considered adding all the matched IDs to the occurrence index.
I guess there are 2 possible versions:

  • Adding all possible candidates. That effectively push the burden to the UI or user. And the same specimens would appear under multiple collections.
  • Only add a link when we have one confident match. The service take the responsibility of the statement. We will be able to match fewer.

I'm more in favour of version 2. Only adding a GrSciColl ID to an occurrence when we have 1 confident match. Not an array of candidate matches. And if we want more matches, then we address publishers to add better identifiers to either GrSciColl or the occurrences. Or we add machine tags to the datasets in case.

If it is useful to have all candidates indexed, could we then consider a separate field for it?

Should The service return flags. FUZZY COLLECTION CODE MATCH. NO COLLECTION CODE MATCH. Similar to species match service.

I can think of these flags:

  • AMBIGUOUS_INSTITUTION: more than 1 institution was found and we couldn't break the tie
  • AMBIGUOUS_COLLECTION: same as above but for collections
  • FUZZY_INSTITUTION_MATCH: 1 institution matched but fuzzily
  • FUZZY_COLLECTION_MATCH: same as above but for collections
  • INSTITUTION_NAME_USED: the institutionCode field contains the institution name instead of the code
  • OWNER_INSTITUTION_NAME_USED: same as above but for the owner institution
  • COLLECTION_NAME_USED: same as above but for collections
  • NO_COLLECTION_CODE_MATCH: the code provided didn't match
  • NO_INSTITUTION_CODE_MATCH: same as above
  • NO_COLLECTION_ID_MATCH: the ID provided didn't match
  • NO_INSTITUTION_ID_MATCH: same as above
  • INSTITUTION_COLLECTION_MISMATCH: the collection found doesn't belong to the institution matched

EDIT: the INSTITUTION_NAME_USED ones maybe can be removed and just used the FUZZY_INSTITUTION_MATCH for these cases. I don't know what would be more useful for publishers

I like it - it is my impression that many publishers appreciate those flags and act on them. This will give them the insights to modify data and improve the matching.

The service now returns a response like:

{
  "institutionMatch": {
    ...
  },
  "collectionMatch": {
    ...
  },
  alternativeMatches { 
    institutionMatches: []
    collectionMatches: []
  }
}

The alternative matches are only shown if the verbose parameter is set to true. The fuzzy matches are limited to 20 results for performance reasons.

It was also added a Country parameter used to break ties: http://api.gbif-dev.org/v1/grscicoll/lookup?institutionCode=BR&country=BE&verbose=true

A match happens if any of these conditions are met:

  • There's only 1 machine tag match
  • There's only 1 exact match
  • There are multiple exact matches but only one matches the country parameter received
  • There's only 1 fuzzy match
  • There are multiple fuzzy matches but only one matches at least the code or the id and one more field (name or alternative code)
  • There are multiple fuzzy matches but only one matches the country parameter received

Additionally, institutions whose owner institution is different than the institution are not considered a match. Also, collections whose institution doesn't match the institution accepted match are also not considered a match.

I haven't added the flags but a status field instead:

  • ACCEPTED: accepted match
  • AMBIGUOUS: more than 1 result was found and we couldn't break the tie
  • AMBIGUOUS_MACHINE_TAGS: same as above but with machine tag matches
  • AMBIGUOUS_OWNER: there are results but don't match with the institution owner so we skip them not to link on loan collections
  • AMBIGUOUS_INSTITUTION_MISMATCH: there are fuzzy matches but don't belong to the institution matched
  • DOUBTFUL: the match found is fuzzy

The rest of the flags can be inferred from the reasons field of the match. Issues can be set from this field in pipelines.

I've extracted from our data in PROD combinations of these fields that are present in more than 1000 records:

  • v_institutionid
  • v_institutioncode
  • v_ownerinstitutioncode
  • v_collectioncode
  • v_collectionid
  • datasetkey

Additionally, I took the country from the publishing organization of the dataset.

Then I passed them to the lookup service in UAT. The results are in this spreadhseet

Was this page helpful?
0 / 5 - 0 ratings

Related issues

MortenHofft picture MortenHofft  ·  5Comments

timrobertson100 picture timrobertson100  ·  17Comments

MortenHofft picture MortenHofft  ·  24Comments

ManonGros picture ManonGros  ·  12Comments

rukayaj picture rukayaj  ·  14Comments