This lookup service is intended to link occurrence data with collections. It will use collections data for the lookup but this behaviour could be overwritten with dataset machine tags.
This service could receive the following parameters:
If there are machine tags in the dataset we use them and stop the lookup.
The service should return how good the match is (exact, fuzzy, etc.). Exact matches will only happen if codes match and IDs match or are not contradictory (e.g.: present in only one side).
Anything else? are there any other parameters that can be useful to take into account?
I put in DEV a first version of the collections lookup service (it still doesn't use the machine tags) to see if this is what we were expecting.
It returns a list of institution matches and another for collection matches. It's a list because codes are not unique so there may be cases where we can have multiple options and we can't discriminate by any other field.
For each match it shows:
exact
or fuzzy
. exact
is only if both the code and identifier match. Otherwise it's fuzzy
.CODE_MATCH
: it doesn't ignore the caseIDENTIFIER_MATCH
: it doesn't ignore the caseALTERNATIVE_CODE_MATCH
: it doesn't ignore the caseNAME_MATCH
: it ignores the case and removes accents and whitespaces but doesn't do prefix or suffix matchesPROBABLY_ON_LOAN
: it happens when the owner institution and the institution are not the sameINST_COLL_MISMATCH
: it happens when the institution of a collection is not present in the institutions matched.When there are institution matches, a collection only matches fuzzily if it belongs to any of the institution matched. Exact collection matches will always be returned.
You can check these examples to see how the service works:
INST_COLL_MISMATCH
PROBABLY_ON_LOAN
Am I missing something @MortenHofft @timrobertson100 ?
I've added the machine tag check following the format that was used already in pipelines for now:
processing.gbif.org
institutionCode
: maps an institution code to a GrSciColl institutioncollectionCode
: maps a collection code to a GrSciColl collectioncollectionToInstitutionCode
: maps a collection code to a GrSciColl institution. I'd probably rename it to collectionCodeToInstitution
(TBD).institutionToCollectionCode
: maps an institution code to a GrSciColl collection. I'd probably rename it to institutionCodeToCollection
(TBD).The value of the tags should follow the pattern {key}:{code}
.
It looks good - I'm very curious to see it applied to actual data.
verbose or not
Should we have a "verbose" option like in the species match?
If we imagine anyone but staff using this, it might be useful to just have a match or none
option. Say Plazi using it to create links from collection codes in articles.
// plain lookup - not verbose - will at most return one institution and one collection.
{
institutionMatch: {
matchType: 'NONE'
},
{
collectionMatch: {
matchType: 'FUZZY',
reasons: ['SAME_NAME', 'SAME_CODE', 'SAME_COUNTRY'],
entity: {
...
}
}
}
}
// verbose option
// could be like the one running in dev
{
"institutionMatches": [],
"collectionMatches": [
...
]
}
on naming
the species match API use matchType
instead of type
and the clusters use reasons
instead of remarks
.
Real data
Before running it on real data I wonder if it would be worth just checking the top 500 distinct combinations of institution code/id colelctionCode/ID and manually assess some of them? We might learn something (say we need to try to flip id and code)
What constitute a match?
When do you imagine this will trigger a match?
Country as a disambiguator
Would it make sense to add country as a search param? When indexing occurrences we could add the country of the publisher to disambiguate when there are for example 2 collection matches? Or is that better done by the consumer iterating the results?
I just tried to match based on a csv extract I had from some time ago.
The csv is distinct institutionId, institutionCode, datasetKey, publisherKey
with an occurrence count for each.
I took anything with more than 2500 occurrences and tried to match them against the service.
105,871,241 occurrences had a single match
24,553,278 with an exact singular match
27,113,941 occurrences had multiple matches
34,998,066 occurrences had no match
2,374 combinations was tested against the service (multiple can be the same since it included dataset and publisher)
That isn't a bad start. I haven't evaluated the quality of the matches though.
_What constitute a match?_
When do you imagine this will trigger a match?
- Exact only?
- Fuzzy but only one result
- Exact institution, but only one fuzzy collection?
For this you mean for the non-verbose version where we show only 1 match, right?
It could be something like:
Do you think we should also provide an overall match status?
_What constitute a match?_
When do you imagine this will trigger a match?
- Exact only?
- Fuzzy but only one result
- Exact institution, but only one fuzzy collection?
For this you mean for the non-verbose version where we show only 1 match, right?
I meant when using it in pipelines for assigning GrSciColl IDs to occurrences. I had imagined that this service included the decision and all logic. How does it work for other lookup services?
That reminds me, you mentioned the other day that you considered adding all the matched IDs to the occurrence index.
I guess there are 2 possible versions:
I'm more in favour of version 2. Only adding a GrSciColl ID to an occurrence when we have 1 confident match. Not an array of candidate matches. And if we want more matches, then we address publishers to add better identifiers to either GrSciColl or the occurrences. Or we add machine tags to the datasets in case.
If it is useful to have all candidates indexed, could we then consider a separate field for it?
Should The service return flags. FUZZY COLLECTION CODE MATCH. NO COLLECTION CODE MATCH. Similar to species match service.
I can think of these flags:
AMBIGUOUS_INSTITUTION
: more than 1 institution was found and we couldn't break the tieAMBIGUOUS_COLLECTION
: same as above but for collectionsFUZZY_INSTITUTION_MATCH
: 1 institution matched but fuzzilyFUZZY_COLLECTION_MATCH
: same as above but for collectionsINSTITUTION_NAME_USED
: the institutionCode
field contains the institution name instead of the codeOWNER_INSTITUTION_NAME_USED
: same as above but for the owner institutionCOLLECTION_NAME_USED
: same as above but for collectionsNO_COLLECTION_CODE_MATCH
: the code provided didn't matchNO_INSTITUTION_CODE_MATCH
: same as aboveNO_COLLECTION_ID_MATCH
: the ID provided didn't matchNO_INSTITUTION_ID_MATCH
: same as aboveINSTITUTION_COLLECTION_MISMATCH
: the collection found doesn't belong to the institution matchedEDIT: the INSTITUTION_NAME_USED
ones maybe can be removed and just used the FUZZY_INSTITUTION_MATCH
for these cases. I don't know what would be more useful for publishers
I like it - it is my impression that many publishers appreciate those flags and act on them. This will give them the insights to modify data and improve the matching.
The service now returns a response like:
{
"institutionMatch": {
...
},
"collectionMatch": {
...
},
alternativeMatches {
institutionMatches: []
collectionMatches: []
}
}
The alternative matches are only shown if the verbose
parameter is set to true. The fuzzy matches are limited to 20 results for performance reasons.
It was also added a Country
parameter used to break ties: http://api.gbif-dev.org/v1/grscicoll/lookup?institutionCode=BR&country=BE&verbose=true
A match happens if any of these conditions are met:
Additionally, institutions whose owner institution is different than the institution are not considered a match. Also, collections whose institution doesn't match the institution accepted match are also not considered a match.
I haven't added the flags but a status field instead:
ACCEPTED
: accepted matchAMBIGUOUS
: more than 1 result was found and we couldn't break the tie AMBIGUOUS_MACHINE_TAGS
: same as above but with machine tag matchesAMBIGUOUS_OWNER
: there are results but don't match with the institution owner so we skip them not to link on loan collectionsAMBIGUOUS_INSTITUTION_MISMATCH
: there are fuzzy matches but don't belong to the institution matchedDOUBTFUL
: the match found is fuzzy The rest of the flags can be inferred from the reasons
field of the match. Issues can be set from this field in pipelines.
I've extracted from our data in PROD combinations of these fields that are present in more than 1000 records:
v_institutionid
v_institutioncode
v_ownerinstitutioncode
v_collectioncode
v_collectionid
datasetkey
Additionally, I took the country from the publishing organization of the dataset.
Then I passed them to the lookup service in UAT. The results are in this spreadhseet
Most helpful comment
I just tried to match based on a csv extract I had from some time ago.
The csv is distinct
institutionId, institutionCode, datasetKey, publisherKey
with an occurrence count for each.I took anything with more than 2500 occurrences and tried to match them against the service.
That isn't a bad start. I haven't evaluated the quality of the matches though.