Registry: Add category to dataset

Created on 3 Nov 2020  ·  20Comments  ·  Source: gbif/registry

The current Dataset has type and subtype which is slightly problematic. Type is really indicating the row format used in the DwC-A and causes problems since a checklist can have occurrences, and an occurrence dataset can in fact be the output of sampling event data.

Better use of SubType may help, but I feel could add to more confusion due to the overlap (e.g. an occurrence dataset with subtype sampling event).

Since the API is now so well used and changing this is disruptive, I propose to introduce a new multi-value field named category to categorize datasets. In time we can deprecate type and subtype.

The categories would include the likes of (edited to include suggestions that came in from chat below):

  1. Citizen science data
  2. Observation data
  3. Natural history collection
    a. Consider separating out fossils as a separate category, to avoid accidental misuse
  4. Single organism sequenced (i.e. tissue from an NHM specimen)
    a. Consider adding tissue sample as well (which may or may not be sequenced) to aid discovery of preserved tissue without drawing on ambiguous other terms
  5. Environmental DNA and/or metagenomics (e.g. soil sample, water, insect soup etc)
  6. Targeted species detection (PCR-based assays)
  7. Long term monitoring data
  8. Sampling event (where some protocol has been used)
  9. Checklist data
  10. Material citations (e.g. taxonomic treatments in literature)
  11. private sector data
    a. Consider splitting this into finer categories (e.g. proponent data for environmental impact assessment prior to development) versus other categories (to be defined)
  12. tracking data (i.e. recaptures or GPS tracking of individual organisms)
  13. Machine observation (e.g. camera trap)

The multiple categories would be added to each occurrence record at indexing, allowing an intuitive filter to be added in GBIF.org so people can select on/off the dataset categories that interest them.

CC @ahahn-gbif @MortenHofft for comments in particular

All 20 comments

Thanks!

~Assuming this will also support metrics (and understanding that multivalue means that a dataset can belong to more than one category), I would like to add~
~9. private sector data~
~10. tracking data (i.e. recaptures or GPS tracking of individual organisms)~

[Tim: Thanks - Added above!]

Question: should 4. metagenomic (eDNA) be two separate categories? There is quite a difference in interpretation of these data, even though they are both "sequence based" @ManonGros, would you comment?

[Tim Edited to add: I've split them above now, but will change again based on more comments]

Machine observation seems like a sub category of Sampling Event.

Machine observation seems like a sub category of Sampling Event.

That's ok isn't it? Because it's multivalue a dataset can be marked as both or just sampling event, or perhaps there are cases where a machine observation would be appropriate where no real sampling protocol is used.

This new category would be free text using the vocab server? Or are we trying to have all the categories defined?

This new category would be free text using the vocab server? Or are we trying to have all the categories defined?

~Undecided, but at this point we're proposing the categories~

Revised: I'd now suggest the vocabulary server, as detailed later in this thread.

Great! I love the idea!

~Just one comment:~
~> 4. Single organism metagenomic (i.e. tissue from an NHM specimen)~
~> 5. Environmental eDNA (e.g. soil sample, water, insect soup etc)~

~Number 4 doesn't seem right. What I understand when reading "Single organism metagenomic" is that someone took a gut sample of a cow (for example) and sequenced it, resulting a bunch of occurrences for the gut microbiome. I guess this isn't the idea, is it?~
~If you mean that tissues from a specimen were sequenced, then I would write something more along the lines of "Single organism sequenced". And actually, we could group metagenomics with eDNA (often eDNA is metagenomics). So in the end, I think we could do something like:~

~4. Single organism sequenced (i.e. tissue from an NHM specimen)~
~5. Environmental eDNA and/or metagenomics (e.g. soil sample, water, insect soup etc)~

[Tim: Edited with suggestions expressed here - thanks, you indeed understood what I intended!]

Perhaps @thomasstjerne has some thoughts on this?

Added Targeted species detection (PCR-based assays)

Thanks @timrobertson100 for making me aware of the thread, very exciting. So far, I found eight likely independent variables that may determine the evidence / dataset type in GBIF. I need to meditate a bit more before presenting my views here, and happy to brainstorm / whiteboard a bit if people are available?

Keeping track of this as well

Hello all, I like the idea of sorting datasets and types of evidence, but I am not sure it is most attractive for users to do so using a single filter / vocabulary (but I got the feasibility as put by Tim). I drew some mind maps but don't have time to add pictures here, so just type for your consideration. I started from thinking why would users need to sort dataset / types of evidence? It is a quick way to in/exclude types of data that matter for your cases based on how the evidence was generated and its properties. I came up with 8 independent variables that cross over suggested categorization of the dataset and the basisOfRecord vocabulary as we have today. Note that I think the work independent is important here, though some of the combinations of 1-8 below are impossible in real life.

I am using loose words to describe my thinking, this is not a vocabulary I am suggesting, and there are some unresolved overlaps:

  1. Preservation status of evidence: virtual only or physical: fossil, dead, living (zoos, cultures, gardens, aquaria). Note some thinks like amber are not easy to place, as one can get DNA from amber, there are subfossils etc.). _Question_: Can I re-examine the physical material? What and where is it?
  2. Integrity / N species: Single & whole (e.g. insect, i.e. contains all its genet within one individual), partial (tissue sample, leaf, fruit body) or mixed specimen (common in moss and lichen collection, when collecting individual species is not possible: but is not intentional sampling e.g. like plankton see 6). _Question_: Can I study full morphology, or only some traits, or only link museum specimen to DNA sequence?
  3. DNA: not explored, sequence, PCR. Note: this is in between virtual and physical, as DNA or PCR products can be stored for long time (physical), but DNA evidence for species presence, often a sequence, is a machine generated virtual evidence not much different from a digital image or a sound. _Question_: Can I re-examine the identification, do phylogeny, or all I have is a label name?
  4. Dynamic / Static data. Dynamic: tracking, time series, mark-recapture. _Question_: can I only study processes, or only patters?
  5. The way the evidence is generated: literature processing, collection digitization, personal observations, systematic sampling. _Question_: Can I sort the data by reliability of its generation?
  6. For sampling event data, but maybe occurrences, too: presence-only (sampling effort unknown / undocumented), presence-absence, abundance (quantitative). _Question_: What kinds of statistical analyses are possible?
  7. The way data is packed in GBIF: metadata only, checklist, occurrences only, sampling event. Might include filter by extension used, esp. if we are getting more of those in TDWG. _Question_: What do I get in my GBIF download, verbatim and GBIF interpreted?
  8. Community that generating the data (perhaps this is more relevant to tagging publishers, but one may need to filter occurrences and datasets by): (groups of) individuals, natural history collections, private sector, marine, citizen science, machine. Some of these are not mutually exclusive: can be "natural history collection" + "citizen science", or "machine". _Question_: Can I study data trends in a particular demographic sector?

Once again, this is just a capture of unfinished thoughts; it would be nice to brainstorm / whiteboard how good categorization would look like. I was thinking to slice it out as e.g. 1, 7, and 13 in the original post can be simultaneously true. If these are tags and overlap is no problem, then fine. But if this is strict filter, we may need more than only field to capture types of preservation vs. generating community vs. ways of generating vs. quantitativness etc. Feel free to discard if out of scope. I also did not find the collection of BoR discussions, which is applicable here partly.

I assume the categorisations would come from us (at least that's how it is at the moment for citizen science datasets) but it would be great if other people could help with the curation as well. Just something to keep in mind.

For example, let's say that we ask Node managers to check the datasets tagged "citizen science". We want:

  1. An easy way for them to see all the citizen science datasets for their node.
  2. If a Node manager noticed a dataset tagged erroneously, we want to keep track of that so that we don't re-tag it next time.

Looking at this issue: https://github.com/gbif/portal-feedback/issues/3381, we would be missing the data extracted from taxonomic literature (i.e., Plazi) category. You are right, I missed it!

Thanks @ManonGros

Looking at this issue: gbif/portal-feedback#3381, we would be missing the data extracted from taxonomic literature (i.e., Plazi) category.

That is what this was intended to be:

Material citations (e.g. taxonomic treatments in literature)

(Related is that Plazi just proposed Material citation an an addition to basisOfRecord vocabulary in the Darwin Core issues for public commentary)

+1 @Dmitry for one to many and using keyword tags (instead of a 1:1 core record to category)
+1 @Marie for thinking of enabling Node staff to curate categories --> and can also add a feature request for enabling anybody to annotate a datapoint/set with category information (with provenance intact)

Remember also that a "dataset" (as in Darwin-Core-archive-dataset) can be a mixed bag of "evidence records" (aka core record, eg. aka occurrences) of different categories -- if a category "tag" is designed to apply to all core records in a DwC-A

And that the de-normalization of the "evidence records" (core records) means that one cannot be certain of which class that a given property linked to a core record is intended to be linked to

I really like this idea. Certainly the ALA has users who want a very simple way to select groupings of records across data providers. The group I hear this request from most are curators/researchers who ‘just’ want museum or herbarium specimens.

A couple of suggestions:

  1. Natural history collection - might still be useful to also have a category for Fossil specimens so these can easily be separated out.
    The reason for separating Fossils out is that subfossils (or any fossil species still extant) often show up outside the extant distribution and can easily be mistaken for errors and flagged as such, when they’re perfectly legitimate.
  1. Single organism sequenced (i.e. tissue from an NHM specimen)
    Having an additional category for Tissue sample would be very useful, whether sequences have been derived or not.
    Users of this category might be researchers seeking tissues for loan/destructive sampling who currently have to search BasisOfRecord = material sample plus Preparations pot luck.

  2. Private sector data - do you mean data gathered by companies undertaking environmental impact assessments prior to approval of development/mining projects? If so, in Australia this would commonly be called “Proponent data” (being data from proponents of a development). If Private sector data means something else, perhaps could have both?

Remember also that a "dataset" (as in Darwin-Core-archive-dataset) can be a mixed bag of "evidence records" (aka core record, eg. aka occurrences) of different categories -- if a category "tag" is designed to apply to all core records in a DwC-A

Thanks, @dagendresen. My thinking here was to try and decouple this from the class/basisOfRecord issues in Darwin Core to be able to react to reporting/user needs quickly (e.g. introduce a new tag for datasets). Acknowledging that there can be "mixed bag" datasets, my intuition is that most users would appreciate broad filtering to e.g. "omit records that originate from datasets tagged as eDNA" even if there were a few entries in there that might be of some interest, or to produce reports (e.g. growth charts) based on e.g. data originating from datasets tagged as private-sector related. Does this seem reasonable, please?

really like this idea

Thanks, @elywallis - I'll add your input to the list at the top now.

Private sector data - do you mean data gathered by companies undertaking environmental impact assessments prior to approval of development/mining projects?

I believe that was the intention, yes. I don't know the details, but I'm aware the data management team is increasingly running reports on trends using categories like this. I'll add your comments in the top list, without proposing a final decision.

Slightly off-topic, but perhaps useful:

It may not be known to many, but GBIF is progressively moving vocabularies like this into our integrated vocabulary server. This will allow data managers (e.g. including node managers @dagendresen ) to be involved in defining the concepts. Concepts can be hierarchical (e.g. finer categorizations of private data) and once a vocabulary version is released, it is picked up in the data processing pipelines. This is still evolving, but LifeStage is in production now.

What this means relating to this issue, is that as we find new requirements to categorise datasets for a new report or community we see emerging, we'll have the tools in place to accommodate that without needing software developer involvement (only requires a vocabulary to be changed, and then proceed with tagging datasets).

"mixed bag" datasets

@timrobertson100 I would (if asked) completely agree that best practice is to avoid "mixed bag" datasets and that a "tag" to enable filter for a _"purpose-of-reuse"_ would be very useful and welcome! And believe we could live well with such functionality not applying 100% to "mixed bag" datasets :-)

(apropos -- GBIF Norway is "negotiating" with Norwegian data publishers to "break" up "mixed bag" datasets into smaller datasets that would be more homogenous)

@timrobertson100 wrote:

Slightly off-topic, but perhaps useful:

It is may not be known to many, but GBIF is progressively moving vocabularies like this into our integrated vocabulary server. This will allow data managers (e.g. including node managers @dagendresen ) to be involved in defining the concepts. Concepts can be hierarchical (e.g. finer categorizations of private data) and once a vocabulary version is released, it is picked up in the data processing pipelines. This is still evolving, but LifeStage is in production now.

What this means relating to this issue, is that as we find new requirements to categorise datasets for a new report or community we see emerging, we'll have the tools in place to accommodate that without needing software developer involvement (only requires a vocabulary to be changed, and then proceed with tagging datasets).

Tim, can you see my ? At some point, we need something, a talk from GBIF, a TDWG Webinar, about this effort. I think the broader community will find it very enlightening about how we can use the data we have to improve and understand the data.

Was this page helpful?
0 / 5 - 0 ratings