The current Dataset
has type and subtype which is slightly problematic. Type
is really indicating the row format used in the DwC-A and causes problems since a checklist can have occurrences, and an occurrence dataset can in fact be the output of sampling event data.
Better use of SubType
may help, but I feel could add to more confusion due to the overlap (e.g. an occurrence dataset with subtype sampling event).
Since the API is now so well used and changing this is disruptive, I propose to introduce a new multi-value field named category
to categorize datasets. In time we can deprecate type and subtype.
The categories would include the likes of (edited to include suggestions that came in from chat below):
The multiple categories would be added to each occurrence record at indexing, allowing an intuitive filter to be added in GBIF.org so people can select on/off the dataset categories that interest them.
CC @ahahn-gbif @MortenHofft for comments in particular
Thanks!
~Assuming this will also support metrics (and understanding that multivalue means that a dataset can belong to more than one category), I would like to add~
~9. private sector data~
~10. tracking data (i.e. recaptures or GPS tracking of individual organisms)~
[Tim: Thanks - Added above!]
Question: should 4. metagenomic (eDNA) be two separate categories? There is quite a difference in interpretation of these data, even though they are both "sequence based" @ManonGros, would you comment?
[Tim Edited to add: I've split them above now, but will change again based on more comments]
Machine observation seems like a sub category of Sampling Event.
Machine observation seems like a sub category of Sampling Event.
That's ok isn't it? Because it's multivalue a dataset can be marked as both or just sampling event, or perhaps there are cases where a machine observation would be appropriate where no real sampling protocol is used.
This new category would be free text using the vocab server? Or are we trying to have all the categories defined?
This new category would be free text using the vocab server? Or are we trying to have all the categories defined?
~Undecided, but at this point we're proposing the categories~
Revised: I'd now suggest the vocabulary server, as detailed later in this thread.
Great! I love the idea!
~Just one comment:~
~> 4. Single organism metagenomic (i.e. tissue from an NHM specimen)~
~> 5. Environmental eDNA (e.g. soil sample, water, insect soup etc)~
~Number 4 doesn't seem right. What I understand when reading "Single organism metagenomic" is that someone took a gut sample of a cow (for example) and sequenced it, resulting a bunch of occurrences for the gut microbiome. I guess this isn't the idea, is it?~
~If you mean that tissues from a specimen were sequenced, then I would write something more along the lines of "Single organism sequenced". And actually, we could group metagenomics with eDNA (often eDNA is metagenomics). So in the end, I think we could do something like:~
~4. Single organism sequenced (i.e. tissue from an NHM specimen)~
~5. Environmental eDNA and/or metagenomics (e.g. soil sample, water, insect soup etc)~
[Tim: Edited with suggestions expressed here - thanks, you indeed understood what I intended!]
Perhaps @thomasstjerne has some thoughts on this?
Added Targeted species detection (PCR-based assays)
Thanks @timrobertson100 for making me aware of the thread, very exciting. So far, I found eight likely independent variables that may determine the evidence / dataset type in GBIF. I need to meditate a bit more before presenting my views here, and happy to brainstorm / whiteboard a bit if people are available?
Keeping track of this as well
Hello all, I like the idea of sorting datasets and types of evidence, but I am not sure it is most attractive for users to do so using a single filter / vocabulary (but I got the feasibility as put by Tim). I drew some mind maps but don't have time to add pictures here, so just type for your consideration. I started from thinking why would users need to sort dataset / types of evidence? It is a quick way to in/exclude types of data that matter for your cases based on how the evidence was generated and its properties. I came up with 8 independent variables that cross over suggested categorization of the dataset and the basisOfRecord vocabulary as we have today. Note that I think the work independent is important here, though some of the combinations of 1-8 below are impossible in real life.
I am using loose words to describe my thinking, this is not a vocabulary I am suggesting, and there are some unresolved overlaps:
Once again, this is just a capture of unfinished thoughts; it would be nice to brainstorm / whiteboard how good categorization would look like. I was thinking to slice it out as e.g. 1, 7, and 13 in the original post can be simultaneously true. If these are tags and overlap is no problem, then fine. But if this is strict filter, we may need more than only field to capture types of preservation vs. generating community vs. ways of generating vs. quantitativness etc. Feel free to discard if out of scope. I also did not find the collection of BoR discussions, which is applicable here partly.
I assume the categorisations would come from us (at least that's how it is at the moment for citizen science datasets) but it would be great if other people could help with the curation as well. Just something to keep in mind.
For example, let's say that we ask Node managers to check the datasets tagged "citizen science". We want:
Looking at this issue: https://github.com/gbif/portal-feedback/issues/3381, we would be missing the You are right, I missed it!data extracted from taxonomic literature (i.e., Plazi)
category.
Thanks @ManonGros
Looking at this issue: gbif/portal-feedback#3381, we would be missing the data extracted from taxonomic literature (i.e., Plazi) category.
That is what this was intended to be:
Material citations (e.g. taxonomic treatments in literature)
(Related is that Plazi just proposed Material citation
an an addition to basisOfRecord vocabulary in the Darwin Core issues for public commentary)
+1 @Dmitry for one to many and using keyword tags (instead of a 1:1 core record to category)
+1 @Marie for thinking of enabling Node staff to curate categories --> and can also add a feature request for enabling anybody to annotate a datapoint/set with category information (with provenance intact)
Remember also that a "dataset" (as in Darwin-Core-archive-dataset) can be a mixed bag of "evidence records" (aka core record, eg. aka occurrences) of different categories -- if a category "tag" is designed to apply to all core records in a DwC-A
And that the de-normalization of the "evidence records" (core records) means that one cannot be certain of which class that a given property linked to a core record is intended to be linked to
I really like this idea. Certainly the ALA has users who want a very simple way to select groupings of records across data providers. The group I hear this request from most are curators/researchers who ‘just’ want museum or herbarium specimens.
A couple of suggestions:
Single organism sequenced (i.e. tissue from an NHM specimen)
Having an additional category for Tissue sample would be very useful, whether sequences have been derived or not.
Users of this category might be researchers seeking tissues for loan/destructive sampling who currently have to search BasisOfRecord = material sample plus Preparations pot luck.
Private sector data - do you mean data gathered by companies undertaking environmental impact assessments prior to approval of development/mining projects? If so, in Australia this would commonly be called “Proponent data” (being data from proponents of a development). If Private sector data means something else, perhaps could have both?
Remember also that a "dataset" (as in Darwin-Core-archive-dataset) can be a mixed bag of "evidence records" (aka core record, eg. aka occurrences) of different categories -- if a category "tag" is designed to apply to all core records in a DwC-A
Thanks, @dagendresen. My thinking here was to try and decouple this from the class/basisOfRecord issues in Darwin Core to be able to react to reporting/user needs quickly (e.g. introduce a new tag for datasets). Acknowledging that there can be "mixed bag" datasets, my intuition is that most users would appreciate broad filtering to e.g. "omit records that originate from datasets tagged as eDNA" even if there were a few entries in there that might be of some interest, or to produce reports (e.g. growth charts) based on e.g. data originating from datasets tagged as private-sector related. Does this seem reasonable, please?
really like this idea
Thanks, @elywallis - I'll add your input to the list at the top now.
Private sector data - do you mean data gathered by companies undertaking environmental impact assessments prior to approval of development/mining projects?
I believe that was the intention, yes. I don't know the details, but I'm aware the data management team is increasingly running reports on trends using categories like this. I'll add your comments in the top list, without proposing a final decision.
Slightly off-topic, but perhaps useful:
It may not be known to many, but GBIF is progressively moving vocabularies like this into our integrated vocabulary server. This will allow data managers (e.g. including node managers @dagendresen ) to be involved in defining the concepts. Concepts can be hierarchical (e.g. finer categorizations of private data) and once a vocabulary version is released, it is picked up in the data processing pipelines. This is still evolving, but LifeStage is in production now.
What this means relating to this issue, is that as we find new requirements to categorise datasets for a new report or community we see emerging, we'll have the tools in place to accommodate that without needing software developer involvement (only requires a vocabulary to be changed, and then proceed with tagging datasets).
"mixed bag" datasets
@timrobertson100 I would (if asked) completely agree that best practice is to avoid "mixed bag" datasets and that a "tag" to enable filter for a _"purpose-of-reuse"_ would be very useful and welcome! And believe we could live well with such functionality not applying 100% to "mixed bag" datasets :-)
(apropos -- GBIF Norway is "negotiating" with Norwegian data publishers to "break" up "mixed bag" datasets into smaller datasets that would be more homogenous)
@timrobertson100 wrote:
Slightly off-topic, but perhaps useful:
It is may not be known to many, but GBIF is progressively moving vocabularies like this into our integrated vocabulary server. This will allow data managers (e.g. including node managers @dagendresen ) to be involved in defining the concepts. Concepts can be hierarchical (e.g. finer categorizations of private data) and once a vocabulary version is released, it is picked up in the data processing pipelines. This is still evolving, but LifeStage is in production now.
What this means relating to this issue, is that as we find new requirements to categorise datasets for a new report or community we see emerging, we'll have the tools in place to accommodate that without needing software developer involvement (only requires a vocabulary to be changed, and then proceed with tagging datasets).
Tim, can you see my