Doccano: Feature Request: Token level output

Created on 3 Oct 2019  ·  4Comments  ·  Source: doccano/doccano

Feature description

doccano currently only outputs character-level annotation. However, some workflows used for NLP require input as lists of words and list of token labels:

Sample sentence: 
['Two', ',', 'Samsung', 'based', ',', 'electronic', 'cash', 'registers', 'were', 'reconstructed', 'in', 'order', 'to', 'expand', 'their', 'functions', 'and', 'adapt', 'them', 'for', 'networking', '.']

Sample sentence labels: 
['O', 'O', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

It is referenced in an earlier issue (#7) that this was done like this due to the fact that some languages are not space-separated. I think it would be good to have the option and users that annotate space-separated languages can use that for seamless input to their workflow.

Example taken from: https://github.com/microsoft/nlp/blob/master/examples/named_entity_recognition/ner_wikigold_bert.ipynb

feature request

Most helpful comment

I have a plan to create a new Python package named doccano-transformer.
It will transform annotated documents into other formats such as https://github.com/chakki-works/doccano/issues/362, https://github.com/chakki-works/doccano/issues/454 and so on. So, the token level output should be included in doccano-transformer.

All 4 comments

Hi there! I wrote something similar for myself and I would love to contribute with PR :) however, I am not sure how to handle mislabeled tokens. Namely, what if a token was marked only partially? For my own purposes, I print out the mislabeled token, which is a warning to a script's user, and drop the token annotation but in a production this is not a way to go.

The other things is, what should be "non-entity" token? 'O' ? Then we should prevent user from adding such label which might be misleading for some people. Or maybe we should create the form where user itself can provide the token? Or leave it blank?

I think this feature request is great but we should agree how exactly tackle this :smile: I would love to read your suggestions

I have a plan to create a new Python package named doccano-transformer.
It will transform annotated documents into other formats such as https://github.com/chakki-works/doccano/issues/362, https://github.com/chakki-works/doccano/issues/454 and so on. So, the token level output should be included in doccano-transformer.

Any updates on this? I'd like to import a dataset simply as .txt (in which each line is a sentence):

George Washington went to Washington.
Sam Houston stayed home.

... and export it (after annotating) as follows, also in a .txt:

George B-PER
Washington I-PER
went O
to O
Washington B-LOC

Sam B-PER
Houston I-PER
stayed O
home O

In other words, export in the well-known IOB annotation format. So for this, Doccano should automatically know that if an annotated entity comprises more than 1 token should be annotated with B (beginning) and I (inside) labels. Also, there are more sophisticated annotation schemes besides IOB, such as BIOES. Here, S (single) is used to represent a chunk containing a single token. The BIOES annotation scheme would result in the following:

George B-PER
Washington E-PER
went O
to O
Washington S-LOC

Sam B-PER
Houston E-PER
stayed O
home O

It would be awesome if I could export annotated datasets in the IOB or BIOES (or other) formats. Many state-of-the-art libraries for NER require token-level annotation in order to train models (Flair from Zalando, Transformers from HuggingFace,...).

We released doccano-transformer. It supports data transformation. Currently, supported tasks are named entity recognition and supported formats are CoNLL2003 and spaCy.

We have a plan to extend tasks and formats.
Please look forward to it.

Was this page helpful?
0 / 5 - 0 ratings