Doccano: Can't upload a file with line break inside text

Created on 12 Aug 2019  ·  3Comments  ·  Source: doccano/doccano

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    Windows 10
  • Python version:
    3.6.4

Describe the problem

I'm trying to upload a file where the texts are not a single line, but they can have line breaks inside of them. Even when using JSON file format to separate every text as property instead of a line, when upload a file, it seems Doccano still separate via line break.

Source code / logs

For instance, this is a JSON file I'm trying to upload with a single text inside of it:

[{"text": "Processo 0000637-15.2012.8.12.0003 (003.12.000637-8) - Procedimento Comum - Inadimplemento Reqte: Fabiano Neves Gon\u00e7alves ADV: PAULO DE TARSO AZEVEDO PEGOLO (OAB 10789/MS) ADV: HENRIQUE LIMA (OAB 9979/MS) ADV: GUILHERME FERREIRA DE BRITO (OAB 9982/MS) ADV: RODRIGO LOUREIRO (OAB 13583/MS) ADV: FRANCIELLI SANCHEZ SALAZAR (OAB 15140/MS) ADV: JAC\u00d3 CARLOS SILVA COELHO (OAB 15155A/MS) ADV: IVONE CONCEI\u00c7\u00c3O SILVA (OAB 13609/MS) 1.\nCom o tr\u00e2nsito em julgado da senten\u00e7a de fl. 393 e satisfa\u00e7\u00e3o integral do cr\u00e9dito, o of\u00edcio jurisdicional acha-se cumprido e acabado, raz\u00e3o por que indefiro o pedido de digitaliza\u00e7\u00e3o do feito (fl. 416).\nAdemais, tramitam nessa unidade judici\u00e1ria milhares de processos e se for admitida a digitaliza\u00e7\u00e3o de todos os feitos finalizados, haver\u00e1 atraso injustificado nas atividades do cart\u00f3rio, pois \u00e9 necess\u00e1rio grande lapso temporal do servidor para este fim.\n2.\nDever\u00e1 o cart\u00f3rio promover a retifica\u00e7\u00e3o do advogado da Mafre Vida S/A no sistema SAJ, para futuras publica\u00e7\u00f5es e intima\u00e7\u00f5es, conforme declinado \u00e0 fl. 416.\nIntimem-se.\nAp\u00f3s, arquive-se."}]

The idea was to visualize the text with line breaks when showing it during the annotation process, but instead what we got was that Doccano was transforming every phrase in a text by itself. For comparison, this same text was uploaded as this:

image

As the image shows, the text was broke in every line break, and every substring was dealed as a document alone.

question

Most helpful comment

I managed to solve it via .jsonl file. The data that I previously showed was saved as following:

{"text": "Processo 0000637-15.2012.8.12.0003 (003.12.000637-8) - Procedimento Comum - Inadimplemento Reqte: Fabiano Neves Gonçalves ADV: PAULO DE TARSO AZEVEDO PEGOLO (OAB 10789/MS) ADV: HENRIQUE LIMA (OAB 9979/MS) ADV: GUILHERME FERREIRA DE BRITO (OAB 9982/MS) ADV: RODRIGO LOUREIRO (OAB 13583/MS) ADV: FRANCIELLI SANCHEZ SALAZAR (OAB 15140/MS) ADV: JACÓ CARLOS SILVA COELHO (OAB 15155A/MS) ADV: IVONE CONCEIÇÃO SILVA (OAB 13609/MS) 1.\n\nCom o trânsito em julgado da sentença de fl. 393 e satisfação integral do crédito, o ofício jurisdicional acha-se cumprido e acabado, razão por que indefiro o pedido de digitalização do feito (fl. 416).\n\nAdemais, tramitam nessa unidade judiciária milhares de processos e se for admitida a digitalização de todos os feitos finalizados, haverá atraso injustificado nas atividades do cartório, pois é necessário grande lapso temporal do servidor para este fim.\n\n2.\n\nDeverá o cartório promover a retificação do advogado da Mafre Vida S/A no sistema SAJ, para futuras publicações e intimações, conforme declinado à fl. 416.\n\nIntimem-se.\n\nApós, arquive-se.", "labels": []}

Saved every document in a single line, with a "\n" for every line break. It doesn't appear in the "Dataset" section:

image

When annotating the examples, the line breaks are rendered successfully:

image

The same approach works when dealing with csv format, but unfortunately it requires at least one label, not allowing using it as an empty array. Because I don't want to send any label value, I had to use jsonl format, as it seems to be the only one allowing an empty label array. The txt/plain text format expects one example per line, not being able to support line breaks at all.

All 3 comments

We don't support the text includes line breaks. Please refer the discussion at #34.

I managed to solve it via .jsonl file. The data that I previously showed was saved as following:

{"text": "Processo 0000637-15.2012.8.12.0003 (003.12.000637-8) - Procedimento Comum - Inadimplemento Reqte: Fabiano Neves Gonçalves ADV: PAULO DE TARSO AZEVEDO PEGOLO (OAB 10789/MS) ADV: HENRIQUE LIMA (OAB 9979/MS) ADV: GUILHERME FERREIRA DE BRITO (OAB 9982/MS) ADV: RODRIGO LOUREIRO (OAB 13583/MS) ADV: FRANCIELLI SANCHEZ SALAZAR (OAB 15140/MS) ADV: JACÓ CARLOS SILVA COELHO (OAB 15155A/MS) ADV: IVONE CONCEIÇÃO SILVA (OAB 13609/MS) 1.\n\nCom o trânsito em julgado da sentença de fl. 393 e satisfação integral do crédito, o ofício jurisdicional acha-se cumprido e acabado, razão por que indefiro o pedido de digitalização do feito (fl. 416).\n\nAdemais, tramitam nessa unidade judiciária milhares de processos e se for admitida a digitalização de todos os feitos finalizados, haverá atraso injustificado nas atividades do cartório, pois é necessário grande lapso temporal do servidor para este fim.\n\n2.\n\nDeverá o cartório promover a retificação do advogado da Mafre Vida S/A no sistema SAJ, para futuras publicações e intimações, conforme declinado à fl. 416.\n\nIntimem-se.\n\nApós, arquive-se.", "labels": []}

Saved every document in a single line, with a "\n" for every line break. It doesn't appear in the "Dataset" section:

image

When annotating the examples, the line breaks are rendered successfully:

image

The same approach works when dealing with csv format, but unfortunately it requires at least one label, not allowing using it as an empty array. Because I don't want to send any label value, I had to use jsonl format, as it seems to be the only one allowing an empty label array. The txt/plain text format expects one example per line, not being able to support line breaks at all.

I copy your example, and save into a.txt, but it does not work , it still can not render line breaks.

My project type is sequence labeling.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

aribornstein picture aribornstein  ·  3Comments

callmeashish picture callmeashish  ·  3Comments

johnmccain picture johnmccain  ·  4Comments

roperi picture roperi  ·  3Comments

bheuju picture bheuju  ·  4Comments