Nltk: word_tokenize replaces characters

Created on 15 Feb 2017  ·  5Comments  ·  Source: nltk/nltk

When using the word_tokenize function the quotation marks get replaced with different quotation marks.

Example (german):

import nltk
sentence = "\"Ja.\"" # sentence[0] = "
tokens = nltk.word_tokenize(sentence) #tokens[0] = ``
print(tokens[0] == sentence[0]) # Prints false.

Is this a bug or is there a reasoning behind this behaviour?

Most helpful comment

@mwess After some checking, the conversion from " to `` is an artifact of the original penn treebank word tokenizer.

It only happens when there are double quotes, the regex rules that does the substitutions are

And as for the single quotes, the treebank tokenizer STARTING_QUOTES regexes we see that it doesn't indicate directionality. I think this is kept to be consistent with Penn Treebank annotations.

I hope the clarifications helps.

All 5 comments

Yes, that is the expected output. The double quotes punctuation change to explicitly denote opening and closing double quotes. The opening " are converted to 2x backticks and closing to 2x single quotes.

>>> from nltk import word_tokenize
>>> sent = '"this is a sentence inside double quotes."'
>>> word_tokenize(sent)
['``', 'this', 'is', 'a', 'sentence', 'inside', 'double', 'quotes', '.', "''"]
>>> word_tokenize(sent)[0]

>>> len(word_tokenize(sent)[0])
>>> word_tokenize(sent)[0] == '`'*2

>>> len(word_tokenize(sent)[-1])
>>> word_tokenize(sent)[-1] == "'" * 2

I'm not sure what is the reason for the behavior though. Possibly, it's to be explicit when identifying opening/closing quotes.

Thanks for the explanation.
But when I replace the double quotes with one (or two) single quotes or backticks this behaviour doesn't occur.
And I think it is a little bit strange that the tokenizer switches out parts of the original text, since it could lead to problems and is not really transparent.

I guess I'll have to keep it in mind, but I would prefer that the orginal elements of the string remain the same.

@mwess After some checking, the conversion from " to `` is an artifact of the original penn treebank word tokenizer.

It only happens when there are double quotes, the regex rules that does the substitutions are

And as for the single quotes, the treebank tokenizer STARTING_QUOTES regexes we see that it doesn't indicate directionality. I think this is kept to be consistent with Penn Treebank annotations.

I hope the clarifications helps.

Thank you very much. It actually helps a lot.

Altering the original text is not recommended in many applications. I wish the word_tokenize had a flag to turn off altering the text.

Was this page helpful?
0 / 5 - 0 ratings