Nltk: word_tokenize replaces characters

Created on 15 Feb 2017 · 5Comments · Source: nltk/nltk

When using the word_tokenize function the quotation marks get replaced with different quotation marks.

Example (german):

import nltk
sentence = "\"Ja.\"" # sentence[0] = "
tokens = nltk.word_tokenize(sentence) #tokens[0] = ``
print(tokens[0] == sentence[0]) # Prints false.

Is this a bug or is there a reasoning behind this behaviour?

Source

mwess

👍1

Most helpful comment

@mwess After some checking, the conversion from " to `` is an artifact of the original penn treebank word tokenizer.

It only happens when there are double quotes, the regex rules that does the substitutions are https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L49

And as for the single quotes, the treebank tokenizer STARTING_QUOTES regexes we see that it doesn't indicate directionality. I think this is kept to be consistent with Penn Treebank annotations.

I hope the clarifications helps.

alvations on 5 May 2017

👍2

All 5 comments

Yes, that is the expected output. The double quotes punctuation change to explicitly denote opening and closing double quotes. The opening " are converted to 2x backticks and closing to 2x single quotes.

>>> from nltk import word_tokenize
>>> sent = '"this is a sentence inside double quotes."'
>>> word_tokenize(sent)
['``', 'this', 'is', 'a', 'sentence', 'inside', 'double', 'quotes', '.', "''"]
>>> word_tokenize(sent)[0]
'``'

>>> len(word_tokenize(sent)[0])
2
>>> word_tokenize(sent)[0] == '`'*2
True

>>> len(word_tokenize(sent)[-1])
2
>>> word_tokenize(sent)[-1] == "'" * 2
True

I'm not sure what is the reason for the behavior though. Possibly, it's to be explicit when identifying opening/closing quotes.

alvations on 15 Feb 2017

Thanks for the explanation.
But when I replace the double quotes with one (or two) single quotes or backticks this behaviour doesn't occur.
And I think it is a little bit strange that the tokenizer switches out parts of the original text, since it could lead to problems and is not really transparent.

I guess I'll have to keep it in mind, but I would prefer that the orginal elements of the string remain the same.

mwess on 21 Feb 2017

👍1

@mwess After some checking, the conversion from " to `` is an artifact of the original penn treebank word tokenizer.

It only happens when there are double quotes, the regex rules that does the substitutions are https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L49

And as for the single quotes, the treebank tokenizer STARTING_QUOTES regexes we see that it doesn't indicate directionality. I think this is kept to be consistent with Penn Treebank annotations.

I hope the clarifications helps.

alvations on 5 May 2017

👍2

Thank you very much. It actually helps a lot.

mwess on 5 May 2017

Altering the original text is not recommended in many applications. I wish the word_tokenize had a flag to turn off altering the text.