When using the word_tokenize function the quotation marks get replaced with different quotation marks.
Example (german):
import nltk
sentence = "\"Ja.\"" # sentence[0] = "
tokens = nltk.word_tokenize(sentence) #tokens[0] = ``
print(tokens[0] == sentence[0]) # Prints false.
Is this a bug or is there a reasoning behind this behaviour?
Yes, that is the expected output. The double quotes punctuation change to explicitly denote opening and closing double quotes. The opening "
are converted to 2x backticks and closing to 2x single quotes.
>>> from nltk import word_tokenize
>>> sent = '"this is a sentence inside double quotes."'
>>> word_tokenize(sent)
['``', 'this', 'is', 'a', 'sentence', 'inside', 'double', 'quotes', '.', "''"]
>>> word_tokenize(sent)[0]
'``'
>>> len(word_tokenize(sent)[0])
2
>>> word_tokenize(sent)[0] == '`'*2
True
>>> len(word_tokenize(sent)[-1])
2
>>> word_tokenize(sent)[-1] == "'" * 2
True
I'm not sure what is the reason for the behavior though. Possibly, it's to be explicit when identifying opening/closing quotes.
Thanks for the explanation.
But when I replace the double quotes with one (or two) single quotes or backticks this behaviour doesn't occur.
And I think it is a little bit strange that the tokenizer switches out parts of the original text, since it could lead to problems and is not really transparent.
I guess I'll have to keep it in mind, but I would prefer that the orginal elements of the string remain the same.
@mwess After some checking, the conversion from "
to `` is an artifact of the original penn treebank word tokenizer.
It only happens when there are double quotes, the regex rules that does the substitutions are https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L49
And as for the single quotes, the treebank tokenizer STARTING_QUOTES
regexes we see that it doesn't indicate directionality. I think this is kept to be consistent with Penn Treebank annotations.
I hope the clarifications helps.
Thank you very much. It actually helps a lot.
Altering the original text is not recommended in many applications. I wish the word_tokenize
had a flag to turn off altering the text.
Most helpful comment
@mwess After some checking, the conversion from
"
to `` is an artifact of the original penn treebank word tokenizer.It only happens when there are double quotes, the regex rules that does the substitutions are https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L49
And as for the single quotes, the treebank tokenizer
STARTING_QUOTES
regexes we see that it doesn't indicate directionality. I think this is kept to be consistent with Penn Treebank annotations.I hope the clarifications helps.