With the current CoreNLPParser.tag()
, the "retokenization" by Stanford CoreNLP is unexpected:
>>> from nltk.parse.corenlp import CoreNLPParser
>>> ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
>>> sent = ['my', 'phone', 'number', 'is', '1111', '1111', '1111']
>>> ner_tagger.tag(sent)
[('my', 'O'),
('phone', 'O'),
('number', 'O'),
('is', 'O'),
('1111\xa01111\xa01111', 'NUMBER')]
The expected behavior should be:
>>> from nltk.parse.corenlp import CoreNLPParser
>>> ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
>>> sent = ['my', 'phone', 'number', 'is', '1111', '1111', '1111']
>>> ner_tagger.tag(sent)
[('my', 'O'), ('phone', 'O'), ('number', 'O'), ('is', 'O'), ('1111', 'DATE'), ('1111', 'DATE'), ('1111', 'DATE')]
Proposed solution is to allow properties
arguments overloading for .tag()
and .tag_sents()
, i.e. at https://github.com/nltk/nltk/blob/develop/nltk/parse/corenlp.py#L348 and by default use properties = {'tokenize.whitespace':'true'}
because we are concatenating the tokens by spaces in tag_sents()
.
def tag_sents(self, sentences, properties=None):
"""
Tag multiple sentences.
Takes multiple sentences as a list where each sentence is a list of
tokens.
:param sentences: Input sentences to tag
:type sentences: list(list(str))
:rtype: list(list(tuple(str, str))
"""
# Converting list(list(str)) -> list(str)
sentences = (' '.join(words) for words in sentences)
if properties == None:
properties = {'tokenize.whitespace':'true'}
return [sentences[0] for sentences in self.raw_tag_sents(sentences, properties)]
def tag(self, sentence, properties=None):
"""
Tag a list of tokens.
:rtype: list(tuple(str, str))
>>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
>>> tokens = 'Rami Eid is studying at Stony Brook University in NY'.split()
>>> parser.tag(tokens)
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'),
('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'O')]
>>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='pos')
>>> tokens = "What is the airspeed of an unladen swallow ?".split()
>>> parser.tag(tokens)
[('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'),
('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'),
('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]
"""
return self.tag_sents([sentence], properties)[0]
def raw_tag_sents(self, sentences, properties=None):
"""
Tag multiple sentences.
Takes multiple sentences as a list where each sentence is a string.
:param sentences: Input sentences to tag
:type sentences: list(str)
:rtype: list(list(list(tuple(str, str)))
"""
default_properties = {'ssplit.isOneSentence': 'true',
'annotators': 'tokenize,ssplit,' }
default_properties.update(properties or {})
# Supports only 'pos' or 'ner' tags.
assert self.tagtype in ['pos', 'ner']
default_properties['annotators'] += self.tagtype
for sentence in sentences:
tagged_data = self.api_call(sentence, properties=default_properties)
yield [[(token['word'], token[self.tagtype]) for token in tagged_sentence['tokens']]
for tagged_sentence in tagged_data['sentences']]
That should enforce the list of string tokens input by the users.
If we allow the .tag()
to overload the properties before the raw_tag_sents
, that'll also allow users to easily handle cases like #1876
Looks good.
Just some minor comments. It should be if properties is None
, not if properties == None
. assert self.tagtype in ['pos', 'ner']
should be assert self.tagtype in ['pos', 'ner'], "CoreNLP tagger supports only 'pos' or 'ner' tags."
.
I don't really like idea of joining and splitting strings, maybe there could be a way to pass a list of words to CoreNLP as a sentence instead of a simple string.
Hi, I would like to take this up as my first issue.
It's great you are interested in the issue. If you have any questions, ask them here.
Most helpful comment
Looks good.
Just some minor comments. It should be
if properties is None
, notif properties == None
.assert self.tagtype in ['pos', 'ner']
should beassert self.tagtype in ['pos', 'ner'], "CoreNLP tagger supports only 'pos' or 'ner' tags."
.I don't really like idea of joining and splitting strings, maybe there could be a way to pass a list of words to CoreNLP as a sentence instead of a simple string.