Nltk: CoreNLParser 태그()는 속성 오버로딩을 허용해야 합니다.

에 만든 2018년 09월 10일 · 3코멘트 · 출처: nltk/nltk

현재 CoreNLPParser.tag() 를 사용하면 Stanford CoreNLP의 "재 토큰화"가 예기치 않게 발생합니다.

>>> from nltk.parse.corenlp import CoreNLPParser
>>> ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
>>> sent = ['my', 'phone', 'number', 'is', '1111', '1111', '1111']
>>> ner_tagger.tag(sent)
[('my', 'O'),
 ('phone', 'O'),
 ('number', 'O'),
 ('is', 'O'),
 ('1111\xa01111\xa01111', 'NUMBER')]

예상되는 동작은 다음과 같아야 합니다.

>>> from nltk.parse.corenlp import CoreNLPParser
>>> ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
>>> sent = ['my', 'phone', 'number', 'is', '1111', '1111', '1111']
>>> ner_tagger.tag(sent)
[('my', 'O'), ('phone', 'O'), ('number', 'O'), ('is', 'O'), ('1111', 'DATE'), ('1111', 'DATE'), ('1111', 'DATE')]

제안된 솔루션은 .tag() 및 .tag_sents() 대한 properties 인수 오버로딩을 허용하는 것입니다(예: https://github.com/nltk/nltk/blob/develop/nltk/parse/). corenlp.py#L348 및 기본적으로 properties = {'tokenize.whitespace':'true'} 를 사용합니다. tag_sents() 공백으로 토큰을 연결하기 때문입니다.


    def tag_sents(self, sentences, properties=None):
        """
        Tag multiple sentences.

        Takes multiple sentences as a list where each sentence is a list of
        tokens.

        :param sentences: Input sentences to tag
        :type sentences: list(list(str))
        :rtype: list(list(tuple(str, str))
        """
        # Converting list(list(str)) -> list(str)
        sentences = (' '.join(words) for words in sentences)
        if properties == None:
            properties = {'tokenize.whitespace':'true'}
        return [sentences[0] for sentences in self.raw_tag_sents(sentences, properties)]

    def tag(self, sentence, properties=None):
        """
        Tag a list of tokens.

        :rtype: list(tuple(str, str))

        >>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
        >>> tokens = 'Rami Eid is studying at Stony Brook University in NY'.split()
        >>> parser.tag(tokens)
        [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'),
        ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'O')]

        >>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='pos')
        >>> tokens = "What is the airspeed of an unladen swallow ?".split()
        >>> parser.tag(tokens)
        [('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'),
        ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'),
        ('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]
        """
        return self.tag_sents([sentence], properties)[0]

    def raw_tag_sents(self, sentences, properties=None):
        """
        Tag multiple sentences.

        Takes multiple sentences as a list where each sentence is a string.

        :param sentences: Input sentences to tag
        :type sentences: list(str)
        :rtype: list(list(list(tuple(str, str)))
        """
        default_properties = {'ssplit.isOneSentence': 'true',
                              'annotators': 'tokenize,ssplit,' }

        default_properties.update(properties or {})

        # Supports only 'pos' or 'ner' tags.
        assert self.tagtype in ['pos', 'ner']
        default_properties['annotators'] += self.tagtype
        for sentence in sentences:
            tagged_data = self.api_call(sentence, properties=default_properties)
            yield [[(token['word'], token[self.tagtype]) for token in tagged_sentence['tokens']]
                    for tagged_sentence in tagged_data['sentences']]

사용자가 입력한 문자열 토큰 목록을 적용해야 합니다.

https://stackoverflow.com/questions/52250268/why-do-corenlp-ner-tagger-and-ner-tagger-join-the-separated-numbers-together 에 대한 세부 정보

.tag() 가 raw_tag_sents 전에 속성을 오버로드하도록 허용하면 사용자가 #1876과 같은 경우를 쉽게 처리할 수 있습니다.

bug goodfirstbug stanford api

출처

alvations

가장 유용한 댓글

좋아 보인다.

몇 가지 사소한 의견입니다. if properties is None 가 아니라 if properties == None if properties is None 이어야 합니다. assert self.tagtype in ['pos', 'ner'] 는 assert self.tagtype in ['pos', 'ner'], "CoreNLP tagger supports only 'pos' or 'ner' tags." 이어야 합니다.

문자열을 결합하고 분할하는 아이디어가 정말 마음에 들지 않습니다. 간단한 문자열 대신에 CoreNLP에 단어 목록을 문장으로 전달하는 방법이 있을 수 있습니다.

dimazest 에 2018년 09월 10일

👍2

모든 3 댓글

좋아 보인다.

dimazest 에 2018년 09월 10일

👍2

안녕하세요, 이 문제를 첫 번째 문제로 삼고 싶습니다.

rishabhkumar296 에 2019년 01월 06일

당신이 문제에 관심이 있다는 것은 대단한 일입니다. 질문이 있으면 여기에서 질문하세요.

dimazest 에 2019년 01월 07일

이 페이지가 도움이 되었나요?

0 / 5 - 0 등급

Nltk: CoreNLParser 태그()는 속성 오버로딩을 허용해야 합니다.

가장 유용한 댓글

모든 3 댓글

관련 문제