Nltk: Tag CoreNLPparser() harus memungkinkan properti kelebihan beban

Dibuat pada 10 Sep 2018 · 3Komentar · Sumber: nltk/nltk

Dengan CoreNLPParser.tag() , "retokenization" oleh Stanford CoreNLP tidak terduga:

>>> from nltk.parse.corenlp import CoreNLPParser
>>> ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
>>> sent = ['my', 'phone', 'number', 'is', '1111', '1111', '1111']
>>> ner_tagger.tag(sent)
[('my', 'O'),
 ('phone', 'O'),
 ('number', 'O'),
 ('is', 'O'),
 ('1111\xa01111\xa01111', 'NUMBER')]

Perilaku yang diharapkan harus:

>>> from nltk.parse.corenlp import CoreNLPParser
>>> ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
>>> sent = ['my', 'phone', 'number', 'is', '1111', '1111', '1111']
>>> ner_tagger.tag(sent)
[('my', 'O'), ('phone', 'O'), ('number', 'O'), ('is', 'O'), ('1111', 'DATE'), ('1111', 'DATE'), ('1111', 'DATE')]

Solusi yang diusulkan adalah mengizinkan argumen properties overloading untuk .tag() dan .tag_sents() , yaitu di https://github.com/nltk/nltk/blob/develop/nltk/parse/ corenlp.py#L348 dan secara default gunakan properties = {'tokenize.whitespace':'true'} karena kami menggabungkan token dengan spasi di tag_sents() .


    def tag_sents(self, sentences, properties=None):
        """
        Tag multiple sentences.

        Takes multiple sentences as a list where each sentence is a list of
        tokens.

        :param sentences: Input sentences to tag
        :type sentences: list(list(str))
        :rtype: list(list(tuple(str, str))
        """
        # Converting list(list(str)) -> list(str)
        sentences = (' '.join(words) for words in sentences)
        if properties == None:
            properties = {'tokenize.whitespace':'true'}
        return [sentences[0] for sentences in self.raw_tag_sents(sentences, properties)]

    def tag(self, sentence, properties=None):
        """
        Tag a list of tokens.

        :rtype: list(tuple(str, str))

        >>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
        >>> tokens = 'Rami Eid is studying at Stony Brook University in NY'.split()
        >>> parser.tag(tokens)
        [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'),
        ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'O')]

        >>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='pos')
        >>> tokens = "What is the airspeed of an unladen swallow ?".split()
        >>> parser.tag(tokens)
        [('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'),
        ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'),
        ('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]
        """
        return self.tag_sents([sentence], properties)[0]

    def raw_tag_sents(self, sentences, properties=None):
        """
        Tag multiple sentences.

        Takes multiple sentences as a list where each sentence is a string.

        :param sentences: Input sentences to tag
        :type sentences: list(str)
        :rtype: list(list(list(tuple(str, str)))
        """
        default_properties = {'ssplit.isOneSentence': 'true',
                              'annotators': 'tokenize,ssplit,' }

        default_properties.update(properties or {})

        # Supports only 'pos' or 'ner' tags.
        assert self.tagtype in ['pos', 'ner']
        default_properties['annotators'] += self.tagtype
        for sentence in sentences:
            tagged_data = self.api_call(sentence, properties=default_properties)
            yield [[(token['word'], token[self.tagtype]) for token in tagged_sentence['tokens']]
                    for tagged_sentence in tagged_data['sentences']]

Itu harus menegakkan daftar input token string oleh pengguna.

Detail tentang https://stackoverflow.com/questions/52250268/why-do-corenlp-ner-tagger-and-ner-tagger-join-the-separated-numbers-together

Jika kita mengizinkan .tag() membebani properti sebelum raw_tag_sents , itu juga akan memungkinkan pengguna menangani kasus seperti #1876 dengan mudah

bug goodfirstbug stanford api

Sumber

alvations

Komentar yang paling membantu

Kelihatan bagus.

Hanya beberapa komentar kecil. Seharusnya if properties is None , bukan if properties == None . assert self.tagtype in ['pos', 'ner'] seharusnya assert self.tagtype in ['pos', 'ner'], "CoreNLP tagger supports only 'pos' or 'ner' tags." .

Saya tidak terlalu suka ide menggabungkan dan memisahkan string, mungkin ada cara untuk mengirimkan daftar kata ke CoreNLP sebagai kalimat, bukan string sederhana.

dimazest pada 10 Sep 2018

👍2

Semua 3 komentar

Kelihatan bagus.

Saya tidak terlalu suka ide menggabungkan dan memisahkan string, mungkin ada cara untuk mengirimkan daftar kata ke CoreNLP sebagai kalimat, bukan string sederhana.

dimazest pada 10 Sep 2018

👍2

Hai, saya ingin menjadikan ini sebagai edisi pertama saya.