Nltk: कुछ कोने के मामलों में बंटवारे के वाक्य विफल हो जाते हैं

को निर्मित 26 अग॰ 2019 · 3टिप्पणियाँ · स्रोत: nltk/nltk

मैं समझता हूं कि संक्षिप्ताक्षरों वाले वाक्यों को विभाजित करना कितना मुश्किल है और संक्षिप्ताक्षरों को जोड़ने से नुकसान हो सकते हैं, जैसा कि #2154 में अच्छी तरह से समझाया गया है। हालाँकि, मैंने कुछ कोने के मामलों पर ठोकर खाई है जिनके बारे में मैं पूछना चाहता हूँ। ऐसा लगता है कि निम्न में से किसी का उपयोग किया जा रहा है

जैसे
अर्थात
और अन्य।

वाक्य में वाक्य को गलत तरीके से विभाजित करेगा।

उदाहरण के लिए और उदाहरण के लिए

>>> sentence = ("Even though exempli gratia and id est are both Latin "
                "(and therefore italicized), no need to put e.g. or i.e. in "
                "italics when they’re in abbreviated form.")
>>> sent_tokenize_list = sent_tokenize(sentence)                                                                                                                           

>>> sent_tokenize_list                                                                                                                                            
['Even though exempli gratia and id est are both Latin (and therefore italicized), no need to put e.g.',
 'or i.e.',
 'in italics when they’re in abbreviated form.']

एट अल के लिए उदाहरण।

>>> from nltk.tokenize import sent_tokenize
>>> sentence = ("If David et al. get the financing, we can move forward "
                "with the prototype. However, this is very unlikely be cause "
                "they did not publish sufficiently last year.")
>>> sent_tokenize_list = sent_tokenize(sentence)
>>> sent_tokenize_list
['If David et al.',
 'get the financing, we can move forward with the prototype.',
 'However, this is very unlikely because they did not publish sufficiently last year.']

अपने लैपटॉप पर मैं nltk.__version__ 3.4.5 का उपयोग कर रहा हूं।

जैसा कि मैं देख रहा हूं कि यह मुद्दा #2154 से अलग है क्योंकि ये जाने-माने और आमतौर पर इस्तेमाल किए जाने वाले संक्षिप्ताक्षर हैं (विशेषकर अकादमिक मंडलियों में)।

nice idea tokenizer

स्रोत

vezeli

👍2

सबसे उपयोगी टिप्पणी

त्वरित हैक, निम्नलिखित #2154

>>> import nltk
>>> punkt = nltk.data.load('tokenizers/punkt/english.pickle')
>>> punkt._params.abbrev_types.add('al')
>>> text = 'If David et al. get the financing, we can move forward with the prototype. However, this is very unlikely be cause they did not publish sufficiently last year.'
>>> punkt.tokenize(text)
['If David et al. get the financing, we can move forward with the prototype.', 
'However, this is very unlikely be cause they did not publish sufficiently last year.']

लेकिन शायद यह एक अच्छा विचार है कि एक बेहतर वाक्य टोकननाइज़र (#2008, #1214) जैसा कि हमने टोकननाइज़र (#2355) शब्द के साथ किया था।

उदाहरण के लिए हम पहले चरण के रूप में आसानी से nltk.corpus.nonbreaking_prefixes से punkt._params.abbrev_types आसानी से nltk.corpus.nonbreaking_prefixes सकते हैं।

alvations 28 अग॰ 2019

👍2

सभी 3 टिप्पणियाँ

त्वरित हैक, निम्नलिखित #2154

>>> import nltk
>>> punkt = nltk.data.load('tokenizers/punkt/english.pickle')
>>> punkt._params.abbrev_types.add('al')
>>> text = 'If David et al. get the financing, we can move forward with the prototype. However, this is very unlikely be cause they did not publish sufficiently last year.'
>>> punkt.tokenize(text)
['If David et al. get the financing, we can move forward with the prototype.', 
'However, this is very unlikely be cause they did not publish sufficiently last year.']

alvations 28 अग॰ 2019

👍2

मैं इन पंक्तियों के साथ एक मुद्दा जोड़ने वाला था, लेकिन मैंने देखा कि आपने मुझे इसमें पीटा है। अगर/जब लोग इसे ठीक करने के लिए इधर-उधर हो जाते हैं, तो मैं यहाँ लैटिन संक्षिप्ताक्षरों की सूची की समीक्षा करने का सुझाव दूंगा:
https://en.wikipedia.org/wiki/List_of_Latin_abbreviations

jrtuenge 20 फ़र॰ 2020

👍1

तकनीकी रूप से उन संक्षेपों को उनके पूर्ण लिखित, गैर-संक्षिप्त रूप में बोलते हुए, _बहु-शब्द अभिव्यक्ति_ (_MWEs_) का प्रतिनिधित्व करते हैं, नहीं? मेरा मतलब है, ठीक है, तकनीकी रूप से वे (भी) निश्चित _phrasal टेम्पलेट्स_ का प्रतिनिधित्व करते हैं, लेकिन यह उनकी MWE स्थिति को नहीं बदलता है, इसलिए, मुझे आश्चर्य है कि क्या #2202 इस समस्या में मदद कर सकता है (यद्यपि मुझे लगता है कि उत्तर "नहीं" पर आ जाएगा। ")

(पुनश्च, मजेदार तथ्य: 'और अल।' 'एट अल' का कानूनी रूप से छोटा है, क्योंकि 'और सी' 'आदि' का कानूनी छोटा है।)

no-identd 3 जन॰ 2021

क्या यह पृष्ठ उपयोगी था?

0 / 5 - 0 रेटिंग्स

Nltk: कुछ कोने के मामलों में बंटवारे के वाक्य विफल हो जाते हैं

सबसे उपयोगी टिप्पणी

सभी 3 टिप्पणियाँ

संबंधित मुद्दों