ΠΡΠ»ΠΈ ΠΌΡ Π²Π²Π΅Π΄Π΅ΠΌ ΠΏΡΠ΅Π΄Π»ΠΎΠΆΠ΅Π½ΠΈΠ΅ Ρ Π΄Π²ΠΎΠΉΠ½ΡΠΌΠΈ ΠΊΠ°Π²ΡΡΠΊΠ°ΠΌΠΈ Π² ΡΡΠ½ΠΊΡΠΈΡ span_tokenize TreebankWordTokenizer, Π²ΠΎΠ·Π½ΠΈΠΊΠ½ΡΡ ΠΎΡΠΈΠ±ΠΊΠΈ. ΠΠ΅ΡΠΎΡΡΠ½ΠΎ, ΡΡΠΎ ΡΠ²ΡΠ·Π°Π½ΠΎ Ρ ΡΠ΅ΠΌ, ΡΡΠΎ ΡΡΠ½ΠΊΡΠΈΡ ΠΎΡΠΏΡΠ°Π²Π»ΡΠ΅Ρ Π½Π΅ΠΎΠ±ΡΠ°Π±ΠΎΡΠ°Π½Π½ΡΡ ΡΡΡΠΎΠΊΡ Π²Π²ΠΎΠ΄Π° Π²ΠΌΠ΅ΡΡΠ΅ Ρ ΡΠΎΠΊΠ΅Π½ΠΈΠ·ΠΈΡΠΎΠ²Π°Π½Π½ΠΎΠΉ ΡΡΡΠΎΠΊΠΎΠΉ Π² ββΡΡΠ½ΠΊΡΠΈΡ align_tokens, Π½Π΅ ΡΡΠΈΡΡΠ²Π°Ρ, ΡΡΠΎ ΡΡΠ½ΠΊΡΠΈΡ tokenize Π·Π°ΠΌΠ΅Π½ΠΈΡ Π΄Π²ΠΎΠΉΠ½ΡΠ΅ ΠΊΠ°Π²ΡΡΠΊΠΈ Π½Π° ΡΡΠΎ-ΡΠΎ Π΅ΡΠ΅.
Π‘ΠΏΠ°ΡΠΈΠ±ΠΎ @albertauyeung Π·Π° ΡΠΎΠΎΠ±ΡΠ΅Π½ΠΈΠ΅ ΠΎ ΠΏΡΠΎΠ±Π»Π΅ΠΌΠ΅. Π£ Π²Π°Ρ Π΅ΡΡΡ ΠΏΡΠΈΠΌΠ΅Ρ, ΠΊΠΎΠ³Π΄Π° Π²Ρ Π²ΡΡΡΠ΅ΡΠΈΠ»ΠΈ ΠΎΡΠΈΠ±ΠΊΡ Ρ TreebankWordTokenizer.span_tokenize()
?
ΠΡ ΠΈΠΌΠ΅Π΅ΡΠ΅ Π² Π²ΠΈΠ΄Ρ ΡΡΠΎ-ΡΠΎ ΠΏΠΎΠ΄ΠΎΠ±Π½ΠΎΠ΅?
>>> from nltk.tokenize.treebank import TreebankWordTokenizer
>>> tbw = TreebankWordTokenizer
>>> tbw = TreebankWordTokenizer()
>>> s = '''This is a sentence with "quotes inside" and alsom some 'single quotes', etc.'''
>>> print(s)
This is a sentence with "quotes inside" and alsom some 'single quotes', etc.
>>> tbw.span_tokenize(s)
Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/nltk/tokenize/util.py", line 230, in align_tokens
start = sentence.index(token, point)
ValueError: substring not found
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/site-packages/nltk/tokenize/treebank.py", line 167, in span_tokenize
return align_tokens(tokens, text)
File "/usr/local/lib/python3.5/site-packages/nltk/tokenize/util.py", line 232, in align_tokens
raise ValueError('substring "{}" not found in "{}"'.format(token, sentence))
ValueError: substring "``" not found in "This is a sentence with "quotes inside" and alsom some 'single quotes', etc."
ΠΠ΅ΠΎΠΏΡΠΈΠΌΠ°Π»ΡΠ½ΠΎΠ΅ ΡΠ΅ΡΠ΅Π½ΠΈΠ΅ :
>>> s = '''This is a sentence with `` quotes inside '' and alsom some 'single quotes', etc.'''
>>> tbw.span_tokenize(s)
[(0, 4), (5, 7), (8, 9), (10, 18), (19, 23), (24, 26), (27, 33), (34, 40), (41, 43), (44, 47), (48, 53), (54, 58), (59, 66), (67, 73), (73, 74), (74, 75), (76, 79), (79, 80)]
@alvations ΠΠ°. ΠΡΠΎ ΡΠΎΡΠ½Π°Ρ ΠΎΡΠΈΠ±ΠΊΠ°, ΠΊΠΎΡΠΎΡΡΡ Ρ ΠΏΠΎΠ»ΡΡΠΈΠ». ΠΡΡΠΌΠΎ ΡΠ΅ΠΉΡΠ°Ρ ΠΊΠ°ΠΆΠ΅ΡΡΡ, ΡΡΠΎ ΠΌΡ Π΄ΠΎΠ»ΠΆΠ½Ρ ΠΏΡΠ΅Π΄Π²Π°ΡΠΈΡΠ΅Π»ΡΠ½ΠΎ ΠΎΠ±ΡΠ°Π±ΠΎΡΠ°ΡΡ ΠΏΡΠ΅Π΄Π»ΠΎΠΆΠ΅Π½ΠΈΠ΅ ΠΏΠ΅ΡΠ΅Π΄ ΠΎΡΠΏΡΠ°Π²ΠΊΠΎΠΉ Π² ββspan_tokenize.
ΠΡΠΎΡΡΡΠΌ ΡΠ΅ΡΠ΅Π½ΠΈΠ΅ΠΌ Π±ΡΠ»ΠΎ Π±Ρ Π·Π°ΠΌΠ΅Π½ΠΈΡΡ ΠΊΠ°Π²ΡΡΠΊΠΈ ΠΏΠ΅ΡΠ΅Π΄ Π²ΡΠ·ΠΎΠ²ΠΎΠΌ ΡΡΠ½ΠΊΡΠΈΠΈ nltk.tokenize.util.align_tokens
Π½Π° https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L147
def span_tokenize(self, text):
tokens = self.tokenize(text)
tokens = ['"' if tok in ['``', "''"] else tok for tok in tokens]
return align_tokens(tokens, text)
ΠΠΎΡΠ»Π΅ ΠΏΠ°ΡΡΠ°:
>>> from nltk.tokenize.treebank import TreebankWordTokenizer
>>> tbw = TreebankWordTokenizer()
>>> s = '''This is a sentence with "quotes inside" and alsom some 'single quotes', etc.'''
>>> print(s)
This is a sentence with "quotes inside" and alsom some 'single quotes', etc.
>>> tbw.span_tokenize(s)
[(0, 4), (5, 7), (8, 9), (10, 18), (19, 23), (24, 25), (25, 31), (32, 38), (38, 39), (40, 43), (44, 49), (50, 54), (55, 62), (63, 69), (69, 70), (70, 71), (72, 75), (75, 76)]
@albertauyeung , Ρ ΠΎΡΠΈΡΠ΅ Π»ΠΈ Π²Ρ
@alvations ΠΠ°, ΠΊΠΎΠ½Π΅ΡΠ½ΠΎ. Π‘Π΄Π΅Π»Π°Ρ!
ΠΡΠΏΡΠ°Π²Π»Π΅Π½ΠΎ Π½Π° # 1751
ΠΠ±ΡΠ°ΡΠΈΡΠ΅ Π²Π½ΠΈΠΌΠ°Π½ΠΈΠ΅, ΡΡΠΎ ΡΡΠΎ ΠΈΡΠΏΡΠ°Π²Π»Π΅Π½ΠΈΠ΅ ΠΏΠΎ-ΠΏΡΠ΅ΠΆΠ½Π΅ΠΌΡ Π²ΡΠ·ΡΠ²Π°Π΅Ρ ΠΈΡΠΊΠ»ΡΡΠ΅Π½ΠΈΠ΅ Π΄Π»Ρ ΡΠ΅ΠΊΡΡΠ° Ρ ΠΎΠ±ΠΎΠΈΠΌΠΈ ΡΠΈΠΏΠ°ΠΌΠΈ ΠΊΠ°Π²ΡΡΠ΅ΠΊ:
nltk.TreebankWordTokenizer (). span_tokenize ('"` `')
ΠΡΠΈΠ²Π΅Ρ, @alyaxey , ΠΊΠ°ΠΊΠΎΠ΅ ΠΈΡΠΊΠ»ΡΡΠ΅Π½ΠΈΠ΅ ΡΡ Π²ΠΈΠ΄ΠΈΡΡ?
Π― Π²ΡΠΏΠΎΠ»Π½ΠΈΠ» nltk.TreebankWordTokenizer().span_tokenize('" ``')
ΠΈ ΠΏΠΎΠ»ΡΡΠΈΠ» ΡΠ»Π΅Π΄ΡΡΡΠ΅Π΅:
[(0, 1), (2, 4)]
ΠΠ·Π²ΠΈΠ½ΠΈΡΠ΅, Ρ ΠΏΡΠ΅Π΄ΠΎΡΡΠ°Π²ΠΈΠ» Π½Π΅Π²Π΅ΡΠ½ΡΠΉ ΡΠ΅ΡΡΠΎΠ²ΡΠΉ ΠΏΡΠΈΠΌΠ΅Ρ. ΠΠΎΠΆΠ°Π»ΡΠΉΡΡΠ°, Π²Π·Π³Π»ΡΠ½ΠΈΡΠ΅ Π½Π° ΡΡΠΎ:
import nltk
print(nltk.TreebankWordTokenizer().span_tokenize('``` "'))
ΠΠΆΠΈΠ΄Π°Π΅ΠΌΡΠΉ ΡΠ΅Π·ΡΠ»ΡΡΠ°Ρ Π±ΡΠ΄Π΅Ρ [(0, 2), (2, 3), (4, 5)]
Π΅ΡΠ»ΠΈ ΠΌΡ Π±ΡΠ΄Π΅ΠΌ ΡΠ»Π΅Π΄ΠΎΠ²Π°ΡΡ Π»ΠΎΠ³ΠΈΠΊΠ΅ ΡΠ΅ΠΊΡΡΠ΅Π³ΠΎ ΠΌΠ΅ΡΠΎΠ΄Π° ΡΠΎΠΊΠ΅Π½ΠΈΠ·Π°ΡΠΈΠΈ. Π’Π°ΠΊΠΆΠ΅ Π΄ΠΎΠΏΡΡΡΠΈΠΌΠ° [(0, 3), (4, 5)]
.
ΠΠΎΡ ΠΌΠΎΠΉ ΡΠ΅Π·ΡΠ»ΡΡΠ°Ρ Π΄Π»Ρ Π²Π΅ΡΠΊΠΈ ΡΠ°Π·ΡΠ°Π±ΠΎΡΡΠΈΠΊΠ°:
Traceback (most recent call last):
File "/Users/alyaxey/Downloads/nltk-develop/nltk/tokenize/util.py", line 254, in align_tokens
start = sentence.index(token, point)
ValueError: substring not found
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test.py", line 2, in <module>
print(nltk.TreebankWordTokenizer().span_tokenize('``` "'))
File "/Users/alyaxey/Downloads/nltk-develop/nltk/tokenize/treebank.py", line 179, in span_tokenize
return align_tokens(tokens, text)
File "/Users/alyaxey/Downloads/nltk-develop/nltk/tokenize/util.py", line 256, in align_tokens
raise ValueError('substring "{}" not found in "{}"'.format(token, sentence))
ValueError: substring "`" not found in "``` ""
Π― Ρ
ΠΎΡΠ΅Π» Π±Ρ ΠΏΡΠ΅Π΄Π»ΠΎΠΆΠΈΡΡ Π΄ΡΡΠ³ΠΎΠ΅ ΡΠ΅ΡΠ΅Π½ΠΈΠ΅: 1) ΠΈΡΠΏΡΠ°Π²ΠΈΡΡ ΡΡΡ ΠΈ ΠΏΠΎΠ΄ΠΎΠ±Π½ΡΠ΅ ΠΎΡΠΈΠ±ΠΊΠΈ, 2) ΠΎΠ±Π΅ΡΠΏΠ΅ΡΠΈΡΡ Π±ΠΎΠ»ΡΡΡΡ Π³ΠΈΠ±ΠΊΠΎΡΡΡ Π΄Π»Ρ ΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΠ΅Π»Π΅ΠΉ, 3) ΡΠ΄Π΅Π»Π°ΡΡ ΠΊΠΎΠ΄ Π±ΠΎΠ»Π΅Π΅ ΠΏΠΎΠ½ΡΡΠ½ΡΠΌ. ΠΡ ΠΌΠΎΠΆΠ΅ΠΌ Π΄ΠΎΠ±Π°Π²ΠΈΡΡ Π»ΠΎΠ³ΠΈΡΠ΅ΡΠΊΠΈΠΉ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡ ΠΊ ΠΌΠ΅ΡΠΎΠ΄Ρ tokenize
ΠΊΠΎΡΠΎΡΡΠΉ Π²ΠΊΠ»ΡΡΠ°Π΅Ρ ΠΈΠ»ΠΈ ΠΎΡΠΊΠ»ΡΡΠ°Π΅Ρ ΠΏΡΠ΅ΠΎΠ±ΡΠ°Π·ΠΎΠ²Π°Π½ΠΈΠ΅ ΠΊΠ°Π²ΡΡΠ΅ΠΊ. ΠΡ ΠΌΠΎΠΆΠ΅ΠΌ ΠΎΡΠΊΠ»ΡΡΠΈΡΡ ΠΏΡΠ΅ΠΎΠ±ΡΠ°Π·ΠΎΠ²Π°Π½ΠΈΠ΅ ΠΊΠ°Π²ΡΡΠ΅ΠΊ Π²ΠΎ Π²ΡΠ΅ΠΌΡ span_tokenize, ΡΡΠΎΠ±Ρ ΠΈΠ·Π±Π΅ΠΆΠ°ΡΡ ΠΊΠ°ΠΊΠΈΡ
-Π»ΠΈΠ±ΠΎ ΠΌΠ°Π½ΠΈΠΏΡΠ»ΡΡΠΈΠΉ, Π½Π΅ ΡΠ²ΡΠ·Π°Π½Π½ΡΡ
Ρ ΠΏΡΠΎΠ±Π΅Π»Π°ΠΌΠΈ.
Π― ΡΡΠΎΠ»ΠΊΠ½ΡΠ»ΡΡ Ρ ΠΈΡΠΊΠ»ΡΡΠ΅Π½ΠΈΠ΅ΠΌ Π² ΡΠ΅ΠΊΡΡΠ΅ΠΉ Π²Π΅ΡΡΠΈΠΈ span_tokenize Π΄Π»Ρ ΡΡΡΠΎΠΊ, ΠΊΠΎΡΠΎΡΡΠ΅ ΡΠΎΠ΄Π΅ΡΠΆΠ°Ρ ΡΠΊΠΎΠ±ΠΊΠΈ ΠΏΠ΅ΡΠ΅Π΄ ΠΊΠ°Π²ΡΡΠΊΠ°ΠΌΠΈ. Π― ΡΡΠΈΡΠ°Ρ, ΡΡΠΎ ΡΠ΅Π³ΡΠ»ΡΡΠ½ΠΎΠ΅ Π²ΡΡΠ°ΠΆΠ΅Π½ΠΈΠ΅ Π½Π΅Π²Π΅ΡΠ½ΠΎ, ΠΏΠΎΡΠΊΠΎΠ»ΡΠΊΡ ΠΎΠ½ΠΎ ΡΠ°ΠΊΠΆΠ΅ ΡΠΎΠΎΡΠ²Π΅ΡΡΡΠ²ΡΠ΅Ρ ΡΠΊΠΎΠ±ΠΊΠ°ΠΌ ΠΈ ΠΏΠΎΠ·ΠΆΠ΅ Π·Π°ΠΌΠ΅Π½ΡΠ΅Ρ ΠΊΠ°Π²ΡΡΠΊΠΈ Π² "raw_tokens" ΡΡΠΈΠΌΠΈ ΡΠΊΠΎΠ±ΠΊΠ°ΠΌΠΈ. ΠΠ»ΠΈ Ρ ΡΡΠΎ-ΡΠΎ ΡΠΏΡΡΠΊΠ°Ρ?
ΠΡΠΈΠΌΠ΅Ρ:
s = ' ( see 6) Biotin " " affinity'
w_spans = TreebankWordTokenizer().span_tokenize(s)
ΠΡΠΊΠ»ΡΡΠ΅Π½ΠΈΠ΅:
...
File "/home/mp/miniconda3/envs/py36/lib/python3.6/site-packages/nltk/tokenize/treebank.py", line 179, in span_tokenize
return align_tokens(tokens, text)
File "/home/mp/miniconda3/envs/py36/lib/python3.6/site-packages/nltk/tokenize/util.py", line 256, in align_tokens
raise ValueError('substring "{}" not found in "{}"'.format(token, sentence))
ValueError: substring "(" not found in " ( see 6) Biotin " " affinity"
ΠΡΠ΅Π΄Π»Π°Π³Π°Π΅ΠΌΠΎΠ΅ ΠΈΡΠΏΡΠ°Π²Π»Π΅Π½ΠΈΠ΅:
ΠΠ·ΠΌΠ΅Π½ΠΈΡΠ΅ ΡΠ΅Π³ΡΠ»ΡΡΠ½ΠΎΠ΅ Π²ΡΡΠ°ΠΆΠ΅Π½ΠΈΠ΅ Π² span_tokenize Ρ r'[(``)(\'\')(")]+'
Π½Π° r'(``)|(\'\')|(")'
Π₯ΠΎΡΠΎΡΠΎ, ΠΌΠΎΡ ΠΏΡΠΎΠ±Π»Π΅ΠΌΠ°, ΡΡΠΎ Π½Π° ΡΠ°ΠΌΠΎΠΌ Π΄Π΅Π»Π΅ ΡΠΆΠ΅ Π±ΡΠ»ΠΎ ΠΈΡΠΏΡΠ°Π²Π»Π΅Π½ΠΎ Π² ΡΠΈΠΊΡΠ°ΡΠΈΠΈ 4b21300999e11ba6f91952c05a936ccec0673e2e ΠΈ ΡΠ°Π±ΠΎΡΠ°Π΅Ρ ΠΊΠ°ΠΊ ΡΠ°ΡΠΌ Π² nltk-3.3
ΠΎ, ΡΡΠΎ Π²ΡΠ΅ Π΅ΡΠ΅ ΠΏΡΠΎΠ±Π»Π΅ΠΌΠ° Π² nltk-3.3
ΠΊΠ°ΠΊ ΡΡΠΎ:
File "/home/users/----/.miniconda2/lib/python2.7/site-packages/nltk/tokenize/util.py", line 258, in align_tokens
raise ValueError('substring "{}" not found in "{}"'.format(token, sentence))
ValueError: substring "''" not found in "''Elton's been through a lot," he told The Sun newspaper."
@memeda ΠΠΎΠ΄ΡΠ²Π΅ΡΠΆΠ΄Π°Ρ, ΡΡΠΎ ΠΌΠΎΠ³Ρ Π²ΠΎΡΠΏΡΠΎΠΈΠ·Π²Π΅ΡΡΠΈ ΡΡΡ ΠΎΡΠΈΠ±ΠΊΡ. Π Π΅ΡΠ΅Π½ΠΈΠ΅ ΡΠΎΡΡΠΎΠΈΡ Π² ΡΠΎΠΌ, ΡΡΠΎΠ±Ρ Π΄ΠΎΠ±Π°Π²ΠΈΡΡ Π΅ΡΠ΅ ΠΎΠ΄Π½ΠΎ ΡΠ΅Π³ΡΠ»ΡΡΠ½ΠΎΠ΅ Π²ΡΡΠ°ΠΆΠ΅Π½ΠΈΠ΅ Π΄Π»Ρ ΡΠΎΠΎΡΠ²Π΅ΡΡΡΠ²ΠΈΡ ΠΎΠ΄ΠΈΠ½Π°ΡΠ½ΡΠΌ ΠΊΠ°Π²ΡΡΠΊΠ°ΠΌ Π² Π½Π°ΡΠ°Π»Π΅ ΡΡΡΠΎΠΊΠΈ. ΠΠΎΠΆΠ°Π»ΡΠΉΡΡΠ°, ΠΏΠΎΡΠΌΠΎΡΡΠΈΡΠ΅ ΠΌΠΎΡ Π²Π΅ΡΠΊΡ Π½Π° https://github.com/albertauyeung/nltk/tree/hotfix-span-tokenizer
ΠΠΎΠ΄ΡΠ²Π΅ΡΠΆΠ΄Π΅Π½Π½ΡΠΉ:
raise ValueError('substring "{}" not found in "{}"'.format(token, sentence))
ValueError: substring "enriched" not found in "The Hindu describing his Cricket, once said: `` His batting resembles very closely that of his father -dashing and carefree -and his cover-drive, a joy to watch, has amazing impetus...''And it added that he had ``enriched Madras sport as his father had''."
ΠΡΠΈΠ²Π΅Ρ, Ρ ΡΠΎΠΆΠ΅ ΡΡΠΎΠ»ΠΊΠ½ΡΠ»ΡΡ Ρ ΡΡΠΎΠΉ ΠΎΡΠΈΠ±ΠΊΠΎΠΉ, Π½Π°ΠΏΡΠΈΠΌΠ΅Ρ, ΡΠΎ ΡΠ»Π΅Π΄ΡΡΡΠΈΠΌ ΡΠ΅ΠΊΡΡΠΎΠΌ:
''Cosita Linda' - Lisandro (2013)\n\"El Clon (2010) .... Mohammed
ΠΠΎΠ»ΡΡΠ΅Π½Π½Π°Ρ ΠΎΡΠΈΠ±ΠΊΠ° Π²ΡΠ³Π»ΡΠ΄ΠΈΡ ΡΠ»Π΅Π΄ΡΡΡΠΈΠΌ ΠΎΠ±ΡΠ°Π·ΠΎΠΌ:
ValueError: substring "''" not found in "''Cosita Linda' - Lisandro (2013)
"El Clon (2010) .... Mohammed"
ΠΡΡΡ Π»ΠΈ ΠΎΠ±Π½ΠΎΠ²Π»Π΅Π½ΠΈΡ ΠΏΠΎ ΡΡΠΎΠΉ ΠΏΡΠΎΠ±Π»Π΅ΠΌΠ΅?
Π‘Π°ΠΌΡΠΉ ΠΏΠΎΠ»Π΅Π·Π½ΡΠΉ ΠΊΠΎΠΌΠΌΠ΅Π½ΡΠ°ΡΠΈΠΉ
ΠΠ±ΡΠ°ΡΠΈΡΠ΅ Π²Π½ΠΈΠΌΠ°Π½ΠΈΠ΅, ΡΡΠΎ ΡΡΠΎ ΠΈΡΠΏΡΠ°Π²Π»Π΅Π½ΠΈΠ΅ ΠΏΠΎ-ΠΏΡΠ΅ΠΆΠ½Π΅ΠΌΡ Π²ΡΠ·ΡΠ²Π°Π΅Ρ ΠΈΡΠΊΠ»ΡΡΠ΅Π½ΠΈΠ΅ Π΄Π»Ρ ΡΠ΅ΠΊΡΡΠ° Ρ ΠΎΠ±ΠΎΠΈΠΌΠΈ ΡΠΈΠΏΠ°ΠΌΠΈ ΠΊΠ°Π²ΡΡΠ΅ΠΊ:
nltk.TreebankWordTokenizer (). span_tokenize ('"` `')