Nltk: ์ผ๋ถ€ ์ฝ”๋„ˆ ์ผ€์ด์Šค์—์„œ ๋ฌธ์žฅ ๋ถ„ํ•  ์‹คํŒจ

์— ๋งŒ๋“  2019๋…„ 08์›” 26์ผ  ยท  3์ฝ”๋ฉ˜ํŠธ  ยท  ์ถœ์ฒ˜: nltk/nltk

๋‚˜๋Š” ์•ฝ์–ด๊ฐ€ ํฌํ•จ๋œ ๋ฌธ์žฅ์„ ๋ถ„ํ• ํ•˜๋Š” ๊ฒƒ์ด ์–ผ๋งˆ๋‚˜ ์–ด๋ ค์šด์ง€์™€ #2154์— ์ž˜ ์„ค๋ช…๋˜์–ด ์žˆ๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ์•ฝ์–ด๋ฅผ ์ถ”๊ฐ€ํ•˜๋ฉด ํ•จ์ •์ด ์žˆ์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, ๋‚˜๋Š” ๋‚ด๊ฐ€ ๋ฌป๊ณ  ์‹ถ์€ ๋ช‡ ๊ฐ€์ง€ ์ฝ”๋„ˆ ์ผ€์ด์Šค๋ฅผ ์šฐ์—ฐํžˆ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ ์ค‘ ํ•˜๋‚˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ž…๋‹ˆ๋‹ค.

  • ์˜ˆ
  • ์ฆ‰
  • et al.

๋ฌธ์žฅ์—์„œ ์ž˜๋ชป๋œ ๋ฐฉ์‹์œผ๋กœ ๋ฌธ์žฅ์„ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค.

ie ๋ฐ eg์˜ ์˜ˆ

>>> sentence = ("Even though exempli gratia and id est are both Latin "
                "(and therefore italicized), no need to put e.g. or i.e. in "
                "italics when theyโ€™re in abbreviated form.")
>>> sent_tokenize_list = sent_tokenize(sentence)                                                                                                                           

>>> sent_tokenize_list                                                                                                                                            
['Even though exempli gratia and id est are both Latin (and therefore italicized), no need to put e.g.',
 'or i.e.',
 'in italics when theyโ€™re in abbreviated form.']

et al.

>>> from nltk.tokenize import sent_tokenize
>>> sentence = ("If David et al. get the financing, we can move forward "
                "with the prototype. However, this is very unlikely be cause "
                "they did not publish sufficiently last year.")
>>> sent_tokenize_list = sent_tokenize(sentence)
>>> sent_tokenize_list
['If David et al.',
 'get the financing, we can move forward with the prototype.',
 'However, this is very unlikely because they did not publish sufficiently last year.']

๋‚ด ๋…ธํŠธ๋ถ์—์„œ nltk.__version__ 3.4.5๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๋‚ด๊ฐ€ ๋ณด๊ธฐ์— ์ด ๋ฌธ์ œ๋Š” ์ž˜ ์•Œ๋ ค์ ธ ์žˆ๊ณ  ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ์•ฝ์–ด(ํŠนํžˆ ํ•™๊ณ„์—์„œ) ๋•Œ๋ฌธ์— #2154์™€ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

nice idea tokenizer

๊ฐ€์žฅ ์œ ์šฉํ•œ ๋Œ“๊ธ€

#2154์— ์ด์–ด ๋น ๋ฅธ ํ•ดํ‚น

>>> import nltk
>>> punkt = nltk.data.load('tokenizers/punkt/english.pickle')
>>> punkt._params.abbrev_types.add('al')
>>> text = 'If David et al. get the financing, we can move forward with the prototype. However, this is very unlikely be cause they did not publish sufficiently last year.'
>>> punkt.tokenize(text)
['If David et al. get the financing, we can move forward with the prototype.', 
'However, this is very unlikely be cause they did not publish sufficiently last year.']

๊ทธ๋Ÿฌ๋‚˜ ์•„๋งˆ๋„ ์šฐ๋ฆฌ๊ฐ€ ๋‹จ์–ด ํ† ํฌ๋‚˜์ด์ €(#2355)๋กœ ํ–ˆ๋˜ ๊ฒƒ๊ณผ ๊ฐ™์ด ๊ฐœ์„ ๋œ ๋ฌธ์žฅ ํ† ํฌ๋‚˜์ด์ €(#2008, #1214)๋ฅผ ๊ฐ–๋Š” ๊ฒƒ์ด ์ข‹์€ ์ƒ๊ฐ์ผ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด ์ฒซ ๋ฒˆ์งธ ๋‹จ๊ณ„๋กœ ๋ชจ๋“  nltk.corpus.nonbreaking_prefixes ์—์„œ punkt._params.abbrev_types ๊นŒ์ง€ ์‰ฝ๊ฒŒ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ชจ๋“  3 ๋Œ“๊ธ€

#2154์— ์ด์–ด ๋น ๋ฅธ ํ•ดํ‚น

>>> import nltk
>>> punkt = nltk.data.load('tokenizers/punkt/english.pickle')
>>> punkt._params.abbrev_types.add('al')
>>> text = 'If David et al. get the financing, we can move forward with the prototype. However, this is very unlikely be cause they did not publish sufficiently last year.'
>>> punkt.tokenize(text)
['If David et al. get the financing, we can move forward with the prototype.', 
'However, this is very unlikely be cause they did not publish sufficiently last year.']

๊ทธ๋Ÿฌ๋‚˜ ์•„๋งˆ๋„ ์šฐ๋ฆฌ๊ฐ€ ๋‹จ์–ด ํ† ํฌ๋‚˜์ด์ €(#2355)๋กœ ํ–ˆ๋˜ ๊ฒƒ๊ณผ ๊ฐ™์ด ๊ฐœ์„ ๋œ ๋ฌธ์žฅ ํ† ํฌ๋‚˜์ด์ €(#2008, #1214)๋ฅผ ๊ฐ–๋Š” ๊ฒƒ์ด ์ข‹์€ ์ƒ๊ฐ์ผ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด ์ฒซ ๋ฒˆ์งธ ๋‹จ๊ณ„๋กœ ๋ชจ๋“  nltk.corpus.nonbreaking_prefixes ์—์„œ punkt._params.abbrev_types ๊นŒ์ง€ ์‰ฝ๊ฒŒ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ๋ผ์ธ์„ ๋”ฐ๋ผ ๋ฌธ์ œ๋ฅผ ์ถ”๊ฐ€ํ•˜๋ ค๊ณ  ํ–ˆ์œผ๋‚˜ ๋‹น์‹ ์ด ๊ทธ๊ฒƒ์— ๋Œ€ํ•ด ์ €๋ฅผ ๋•Œ๋ฆฐ ๊ฒƒ์„ ๋ณด์•˜์Šต๋‹ˆ๋‹ค. ์‚ฌ๋žŒ๋“ค์ด ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋ ค๊ณ  ํ•  ๋•Œ ์—ฌ๊ธฐ์—์„œ ๋ผํ‹ด์–ด ์•ฝ์–ด ๋ชฉ๋ก์„ ๊ฒ€ํ† ํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค.
https://en.wikipedia.org/wiki/List_of_Latin_abbreviations

์—„๋ฐ€ํžˆ ๋งํ•˜๋ฉด ์™„์ „ํžˆ ์“ฐ์—ฌ์ง€๊ณ  ์ถ•์•ฝ๋˜์ง€ ์•Š์€ ํ˜•ํƒœ์˜ ์ด๋Ÿฌํ•œ ์•ฝ์–ด๋Š” _๋‹ค์ค‘ ๋‹จ์–ด ํ‘œํ˜„_(_MWEs_)์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์•„๋‹ˆ์š”? ๋‚ด ๋ง์€, ์•Œ๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ธฐ์ˆ ์ ์œผ๋กœ ๊ทธ๋“ค์€ (์—ญ์‹œ) ๊ณ ์ • _๊ตฌ๋ฌธ ํ…œํ”Œ๋ฆฟ_์„ ๋‚˜ํƒ€๋‚ด์ง€๋งŒ MWE ์ƒํƒœ๋ฅผ ๋ณ€๊ฒฝํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ #2202๊ฐ€ ์ด ๋ฌธ์ œ์— ๋„์›€์ด ๋  ์ˆ˜ ์žˆ๋Š”์ง€ ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค. ") ๐Ÿค”

(์ถ”์‹ , ์žฌ๋ฏธ์žˆ๋Š” ์‚ฌ์‹ค: '& al.'์€ 'et al.'์˜ ๋ฒ•์  ์ค„์ž„๋ง์ด๊ณ  '&c.'๋Š” 'etc.'์˜ ๋ฒ•์  ์ค„์ž„๋ง์ž…๋‹ˆ๋‹ค.)

์ด ํŽ˜์ด์ง€๊ฐ€ ๋„์›€์ด ๋˜์—ˆ๋‚˜์š”?
0 / 5 - 0 ๋“ฑ๊ธ‰