Nltk: ๋„์™€์ฃผ์„ธ์š”, ์ œ๋ฐœ! nltk ๋ฐ standford nlp ํ†ตํ•ฉ

์— ๋งŒ๋“  2018๋…„ 07์›” 06์ผ  ยท  3์ฝ”๋ฉ˜ํŠธ  ยท  ์ถœ์ฒ˜: nltk/nltk

nltk์™€ standford nlp ํ†ตํ•ฉ์„ ์‚ฌ์šฉํ•  ๋•Œ ํ˜ผ๋™์„ ์ผ์œผํ‚ค๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.
๋‚ด ๊ฐœ๋ฐœ ํ™˜๊ฒฝ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  1. nltk 3.3
  2. ์Šคํƒ ํฌ๋“œ nlp ์Šคํƒ ํฌ๋“œ ์„ธ๊ทธ๋จผํŠธ 3.6.0 / 3.9.1
    ๊ทธ๋ฆฌ๊ณ  ๋‹ค์Œ๊ณผ ๊ฐ™์ด StanfordSegmenter ๊ฐœ์ฒด๋ฅผ ๋งŒ๋“ค๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
    standfordNlpPath = self.projectPath + "\standford-nlp\stanford-segmenter-2015-12-09"
    stanfordSegmenter= ์Šคํƒ ํฌ๋“œ์„ธ๊ทธ๋ฉ˜ํ„ฐ(
    path_to_jar=standfordNlpPath + "\stanford-segmenter-3.6.0.jar",
    path_to_slf4j=standfordNlpPath + "\slf4j-api.jar",
    path_to_sihan_corpora_dict=standfordNlpPath + "\data-2015",
    path_to_model=standfordNlpPath + "\data-2015\pku.gz",
    path_to_dict=standfordNlpPath + "\data-2015\dict-chris6.ser.gz")
    ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ฒฐ๊ณผ๋กœ ์‹คํŒจํ•ฉ๋‹ˆ๋‹ค.
    ==================================================== ===========================
    NLTK๋Š” stanford-segmenter.jar๋ฅผ ์ฐพ์„ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค! CLASSPATH ์„ค์ •
    ํ™˜๊ฒฝ ๋ณ€์ˆ˜.
    ์ž์„ธํ•œ ๋‚ด์šฉ์€ stanford-segmenter.jar์—์„œ ๋‹ค์Œ์„ ์ฐธ์กฐํ•˜์„ธ์š”.

https://nlp.stanford.edu/software

๋ชจ๋“  ์ข…๋ฅ˜์˜ ํ•ญ์•„๋ฆฌ๊ฐ€ ์ •ํ™•ํžˆ ๊ฑฐ๊ธฐ์— ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. ๋‚ด ๊ฒฝ๋กœ๋‚˜ StanfordSegmenter ํด๋ž˜์Šค์— ๋„ฃ์€ ๋งค๊ฐœ๋ณ€์ˆ˜์— ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๊นŒ? ์˜ˆ์ œ๋Š” nltk 3.3 ๋ฌธ์„œ์—์„œ ์ฐพ์€ ๋งค์šฐ ์‰ฌ์› ์œผ๋ฉฐ "path_to_slf4j"๋ผ๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ํ•˜๋‚˜๋งŒ ๋„ฃ์—ˆ์Šต๋‹ˆ๋‹ค.
๊ทธ๋Ÿฌ๋‹ˆ ๋ˆ„๊ฐ€ ์ข€ ๋„์™€์ฃผ์„ธ์š” :-( !

resolved stanford api

๊ฐ€์žฅ ์œ ์šฉํ•œ ๋Œ“๊ธ€

์ƒˆ๋กœ์šด CoreNLPParser ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์‚ฌ์šฉํ•˜์‹ญ์‹œ์˜ค.

๋จผ์ € NLTK๋ฅผ ์—…๋ฐ์ดํŠธํ•˜์‹ญ์‹œ์˜ค.

pip3 install -U nltk

๊ทธ๋Ÿฐ ๋‹ค์Œ ์—ฌ์ „ํžˆ ํ„ฐ๋ฏธ๋„์— ์žˆ์Šต๋‹ˆ๋‹ค.

# Get the CoreNLP package
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-02-27.zip
unzip stanford-corenlp-full-2018-02-27.zip
cd stanford-corenlp-full-2018-02-27/

# Download the properties for chinese language
wget http://nlp.stanford.edu/software/stanford-chinese-corenlp-2018-02-27-models.jar 
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-chinese.properties 

# Download the properties for arabic
wget http://nlp.stanford.edu/software/stanford-arabic-corenlp-2018-02-27-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-arabic.properties


์ค‘๊ตญ์–ด:

# Start the server.
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-chinese.properties \
-preload tokenize,ssplit,pos,lemma,ner,parse \
-status_port 9001  -port 9001 -timeout 15000 & 

๊ทธ๋Ÿฐ ๋‹ค์Œ Python3์—์„œ:

>>> from nltk.parse import CoreNLPParser
>>> parser = CoreNLPParser('http://localhost:9001')
>>> list(parser.tokenize(u'ๆˆ‘ๅฎถๆฒกๆœ‰็”ต่„‘ใ€‚'))
['ๆˆ‘ๅฎถ', 'ๆฒกๆœ‰', '็”ต่„‘', 'ใ€‚']

์•„๋ž์–ด:

# Start the server.
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-arabic.properties \
-preload tokenize,ssplit,pos,parse \
-status_port 9005  -port 9005 -timeout 15000

๋งˆ์ง€๋ง‰์œผ๋กœ Python์„ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

>>> from nltk.parse import CoreNLPParser
>>> parser = CoreNLPParser(url='http://localhost:9005')
>>> text = u'ุงู†ุง ุญุงู…ู„'
>>> parser.tokenize(text)
<generator object GenericCoreNLPParser.tokenize at 0x7f4a26181bf8>
>>> list(parser.tokenize(text))
['ุงู†ุง', 'ุญุงู…ู„']

๋ชจ๋“  3 ๋Œ“๊ธ€

@libingnan54321 ์™œ ์ตœ์‹  3.9.1 ๋ฒ„์ „์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๊นŒ?

๋จผ์ € ์ด๊ฒƒ์„ ์‹œ๋„ํ•˜๊ณ  ์ถœ๋ ฅ์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?

segmenter_jar_file = os.path.join(standfordNlpPath,'stanford-segmenter-2018-02-27/stanford-segmenter-3.9.1.jar')
assert(os.path.isfile(segmenter_jar_file))
stanfordSegmenter = StanfordSegmenter(
    path_to_jar=segmenter_jar_file,
)

์ƒˆ๋กœ์šด CoreNLPParser ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์‚ฌ์šฉํ•˜์‹ญ์‹œ์˜ค.

๋จผ์ € NLTK๋ฅผ ์—…๋ฐ์ดํŠธํ•˜์‹ญ์‹œ์˜ค.

pip3 install -U nltk

๊ทธ๋Ÿฐ ๋‹ค์Œ ์—ฌ์ „ํžˆ ํ„ฐ๋ฏธ๋„์— ์žˆ์Šต๋‹ˆ๋‹ค.

# Get the CoreNLP package
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-02-27.zip
unzip stanford-corenlp-full-2018-02-27.zip
cd stanford-corenlp-full-2018-02-27/

# Download the properties for chinese language
wget http://nlp.stanford.edu/software/stanford-chinese-corenlp-2018-02-27-models.jar 
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-chinese.properties 

# Download the properties for arabic
wget http://nlp.stanford.edu/software/stanford-arabic-corenlp-2018-02-27-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-arabic.properties


์ค‘๊ตญ์–ด:

# Start the server.
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-chinese.properties \
-preload tokenize,ssplit,pos,lemma,ner,parse \
-status_port 9001  -port 9001 -timeout 15000 & 

๊ทธ๋Ÿฐ ๋‹ค์Œ Python3์—์„œ:

>>> from nltk.parse import CoreNLPParser
>>> parser = CoreNLPParser('http://localhost:9001')
>>> list(parser.tokenize(u'ๆˆ‘ๅฎถๆฒกๆœ‰็”ต่„‘ใ€‚'))
['ๆˆ‘ๅฎถ', 'ๆฒกๆœ‰', '็”ต่„‘', 'ใ€‚']

์•„๋ž์–ด:

# Start the server.
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-arabic.properties \
-preload tokenize,ssplit,pos,parse \
-status_port 9005  -port 9005 -timeout 15000

๋งˆ์ง€๋ง‰์œผ๋กœ Python์„ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

>>> from nltk.parse import CoreNLPParser
>>> parser = CoreNLPParser(url='http://localhost:9005')
>>> text = u'ุงู†ุง ุญุงู…ู„'
>>> parser.tokenize(text)
<generator object GenericCoreNLPParser.tokenize at 0x7f4a26181bf8>
>>> list(parser.tokenize(text))
['ุงู†ุง', 'ุญุงู…ู„']

ํ˜„์žฌ ํ•ด๊ฒฐ๋œ ๋Œ€๋กœ ๋ฌธ์ œ ์ข…๋ฃŒ =)
์ถ”๊ฐ€ ๋ฌธ์ œ๊ฐ€ ์žˆ์œผ๋ฉด ์—ด์–ด์ฃผ์„ธ์š”.

์ด ํŽ˜์ด์ง€๊ฐ€ ๋„์›€์ด ๋˜์—ˆ๋‚˜์š”?
0 / 5 - 0 ๋“ฑ๊ธ‰