Scikit-learn: CountVectorizer ๋ฐ TfidfVectorizer ๋ฌธ์„œ์—๋Š” ์‚ฌ์šฉ์ž ์ •์˜ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์ „๋‹ฌํ•  ๋•Œ token_pattern์ด ๋ฌด์‹œ๋œ๋‹ค๋Š” ์–ธ๊ธ‰์ด ์—†์Šต๋‹ˆ๋‹ค.

์— ๋งŒ๋“  2019๋…„ 11์›” 29์ผ  ยท  3์ฝ”๋ฉ˜ํŠธ  ยท  ์ถœ์ฒ˜: scikit-learn/scikit-learn

์„ค๋ช…

Countvectorizer ๋ฐ TfidfVectorizer์— ๋Œ€ํ•œ ๋ฌธ์„œ๋Š” token_pattern ์™€ ์‚ฌ์šฉ์ž ์ •์˜ tokenizer ์ „๋‹ฌ ๊ฐ„์˜ ์ƒํ˜ธ ์ž‘์šฉ์— ๋Œ€ํ•ด ๋ช…ํ™•ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ˜„์žฌ tokenizer ๊ฐ€ ์ „๋‹ฌ๋˜๋ฉด token_pattern ๋Š” ๋ฌด์‹œ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ํ† ํฌ๋‚˜์ด์ € ๋งค๊ฐœ๋ณ€์ˆ˜์— ๋Œ€ํ•œ docstring ํ•ญ๋ชฉ์€ Override the string tokenization step while preserving the preprocessing and n-grams generation steps. ๋งŒ ์–ธ๊ธ‰ํ•ฉ๋‹ˆ๋‹ค. ๋‚˜์—๊ฒŒ๋Š” ์ด๊ฒƒ์ด token_pattern ๊ฐ€ ์ „ํ˜€ ์‚ฌ์šฉ๋˜์ง€ ์•Š์•˜๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•˜๋Š”์ง€ ์ฆ‰์‹œ ๋ช…ํ™•ํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ์€ ์ด๊ฒƒ์œผ๋กœ ์ธํ•ด ๋ฐœ์ƒํ•œ ์‚ฌ์šฉ์ž์ž…๋‹ˆ๋‹ค. Stackoverflow

๋‚ด๊ฐ€ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋Š” ๋ช‡ ๊ฐ€์ง€:

  • ์‚ฌ์šฉ์ž๊ฐ€ (๋น„ํ‘œ์ค€) ํ† ํฐ ํŒจํ„ด๊ณผ ์‚ฌ์šฉ์ž ์ •์˜ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์ „๋‹ฌํ•˜๋ฉด ๊ฒฝ๊ณ  ๋ฐœ์ƒ
  • ์ƒํ˜ธ ์ž‘์šฉ์— ๋Œ€ํ•ด ๋ช…์‹œ์ ์œผ๋กœ docstring์„ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค.

๊ฐ€์žฅ ์œ ์šฉํ•œ ๋Œ“๊ธ€

๊ฒฝ๊ณ ๋Š” ์ƒˆ๋กœ์šด ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์–ด๋–ป๊ฒŒ ๋˜๋Š”์ง€ ๋ณด์ž

๋ชจ๋“  3 ๋Œ“๊ธ€

๊ฒฝ๊ณ ๋Š” 0.23rc3์— ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋ฅผ ์œ„ํ•ด ๊ทธ๊ฒƒ์„ ์‹œ๋„?

ํ™•์‹ ํ•˜๋Š”. ๊ฒฝ๊ณ ( UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None' )๊ฐ€ ์‹ค์ œ๋กœ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค. ๋จผ์ € ํ™•์ธํ•˜์ง€ ์•Š์€ ๊ฒƒ์ด ๋‚˜์ฉ๋‹ˆ๋‹ค. ์›ํ•˜์‹ ๋‹ค๋ฉด ์ง„ํ–‰ ์ƒํ™ฉ์„ ์„ค๋ช…ํ•˜๋Š” ๋ฌธ์„œ ํŽธ์ง‘์œผ๋กœ PR์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ์ง€๋งŒ ์•„๋งˆ๋„ ๊ฒฝ๊ณ ๋กœ ์ถฉ๋ถ„ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ฒฝ๊ณ ๋Š” ์ƒˆ๋กœ์šด ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์–ด๋–ป๊ฒŒ ๋˜๋Š”์ง€ ๋ณด์ž

์ด ํŽ˜์ด์ง€๊ฐ€ ๋„์›€์ด ๋˜์—ˆ๋‚˜์š”?
0 / 5 - 0 ๋“ฑ๊ธ‰