Xgboost: [ํ† ๋ก ] PySpark์™€ ํ†ตํ•ฉ

์— ๋งŒ๋“  2016๋…„ 10์›” 25์ผ  ยท  53์ฝ”๋ฉ˜ํŠธ  ยท  ์ถœ์ฒ˜: dmlc/xgboost

๋ฐฉ๊ธˆ PySpark http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html ํ†ตํ•ฉ์— ๋Œ€ํ•œ ๋ช‡ ๊ฐ€์ง€ ์š”์ฒญ์ด ์žˆ์Œ์„ ์•Œ์•˜์Šต๋‹ˆ๋‹ค.

๋‚˜๋Š” ๋˜ํ•œ ๊ฐ™์€ ์ฃผ์ œ์— ๋Œ€ํ•ด ํ† ๋ก ํ•˜๋Š” ์‚ฌ์šฉ์ž๋กœ๋ถ€ํ„ฐ ๋ช‡ ๊ฐ€์ง€ ์ด๋ฉ”์ผ์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค.

์ด ์ž‘์—…์„ ์–ธ์ œ ์‹œ์ž‘ํ• ์ง€ ์—ฌ๋ถ€์— ๋Œ€ํ•œ ๋…ผ์˜๋ฅผ ์—ฌ๊ธฐ์„œ ์‹œ์ž‘ํ•˜๊ณ  ์‹ถ์Šต๋‹ˆ๋‹ค.

@tqchen @terrytangyuan

๊ฐ€์žฅ ์œ ์šฉํ•œ ๋Œ“๊ธ€

@CodingCat @tqchen ๋ฐ์ดํ„ฐ ๊ณผํ•™ ์ปค๋ฎค๋‹ˆํ‹ฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ด์œ ๋กœ PySpark์—์„œ ๊ตฌํ˜„๋œ XGboost์˜ ์ด์ ์„ ํ™•์‹คํžˆ ๋ˆ„๋ฆด ๊ฒƒ์ž…๋‹ˆ๋‹ค.

  • ์ผ๋ฐ˜์ ์œผ๋กœ Python์€ 2017๋…„ 3์›” ๊ธฐ์ค€ 10.2%์˜ ์ธ๊ธฐ๋กœ #3 ์–ธ์–ด์ž…๋‹ˆ๋‹ค(์Šค์นผ๋ผ์˜ ๊ฒฝ์šฐ 1.8%).
    http://redmonk.com/sogrady/2017/03/17/language-rankings-1-17/
    https://jobsquery.it/stats/language/group;
  • PySpark ๋Œ€ Scala์˜ ์„ฑ๋Šฅ ์ธก๋ฉด์—์„œ ๋ณด๋ฉด Spark์˜ ํ›„๋“œ ์•„๋ž˜ ๊ฑฐ์˜ ๋ชจ๋“  Scala๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ทธ๋‹ค์ง€ ์ค‘์š”ํ•˜์ง€ ์•Š๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. ๋งž์Šต๋‹ˆ๊นŒ?
  • ๋‚˜๋Š” ์ ์–ด๋„ ์ƒ์‚ฐ์„ ์œ„ํ•ด PySpark๋ฅผ ์‚ฌ์šฉํ•˜๋Š” 3๊ฐœ์˜ AI ์‹ ์ƒ ๊ธฐ์—…์„ ์•Œ๊ณ  ์žˆ์œผ๋ฉฐ, ๋” ์ผ๋ฐ˜์ ์œผ๋กœ Python์€ ๋ฐ์ดํ„ฐ ๊ณผํ•™์ž์™€ ์ทจ์—… ์‹œ์žฅ์—์„œ ํ›จ์”ฌ ๋” ์ธ๊ธฐ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
    http://r4stats.com/articles/popularity/)
    ๋‚ด ๊ฒฐ๋ก : PySpark์—์„œ ๊ตฌํ˜„๋œ XGboost๋Š” ๋‹ค๋ฅธ ๋ชจ๋“  ๊ตฌํ˜„ ์ค‘์—์„œ DataScience์— ๊ฐ€์žฅ ํฐ ์˜ํ–ฅ์„ ๋ฏธ์น  ๊ฒƒ์ž…๋‹ˆ๋‹ค.
    (PS๊ฐ€ ์ผ๋‹จ ์•ˆ์ •๋˜๋ฉด Cloudera์—์„œ ๊ตฌํ˜„ํ•˜๋Š”์ง€ ํ™•์ธ)

๋ชจ๋“  53 ๋Œ“๊ธ€

@CodingCat PySpark ์ปค๋ฎค๋‹ˆํ‹ฐ๊ฐ€ ์–ผ๋งˆ๋‚˜ ํฐ์ง€ ์•„์‹ญ๋‹ˆ๊นŒ? ๋Œ€๋ถ€๋ถ„์˜ ์‚ฌ๋žŒ๋“ค์€ Scala API๋งŒ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋งŽ์€ ๊ฒƒ๋“ค์ด ํŒŒ์ด์ฌ์—์„œ ๋‹ค์‹œ ๊ตฌํ˜„๋˜์–ด์•ผ ํ•˜๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค - ๋งŒ์•ฝ ์ œ๊ฐ€ ํ‹€๋ ธ๋‹ค๋ฉด ์ €๋ฅผ ๊ณ ์ณ์ฃผ์„ธ์š”.

๋‚˜๋Š” PySpark๊ฐ€ ๋ฐ์ดํ„ฐ ๊ณผํ•™์ž ์ปค๋ฎค๋‹ˆํ‹ฐ์—์„œ ๊ฝค ๋„๋ฆฌ ํผ์ ธ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ๋น ๋ฅธ ํ”„๋กœํ† ํƒ€์ดํ•‘ ์‹œ๋‚˜๋ฆฌ์˜ค ๋“ฑ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ๊ณผํ•™์ž๊ฐ€ pySpark๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜๋Š” ๋งŽ์€ ์‚ฌ๋ก€์— ๋Œ€ํ•ด ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค.

๋ฐ˜๋ฉด์— ๋Œ€๋ถ€๋ถ„์˜ ํ”„๋กœ๋•์…˜ ์ˆ˜์ค€ ์‹œ๋‚˜๋ฆฌ์˜ค๋Š” Scala API๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค(์‚ฌ๋žŒ๋“ค์ด ๋Œ€๊ทœ๋ชจ ํ”„๋กœ๋•์…˜์—์„œ PySpark๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ๋Š” ํ•˜๋‚˜๋งŒ ์•Œ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค).

์˜ˆ, ํ˜„์žฌ Python API๊ฐ€ ๋Œ€๋ถ€๋ถ„์˜ ํ”„๋กœํ† ํƒ€์ดํ•‘ ์š”๊ตฌ ์‚ฌํ•ญ์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ๋” ๋งŽ์€ ํ”„๋กœ๋•์…˜ ์ค€๋น„๊ฐ€ ๋œ ๊ฒƒ์„ ์›ํ•  ๋•Œ ๋‚˜๋Š” ๊ฐœ์ธ์ ์œผ๋กœ Spark์— ๋” ๊ด€์‹ฌ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์•„๋งˆ๋„ ์šฐ๋ฆฌ๋Š” ์‚ฌ๋žŒ๋“ค์ด ๊ทธ๋“ค์˜ ํ•„์š”์— ๋Œ€ํ•ด ๋…ผ์˜ํ•  ์ˆ˜ ์žˆ๋„๋ก ์—ฌ๊ธฐ์— ํ† ๋ก ์„ ๋‚จ๊ฒจ์•ผ ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ทธ๋™์•ˆ ํ†ตํ•ฉ์„ ์œ„ํ•œ ์ ‘๊ทผ ๋ฐฉ์‹/์ถ”์ •/๋‹จ๊ณ„์— ๋Œ€ํ•œ ์„ธ๋ถ€ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•ด ์ฃผ์‹œ๋ฉด ๊ฐ์‚ฌํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

์ปค๋ฎค๋‹ˆํ‹ฐ์—์„œ ์ผ๋ถ€ ํ† ๋ก ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค.

http://apache-spark-developers-list.1001551.n3.nabble.com/Blocked-PySpark-changes-td19712.html

PySpark์˜ ๊ฐœ๋ฐœ์ด ๋’ค์ฒ˜์ ธ ์žˆ๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค...๋‹ค์šด์ŠคํŠธ๋ฆผ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ์„œ ์ €๋Š” PySpark ํ†ตํ•ฉ์— ์ „๋…ํ•˜๊ธฐ ์œ„ํ•ด hold on ์— ํˆฌํ‘œํ•ฉ๋‹ˆ๋‹ค.....

๋„ค, ๋ฐœ์ƒํ•œ ๋ฌธ์ œ๋ฅผ ๋””๋ฒ„๊น…ํ•˜๋Š” ๊ฒƒ๋„ ์–ด๋ ต์Šต๋‹ˆ๋‹ค(์ ์–ด๋„ ์ž‘๋…„์— ์‹œ๋„ํ–ˆ์„ ๋•Œ)...

๋กœ๋“œ๋งต(#873)์—๋Š” ๋ถ„์‚ฐ ํŒŒ์ด์ฌ์ด ๊ตฌํ˜„๋˜์—ˆ๋‹ค๊ณ  ๋‚˜์™€ ์žˆ์Šต๋‹ˆ๋‹ค. xgboost๊ฐ€ ํŒŒ์ด์ฌ์œผ๋กœ hadoop ํด๋Ÿฌ์Šคํ„ฐ์—์„œ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๊นŒ? (๋‚˜๋Š” pyspark๋ฅผ ์˜๋ฏธํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค)

์˜ˆ, ๋งํฌ์— ๊ฒŒ์‹œ๋œ ์˜ˆ๋ฅผ ์ฐธ์กฐํ•˜์‹ญ์‹œ์˜ค.

python์œผ๋กœ hadoop ํด๋Ÿฌ์Šคํ„ฐ์—์„œ xgboost๋ฅผ ์‹คํ–‰ํ•˜๋Š” ๊ฒƒ๊ณผ scala api๋กœ hadoop ํด๋Ÿฌ์Šคํ„ฐ์—์„œ xgboost๋ฅผ ์‹คํ–‰ํ•˜๋Š” ๊ฒƒ์˜ ์ฐจ์ด์ ์€ ๋ฌด์—‡์ž…๋‹ˆ๊นŒ? ์ฃผ์š” ์„ฑ๋Šฅ ์ฐจ์ด๊ฐ€ ์žˆ์Šต๋‹ˆ๊นŒ?
ํ”„๋กœ๋•์…˜ ๋ชจ๋ธ๋กœ๋„ pyspark๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์‚ฌ๋žŒ๋“ค์ด ์•„์ง ๋งŽ์ด ์žˆ๋Š” ๊ฒƒ ๊ฐ™์•„์š”.

@yiming-chen xgboost4j-spark์˜ ๋ชฉํ‘œ๋Š” ๋™์ผํ•œ ํŒŒ์ดํ”„๋ผ์ธ์—์„œ ETL๊ณผ ๋ชจ๋ธ ๊ต์œก์„ ํ†ตํ•ฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์งˆ๋ฌธ์€ ์‚ฌ์šฉ์ž๊ฐ€ ETL์„ ์ˆ˜ํ–‰ํ•  ๋•Œ ์–ด๋–ค ์–ธ์–ด๋ฅผ ์‚ฌ์šฉํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋‚ด ๊ด€์ฐฐ๊ณผ ๊ฒฝํ—˜์— ๋”ฐ๋ฅด๋ฉด 95%์˜ ์‚ฌ์šฉ์ž๊ฐ€ scala๋กœ ETL ์‹œ์Šคํ…œ์„ ๊ตฌ์ถ•ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

@CodingCat 95% ํ†ต๊ณ„๋ฅผ ์–ด๋””์„œ ์–ป์—ˆ๋Š”์ง€ ๋ชจ๋ฅด๊ฒ ์ง€๋งŒ PySpark๋Š” ํ™•์‹คํžˆ ์ œ ๊ฒฝํ—˜์— ๋„๋ฆฌ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์šฐ๋ฆฌ๋Š” Airflow๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ ํŒŒ์ดํ”„๋ผ์ธ์— ๋Œ€ํ•œ ์ž‘์—…์„ ์˜ˆ์•ฝํ•˜๋ ค๊ณ  ํ•˜๊ณ  Python์ด ์ด ์ƒํ™ฉ์— ์ ํ•ฉํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

@berch PySpark๋Š” ๊ท€ํ•˜๊ฐ€ ๋„๋ฆฌ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์œผ๋ฉฐ ๊ธฐ๋ฅ˜์™€ ํ†ตํ•ฉํ•  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค... ์ œ๊ฐ€ ๋งํ•œ ๊ฒƒ๊ณผ ๊ด€๋ จ์ด ์žˆ์Šต๋‹ˆ๊นŒ?

@CodingCat @tqchen ๋ฐ์ดํ„ฐ ๊ณผํ•™ ์ปค๋ฎค๋‹ˆํ‹ฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ด์œ ๋กœ PySpark์—์„œ ๊ตฌํ˜„๋œ XGboost์˜ ์ด์ ์„ ํ™•์‹คํžˆ ๋ˆ„๋ฆด ๊ฒƒ์ž…๋‹ˆ๋‹ค.

  • ์ผ๋ฐ˜์ ์œผ๋กœ Python์€ 2017๋…„ 3์›” ๊ธฐ์ค€ 10.2%์˜ ์ธ๊ธฐ๋กœ #3 ์–ธ์–ด์ž…๋‹ˆ๋‹ค(์Šค์นผ๋ผ์˜ ๊ฒฝ์šฐ 1.8%).
    http://redmonk.com/sogrady/2017/03/17/language-rankings-1-17/
    https://jobsquery.it/stats/language/group;
  • PySpark ๋Œ€ Scala์˜ ์„ฑ๋Šฅ ์ธก๋ฉด์—์„œ ๋ณด๋ฉด Spark์˜ ํ›„๋“œ ์•„๋ž˜ ๊ฑฐ์˜ ๋ชจ๋“  Scala๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ทธ๋‹ค์ง€ ์ค‘์š”ํ•˜์ง€ ์•Š๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. ๋งž์Šต๋‹ˆ๊นŒ?
  • ๋‚˜๋Š” ์ ์–ด๋„ ์ƒ์‚ฐ์„ ์œ„ํ•ด PySpark๋ฅผ ์‚ฌ์šฉํ•˜๋Š” 3๊ฐœ์˜ AI ์‹ ์ƒ ๊ธฐ์—…์„ ์•Œ๊ณ  ์žˆ์œผ๋ฉฐ, ๋” ์ผ๋ฐ˜์ ์œผ๋กœ Python์€ ๋ฐ์ดํ„ฐ ๊ณผํ•™์ž์™€ ์ทจ์—… ์‹œ์žฅ์—์„œ ํ›จ์”ฌ ๋” ์ธ๊ธฐ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
    http://r4stats.com/articles/popularity/)
    ๋‚ด ๊ฒฐ๋ก : PySpark์—์„œ ๊ตฌํ˜„๋œ XGboost๋Š” ๋‹ค๋ฅธ ๋ชจ๋“  ๊ตฌํ˜„ ์ค‘์—์„œ DataScience์— ๊ฐ€์žฅ ํฐ ์˜ํ–ฅ์„ ๋ฏธ์น  ๊ฒƒ์ž…๋‹ˆ๋‹ค.
    (PS๊ฐ€ ์ผ๋‹จ ์•ˆ์ •๋˜๋ฉด Cloudera์—์„œ ๊ตฌํ˜„ํ•˜๋Š”์ง€ ํ™•์ธ)

PR์„ ๋ณด๋‚ด์ฃผ์‹ญ์‹œ์˜ค. ๋น„์šฉ์„ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋‹ค์‹œ ์Šค๋ ˆ๋“œ๋กœ ๋Œ์•„๊ฐ€์ง€ ์•Š๋„๋ก ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ฒฐ๋ก ์œผ๋กœ โ€‹โ€‹ํ† ๋ก ์„ ๋งˆ๋ฌด๋ฆฌํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

  • ๋‚˜๋Š” ๊ฐœ์ธ์ ์œผ๋กœ PySpark๋ฅผ ํ†ตํ•ฉํ•˜๊ธฐ ์œ„ํ•œ ์ด ๋…ธ๋ ฅ์„ ๊ณ„์†ํ•˜๊ธฐ ์œ„ํ•ด ํˆฌํ‘œํ•˜์ง€ ์•Š์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค(ํ˜„์žฌ๋กœ์„œ๋Š”)

  • ๋‹ค๋ฅธ ์‚ฌ๋žŒ์ด ์ด์— ๊ธฐ์—ฌํ•˜๋Š” ๊ฒƒ์„ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์šฐ๋ฆฌ๋Š” ์ตœ์†Œํ•œ ๋‹ค์Œ ์‚ฌํ•ญ์„ ๊ณ ๋ คํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

    • ๋‹ค๋ฅธ ํŒŒ์ด์ฌ ํŒจํ‚ค์ง€๋ฅผ ์†Œ๊ฐœํ•˜์ง€ ๋งˆ์‹ญ์‹œ์˜ค

    • ํ†ตํ•ฉ์„ ๊ตฌํ˜„ํ•  ๋•Œ ํ˜„์žฌ Python API์— ๋Œ€ํ•œ ์ด์ „ ๋ฒ„์ „๊ณผ์˜ ํ˜ธํ™˜์„ฑ

    • pyspark ML์˜ ๋’ค์ณ์ง„ ๊ธฐ๋Šฅ ์ฒ˜๋ฆฌ

๊ทธ๋ ‡๋‹ค๋ฉด pyspark๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ XGBoost-spark ๋ชจ๋ธ์„ ๋กœ๋“œํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๊นŒ? @์ฝ”๋”ฉ์บฃ

๋”ฐ๋ผ์„œ ์‹ค์ œ๋กœ scala XGBoost ๋Š” PySpark JavaEstimator API์—์„œ ๋œ ๊ณ ํ†ต์Šค๋Ÿฝ๊ฒŒ ๋ž˜ํ•‘๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‚˜๋Š” ์กฐ๊ธˆ ๋†€์•˜๊ณ  ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ”„๋กœํ† ํƒ€์ž…์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

from pyspark.ml.wrapper import JavaEstimator, JavaModel
from pyspark.ml.param.shared import *
from pyspark.ml.util import *
from pyspark.context import SparkContext

class XGBoost(JavaEstimator, JavaMLWritable, JavaMLReadable, HasRegParam, HasElasticNetParam):

    def __init__(self, paramMap = {}):
        super(XGBoost, self).__init__()
        scalaMap = SparkContext._active_spark_context._jvm.PythonUtils.toScalaMap(paramMap)
        self._java_obj = self._new_java_obj(
            "ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator", self.uid, scalaMap)
        self._defaultParamMap = paramMap
        self._paramMap = paramMap

    def setParams(self, paramMap = {}):
        return self._set(paramMap)

    def _create_model(self, javaTrainingData):
        return JavaModel(javaTrainingData)

์•„์ง ์ž‘์—…์ด ํ•„์š”ํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•˜์ง€๋งŒ PySpark์—์„œ Xgboost ๋ฅผ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

Wieslaw๋Š” XGBoost PySpark ๋ž˜ํผ์— ๋Œ€ํ•œ ์ฝ”๋“œ ์กฐ๊ฐ์„ ๊ณต์œ ํ•ด ์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ์ ์ ˆํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ XGBoost ํด๋ž˜์Šค๋ฅผ ํ˜ธ์ถœํ•˜๋Š” ์ฝ”๋“œ๋ฅผ ๊ณต์œ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?

๊ฐ์‚ฌ ํ•ด์š”

@wpopielarski ๋‹น์‹ ์ด ํ•œ ๋ฉ‹์ง„ ์ผ์ž…๋‹ˆ๋‹ค. ํ•„์š”ํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜์™€ ํ•จ๊ป˜ XGBoost๋ฅผ ํ˜ธ์ถœํ•˜๋Š” ์ฝ”๋“œ๋ฅผ ๊ณต์œ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ? ํฐ ๋„์›€์ด ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค!

์ด๊ฒƒ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

        from app.xgboost import XGBoost
        xgboost_params = {
            "eta"  : 0.023,
            "max_depth" : 10,
            "min_child_weight" : 0.3,
            "subsample" : 0.7,
            "colsample_bytree" : 0.82,
            "colsample_bylevel" : 0.9,
            "base_score" : base_score,
            "eval_metric" : "auc",
            "seed" : 49,
            "silent" : 1,
            "objective" : "binary:logistic",
            "round" : 10,
            "nWorkers" : 2,
            "useExternalMemory" : True
        }
        xgboost_estimator = XGBoost.XGBoost(xgboost_params)
...
        model = xgboost_estimator.fit(data)

์ ์ ˆํ•œ PySpark ์ง€์›์œผ๋กœ PR์„ ํ•˜๋Š” ๋ฐ ๊ฐ€๊นŒ์›Œ์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

@thesuperzapper , ํ›Œ๋ฅญํ•ฉ๋‹ˆ๋‹ค!

ํฌ์žฅํ•˜๋Š”๋ฐ ์–ผ๋งˆ๋‚˜ ๊ฑธ๋ฆด ๊ฒƒ ๊ฐ™๋‚˜์š”? ์ง„ํ–‰ํ•˜๋Š” ๋™์•ˆ ํ†ต์ฐฐ๋ ฅ์„ ๊ณต์œ ํ•˜์‹ญ์‹œ์˜ค.

๊ฐ์‚ฌ ํ•ด์š”!

์•ˆ๋…•ํ•˜์„ธ์š”, ๊ด€์‹ฌ ์žˆ๋Š” ์‚ฌ๋žŒ์ด ์žˆ์œผ๋ฉด ์‰ฝ๊ฒŒ ์‚ฌ์šฉ์ž ์ง€์ •ํ•  ์ˆ˜ ์žˆ๋„๋ก ParamGridBuilder ๋กœ ๊ฐ„๋‹จํ•œ ๋ฒ„์ „์„ ์ž‘์„ฑํ•ฉ๋‹ˆ๋‹ค.

  • 1 ์œ ํšจํ•œ PYTHONPATH ๋””๋ ‰ํ† ๋ฆฌ์— mkdir -p ml/dmlc/xgboost4j/scala ํŒจํ‚ค์ง€ ๋””๋ ‰ํ† ๋ฆฌ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
  • ml/dmlc/xgboost4j/scala/spark.py ์•„๋ž˜ 2๊ฐœ์˜ ๋ณต์‚ฌ ์ฝ”๋“œ
from pyspark.ml.classification import JavaClassificationModel, JavaMLWritable, JavaMLReadable, TypeConverters, Param, \
    Params, HasFeaturesCol, HasLabelCol, HasPredictionCol, HasRawPredictionCol, SparkContext
from pyspark.ml.wrapper import JavaModel, JavaWrapper, JavaEstimator


class XGBParams(Params):
    '''

    '''
    eta = Param(Params._dummy(), "eta",
                "step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinks the feature weights to make the boosting process more conservative",
                typeConverter=TypeConverters.toFloat)
    max_depth = Param(Params._dummy(), "max_depth",
                      "maximum depth of a tree, increase this value will make the model more complex / likely to be overfitting. 0 indicates no limit, limit is required for depth-wise grow policy.range: [0,โˆž]",
                      typeConverter=TypeConverters.toInt)
    min_child_weight = Param(Params._dummy(), "min_child_weight",
                             "minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will berange: [0,โˆž]",
                             typeConverter=TypeConverters.toFloat)
    max_delta_step = Param(Params._dummy(), "max_delta_step",
                           "Maximum delta step we allow each treeโ€™s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced. Set it to value of 1-10 might help control the update.",
                           typeConverter=TypeConverters.toInt)
    subsample = Param(Params._dummy(), "subsample",
                      "subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting.",
                      typeConverter=TypeConverters.toFloat)
    colsample_bytree = Param(Params._dummy(), "colsample_bytree",
                             "subsample ratio of columns when constructing each tree",
                             typeConverter=TypeConverters.toFloat)
    colsample_bylevel = Param(Params._dummy(), "colsample_bylevel",
                              "subsample ratio of columns for each split, in each level.",
                              typeConverter=TypeConverters.toFloat)
    max_leaves = Param(Params._dummy(), "max_leaves",
                       "Maximum number of nodes to be added. Only relevant for the โ€˜lossguideโ€™ grow policy.",
                       typeConverter=TypeConverters.toInt)

    def __init__(self):
        super(XGBParams, self).__init__()

class XGBoostClassifier(JavaEstimator, JavaMLWritable, JavaMLReadable, XGBParams,
                        HasFeaturesCol, HasLabelCol, HasPredictionCol, HasRawPredictionCol):
    def __init__(self, paramMap={}):
        super(XGBoostClassifier, self).__init__()
        scalaMap = SparkContext._active_spark_context._jvm.PythonUtils.toScalaMap(paramMap)
        self._java_obj = self._new_java_obj("ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator", self.uid, scalaMap)
        self._defaultParamMap = paramMap
        self._paramMap = paramMap

    def setParams(self, paramMap={}):
        return self._set(paramMap)

    def _create_model(self, java_model):
        return XGBoostClassificationModel(java_model)


class XGBoostClassificationModel(JavaModel, JavaClassificationModel, JavaMLWritable, JavaMLReadable):

    def getBooster(self):
        return self._call_java("booster")

    def saveBooster(self, save_path):
        jxgb = JavaWrapper(self.getBooster())
        jxgb._call_java("saveModel", save_path)
  • 3 ์ผ๋ฐ˜ pyspark ๋ชจ๋ธ๋กœ ํ”Œ๋ ˆ์ดํ•˜์‹ญ์‹œ์˜ค!

@AakashBasuRZT @haiy , ์šฐ๋ฆฌ๋Š” ํ˜„์žฌ ๋ฌธ์ œ #3370์—์„œ ์ด์— ๋Œ€ํ•ด ์ œ๋Œ€๋กœ ์ž‘์—…ํ•˜๊ณ  ์žˆ์œผ๋ฉฐ PR #3376์ด ์ดˆ๊ธฐ ์ง€์›์„ ์ œ๊ณตํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

@haiy ์ผ๋ถ€ ์ž„์˜ ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ๋ถ„๋ฅ˜๊ธฐ์— ๋งž๋Š” ์ฝ”๋“œ ์Šค๋‹ˆํŽซ์„ ๋ณด์—ฌ ์ฃผ์‹œ๊ฒ ์Šต๋‹ˆ๊นŒ? ๋‚˜๋Š” ๋‹น์‹ ์ด ์š”์•ฝ ํ•œ 1๊ณผ 2๋ฅผ ๋”ฐ๋ž์ง€๋งŒ ์„ธ ๋ฒˆ์งธ ์š”์ ์„ ์ดํ•ดํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

@sagnik-rzt ์ด ์ƒ˜ํ”Œ์„ ํ™•์ธ

@haiy ๋‚˜๋Š” ์ด๊ฒƒ์„ ์‹คํ–‰ํ•˜๋ ค๊ณ ํ•ฉ๋‹ˆ๋‹ค :

import pyspark
import pandas as pd
from dmlc.xgboost4j.scala.spark import XGBoostClassifier
from sklearn.utils import shuffle

sc = pyspark.SparkContext('local[2]')
spark = pyspark.sql.SparkSession(sc)
df = pd.DataFrame({'x1': range(10), 'x2': [10] * 10, 'y': shuffle([0 for i in range(5)] + [1 for i in range(5)])})
sdf = spark.createDataFrame(df)
X = sdf.select(['x1', 'x2'])
Y = sdf.select(['y'])
print(X.show(5))

params = {'objective' :'binary:logistic', 'n_estimators' : 10, 'max_depth' : 3, 'learning_rate' : 0.033}
xgb_model = XGBoostClassifier(params)

์ด ์˜ˆ์™ธ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

Traceback (most recent call last):
  File "/home/sagnikb/PycharmProjects/auto_ML/pyspark_xgboost.py", line 20, in <module>
    xgb_model = XGBoostClassifier(params)
  File "/usr/lib/ml/dmlc/xgboost4j/scala/spark.py", line 47, in __init__
    self._java_obj = self._new_java_obj("ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator", self.uid, scalaMap)
  File "/usr/local/lib/python3.6/dist-packages/pyspark/ml/wrapper.py", line 63, in _new_java_obj
    return java_obj(*java_args)
TypeError: 'JavaPackage' object is not callable

Error in sys.excepthook:
Traceback (most recent call last):
  File "/home/sagnikb/PycharmProjects/auto_ML/pyspark_xgboost.py", line 20, in <module>
    xgb_model = XGBoostClassifier(params)
  File "/usr/lib/ml/dmlc/xgboost4j/scala/spark.py", line 47, in __init__
    self._java_obj = self._new_java_obj("ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator", self.uid, scalaMap)
  File "/usr/local/lib/python3.6/dist-packages/pyspark/ml/wrapper.py", line 63, in _new_java_obj
    return java_obj(*java_args)
TypeError: 'JavaPackage' object is not callable
Exception ignored in: <bound method JavaParams.__del__ of XGBoostClassifier_4f9eb5d1388e9e1424a4>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/pyspark/ml/wrapper.py", line 105, in __del__
    SparkContext._active_spark_context._gateway.detach(self._java_obj)
  File "/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py", line 1897, in detach
    java_object._detach()
AttributeError: 'NoneType' object has no attribute '_detach'

ํ™˜๊ฒฝ:
ํŒŒ์ด์ฌ 3.6
์ŠคํŒŒํฌ 2.3
์Šค์นผ๋ผ 2.11

@sagnik-rzt
ํ™•์‹คํ•˜์ง€ ์•Š์ง€๋งŒ ์ŠคํŒŒํฌ ํด๋ž˜์Šค ๊ฒฝ๋กœ์— deps์™€ ํ•จ๊ป˜ xgboost-spark.jar์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๊นŒ?

@wpopielarski ์•ˆ๋…•ํ•˜์„ธ์š”, ์ €๋Š” ๊ทธ๋ ‡๊ฒŒ ํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ๊ทธ jar ํŒŒ์ผ์„ ์–ด๋””์—์„œ ์ฐพ์„ ์ˆ˜ ์žˆ๋Š”์ง€ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?

@sagnik-rzt ์•ˆ๋…•ํ•˜์„ธ์š”, ์—ฌ๊ธฐ ์—์„œ ํ•ญ์•„๋ฆฌ๋ฅผ ๋‹ค์šด๋กœ๋“œ . ์ฃ„์†กํ•ฉ๋‹ˆ๋‹ค. ๋ฐฉ๊ธˆ ๋‚ด ํ•ญ์•„๋ฆฌ๊ฐ€ mac์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ตฌ์ถ• ๋˜์—ˆ์Œ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ตฌ์ถ•ํ•ด ๋ณด์„ธ์š”. ๊ทธ๋ฆฌ๊ณ  $SPARK_HOME/jars ์™€ ๊ฐ™์ด spark deps dir์— ๋„ฃ์œผ์‹ญ์‹œ์˜ค.

jvm-packages/xgboost-spark/target ์ •๋„์— ๋šฑ๋šฑํ•œ ํ•ญ์•„๋ฆฌ๋ฅผ ๋งŒ๋“œ๋Š” maven ๋ฐ assembly ํ”„๋กœํ•„์„ ์‚ฌ์šฉํ•˜์—ฌ ์ง์ ‘ ๋นŒ๋“œํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

@sagnik-rzt ๋ฌด์—‡์„ ํ•˜๋ ค๋Š”์ง€ ํ™•์‹คํ•˜์ง€ ์•Š์ง€๋งŒ OS์šฉ ๋šฑ๋šฑํ•œ ํ•ญ์•„๋ฆฌ๋ฅผ ๋งŒ๋“ค๋ ค๋ฉด dmlc xgboost github ํ”„๋กœ์ ํŠธ๋ฅผ ๋ณต์ œํ•˜๊ณ  cd๋ฅผ jvm-packages๋กœ ๋ณต์ œํ•˜๊ณ  assemby ํ”„๋กœํ•„๋กœ mvn์„ ์‹คํ–‰ํ•˜์‹ญ์‹œ์˜ค. Gradle ๋นŒ๋“œ ํŒŒ์ผ์„ ์ž‘์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์ž˜ ๋ชจ๋ฆ…๋‹ˆ๋‹ค.

์ž, ์ข…์†์„ฑ์ด ์žˆ๋Š” ๋šฑ๋šฑํ•œ ํ•ญ์•„๋ฆฌ๋ฅผ ๋งŒ๋“  ๋‹ค์Œ $SPARK_HOME/jars์— ๋ณต์‚ฌํ•˜์—ฌ ๋ถ™์—ฌ๋„ฃ์—ˆ์Šต๋‹ˆ๋‹ค.
๊ทธ๋Ÿฌ๋‚˜ ๋™์ผํ•œ ์˜ˆ์™ธ๊ฐ€ ๊ณ„์† ์ ์šฉ๋ฉ๋‹ˆ๋‹ค.

Traceback (most recent call last):
  File "/home/sagnikb/PycharmProjects/xgboost/test_import.py", line 21, in <module>
    clf = xgb(params)
  File "/usr/lib/ml/dmlc/xgboost4j/scala/spark.py", line 48, in __init__
    self._java_obj = self._new_java_obj("dmlc.xgboost4j.scala.spark.XGBoostEstimator", self.uid, scalaMap)
  File "/usr/local/lib/python3.6/dist-packages/pyspark/ml/wrapper.py", line 63, in _new_java_obj
    return java_obj(*java_args)
TypeError: 'JavaPackage' object is not callable

์ฃ„์†กํ•˜์ง€๋งŒ ์ผ๋ถ€ IDE ํ”„๋กœ์ ํŠธ์—์„œ ๋กœ์ปฌ๋กœ ํด๋Ÿฌ์Šคํ„ฐ์—์„œ ์‹คํ–‰ํ•ฉ๋‹ˆ๊นŒ? ๋‹น์‹ ์ด์žˆ๋Š” ๊ฒฝ์šฐ
spark-submit์„ ์‚ฌ์šฉํ•˜์—ฌ --jars ์Šค์œ„์น˜์— deps๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค.

2018-06-29 14:30 GMT+02:00 sagnik-rzt [email protected] :

์ข‹์•„, ์˜์กด์„ฑ์ด ์žˆ๋Š” ๋šฑ๋šฑํ•œ ํ•ญ์•„๋ฆฌ๋ฅผ ๋งŒ๋“  ๋‹ค์Œ ๋ณต์‚ฌํ•˜์—ฌ ๋ถ™์—ฌ๋„ฃ์—ˆ์Šต๋‹ˆ๋‹ค.
$SPARK_HOME/jars์—.
๊ทธ๋Ÿฌ๋‚˜ ๋™์ผํ•œ ์˜ˆ์™ธ๊ฐ€ ๊ณ„์† ์ ์šฉ๋ฉ๋‹ˆ๋‹ค.

์—ญ์ถ”์ (๊ฐ€์žฅ ์ตœ๊ทผ ํ˜ธ์ถœ ๋งˆ์ง€๋ง‰):
ํŒŒ์ผ "/home/sagnikb/PycharmProjects/xgboost/test_import.py", 21ํ–‰,
clf = xgb(๋งค๊ฐœ๋ณ€์ˆ˜)
ํŒŒ์ผ "/usr/lib/ml/dmlc/xgboost4j/scala/spark.py", 48ํ–‰, __init__
self._java_obj = self._new_java_obj("dmlc.xgboost4j.scala.spark.XGBoostEstimator", self.uid, scalaMap)
ํŒŒ์ผ "/usr/local/lib/python3.6/dist-packages/pyspark/ml/wrapper.py", 63ํ–‰, _new_java_obj
๋ฐ˜ํ™˜ java_obj(*java_args)
TypeError: 'JavaPackage' ๊ฐœ์ฒด๋ฅผ ํ˜ธ์ถœํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

โ€”
๋‹น์‹ ์ด ์–ธ๊ธ‰๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด๊ฒƒ์„ ๋ฐ›๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
์ด ์ด๋ฉ”์ผ์— ์ง์ ‘ ๋‹ต์žฅํ•˜๊ณ  GitHub์—์„œ ํ™•์ธํ•˜์„ธ์š”.
https://github.com/dmlc/xgboost/issues/1698#issuecomment-401339910 ๋˜๋Š” ์Œ์†Œ๊ฑฐ
์Šค๋ ˆ๋“œ
https://github.com/notifications/unsubscribe-auth/ALEzS3JmKjO0AZ6JMzcixwCce0_3zRM0ks5uBh3ogaJpZM4KgAY_
.

์ €๋Š” ํ˜„์žฌ #3376์„ ์ƒˆ๋กœ์šด spark ๋ธŒ๋žœ์น˜๋กœ ๋ฆฌ๋ฒ ์ด์Šคํ•˜๋Š” ์ž‘์—…์„ ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋™์•ˆ ๋ช‡๋ช‡ ์‚ฌ๋žŒ๋“ค์ด XGBoost-0.72์—์„œ ํ˜„์žฌ ์ฝ”๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์งˆ๋ฌธํ–ˆ์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ์€ XGBoost-0.72์šฉ pyspark ์ฝ”๋“œ๊ฐ€ ํฌํ•จ๋œ zip ํŒŒ์ผ์ž…๋‹ˆ๋‹ค.
๋‹ค์šด๋กœ๋“œ: sparkxgb.zip

๋‹ค์Œ ์ž‘์—…๋งŒ ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

  1. ์ผ๋ฐ˜ Scala XGBoost jar ๋ฐ ์ข…์†์„ฑ์„ ์ž‘์—…์— ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. (์˜ˆ: --jars ๋˜๋Š” spark.jars ๊ตฌ์„ฑ ์‚ฌ์šฉ).
  2. ์ž‘์—…์ด ์‹œ์ž‘๋˜๋ฉด Python์—์„œ ๋‹ค์Œ์„ ์‹คํ–‰ํ•˜์‹ญ์‹œ์˜ค. (๋˜๋Š” ๋ชจ๋“  ์‹คํ–‰์ž๊ฐ€ ๋ณผ ์ˆ˜ ์žˆ๋Š” ๋กœ์ปฌ ์œ„์น˜)
sc.addPyFile("hdfs:///XXXX/XXXX/XXXX/sparkxgb.zip")
  1. ๋‹ค์Œ ์ฝ”๋“œ๋กœ ํ…Œ์ŠคํŠธํ•˜์‹ญ์‹œ์˜ค. ( sample_binary_classification_data.txt ์„ ๋„๋‹ฌ ๊ฐ€๋Šฅํ•œ ์œ„์น˜๋กœ ์ด๋™ํ–ˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฉด ์ผ๋ฐ˜์ ์œผ๋กœ $SPARK_HOME/data/mllib/sample_binary_classification_data.txt )
from sparkxgb import XGBoostEstimator

# Load Data
dataPath = "sample_binary_classification_data.txt"
dataDF = spark.read.format("libsvm").load(dataPath)

# Split into Train/Test
trainDF, testDF = dataDF.randomSplit([0.8, 0.2], seed=1000)

# Define and train model
xgboost = XGBoostEstimator(
    # General Params
    nworkers=1, nthread=1, checkpointInterval=-1, checkpoint_path="",
    use_external_memory=False, silent=0, missing=float("nan"),

    # Column Params
    featuresCol="features", labelCol="label", predictionCol="prediction", 
    weightCol="weight", baseMarginCol="baseMargin", 

    # Booster Params
    booster="gbtree", base_score=0.5, objective="binary:logistic", eval_metric="error", 
    num_class=2, num_round=2, seed=None,

    # Tree Booster Params
    eta=0.3, gamma=0.0, max_depth=6, min_child_weight=1.0, max_delta_step=0.0, subsample=1.0,
    colsample_bytree=1.0, colsample_bylevel=1.0, reg_lambda=0.0, alpha=0.0, tree_method="auto",
    sketch_eps=0.03, scale_pos_weight=1.0, grow_policy='depthwise', max_bin=256,

    # Dart Booster Params
    sample_type="uniform", normalize_type="tree", rate_drop=0.0, skip_drop=0.0,

    # Linear Booster Params
    lambda_bias=0.0
)
xgboost_model = xgboost.fit(trainDF)

# Transform test set
xgboost_model.transform(testDF).show()

# Write model/classifier
xgboost.write().overwrite().save("xgboost_class_test")
xgboost_model.write().overwrite().save("xgboost_class_test.model")

๋ฉ”๋ชจ:

  • ์ด๊ฒƒ์€ Spark 2.2 ์ด์ƒ์—์„œ๋งŒ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.
  • Pipelines ๋ฐ ParamGridBuilder๋Š” ์ผ์ข…์˜ ์ง€์›์ด๋ฉฐ ์ผ๋ฐ˜ ๊ฐœ์ฒด์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ from sparkxgb.pipeline import XGBoostPipeline,XGBoostPipelineModel ์™€ ํ•จ๊ป˜ ์ˆ˜์ •๋œ ํŒŒ์ดํ”„๋ผ์ธ ๊ฐœ์ฒด๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • XGboost-0.72์˜ ์˜ค๋ฅ˜๋กœ ์ธํ•ด null ๊ฐ’์„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์ฒ˜๋ฆฌํ•˜๋ ค๋ฉด float("nan")๋ณด๋‹ค ๋ˆ„๋ฝ๋œ ๊ฐ’์— float("+inf")๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • ํ›ˆ๋ จ๋˜์ง€ ์•Š์€ ๋ชจ๋ธ ๊ฐ์ฒด๋Š” ๋‹ค์‹œ ๋กœ๋“œํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค(#3035 ์ฐธ์กฐ).
  • ์ด API๋Š” pyspark ์ง€์›์˜ ์ „์ฒด ๋ฆด๋ฆฌ์Šค๋กœ ๋ณ€๊ฒฝ๋ฉ๋‹ˆ๋‹ค.

@thesuperzapper jupyter ๋…ธํŠธ๋ถ์—์„œ pyspark๋กœ ์ด๊ฒƒ์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

๋‚ด ์‹œ์Šคํ…œ:
ํŒŒ์ด์ฌ 3.6.1
xg๋ถ€์ŠคํŠธ 0.72
์ŠคํŒŒํฌ 2.2.0
์ž๋ฐ” 1.8
์Šค์นผ๋ผ 2.12

XGBoostEstimator๋ฅผ ๋กœ๋“œํ•˜๋ ค๊ณ  ํ•  ๋•Œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค.

Exception in thread "Thread-19" java.lang.NoClassDefFoundError: ml/dmlc/xgboost4j/scala/EvalTrait
    at java.lang.Class.getDeclaredMethods0(Native Method)
    at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
    at java.lang.Class.privateGetPublicMethods(Class.java:2902)
    at java.lang.Class.getMethods(Class.java:1615)
    at py4j.reflection.ReflectionEngine.getMethodsByNameAndLength(ReflectionEngine.java:345)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:305)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
    at py4j.Gateway.invoke(Gateway.java:272)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: ml.dmlc.xgboost4j.scala.EvalTrait
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 12 more

์ด๊ฒƒ์€ ๋ฒ„๊ทธ์ž…๋‹ˆ๊นŒ ์•„๋‹ˆ๋ฉด ๋ช‡ ๊ฐ€์ง€ ์š”๊ตฌ ์‚ฌํ•ญ์„ ๋†“์น˜๊ณ  ์žˆ์Šต๋‹ˆ๊นŒ?

@BogdanCojocar xgboost ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ๋ˆ„๋ฝ๋œ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

xgboost๊ฐ€ ์ œ๋Œ€๋กœ ์ž‘๋™ํ•˜๋ ค๋ฉด ๋‹ค์Œ ๋‘ ๋ณ‘์ด ๋ชจ๋‘ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

ํ•ด๋‹น maven ๋งํฌ์—์„œ ํ•„์š”ํ•œ jar๋ฅผ ๋‹ค์šด๋กœ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

@superzapper๋‹˜ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. pyspark์— ๋Œ€ํ•œ ์ด ํ†ตํ•ฉ์ด ํ›Œ๋ฅญํ•ฉ๋‹ˆ๋‹ค!

Python ๋ชจ๋“ˆ์— ๋กœ๋“œํ•˜๊ธฐ ์œ„ํ•ด ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์„ ๋ถ€์Šคํ„ฐ์— ์ €์žฅํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ์ œ์•ˆ์ด ์žˆ์Šต๋‹ˆ๊นŒ?

@ericwang915 ์ผ๋ฐ˜์ ์œผ๋กœ ๋‹ค๋ฅธ XGBoost ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์™€ ์ƒํ˜ธ ์šด์šฉ๋˜๋Š” ๋ชจ๋ธ์„ ์–ป์œผ๋ ค๋ฉด ๋ชจ๋ธ ๊ฐœ์ฒด์— .booster.saveModel("XXX/XXX") ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ XXX๋Š” Spark ๋“œ๋ผ์ด๋ฒ„์˜ ๋กœ์ปฌ(๋น„ HDFS) ๊ฒฝ๋กœ์ž…๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ์ €์žฅ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋ฉด ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค(์ฐธ์กฐ: #2480).

๊ทธ๋Ÿฌ๋‚˜ ํ•ด๋‹น ๋ฒ„์ „์˜ ๋ž˜ํผ์—์„œ ์ €์žฅ ๊ธฐ๋Šฅ์„ ํ˜ธ์ถœํ•˜๋Š” ๋ฉ”์„œ๋“œ๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์„ ์žŠ์—ˆ์Šต๋‹ˆ๋‹ค. ์‹œ๊ฐ„์ด ๋˜๋ฉด ๋‚ด์ผ ์ถ”๊ฐ€ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. (์ €๋Š” ๋‰ด์งˆ๋žœ๋“œ์— ์‚ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค... ๊ทธ๋ž˜์„œ ์‹œ๊ฐ„๋Œ€)

๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๊ฑด ๊ทธ๋ ‡๊ณ , ํ›ˆ๋ จ ๊ณผ์ •์—์„œ ๋ฌด์Œ์ด 1๋กœ ์„ค์ •๋˜์–ด ์žˆ์–ด๋„ ํ‰๊ฐ€ ์ง€ํ‘œ์™€ ๋ถ€์ŠคํŒ… ๋ผ์šด๋“œ๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ๋กœ๊ทธ๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.

@superzapper ์ง€์‹œ์— ๊ฐ์‚ฌ๋“œ๋ฆฝ๋‹ˆ๋‹ค. pyspark์—์„œ xgboost ๋ชจ๋ธ์„ ํ›ˆ๋ จ/์ €์žฅํ•˜๋ผ๋Š” ์ง€์‹œ๋ฅผ ๋”ฐ๋ฅผ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. (scala)getFeatureScore()์™€ ๊ฐ™์€ ๋‹ค๋ฅธ xgboost ๋ชจ๋ธ ํ•จ์ˆ˜์— ์•ก์„ธ์Šคํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ์•„์ด๋””์–ด๊ฐ€ ์žˆ์Šต๋‹ˆ๊นŒ?

@ccdtzccdtz ํ˜„์žฌ 0.8์—์„œ Spark API๊ฐ€ ํฌ๊ฒŒ ๋ณ€๊ฒฝ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— pyspark ๋ž˜ํผ๋ฅผ ๋‹ค์‹œ ๋ฐฐ์„ ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์™„๋ฃŒ๋˜๋ฉด Spark Scala API์™€ ๊ธฐ๋Šฅ ํŒจ๋ฆฌํ‹ฐ๋ฅผ ๊ฐ–๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ž…๋‹ˆ๋‹ค.

์ดˆ๊ธฐ pyspark ๋ž˜ํผ์—์„œ ๊ธฐ๋ณธ ๋ถ€์Šคํ„ฐ ๋ฉ”์„œ๋“œ๋ฅผ ๋…ธ์ถœํ•˜์ง€ ์•Š์•˜์ง€๋งŒ Spark Scala API๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด xgboost_model_object.nativeBooster.getFeatureScore( ํ˜ธ์ถœํ•˜์—ฌ ํ‰์†Œ์ฒ˜๋Ÿผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

2๋ฒˆ ์ด์ƒ ์‹คํ–‰ํ•˜๋ฉด pyspark์˜ XGBoost๊ฐ€ ์ง€์†์ ์œผ๋กœ ์‹คํŒจํ•˜๋Š” ๊ฒƒ์„ ๋ณด์•˜์Šต๋‹ˆ๋‹ค. ๋™์ผํ•œ ์ฝ”๋“œ๋กœ ๋™์ผํ•œ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ ์‹คํ–‰ ์ค‘์ž…๋‹ˆ๋‹ค. ์ฒ˜์Œ์—๋Š” ์„ฑ๊ณตํ•˜์ง€๋งŒ ๋‘ ๋ฒˆ์งธ์—๋Š” ์‹คํŒจํ•ฉ๋‹ˆ๋‹ค. Spark 2.3์—์„œ XGBoost 0.72๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ž‘์—…์„ ๋‹ค์‹œ ์„ฑ๊ณต์ ์œผ๋กœ ์‹คํ–‰ํ•˜๋ ค๋ฉด pyspark ์…ธ์„ ๋‹ค์‹œ ์‹œ์ž‘ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๊ต์œก ๋ชฉ์ ์œผ๋กœ xgboost.trainWithDataFrame์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์ด ๋ฌธ์ œ๋ฅผ ๋ณธ ์‚ฌ๋žŒ์ด ์žˆ์Šต๋‹ˆ๊นŒ?

์•ˆ๋…•ํ•˜์„ธ์š” @thesuperzapper
๋‹น์‹ ์ด ์ฒ˜๋ฐฉํ•œ ๊ฒƒ์€ ๋‹จ์ผ ์ž‘์—…์ž ๋…ธ๋“œ์—์„œ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.
๊ทธ๋Ÿฌ๋‚˜ ๋‘˜ ์ด์ƒ์˜ ์ž‘์—…์ž(์ด ๊ฒฝ์šฐ 3๊ฐœ)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ pyspark xgboost๋ฅผ ์‹คํ–‰ํ•˜๋ ค๊ณ  ํ•˜๋ฉด ์‹คํ–‰๊ธฐ๊ฐ€ ์œ ํœด ์ƒํƒœ๊ฐ€ ๋˜๊ณ  ์ž ์‹œ ํ›„ ์ข…๋ฃŒ๋ฉ๋‹ˆ๋‹ค.
์ด๊ฒƒ์€ Titanic ๋ฐ์ดํ„ฐ ์„ธํŠธ(์ž‘์€ ๋ฐ์ดํ„ฐ ์„ธํŠธ)์—์„œ ์‹คํ–‰ํ•˜๋ ค๊ณ  ํ•˜๋Š” ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค.

from pyspark.sql.session import SparkSession
from pyspark.sql.types import *
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.sql.functions import col

spark = SparkSession\
        .builder\
        .appName("PySpark XGBOOST Titanic")\
        .getOrCreate()

#spark.sparkContext.addPyFile("../sparkxgb.zip")

from automl.sparkxgb import XGBoostEstimator

schema = StructType(
  [StructField("PassengerId", DoubleType()),
    StructField("Survival", DoubleType()),
    StructField("Pclass", DoubleType()),
    StructField("Name", StringType()),
    StructField("Sex", StringType()),
    StructField("Age", DoubleType()),
    StructField("SibSp", DoubleType()),
    StructField("Parch", DoubleType()),
    StructField("Ticket", StringType()),
    StructField("Fare", DoubleType()),
    StructField("Cabin", StringType()),
    StructField("Embarked", StringType())
  ])

df_raw = spark\
  .read\
  .option("header", "true")\
  .schema(schema)\
  .csv("titanic.csv")


df = df_raw.na.fill(0)

sexIndexer = StringIndexer() \
    .setInputCol("Sex") \
    .setOutputCol("SexIndex") \
    .setHandleInvalid("keep")

cabinIndexer = StringIndexer() \
    .setInputCol("Cabin") \
    .setOutputCol("CabinIndex") \
    .setHandleInvalid("keep")

embarkedIndexer = StringIndexer() \
    .setInputCol("Embarked") \
    .setOutputCol("EmbarkedIndex") \
    .setHandleInvalid("keep")

vectorAssembler  = VectorAssembler()\
  .setInputCols(["Pclass", "SexIndex", "Age", "SibSp", "Parch", "Fare", "CabinIndex", "EmbarkedIndex"])\
  .setOutputCol("features")

xgboost = XGBoostEstimator(nworkers=2,
    featuresCol="features",
    labelCol="Survival",
    predictionCol="prediction"
)

pipeline = Pipeline().setStages([sexIndexer, cabinIndexer, embarkedIndexer, vectorAssembler, xgboost])
trainDF, testDF = df.randomSplit([0.8, 0.2], seed=24)

model  =pipeline.fit(trainDF)
print(trainDF.schema)

๋‹ค์Œ์€ ์Šคํƒ ์ถ”์ ์ž…๋‹ˆ๋‹ค.
Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=172.16.1.5, DMLC_TRACKER_PORT=9093, DMLC_NUM_WORKER=3}2018-09-04 08:52:55 ERROR TaskSchedulerImpl:70 - Lost executor 0 on 192.168.49.43: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.2018-09-04 08:52:55 ERROR AsyncEventQueue:91 - Interrupted while posting to TaskFailedListener. Removing that listener.java.lang.InterruptedException: ExecutorLost during XGBoost Training: ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. at org.apache.spark.TaskFailedListener.onTaskEnd(SparkParallelismTracker.scala:116) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)

์‹คํ–‰๊ธฐ๊ฐ€ ๋‹ค์Œ์—์„œ ๋ฉˆ์ท„์Šต๋‹ˆ๋‹ค.
org.apache.spark.RDD.foreachPartition(RDD.scala:927) ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4$$anon$1.run(XGBoost.scala:348)

ํ™˜๊ฒฝ: Python 3.5.4, Spark ๋ฒ„์ „ 2.3.1, Xgboost 0.72

xgboost ๋ฐ spark ๊ตฌ์„ฑ์„ ๊ณต์œ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ? ์–ผ๋งˆ๋‚˜
์ž‘์—…์ž(xgboost ์ž‘์—…์ž), ์ŠคํŒŒํฌ ์‹คํ–‰๊ธฐ, ์ฝ”์–ด ๋“ฑ

-๋‹ˆํ‹ด

2018๋…„ 9์›” 4์ผ ํ™”์š”์ผ ์˜ค์ „ 5:03 sagnik-rzt [email protected]์—์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ผ์Šต๋‹ˆ๋‹ค.

์•ˆ๋…•ํ•˜์„ธ์š” @thesuperzapper https://github.com/thesuperzapper
๋‹น์‹ ์ด ์ฒ˜๋ฐฉํ•œ ๊ฒƒ์€ ๋‹จ์ผ ์ž‘์—…์ž ๋…ธ๋“œ์—์„œ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.
๊ทธ๋Ÿฌ๋‚˜ ๋‘˜ ์ด์ƒ์„ ์‚ฌ์šฉํ•˜์—ฌ pyspark xgboost๋ฅผ ์‹คํ–‰ํ•˜๋ ค๊ณ  ํ•  ๋•Œ
์ž‘์—…์ž, ์‹คํ–‰์ž๋Š” ์œ ํœด ์ƒํƒœ๊ฐ€ ๋˜๊ณ  ์ž ์‹œ ํ›„ ์ข…๋ฃŒ๋ฉ๋‹ˆ๋‹ค.
๋‹ค์Œ์€ ์Šคํƒ ์ถ”์ ์ž…๋‹ˆ๋‹ค.
'''
์ถ”์ ๊ธฐ๊ฐ€ ์‹œ์ž‘๋จ, env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=172.16.1.5,
DMLC_TRACKER_PORT=9093, DMLC_NUM_WORKER=3}2018-09-04 08:52:55 ์˜ค๋ฅ˜
TaskSche dulerImpl:70 - 192.168.49.43์—์„œ ์‹คํ–‰๊ธฐ 0 ์†์‹ค: ์›๊ฒฉ RPC
ํด๋ผ์ด์–ธํŠธ๊ฐ€ ์—ฐ๊ฒฐ ํ•ด์ œ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ž„๊ณ„๊ฐ’์„ ์ดˆ๊ณผํ•˜๋Š” ์ปจํ…Œ์ด๋„ˆ๋กœ ์ธํ•œ ๊ฒƒ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๋„คํŠธ์›Œํฌ ๋ฌธ์ œ. WARN ๋ฉ”์‹œ์ง€์— ๋Œ€ํ•œ ๋“œ๋ผ์ด๋ฒ„ ๋กœ๊ทธ๋ฅผ ํ™•์ธํ•˜์‹ญ์‹œ์˜ค.2018-09-04 08:52:55
์˜ค๋ฅ˜ AsyncE ventQueue:91 - TaskFailedListener์— ๊ฒŒ์‹œํ•˜๋Š” ๋™์•ˆ ์ค‘๋‹จ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
ํ•ด๋‹น listener.java.lang.InterruptedException ์ œ๊ฑฐ: ExecutorLost ์ค‘
XGBoost ๊ต์œก: ExecutorLostFailure(๋‹ค์Œ ์ค‘ ํ•˜๋‚˜๋กœ ์ธํ•ด ์‹คํ–‰๊ธฐ 0์ด ์ข…๋ฃŒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
์‹คํ–‰ ์ค‘์ธ ์ž‘์—…) ์ด์œ : ์›๊ฒฉ RPC ํด๋ผ์ด์–ธํŠธ ์—ฐ๊ฒฐ์ด ํ•ด์ œ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์•„๋งˆ๋„ ๋‹ค์Œ์œผ๋กœ ์ธํ•ด
์ž„๊ณ„๊ฐ’์„ ์ดˆ๊ณผํ•˜๋Š” ์ปจํ…Œ์ด๋„ˆ ๋˜๋Š” ๋„คํŠธ์›Œํฌ ๋ฌธ์ œ. ๋“œ๋ผ์ด๋ฒ„ ๋กœ๊ทธ ํ™•์ธ
๊ฒฝ๊ณ  ๋ฉ”์‹œ์ง€. ~์—
org.apache.spark.TaskFailedListener.onTaskEnd(SparkParallelismTracker.scala:116)
~์—
org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
~์—
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
~์—
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)
'''

โ€”
๋‹น์‹ ์ด ๋Œ“๊ธ€์„ ๋‹ฌ์•˜๊ธฐ ๋•Œ๋ฌธ์— ์ด๊ฒƒ์„ ๋ฐ›๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
์ด ์ด๋ฉ”์ผ์— ์ง์ ‘ ๋‹ต์žฅํ•˜๊ณ  GitHub์—์„œ ํ™•์ธํ•˜์„ธ์š”.
https://github.com/dmlc/xgboost/issues/1698#issuecomment-418341557 ๋˜๋Š” ์Œ์†Œ๊ฑฐ
์Šค๋ ˆ๋“œ
https://github.com/notifications/unsubscribe-auth/AJY-XklxsZB_FE7ZoAarV_fqw8D3JqxWks5uXmwqgaJpZM4KgAY_
.

@sagnik-rzt pyspark ๋ž˜ํผ๊ฐ€ XGboost 0.72๋งŒ ์ง€์›ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ „ํ˜€ ์ž‘๋™ํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ์‚ฌ์‹ค์— ๋†€๋ž์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ์—ฌ์ „ํžˆ 0.8์—์„œ ์ž‘์—…ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

@thesuperzapper , ๊ท€ํ•˜๊ฐ€ ์ œ๊ณตํ•œ ๋ฒ„์ „์„ ๊ธฐ๋ฐ˜์œผ๋กœ xgboost 0.80์„ ์ง€์›ํ•˜๋„๋ก ์ผ๋ถ€ ๋ถ€๋ถ„์„ ๋‹ค์‹œ ์ˆ˜์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.
๊ทธ๋Ÿฌ๋‚˜ py4j.protocol.Py4JError: ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier does not exist in the JVM ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์— ์ „์ฒด ์„ค๋ช…์„ ์ œ๊ณตํ–ˆ์Šต๋‹ˆ๋‹ค. ์ข€ ๋ด์ฃผ์‹œ๊ฒ ์–ด์š”?

๋ชจ๋“  ์ฝ”๋“œ๋Š” ์—ฌ๊ธฐ ์— ๋ฐฐ์น˜

0.8์—์„œ ์ž‘๋™ํ•˜๋„๋ก ๋งŒ๋“  ๊ฒƒ๋ณด๋‹ค ํ›จ์”ฌ ๋” ๋งŽ์€ ๋ณ€๊ฒฝ ์‚ฌํ•ญ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

๋‚ด๊ฐ€ 0.8 ๋ฒ„์ „์„ ๋‚ด๋†“์ง€ ์•Š์€ ์ฃผ๋œ ์ด์œ ๋Š” 0.72์—์„œ์™€ ๊ฐ™์ด xgboost ํŠน์ • ํŒŒ์ดํ”„๋ผ์ธ ๊ฐœ์ฒด๋ฅผ ๋งŒ๋“ค๊ณ  ์‹ถ์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ ํŒŒ์ดํ”„๋ผ์ธ ์ง€์†์„ฑ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

@thesuperzapper , 0.72 ์— ๋Œ€ํ•œ ์ฝ”๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด์„œ XGBoostPipeline ๊ฐœ์ฒด๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  XGBoostEstimator ๊ฐœ์ฒด๋ฅผ ์ง์ ‘ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ทธ๋ ‡๊ฒŒ ํ•˜๋Š” ๋™์•ˆ ํ›ˆ๋ จ/ํ”ผํŒ…์ด ํด๋Ÿฌ์Šคํ„ฐ์˜ ์ž‘์—…์ž์—๊ฒŒ ๋ถ„์‚ฐ๋˜์ง€ ์•Š๋Š”๋‹ค๋Š” ๊ฒƒ์„ ์•Œ์•˜์Šต๋‹ˆ๋‹ค. ์ž‘์—…์ž ์ „์ฒด์— ๋ฐฐํฌํ•˜๋ ค๋ฉด XGBoostPipeline ๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๊นŒ?

๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ ๊ต์œก์ด ์ง์›๋“ค์—๊ฒŒ ๋ถ„์‚ฐ๋˜์ง€ ์•Š๋Š” ์ด์œ ๋ฅผ ์•Œ๊ณ  ์žˆ์Šต๋‹ˆ๊นŒ?

์—…๋ฐ์ดํŠธ
XGBoostPipeline ์—์„œ XGBoostEstimator ๋ฅผ ์Šคํ…Œ์ด์ง€๋กœ ์„ค์ •ํ•˜์—ฌ ๊ต์œก์„ ์‹œ๋„ํ–ˆ์ง€๋งŒ ๋ฌธ์ œ๊ฐ€ ์ง€์†๋ฉ๋‹ˆ๋‹ค. ๋‹ค๋ฅธ pyspark ์ง€์› ๋ชจ๋ธ์— ๋Œ€ํ•ด ์ˆ˜ํ–‰ํ•˜๋Š” ๋™์•ˆ ํด๋Ÿฌ์Šคํ„ฐ์—์„œ ์‹คํ–‰ํ•  ๋•Œ ๊ต์œก์ด ์ž‘์—…์ž ๊ฐ„์— ๋ถ„์‚ฐ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

์ด ํ–‰๋™์„ ๊ด€์ฐฐํ•˜์…จ์Šต๋‹ˆ๊นŒ? ์–ด๋–ป๊ฒŒ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๊นŒ?

๋‚˜๋Š” ๋Œ€๋ถ€๋ถ„ XGBoost 0.8์šฉ ๋ž˜ํผ๋ฅผ ๋‹ค์‹œ ์ฝ”๋”ฉํ–ˆ์ง€๋งŒ ๋‚ด ์ž‘์—… ํด๋Ÿฌ์Šคํ„ฐ๊ฐ€ ์—ฌ์ „ํžˆ 2.2์— ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋‚ด Dockerized Spark 2.3 ํด๋Ÿฌ์Šคํ„ฐ๊ฐ€ ์…”ํ”Œ ์œ„์น˜ ๋ˆ„๋ฝ ๋ฌธ์ œ ์—†์ด Scala XGBoost ๋ถ„์‚ฐ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•  ์ˆ˜๋„ ์—†๊ธฐ ๋•Œ๋ฌธ์— ๋ถ„์‚ฐ ๋ชจ๋“œ์—์„œ ์‰ฝ๊ฒŒ ํ…Œ์ŠคํŠธํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. .

@sagnik-rzt ๋ฐ ๊ธฐํƒ€ ์‚ฌ์šฉ์ž๊ฐ€ ๊ฒช๊ณ  ์žˆ๋Š” ๋ฌธ์ œ๋Š” ํด๋Ÿฌ์Šคํ„ฐ ๊ตฌ์„ฑ ๋˜๋Š” Spark-Scala XGBoost์˜ ๋” ๊นŠ์€ ๋ฌธ์ œ์™€ ๊ด€๋ จ์ด ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

Spark-Scala XGBoost์—์„œ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?

@thesuperzapper ๋•๋ถ„์— ์…”ํ”Œ ์œ„์น˜๊ฐ€ ๋‚ด๋ถ€์ ์œผ๋กœ ์ฒ˜๋ฆฌ๋œ๋‹ค๊ณ  ์ƒ๊ฐํ–ˆ์Šต๋‹ˆ๋‹ค. ์ฆ‰, ํด๋Ÿฌ์Šคํ„ฐ ๊ตฌ์„ฑ๊ณผ ๋…๋ฆฝ์ ์œผ๋กœ ์ฒ˜๋ฆฌ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด stackoverflow ๊ฒŒ์‹œ๋ฌผ์„ ์ฐพ์•˜์œผ๋ฏ€๋กœ ์ด๋Ÿฌํ•œ ์ œ์•ˆ์„ ๊ตฌํ˜„ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋˜ํ•œ ์ค€๋น„๊ฐ€ ๋˜์—ˆ์œผ๋ฉด 0.8 ๋ฒ„์ „์„ ๊ณต์œ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ? ๋‚ด ํด๋Ÿฌ์Šคํ„ฐ์—์„œ ๋ฐฐํฌ๋ฅผ ํ…Œ์ŠคํŠธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ŠคํŒŒํฌ 2.3.1๊ณผ ํŒŒ์ด์ฌ 3.5๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ชจ๋ธ์„ ์ €์žฅํ•˜๊ณ  ๋กœ๋“œํ•œ ํ›„ ๋‹ค์Œ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

IllegalArgumentException: u'requirement ์‹คํŒจ: ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ๋กœ๋“œ ์˜ค๋ฅ˜: ํด๋ž˜์Šค ์ด๋ฆ„ org.apache.spark.ml.Pipeline์ด ํ•„์š”ํ•˜์ง€๋งŒ ํด๋ž˜์Šค ์ด๋ฆ„ org.apache.spark.ml.PipelineModel์„ ์ฐพ์•˜์Šต๋‹ˆ๋‹ค.

์ด๊ฒƒ ์ข€ ๋„์™€์ฃผ์„ธ์š”. ๊ฐ์‚ฌ

import pyspark
from pyspark.sql.session import SparkSession
from pyspark.sql.types import *
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.sql.functions import col

spark.sparkContext.addPyFile("sparkxgb.zip")
from sparkxgb import XGBoostEstimator
schema = StructType(
  [StructField("PassengerId", DoubleType()),
    StructField("Survival", DoubleType()),
    StructField("Pclass", DoubleType()),
    StructField("Name", StringType()),
    StructField("Sex", StringType()),
    StructField("Age", DoubleType()),
    StructField("SibSp", DoubleType()),
    StructField("Parch", DoubleType()),
    StructField("Ticket", StringType()),
    StructField("Fare", DoubleType()),
    StructField("Cabin", StringType()),
    StructField("Embarked", StringType())
  ])

df_raw = spark\
  .read\
  .option("header", "true")\
  .schema(schema)\
  .csv("train.csv")

 df = df_raw.na.fill(0)

 sexIndexer = StringIndexer()\
  .setInputCol("Sex")\
  .setOutputCol("SexIndex")\
  .setHandleInvalid("keep")

cabinIndexer = StringIndexer()\
  .setInputCol("Cabin")\
  .setOutputCol("CabinIndex")\
  .setHandleInvalid("keep")

embarkedIndexer = StringIndexer()\
  .setInputCol("Embarked")\
  .setOutputCol("EmbarkedIndex")\
  .setHandleInvalid("keep")

vectorAssembler = VectorAssembler()\
  .setInputCols(["Pclass", "SexIndex", "Age", "SibSp", "Parch", "Fare", "CabinIndex", "EmbarkedIndex"])\
  .setOutputCol("features")
xgboost = XGBoostEstimator(
    featuresCol="features", 
    labelCol="Survival", 
    predictionCol="prediction"
)

pipeline = Pipeline().setStages([sexIndexer, cabinIndexer, embarkedIndexer, vectorAssembler, xgboost])
model = pipeline.fit(df)
model.transform(df).select(col("PassengerId"), col("prediction")).show()

model.save("model_xgboost")
loadedModel = Pipeline.load("model_xgboost")


IllegalArgumentException: u'requirement failed: Error loading metadata: Expected class name org.apache.spark.ml.Pipeline but found class name org.apache.spark.ml.PipelineModel'


#predict2 = loadedModel.transform(df)

๋‹ค์Œ ์˜ต์…˜์„ ์‹œ๋„ํ–ˆ์Šต๋‹ˆ๋‹ค.

from pyspark.ml import PipelineModel
#model.save("model_xgboost")
loadedModel = PipelineModel.load("model_xgboost")

๋‹ค์Œ ์˜ค๋ฅ˜ ๋ฐœ์ƒ

ml.dmlc.xgboost4j.scala.spark๋ผ๋Š” ๋ชจ๋“ˆ์ด ์—†์Šต๋‹ˆ๋‹ค.

๊ฐœ๋ฐœ์ž ๋‹ค์šด๋กœ๋“œ ๋งํฌ: sparkxgb.zip

์ด ๋ฒ„์ „์€ XGBoost-0.8์—์„œ ์ž‘๋™ํ•˜์ง€๋งŒ ํ…Œ์ŠคํŠธ๋‚˜ ์ด ์Šค๋ ˆ๋“œ์— ๊ธฐ์—ฌํ•˜๋Š” ๊ฒƒ ์™ธ์—๋Š” ์‚ฌ์šฉํ•˜์ง€ ๋งˆ์‹ญ์‹œ์˜ค. ๋ณ€๊ฒฝ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.
(๋˜ํ•œ ์ฐธ๊ณ : Spark 2.2์˜ ๋ชจ๋“  ๋ฐฑํฌํŠธ๋ฅผ ์ œ๊ฑฐํ–ˆ์œผ๋ฏ€๋กœ Spark 2.3๋งŒ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.)

ํ•ด๋‹น ๋ฒ„์ „์—์„œ ๋‚ด๊ฐ€ ์•Œ๊ณ  ์žˆ๋Š” ์ฃผ์š” ๋ฌธ์ œ๋Š” ๋ถ„๋ฅ˜ ๋ชจ๋ธ์ด ์ €์žฅ๋œ ํ›„ ๋‹ค์‹œ ๋กœ๋“œ๋˜์ง€ ์•Š์•„ TypeError: 'JavaPackage' object is not callable ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด์ƒํ•˜๊ฒŒ๋„ XGBoostPipelineModel์€ XGBoost ๋ถ„๋ฅ˜ ๋‹จ๊ณ„์—์„œ ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ๋‚ด ๋ฌธ์ œ๋ผ๊ณ  ์ƒ๊ฐํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ๋ˆ„๊ตฐ๊ฐ€ ๋ถ„๋ฅ˜ ๋ชจ๋ธ์„ ์ฝ๋Š” ๊ฒƒ์ด ํšจ๊ณผ๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?

๊ทธ๋Ÿผ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ์ €๋Š” DefaultParamsWritable ์„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๊ตฌํ˜„ํ•˜๋ ค๊ณ  ์‹œ๋„ XGBoostPipeline ๋Œ€ํ•œ ํ•„์š”์„ฑ์„ ์ œ๊ฑฐํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋Š” ์žฅ๊ธฐ์ ์œผ๋กœ ์œ ์ง€ ๊ด€๋ฆฌํ•˜๊ธฐ๊ฐ€ ํ›จ์”ฌ ์‰ฌ์šฐ๋ฏ€๋กœ ์ฝ๊ธฐ/์“ฐ๊ธฐ ๋ฌธ์ œ๋Š” ์–ด์จŒ๋“  ๊ด€๋ จ์ด ์—†์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. (์ด๋ ‡๊ฒŒ ํ•˜๋ฉด CrossValidator์—์„œ ์ง€์†์„ฑ์ด ์ž‘๋™ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.)

์ด ํŽ˜์ด์ง€๊ฐ€ ๋„์›€์ด ๋˜์—ˆ๋‚˜์š”?
0 / 5 - 0 ๋“ฑ๊ธ‰