çŸåšãsklearnã«ã¯å±€åãããã°ã«ãŒãkfoldæ©èœããããŸããã å±€åã䜿çšããããã°ã«ãŒãkfoldã䜿çšã§ããŸãã ãã ããäž¡æ¹ãããã°ããã§ãããã
ç§ãã¡ããããæã£ãŠãããšæ±ºããããç§ã¯ãããå®è£ ããããšæããŸãã
@TomDLT @NicolasHugã©ãæããŸããïŒ
çè«çã«ã¯èå³æ·±ããããããŸããããå®éã«ã©ãã»ã©åœ¹ç«ã€ãã¯ããããŸããã ç§ãã¡ã¯ç¢ºãã«åé¡ãéãããŸãŸã«ããŠãäœäººã®äººã ããã®æ©èœãèŠæ±ããããèŠãããšãã§ããŸã
åã°ã«ãŒããåäžã®ã¯ã©ã¹ã«ãããšæããŸããïŒ
ïŒ9413ãåç §ããŠãã ãã
@jnothmanã¯ããç§ã¯åããããªããšãèããŠããŸããã ãã ãããã«ãªã¯ãšã¹ãã¯ãŸã éããŠããããã§ãã ç§ã¯ãã°ã«ãŒãããã©ãŒã«ããè¶ããŠç¹°ãè¿ãããªãããšãæå³ããŸããã ã°ã«ãŒããšããŠIDãããå ŽåãåãIDãè€æ°ã®ãã©ãŒã«ãã«ãŸããã£ãŠçºçããããšã¯ãããŸãã
ããã¯RFECVã®äœ¿çšã«é¢é£ããŠããããšãç解ããŠããŸãã
çŸåšãããã¯ããã©ã«ãã§StratifiedKFoldcvã䜿çšããŸãã ãã®fitïŒïŒãgroups =ãåããŸã
ãã ããfitïŒïŒãå®è¡ãããšãã«ã°ã«ãŒããå°éãããªãããã§ãã èŠåãªãïŒãã°ãšèŠãªãããå ŽåããããŸãïŒã
ã°ã«ãŒãåãšéå±€åã¯ãã¬ã³ãŒãéã®äŸåé¢ä¿ãããéåžžã«äžåè¡¡ãªããŒã¿ã»ããã«åœ¹ç«ã¡ãŸã
ïŒç§ã®å Žåãåãå人ãè€æ°ã®ã¬ã³ãŒããæã£ãŠããŸãããåå²ã®æ°ã«æ¯ã¹ãŠãŸã å€æ°ã®ã°ã«ãŒã=人ãããŸããå°æ°æŽŸã¯ã©ã¹ã®äžæã®ã°ã«ãŒãã®æ°ãã©ããã«è¿ã¥ãã«ã€ããŠãå®éçãªåé¡ããããšæããŸãåå²æ°ïŒã
ã ããïŒ+1ïŒ
ããã¯ééããªã䟿å©ã§ãã ããšãã°ãéåžžã«äžåè¡¡ãªæç³»åã®å»çããŒã¿ã䜿çšããŠãæ£è ãåé¢ããŸãããïŒã»ãŒïŒåãã©ãŒã«ãã§äžåè¡¡ãªã¯ã©ã¹ã®ãã©ã³ã¹ãåããŸãã
ãŸããStratifiedKFoldã¯ã°ã«ãŒãããã©ã¡ãŒã¿ãŒãšããŠåãåããŸãããã°ã«ãŒãã«åŸã£ãŠã°ã«ãŒãåããªãããããã©ã°ãç«ãŠãå¿ èŠãããããšãããããŸããã
ãã®æ©èœã®ãã1ã€ã®è¯ã䜿ãæ¹ã¯ã財åããŒã¿ã§ããããã¯éåžžãéåžžã«äžåè¡¡ã§ãã ç§ã®å Žåãåããšã³ãã£ãã£ïŒç°ãªãæç¹ïŒã®è€æ°ã®ã¬ã³ãŒããæã€éåžžã«äžåè¡¡ãªããŒã¿ã»ããããããŸãã ãªãŒã¯ãåé¿ããããã«GroupKFold
ãå®è¡ããŸãããäžåè¡¡ã倧ããããã«ãããžãã£ããã»ãšãã©ãŸãã¯ãŸã£ãããªãã°ã«ãŒãã«ãªãå¯èœæ§ããããããéå±€åããŸãã
ïŒ14524ãåç §ããŠãã ããã
å±€åGroupShuffleSplitããã³GroupKFoldã®ãã1ã€ã®äœ¿çšäŸã¯ãçç©åŠçãå埩枬å®ãèšèšã§ãããã®èšèšã§ã¯ã被éšè ãŸãã¯ä»ã®èŠªã®çç©åŠçåäœããšã«è€æ°ã®ãµã³ãã«ããããŸãã ãŸããçç©åŠã®å€ãã®å®äžçã®ããŒã¿ã»ããã«ã¯ãã¯ã©ã¹ã®äžåè¡¡ããããŸãã ãµã³ãã«ã®åã°ã«ãŒãã«ã¯åãã¯ã©ã¹ããããŸãã ãããã£ãŠãã°ã«ãŒããéå±€åããŠãŸãšããããšãéèŠã§ãã
説æ
çŸåšãsklearnã«ã¯å±€åãããã°ã«ãŒãkfoldæ©èœããããŸããã å±€åã䜿çšããããã°ã«ãŒãkfoldã䜿çšã§ããŸãã ãã ããäž¡æ¹ãããã°ããã§ãããã
ç§ãã¡ããããæã£ãŠãããšæ±ºããããç§ã¯ãããå®è£ ããããšæããŸãã
ããã«ã¡ã¯ãç§ã¯ãããå»åŠMLã«éåžžã«åœ¹ç«ã€ãšæããŸãã ãã§ã«å®è£ ãããŠããŸããïŒ
@amueller人ã ãããã«èå³ãæã£ãŠããããšãèãããšããããå®è£ ããå¿ èŠããããšæããŸããïŒ
ç§ãéåžžã«èå³ããããŸã...ãµã³ãã«ããšã«è€æ°ã®è€è£œæž¬å®å€ãããå Žåãåå
æ³ã§éåžžã«åœ¹ç«ã¡ãŸãã亀差æ€å®ã®éããããã¯å®éã«åããã©ãŒã«ãã«ãšã©ãŸãå¿
èŠããããŸãã ãŸããåé¡ããããšããŠããäžåè¡¡ãªã¯ã©ã¹ãããã€ãããå Žåã¯ãéå±€åæ©èœã䜿çšããå¿
èŠããããŸãã ãããã£ãŠãç§ãããã«æ祚ããŸãïŒ ç³ãèš³ãããŸããããç§ã¯éçºã«åå ããã®ã«ååã§ã¯ãããŸããããåå ãã人ã«ãšã£ãŠã¯ãããã䜿çšãããããšã確信ã§ããŸã:-)
ãã¹ãŠã®ããŒã ã«è³æã§ãã ããããšãïŒ
å°ãªããšãStratifiedGroupKFold
ã§äœæ¥ãè©Šã¿ãããŠããããããã®ã¹ã¬ããã§åç
§ãããŠããåé¡ãšPRã確èªããŠãã ããã ç§ã¯ãã§ã«StratifiedGroupShuffleSplit
ïŒ15239ãå®è¡ããŸããããããã¯ãã¹ããå¿
èŠã§ãããç§ã¯ãã§ã«èªåã®äœæ¥ã«ããªã䜿çšããŠããŸãã
å®è£ ãã¹ãã ãšæããŸãããå®éã«äœã欲ããã®ãã¯ãŸã ããããŸããã @hermidalcã«ã¯ãåãã°ã«ãŒãã®ã¡ã³ããŒãåãã¯ã©ã¹ã§ãªããã°ãªããªããšããå¶éããããŸãã ããã¯äžè¬çãªã±ãŒã¹ã§ã¯ãããŸããããïŒ
èå³ã®ãã人ãèªåã®ãŠãŒã¹ã±ãŒã¹ãšãããããæ¬åœã«äœãæãã§ããã®ãã説æã§ããã°ããã§ãããã
ïŒ15239ïŒ14524ãšïŒ9413ããããŸããããããã¯ãã¹ãŠç°ãªãã»ãã³ãã£ã¯ã¹ãæã£ãŠããããšãèŠããŠããŸãã
@amuellerã¯ããªãã«å®å
šã«åæããŸããä»æ¥ãå©çšå¯èœãªããŸããŸãªããŒãžã§ã³ïŒïŒ15239ïŒ14524ãšïŒ9413ïŒã®éã§äœããæ¢ããŠããŸãããããããã®ãããããç§ã®ããŒãºã«åããã©ãããæ¬åœã«ç解ã§ããŸããã§ããã ããã§ãããã圹ç«ã€ãªãããããç§ã®ãŠãŒã¹ã±ãŒã¹ã§ãïŒ
ç§ã¯1000ã®ãµã³ãã«ãæã£ãŠããŸãã åãµã³ãã«ã¯NIRåå
èšã§3å枬å®ãããŠããã®ã§ãåãµã³ãã«ã«ã¯3ã€ã®è€è£œãããããã£ãšäžç·ã«ããããšæããŸã...
ãããã®1000åã®ãµã³ãã«ã¯ãããããã«éåžžã«ç°ãªãæ°ã®ãµã³ãã«ãæã€6ã€ã®ç°ãªãã¯ã©ã¹ã«å±ããŠããŸãã
ã¯ã©ã¹1ïŒ400ãµã³ãã«
ã¯ã©ã¹2ïŒ300ãµã³ãã«
ã¯ã©ã¹3ïŒ100ãµã³ãã«
ã¯ã©ã¹4ïŒ100ãµã³ãã«
ã¯ã©ã¹5ïŒ70ãµã³ãã«
ã¯ã©ã¹6ïŒ30ãµã³ãã«
ã¯ã©ã¹ããšã«åé¡åšãäœæããããšæããŸãã ãããã£ãŠãã¯ã©ã¹1ãšä»ã®ãã¹ãŠã®ã¯ã©ã¹ã次ã«ã¯ã©ã¹2ãšä»ã®ãã¹ãŠã®ã¯ã©ã¹ãªã©ã§ãã
ååé¡åã®ç²ŸåºŠãæ倧åããã«ã¯ãåãã©ãŒã«ãã«6ã€ã®ã¯ã©ã¹ã®ãµã³ãã«ã衚瀺ããããšãéèŠã§ããããã¯ãã¯ã©ã¹ã«ããã»ã©éãããªããããåžžã«6ã€ã®ã¯ã©ã¹ã衚瀺ããããã®æ£ç¢ºãªå¢çç·ãäœæããã®ã«åœ¹ç«ã¡ãŸããåãã©ãŒã«ãã§ã
ããããå±€åãããïŒåžžã«åãã©ãŒã«ãã§è¡šãããç§ã®6ã€ã®ã¯ã©ã¹ïŒã°ã«ãŒãïŒåžžã«ç§ã®åãµã³ãã«ã®3ã€ã®è€è£œã¡ãžã£ãŒãäžç·ã«ä¿ã€ïŒkfoldãç§ãããã§æ¢ããŠãããã®ã§ãããšç§ãä¿¡ããçç±ã§ãã
äœãæèŠã¯ãããŸããïŒ
ç§ã®ãŠãŒã¹ã±ãŒã¹ãšStratifiedGroupShuffleSplit
ãäœæããçç±ã¯ãå埩枬å®ãã¶ã€ã³https://en.wikipedia.org/wiki/Repeated_measures_designããµããŒãããããã§ãã ç§ã®ãŠãŒã¹ã±ãŒã¹ã§ã¯ãåãã°ã«ãŒãã®ã¡ã³ããŒã¯åãã¯ã©ã¹ã§ãªããã°ãªããŸããã
@fcoppeyããªãã«ãšã£ãŠãã°ã«ãŒãå ã®ãµã³ãã«ã¯åžžã«åãã¯ã©ã¹ãæã£ãŠããŸãããïŒ
@hermidalcç§ã¯ãã®çšèªã«ããŸã粟éããŠããŸãããããŠã£ãããã£ã¢ãããå埩枬å®ãã¶ã€ã³ãã¯ããã¯ãã¹ãªãŒããŒè©Šéšã«ã¯å埩枬å®ãã¶ã€ã³ããããåãã¯ã©ã¹å
ã«åãã°ã«ãŒããå«ãŸããŠããå¿
èŠãããããšããæå³ã§ã¯ãªãããã§ããåæ£è
ã¯2ã€ä»¥äžã®æ²»çã®ã·ãŒã±ã³ã¹ã«å²ãåœãŠããããã®ãã¡ã®1ã€ã¯æšæºæ²»çãŸãã¯ãã©ã»ãã§ããå¯èœæ§ããããŸããã
ãããMLèšå®ã«é¢é£ä»ãããšãå人ãæ²»çãåããã°ããããã©ã»ããåãããã枬å®å€ããäºæž¬ããããæ²»çãåããçµæãäºæž¬ããããšãã§ããŸãã
ã©ã¡ãã®å Žåããåãå人ã®ã¯ã©ã¹ãå€ããå¯èœæ§ããããŸãããïŒ
ååã«é¢ä¿ãªããã¯ãã¹ãªãŒããŒè©Šéšã§èª¬æãããŠããã®ãšåæ§ã®ã±ãŒã¹ã«ã€ããŠèããŠãããšãã«ãã©ã¡ããåããŠãŒã¹ã±ãŒã¹ãæã£ãŠããããã«æããŸãã ãããã¯ãããå°ãåçŽãªããšãããããŸãããæéã®çµéãšãšãã«æ£è ãç æ°ã«ãªãïŒãŸãã¯è¯ããªãïŒå¯èœæ§ããããããæ£è ã®è»¢åž°ãå€ããå¯èœæ§ããããŸãã
å®éããªã³ã¯å
ã®ãŠã£ãããã£ã¢ã®èšäºã«ã¯ãã瞊æåæ-å埩枬å®ãã¶ã€ã³ã«ãããç 究è
ã¯é·æããã³çæã®äž¡æ¹ã®ç¶æ³ã§åå è
ãæéã®çµéãšãšãã«ã©ã®ããã«å€åããããç£èŠã§ããŸãããšæ瀺çã«èšèŒãããŠãããããã¯ã©ã¹ã®å€æŽãå«ãŸããŠãããšæããŸãã
åãæ¡ä»¶ã§æž¬å®ãè¡ãããããšãæå³ããå¥ã®åèªãããå Žåããã®åèªã䜿çšã§ããŸããïŒ
@amuellerã¯ãããã®éãã§ãããã®ãã¶ã€ã³ã®ãŠãŒã¹ã±ãŒã¹ã§ã¯ãäžè¬çãªãŠãŒã¹ã±ãŒã¹ã§ã¯ãªããäžèšã®èª€ã£ãæžãæ¹ãããŠããããšã«æ°ä»ããŸããã
å埩枬å®ã®èšèšã«ã¯éåžžã«è€éãªã¿ã€ããå€æ°ããStratifiedGroupShuffleSplit
ãã2ã€ã®ã¿ã€ãã§ã¯ãã°ã«ãŒãå
ã§åãã¯ã©ã¹å¶éãé©çšãããŸãïŒæ²»çåå¿ãäºæž¬ããéã®æ²»çååŸã®çžŠæãµã³ããªã³ã°ãè€æ°ã®åæ²»çïŒæ²»çåå¿ãäºæž¬ããéã®ãç°ãªã身äœäœçœ®ã§ã®è¢«éšè
ããšã®ãµã³ãã«ïŒã
ããã«æ©èœãããã®ãå¿ èŠã ã£ãã®ã§ãä»ã®äººã䜿çšããããsklearnã§äœããå§ãããã§ããããã«ããããšæããŸãããããã«ãééãããªããã°ãã°ã«ãŒãå ã®ã¯ã©ã¹ã©ãã«ãç°ãªãå Žåã¯ãéå±€åããžãã¯ã®èšèšãããè€éã«ãªããŸãã
@amuellerã¯ãåžžã«ããã§ãã ãããã¯ãäºæž¬ã«ããã€ã¹ã®å éšå€åæ§ãå«ããããã®åã枬å®å€ã®è€è£œã§ãã
@hermidalcã¯ãããã®å Žåã¯ã¯ããã«ç°¡åã§ãã ãããäžè¬çãªããŒãºã§ããå Žåãç§ãã¡ã¯ãããè¿œå ããŠããããã§ãã ååããããããäœãããã®ããããçšåºŠæ確ã§ããããšã確èªããå¿ èŠããããŸãããŸããããã2ã€ã®ããŒãžã§ã³ãåãã¯ã©ã¹ã«ååšããå¿ èŠããããã©ãããæ€èšããå¿ èŠããããŸãã
StratifiedKFold
ã«ãããè¡ãããã®ã¯éåžžã«ç°¡åãªã¯ãã§ãã 2ã€ã®ãªãã·ã§ã³ããããŸããåãã©ãŒã«ãã«åãæ°ã®ãµã³ãã«ãå«ãŸããŠããããšã確èªããããåãã©ãŒã«ãã«åãæ°ã®ã°ã«ãŒããå«ãŸããŠããããšã確èªããŸãã
2çªç®ã®æ¹æ³ã¯ç°¡åã§ãïŒåã°ã«ãŒããåäžã®ãã€ã³ãã§ãããšåœã£ãŠStratifiedKFold
ã«æž¡ãã ãã§ãïŒã ããã¯ããªããããªãã®PRã§ããŠããããšã§ããããã¯ã®ããã«èŠããŸãã
GroupKFoldæåã«æå°ã®ãã©ãŒã«ãã«è¿œå ããããšã§ããã¥ãŒãªã¹ãã£ãã¯ã«2ã€ããã¬ãŒããªããããšæããŸãã ãããå±€åãããã±ãŒã¹ã«ã©ã®ããã«å€æããããã¯ããããŸããã®ã§ãããªãã®ã¢ãããŒãã䜿çšããŠæºè¶³ããŠããŸãã
åãPRã«GroupStratifiedKFoldãè¿œå ããå¿
èŠããããŸããïŒ ãããšãåŸã§ãããæ®ããŸããïŒ
ä»ã®PRã®ç®æšã¯å°ãç°ãªããŸãã 誰ããããŸããŸãªãŠãŒã¹ã±ãŒã¹ãäœã§ããããæžãçããããšãã§ããã°ãããã¯è¯ãããšã§ãïŒç§ã¯ããããä»ã¯æéããããŸããïŒã
ãã¹ãŠã®ãµã³ãã«ãåãã¯ã©ã¹ãæã€ã°ã«ãŒãå¶çŽãåå¥ã«åŠçããå Žåã¯+1ã
@hermidalcã¯ãããã®å Žåã¯ã¯ããã«ç°¡åã§ãã ãããäžè¬çãªããŒãºã§ããå Žåãç§ãã¡ã¯ãããè¿œå ããŠããããã§ãã ååããããããäœãããã®ããããçšåºŠæ確ã§ããããšã確èªããå¿ èŠããããŸãããŸããããã2ã€ã®ããŒãžã§ã³ãåãã¯ã©ã¹ã«ååšããå¿ èŠããããã©ãããæ€èšããå¿ èŠããããŸãã
ç§ã¯ãããå®å
šã«ã¯ç解ããŠããŸãããåã°ã«ãŒãã®ã¡ã³ããŒãç°ãªãã¯ã©ã¹ã«ããããšãã§ããStratifiedGroupShuffleSplit
ãšStratifiedGroupKFold
ã¯ããŠãŒã¶ãŒããã¹ãŠã®ã°ã«ãŒãã¡ã³ããŒãæå®ãããšãã«ããŸã£ããåãåå²åäœãããå¿
èŠããããŸããåãã¯ã©ã¹ã®ã åŸã§å
éšãæ¹åããããšãã§ããæ¢åã®åäœã¯åãã«ãªãã®ã¯ãã€ã§ããïŒ
2çªç®ã®æ¹æ³ã¯ç°¡åã§ãïŒåã°ã«ãŒããåäžã®ãã€ã³ãã§ãããšåœã£ãŠ
StratifiedKFold
ã«æž¡ãã ãã§ãïŒã ããã¯ããªããããªãã®PRã§ããŠããããšã§ããããã¯ã®ããã«èŠããŸããGroupKFoldæåã«æå°ã®ãã©ãŒã«ãã«è¿œå ããããšã§ããã¥ãŒãªã¹ãã£ãã¯ã«2ã€ããã¬ãŒããªããããšæããŸãã ãããå±€åãããã±ãŒã¹ã«ã©ã®ããã«å€æããããã¯ããããŸããã®ã§ãããªãã®ã¢ãããŒãã䜿çšããŠæºè¶³ããŠããŸãã
åãPRã«GroupStratifiedKFoldãè¿œå ããå¿ èŠããããŸããïŒ ãããšãåŸã§ãããæ®ããŸããïŒ
ä»ã®PRã®ç®æšã¯å°ãç°ãªããŸãã 誰ããããŸããŸãªãŠãŒã¹ã±ãŒã¹ãäœã§ããããæžãçããããšãã§ããã°ãããã¯è¯ãããšã§ãïŒç§ã¯ããããä»ã¯æéããããŸããïŒã
䜿çšãããåã°ã«ãŒãã®åäžãµã³ãã«ãã¢ãããŒãã䜿çšããŠã StatifiedGroupKFold
ãè¿œå ããŸãã
èå³ã®ãã人ãèªåã®ãŠãŒã¹ã±ãŒã¹ãšãããããæ¬åœã«äœãæãã§ããã®ãã説æã§ããã°ããã§ãããã
å埩枬å®ãè¡ã£ãå Žåã®å»åŠããã³çç©åŠã§ã®éåžžã«äžè¬çãªäœ¿çšäŸã
äŸïŒMRç»åããã¢ã«ããã€ããŒç
ïŒADïŒãšå¥åº·ãªå¯Ÿç
§ãªã©ã®ç
æ°ãåé¡ããããšããŸãã åãäž»é¡ã«ã€ããŠãïŒãã©ããŒã¢ããã»ãã·ã§ã³ãŸãã¯çžŠæçããŒã¿ããã®ïŒè€æ°ã®ã¹ãã£ã³ãããå ŽåããããŸãã åèš1000人ã®è¢«éšè
ãããŠããã®ãã¡200人ãADïŒäžåè¡¡ãªã¯ã©ã¹ïŒãšèšºæãããŠãããšä»®å®ããŸãã ã»ãšãã©ã®è¢«éšè
ã¯1åã®ã¹ãã£ã³ãè¡ããŸãããäžéšã®è¢«éšè
ã§ã¯2ã€ãŸãã¯3ã€ã®ç»åã䜿çšã§ããŸãã åé¡åšããã¬ãŒãã³ã°/ãã¹ããããšãã¯ãããŒã¿ã®æŒæŽ©ãé²ãããã«ãåã被åäœããã®ç»åãåžžã«åãæãç³ã¿ã«ããããšã確èªããå¿
èŠããããŸãã
ããã«ã¯StratifiedGroupKFoldã䜿çšããã®ãæé©ã§ããå±€åããŠã¯ã©ã¹ã®äžåè¡¡ãèæ
®ããŸããããµããžã§ã¯ããç°ãªããã©ãŒã«ãã«è¡šç€ºãããŠã¯ãªããªããšããã°ã«ãŒãå¶çŽããããŸãã
NBïŒãããç¹°ãè¿ãå¯èœã«ãããšããã§ãããã
以äžã®å®è£ äŸã¯ã kaggle-kernelã«è§ŠçºãããŠããŸãã
import numpy as np
from collections import Counter, defaultdict
from sklearn.utils import check_random_state
class RepeatedStratifiedGroupKFold():
def __init__(self, n_splits=5, n_repeats=1, random_state=None):
self.n_splits = n_splits
self.n_repeats = n_repeats
self.random_state = random_state
# Implementation based on this kaggle kernel:
# https://www.kaggle.com/jakubwasikowski/stratified-group-k-fold-cross-validation
def split(self, X, y=None, groups=None):
k = self.n_splits
def eval_y_counts_per_fold(y_counts, fold):
y_counts_per_fold[fold] += y_counts
std_per_label = []
for label in range(labels_num):
label_std = np.std(
[y_counts_per_fold[i][label] / y_distr[label] for i in range(k)]
)
std_per_label.append(label_std)
y_counts_per_fold[fold] -= y_counts
return np.mean(std_per_label)
rnd = check_random_state(self.random_state)
for repeat in range(self.n_repeats):
labels_num = np.max(y) + 1
y_counts_per_group = defaultdict(lambda: np.zeros(labels_num))
y_distr = Counter()
for label, g in zip(y, groups):
y_counts_per_group[g][label] += 1
y_distr[label] += 1
y_counts_per_fold = defaultdict(lambda: np.zeros(labels_num))
groups_per_fold = defaultdict(set)
groups_and_y_counts = list(y_counts_per_group.items())
rnd.shuffle(groups_and_y_counts)
for g, y_counts in sorted(groups_and_y_counts, key=lambda x: -np.std(x[1])):
best_fold = None
min_eval = None
for i in range(k):
fold_eval = eval_y_counts_per_fold(y_counts, i)
if min_eval is None or fold_eval < min_eval:
min_eval = fold_eval
best_fold = i
y_counts_per_fold[best_fold] += y_counts
groups_per_fold[best_fold].add(g)
all_groups = set(groups)
for i in range(k):
train_groups = all_groups - groups_per_fold[i]
test_groups = groups_per_fold[i]
train_indices = [i for i, g in enumerate(groups) if g in train_groups]
test_indices = [i for i, g in enumerate(groups) if g in test_groups]
yield train_indices, test_indices
RepeatedStratifiedKFold
ïŒåãã°ã«ãŒãã®ãµã³ãã«ãäž¡æ¹ã®ãã©ãŒã«ãã«è¡šç€ºãããå ŽåããããŸãïŒãšRepeatedStratifiedGroupKFold
ã®æ¯èŒïŒ
import matplotlib.pyplot as plt
from sklearn import model_selection
def plot_cv_indices(cv, X, y, group, ax, n_splits, lw=10):
for ii, (tr, tt) in enumerate(cv.split(X=X, y=y, groups=group)):
indices = np.array([np.nan] * len(X))
indices[tt] = 1
indices[tr] = 0
ax.scatter(range(len(indices)), [ii + .5] * len(indices),
c=indices, marker='_', lw=lw, cmap=plt.cm.coolwarm,
vmin=-.2, vmax=1.2)
ax.scatter(range(len(X)), [ii + 1.5] * len(X), c=y, marker='_',
lw=lw, cmap=plt.cm.Paired)
ax.scatter(range(len(X)), [ii + 2.5] * len(X), c=group, marker='_',
lw=lw, cmap=plt.cm.tab20c)
yticklabels = list(range(n_splits)) + ['class', 'group']
ax.set(yticks=np.arange(n_splits+2) + .5, yticklabels=yticklabels,
xlabel='Sample index', ylabel="CV iteration",
ylim=[n_splits+2.2, -.2], xlim=[0, 100])
ax.set_title('{}'.format(type(cv).__name__), fontsize=15)
# demonstration
np.random.seed(1338)
n_splits = 4
n_repeats=5
# Generate the class/group data
n_points = 100
X = np.random.randn(100, 10)
percentiles_classes = [.4, .6]
y = np.hstack([[ii] * int(100 * perc) for ii, perc in enumerate(percentiles_classes)])
# Evenly spaced groups
g = np.hstack([[ii] * 5 for ii in range(20)])
fig, ax = plt.subplots(1,2, figsize=(14,4))
cv_nogrp = model_selection.RepeatedStratifiedKFold(n_splits=n_splits,
n_repeats=n_repeats,
random_state=1338)
cv_grp = RepeatedStratifiedGroupKFold(n_splits=n_splits,
n_repeats=n_repeats,
random_state=1338)
plot_cv_indices(cv_nogrp, X, y, g, ax[0], n_splits * n_repeats)
plot_cv_indices(cv_grp, X, y, g, ax[1], n_splits * n_repeats)
plt.show()
stratifiedGroupKfoldã®å Žåã¯+1ã ãµã ã«ããŠã©ããããã»ã³ãµãŒãåãåºããŠãé«éœ¢è ã®è»¢åãæ€åºããããšããŠããŸãã èœäžããŒã¿ãããŸããªããããããŸããŸãªã¯ã©ã¹ãååŸããããŸããŸãªæèšã䜿çšããŠã·ãã¥ã¬ãŒã·ã§ã³ãå®è¡ããŸãã ãŸããããŒã¿ããã¬ãŒãã³ã°ããåã«ãããŒã¿ã®æ¡åŒµãè¡ããŸãã åããŒã¿ãã€ã³ããã9ã€ã®ãã€ã³ããäœæããŸã-ããã¯ã°ã«ãŒãã§ãã 説æãããŠããããã«ãã°ã«ãŒãããã¬ãŒãã³ã°ãšãã¹ãã®äž¡æ¹ã«åå ããªãããšãéèŠã§ã
StratifiedGroupKFoldã䜿çšã§ããããã«ããããšæããŸãã ç§ã¯éèå±æ©ãäºæž¬ããããã®ããŒã¿ã»ãããèŠãŠããŸããããã§ãåå±æ©ã®æ°å¹ŽåãåŸãããã³æäžã¯ç¬èªã®ã°ã«ãŒãã§ãã ãã¬ãŒãã³ã°ããã³çžäºæ€èšŒäžãåã°ã«ãŒãã®ã¡ã³ããŒã¯ãã©ãŒã«ãéã§ãªãŒã¯ããªãããã«ããå¿ èŠããããŸãã
ãã«ãã©ãã«ã·ããªãªïŒMultilabel_
stratifiedGroupKfoldïŒïŒ
ãã®ããã«+1ã ã¹ãã ã®ãŠãŒã¶ãŒã¢ã«ãŠã³ããåæããŠããã®ã§ããŠãŒã¶ãŒããšã«ã°ã«ãŒãåããŸãããã¹ãã ã®çºççã¯æ¯èŒçäœããããå±€å¥åããŸãã ç§ãã¡ã®ãŠãŒã¹ã±ãŒã¹ã§ã¯ãäžåºŠã¹ãã ãéä¿¡ãããŠãŒã¶ãŒã¯ãã¹ãŠã®ããŒã¿ã§ã¹ãããŒãšããŠãã©ã°ãç«ãŠããããããã°ã«ãŒãã¡ã³ããŒã¯åžžã«åãã©ãã«ãæã¡ãŸãã
ããã¥ã¡ã³ããçµã¿ç«ãŠãããã®å€å
žçãªãŠãŒã¹ã±ãŒã¹ãæäŸããŠãããŠããããšãã
@ philip-ivïŒ
StratifiedGroupShuffleSplit
ãšåãPRïŒ15239ã«StratifiedGroupKFold
ã®å®è£
ãè¿œå ããŸããã
PRã§ãããããã«ãäž¡æ¹ã®ããžãã¯ã¯https://github.com/scikit-learn/scikit-learn/issues/13621#issuecomment -557802602ãããã¯ããã«åçŽã§ããããã¯ãç§ã®ã°ã«ãŒããåã¯ã©ã¹ïŒãµã³ãã«ã®ããŒã»ã³ããŒãžã§ã¯ãªãïŒãããã«ãããæ¢åã®StratifiedKFold
ããã³StratifiedShuffleSplit
ã³ãŒãããäžæã®ã°ã«ãŒãæ
å ±ãæž¡ãããšã§æŽ»çšã§ããŸãã ãã ããã©ã¡ãã®å®è£
ã§ããåã°ã«ãŒãã®ãµã³ãã«ãåããã©ãŒã«ãã«äžç·ã«ãšã©ãŸããã©ãŒã«ããçæãããŸãã
https://github.com/scikit-learn/scikit-learn/issues/13621#issuecomment-557802602ã«åºã¥ããããæŽç·Žãããæ¹æ³ã«æ祚ããŸãã
æäŸãããã³ãŒã@mrunibeã䜿çšããStratifiedGroupKFold
ãšRepeatedStratifiedGroupKFold
ã®æ¬æ ŒçãªããŒãžã§ã³ã次ã«ç€ºããŸãããããããã«ç°¡ç¥åããŠãããã€ãå€æŽããŸããã ãããã®ã¯ã©ã¹ã¯ãåãã¿ã€ãã®ä»ã®sklearnCVã¯ã©ã¹ãã©ã®ããã«å®è¡ããããã®èšèšã«ãæºæ ããŠããŸãã
class StratifiedGroupKFold(_BaseKFold):
"""Stratified K-Folds iterator variant with non-overlapping groups.
This cross-validation object is a variation of StratifiedKFold that returns
stratified folds with non-overlapping groups. The folds are made by
preserving the percentage of samples for each class.
The same group will not appear in two different folds (the number of
distinct groups has to be at least equal to the number of folds).
The difference between GroupKFold and StratifiedGroupKFold is that
the former attempts to create balanced folds such that the number of
distinct groups is approximately the same in each fold, whereas
StratifiedGroupKFold attempts to create folds which preserve the
percentage of samples for each class.
Read more in the :ref:`User Guide <cross_validation>`.
Parameters
----------
n_splits : int, default=5
Number of folds. Must be at least 2.
shuffle : bool, default=False
Whether to shuffle each class's samples before splitting into batches.
Note that the samples within each split will not be shuffled.
random_state : int or RandomState instance, default=None
When `shuffle` is True, `random_state` affects the ordering of the
indices, which controls the randomness of each fold for each class.
Otherwise, leave `random_state` as `None`.
Pass an int for reproducible output across multiple function calls.
See :term:`Glossary <random_state>`.
Examples
--------
>>> import numpy as np
>>> from sklearn.model_selection import StratifiedGroupKFold
>>> X = np.ones((17, 2))
>>> y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
>>> groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8])
>>> cv = StratifiedGroupKFold(n_splits=3)
>>> for train_idxs, test_idxs in cv.split(X, y, groups):
... print("TRAIN:", groups[train_idxs])
... print(" ", y[train_idxs])
... print(" TEST:", groups[test_idxs])
... print(" ", y[test_idxs])
TRAIN: [2 2 4 5 5 5 5 6 6 7]
[1 1 1 0 0 0 0 0 0 0]
TEST: [1 1 3 3 3 8 8]
[0 0 1 1 1 0 0]
TRAIN: [1 1 3 3 3 4 5 5 5 5 8 8]
[0 0 1 1 1 1 0 0 0 0 0 0]
TEST: [2 2 6 6 7]
[1 1 0 0 0]
TRAIN: [1 1 2 2 3 3 3 6 6 7 8 8]
[0 0 1 1 1 1 1 0 0 0 0 0]
TEST: [4 5 5 5 5]
[1 0 0 0 0]
See also
--------
StratifiedKFold: Takes class information into account to build folds which
retain class distributions (for binary or multiclass classification
tasks).
GroupKFold: K-fold iterator variant with non-overlapping groups.
"""
def __init__(self, n_splits=5, shuffle=False, random_state=None):
super().__init__(n_splits=n_splits, shuffle=shuffle,
random_state=random_state)
# Implementation based on this kaggle kernel:
# https://www.kaggle.com/jakubwasikowski/stratified-group-k-fold-cross-validation
def _iter_test_indices(self, X, y, groups):
labels_num = np.max(y) + 1
y_counts_per_group = defaultdict(lambda: np.zeros(labels_num))
y_distr = Counter()
for label, group in zip(y, groups):
y_counts_per_group[group][label] += 1
y_distr[label] += 1
y_counts_per_fold = defaultdict(lambda: np.zeros(labels_num))
groups_per_fold = defaultdict(set)
groups_and_y_counts = list(y_counts_per_group.items())
rng = check_random_state(self.random_state)
if self.shuffle:
rng.shuffle(groups_and_y_counts)
for group, y_counts in sorted(groups_and_y_counts,
key=lambda x: -np.std(x[1])):
best_fold = None
min_eval = None
for i in range(self.n_splits):
y_counts_per_fold[i] += y_counts
std_per_label = []
for label in range(labels_num):
std_per_label.append(np.std(
[y_counts_per_fold[j][label] / y_distr[label]
for j in range(self.n_splits)]))
y_counts_per_fold[i] -= y_counts
fold_eval = np.mean(std_per_label)
if min_eval is None or fold_eval < min_eval:
min_eval = fold_eval
best_fold = i
y_counts_per_fold[best_fold] += y_counts
groups_per_fold[best_fold].add(group)
for i in range(self.n_splits):
test_indices = [idx for idx, group in enumerate(groups)
if group in groups_per_fold[i]]
yield test_indices
class RepeatedStratifiedGroupKFold(_RepeatedSplits):
"""Repeated Stratified K-Fold cross validator.
Repeats Stratified K-Fold with non-overlapping groups n times with
different randomization in each repetition.
Read more in the :ref:`User Guide <cross_validation>`.
Parameters
----------
n_splits : int, default=5
Number of folds. Must be at least 2.
n_repeats : int, default=10
Number of times cross-validator needs to be repeated.
random_state : int or RandomState instance, default=None
Controls the generation of the random states for each repetition.
Pass an int for reproducible output across multiple function calls.
See :term:`Glossary <random_state>`.
Examples
--------
>>> import numpy as np
>>> from sklearn.model_selection import RepeatedStratifiedGroupKFold
>>> X = np.ones((17, 2))
>>> y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
>>> groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8])
>>> cv = RepeatedStratifiedGroupKFold(n_splits=2, n_repeats=2,
... random_state=36851234)
>>> for train_index, test_index in cv.split(X, y, groups):
... print("TRAIN:", groups[train_idxs])
... print(" ", y[train_idxs])
... print(" TEST:", groups[test_idxs])
... print(" ", y[test_idxs])
TRAIN: [2 2 4 5 5 5 5 8 8]
[1 1 1 0 0 0 0 0 0]
TEST: [1 1 3 3 3 6 6 7]
[0 0 1 1 1 0 0 0]
TRAIN: [1 1 3 3 3 6 6 7]
[0 0 1 1 1 0 0 0]
TEST: [2 2 4 5 5 5 5 8 8]
[1 1 1 0 0 0 0 0 0]
TRAIN: [3 3 3 4 7 8 8]
[1 1 1 1 0 0 0]
TEST: [1 1 2 2 5 5 5 5 6 6]
[0 0 1 1 0 0 0 0 0 0]
TRAIN: [1 1 2 2 5 5 5 5 6 6]
[0 0 1 1 0 0 0 0 0 0]
TEST: [3 3 3 4 7 8 8]
[1 1 1 1 0 0 0]
Notes
-----
Randomized CV splitters may return different results for each call of
split. You can make the results identical by setting `random_state`
to an integer.
See also
--------
RepeatedStratifiedKFold: Repeats Stratified K-Fold n times.
"""
def __init__(self, n_splits=5, n_repeats=10, random_state=None):
super().__init__(StratifiedGroupKFold, n_splits=n_splits,
n_repeats=n_repeats, random_state=random_state)
@hermidalcæã ãããæ¯ãè¿ããšãç§ãã¡ã解決ããããšã«ã€ããŠããªãæ··ä¹±ããŠããŸãã ïŒæ®å¿µãªãããç§ã®æéã¯ä»¥åãšã¯ç°ãªããŸãïŒïŒscikit-learnã«å«ããããšããå§ããããã®ã«ã€ããŠæããŠãã ããã
@hermidalcæã ãããæ¯ãè¿ããšãç§ãã¡ã解決ããããšã«ã€ããŠããªãæ··ä¹±ããŠããŸãã ïŒæ®å¿µãªãããç§ã®æéã¯ä»¥åãšã¯ç°ãªããŸãïŒïŒscikit-learnã«å«ããããšããå§ããããã®ã«ã€ããŠæããŠãã ããã
ïŒ15239ã§è¡ã£ããããåªããå®è£ ãããããšæã£ãŠããŸããã ãã®PRã§ã®å®è£ ã¯æ©èœããŸãããè«çãåçŽåããããã«ã°ã«ãŒããéå±€åããŸãããããã¯çæ³çã§ã¯ãããŸããã
ãããã£ãŠãäžèšã§è¡ã£ãããšïŒjakubwasikowskiã®@mrunibeãškaggleã®ãããã§ïŒã¯ããµã³ãã«ãéå±€åããStratifiedGroupKFold
ã®ããè¯ãå®è£
ã§ãã åãããžãã¯ã移æ€ããŠããè¯ãStratifiedGroupShuffleSplit
ãå®è¡ãããã®ã§ãæºåãæŽããŸãã å€ãå®è£
ã眮ãæããããã«ãæ°ããã³ãŒããïŒ15239ã«é
眮ããŸãã
æªå®æã®PRã«ã€ããŠãè©«ã³ç³ãäžããŸããå士å·ãååŸããŠããã®ã§ãæéããããŸããã
å®è£
ãæäŸããŠããã@hermidalcãš@mrunibeã«æè¬ããŸãã ãŸããã¯ã©ã¹ã®äžåè¡¡ã匷ãã被éšè
ããšã®ãµã³ãã«æ°ã倧ããç°ãªãå»çããŒã¿ãåŠçããããã®StratifiedGroupKFold
ã¡ãœãããæ¢ããŠããŸããã GroupKFold
ãããèªäœã§ã1ã€ã®ã¯ã©ã¹ã®ã¿ãå«ããã¬ãŒãã³ã°ããŒã¿ãµãã»ãããäœæããŸãã
åãããžãã¯ã移æ€ããŠããè¯ãStratifiedGroupShuffleSplitãå®è¡ãããã®ã§ãæºåãæŽããŸãã
StratifiedGroupShuffleSplit
ã®æºåãæŽãåã«ã$ StratifiedGroupKFold
ãããŒãžããããšãæ€èšã§ããŸãã
æªå®æã®PRã«ã€ããŠãè©«ã³ç³ãäžããŸããå士å·ãååŸããŠããã®ã§ãæéããããŸããã
ãµããŒããå¿ èŠãªå Žåã¯ãç¥ãããã ããã
ãããŠããªãã®å士å·ã®ä»äºã§é 匵ã£ãŠãã ãã
æäŸãããã³ãŒã@mrunibeã䜿çšãã
StratifiedGroupKFold
ãšRepeatedStratifiedGroupKFold
ã®æ¬æ ŒçãªããŒãžã§ã³ã次ã«ç€ºããŸãããããããã«ç°¡ç¥åããŠãããã€ãå€æŽããŸããã ãããã®ã¯ã©ã¹ã¯ãåãã¿ã€ãã®ä»ã®sklearnCVã¯ã©ã¹ãã©ã®ããã«å®è¡ããããã®èšèšã«ãæºæ ããŠããŸãã
ãããè©Šãããšã¯å¯èœã§ããïŒ ããŸããŸãªäŸåé¢ä¿ã®ããã€ãã䜿çšããŠã«ããã¢ã³ãããŒã¹ããè©Šã¿ãŸããããçµäºããŸããã§ããã ãã®ã¯ã©ã¹ãç§ã®ãããžã§ã¯ãã§è©ŠããŠã¿ãããšæããŸãã ãããè¡ãããã«ä»å©çšã§ããæ¹æ³ããããã©ããã確èªããããšããŠããŸãã
@hermidalcå士å·ååŸãæåããããšãé¡ã£ãŠããŸãïŒ
å°çç§åŠã®å士å·ååŸã«ã¯ã°ã«ãŒãå¶åŸ¡ãåãããã®éå±€åæ©èœãå¿
èŠãªã®ã§ããã®å®è£
ãè¡ãããã®ã楜ãã¿ã«ããŠããŸãã ãããžã§ã¯ãã§æåã§åå²ãããšãããã®ã¢ã€ãã¢ã®å®è£
ã«æ°æéãè²»ãããŸããã ããããç§ã¯åãçç±ã§ãããçµããããšããããããŸãã...å士å·ã®é²æ©ã ã§ããããå士å·ã®ä»äºãã©ã®ããã«äººã®æéãèŠãããããšãã§ããããå®å
šã«ç解ããããšãã§ããŸãã ç¬ãã¬ãã·ã£ãŒãªãã ä»ã®ãšããã代ããã«GroupShuffleSplitã䜿çšããŠããŸãã
也æ¯
@ bfeeny @ dispinkäžèšã®2ã€ã®ã¯ã©ã¹ã䜿çšããã®ã¯éåžžã«ç°¡åã§ãã 次ã®ãããªãã¡ã€ã«ãäœæããŸãïŒäŸïŒ split.py
ã 次ã«ããŠãŒã¶ãŒã³ãŒãã§ãã¹ã¯ãªãããsplit.py
ãšåããã£ã¬ã¯ããªã«ããå Žåã¯ã from split import StratifiedGroupKFold, RepeatedStratifiedGroupKFold
ãã€ã³ããŒãããã ãã§ãã
from collections import Counter, defaultdict
import numpy as np
from sklearn.model_selection._split import _BaseKFold, _RepeatedSplits
from sklearn.utils.validation import check_random_state
class StratifiedGroupKFold(_BaseKFold):
"""Stratified K-Folds iterator variant with non-overlapping groups.
This cross-validation object is a variation of StratifiedKFold that returns
stratified folds with non-overlapping groups. The folds are made by
preserving the percentage of samples for each class.
The same group will not appear in two different folds (the number of
distinct groups has to be at least equal to the number of folds).
The difference between GroupKFold and StratifiedGroupKFold is that
the former attempts to create balanced folds such that the number of
distinct groups is approximately the same in each fold, whereas
StratifiedGroupKFold attempts to create folds which preserve the
percentage of samples for each class.
Read more in the :ref:`User Guide <cross_validation>`.
Parameters
----------
n_splits : int, default=5
Number of folds. Must be at least 2.
shuffle : bool, default=False
Whether to shuffle each class's samples before splitting into batches.
Note that the samples within each split will not be shuffled.
random_state : int or RandomState instance, default=None
When `shuffle` is True, `random_state` affects the ordering of the
indices, which controls the randomness of each fold for each class.
Otherwise, leave `random_state` as `None`.
Pass an int for reproducible output across multiple function calls.
See :term:`Glossary <random_state>`.
Examples
--------
>>> import numpy as np
>>> from sklearn.model_selection import StratifiedGroupKFold
>>> X = np.ones((17, 2))
>>> y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
>>> groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8])
>>> cv = StratifiedGroupKFold(n_splits=3)
>>> for train_idxs, test_idxs in cv.split(X, y, groups):
... print("TRAIN:", groups[train_idxs])
... print(" ", y[train_idxs])
... print(" TEST:", groups[test_idxs])
... print(" ", y[test_idxs])
TRAIN: [2 2 4 5 5 5 5 6 6 7]
[1 1 1 0 0 0 0 0 0 0]
TEST: [1 1 3 3 3 8 8]
[0 0 1 1 1 0 0]
TRAIN: [1 1 3 3 3 4 5 5 5 5 8 8]
[0 0 1 1 1 1 0 0 0 0 0 0]
TEST: [2 2 6 6 7]
[1 1 0 0 0]
TRAIN: [1 1 2 2 3 3 3 6 6 7 8 8]
[0 0 1 1 1 1 1 0 0 0 0 0]
TEST: [4 5 5 5 5]
[1 0 0 0 0]
See also
--------
StratifiedKFold: Takes class information into account to build folds which
retain class distributions (for binary or multiclass classification
tasks).
GroupKFold: K-fold iterator variant with non-overlapping groups.
"""
def __init__(self, n_splits=5, shuffle=False, random_state=None):
super().__init__(n_splits=n_splits, shuffle=shuffle,
random_state=random_state)
# Implementation based on this kaggle kernel:
# https://www.kaggle.com/jakubwasikowski/stratified-group-k-fold-cross-validation
def _iter_test_indices(self, X, y, groups):
labels_num = np.max(y) + 1
y_counts_per_group = defaultdict(lambda: np.zeros(labels_num))
y_distr = Counter()
for label, group in zip(y, groups):
y_counts_per_group[group][label] += 1
y_distr[label] += 1
y_counts_per_fold = defaultdict(lambda: np.zeros(labels_num))
groups_per_fold = defaultdict(set)
groups_and_y_counts = list(y_counts_per_group.items())
rng = check_random_state(self.random_state)
if self.shuffle:
rng.shuffle(groups_and_y_counts)
for group, y_counts in sorted(groups_and_y_counts,
key=lambda x: -np.std(x[1])):
best_fold = None
min_eval = None
for i in range(self.n_splits):
y_counts_per_fold[i] += y_counts
std_per_label = []
for label in range(labels_num):
std_per_label.append(np.std(
[y_counts_per_fold[j][label] / y_distr[label]
for j in range(self.n_splits)]))
y_counts_per_fold[i] -= y_counts
fold_eval = np.mean(std_per_label)
if min_eval is None or fold_eval < min_eval:
min_eval = fold_eval
best_fold = i
y_counts_per_fold[best_fold] += y_counts
groups_per_fold[best_fold].add(group)
for i in range(self.n_splits):
test_indices = [idx for idx, group in enumerate(groups)
if group in groups_per_fold[i]]
yield test_indices
class RepeatedStratifiedGroupKFold(_RepeatedSplits):
"""Repeated Stratified K-Fold cross validator.
Repeats Stratified K-Fold with non-overlapping groups n times with
different randomization in each repetition.
Read more in the :ref:`User Guide <cross_validation>`.
Parameters
----------
n_splits : int, default=5
Number of folds. Must be at least 2.
n_repeats : int, default=10
Number of times cross-validator needs to be repeated.
random_state : int or RandomState instance, default=None
Controls the generation of the random states for each repetition.
Pass an int for reproducible output across multiple function calls.
See :term:`Glossary <random_state>`.
Examples
--------
>>> import numpy as np
>>> from sklearn.model_selection import RepeatedStratifiedGroupKFold
>>> X = np.ones((17, 2))
>>> y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
>>> groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8])
>>> cv = RepeatedStratifiedGroupKFold(n_splits=2, n_repeats=2,
... random_state=36851234)
>>> for train_index, test_index in cv.split(X, y, groups):
... print("TRAIN:", groups[train_idxs])
... print(" ", y[train_idxs])
... print(" TEST:", groups[test_idxs])
... print(" ", y[test_idxs])
TRAIN: [2 2 4 5 5 5 5 8 8]
[1 1 1 0 0 0 0 0 0]
TEST: [1 1 3 3 3 6 6 7]
[0 0 1 1 1 0 0 0]
TRAIN: [1 1 3 3 3 6 6 7]
[0 0 1 1 1 0 0 0]
TEST: [2 2 4 5 5 5 5 8 8]
[1 1 1 0 0 0 0 0 0]
TRAIN: [3 3 3 4 7 8 8]
[1 1 1 1 0 0 0]
TEST: [1 1 2 2 5 5 5 5 6 6]
[0 0 1 1 0 0 0 0 0 0]
TRAIN: [1 1 2 2 5 5 5 5 6 6]
[0 0 1 1 0 0 0 0 0 0]
TEST: [3 3 3 4 7 8 8]
[1 1 1 1 0 0 0]
Notes
-----
Randomized CV splitters may return different results for each call of
split. You can make the results identical by setting `random_state`
to an integer.
See also
--------
RepeatedStratifiedKFold: Repeats Stratified K-Fold n times.
"""
def __init__(self, n_splits=5, n_repeats=10, random_state=None):
super().__init__(StratifiedGroupKFold, n_splits=n_splits,
n_repeats=n_repeats, random_state=random_state)
@hermidalcè¯å®çãªè¿ä¿¡ããããšãããããŸãïŒ
ããªãã説æããããã«ãç§ã¯ããã«ãããæ¡çšããŸãã ãã ãããã¬ãŒãã³ã°ã»ãããŸãã¯ãã¹ãã»ããã®ããŒã¿ã®ã¿ãå«ãåå²ã®ã¿ãååŸã§ããŸãã ã³ãŒãã®èª¬æãç解ããŠããéãããã¬ãŒãã³ã°ã»ãããšãã¹ãã»ããã®æ¯çãæå®ãããã©ã¡ãŒã¿ãŒã¯ãããŸããããïŒ
éå±€åãã°ã«ãŒãå¶åŸ¡ãããŒã¿ã»ããã®æ¯çã®éã®ç«¶åã§ããããšãç§ã¯ç¥ã£ãŠããŸã...ããã§ç§ã¯ç¶ç¶ããããããŸãã...ããããããããç§ãã¡ã¯åé¿ããããã«åŠ¥åãèŠã€ããããšãã§ããŸãã
å¿ãã
@hermidalcè¯å®çãªè¿ä¿¡ããããšãããããŸãïŒ
ããªãã説æããããã«ãç§ã¯ããã«ãããæ¡çšããŸãã ãã ãããã¬ãŒãã³ã°ã»ãããŸãã¯ãã¹ãã»ããã®ããŒã¿ã®ã¿ãå«ãåå²ã®ã¿ãååŸã§ããŸãã ã³ãŒãã®èª¬æãç解ããŠããéãããã¬ãŒãã³ã°ã»ãããšãã¹ãã»ããã®æ¯çãæå®ãããã©ã¡ãŒã¿ãŒã¯ãããŸããããïŒ
éå±€åãã°ã«ãŒãå¶åŸ¡ãããŒã¿ã»ããã®æ¯çã®éã®ç«¶åã§ããããšãç§ã¯ç¥ã£ãŠããŸã...ããã§ç§ã¯ç¶ç¶ããããããŸãã...ããããããããç§ãã¡ã¯åé¿ããããã«åŠ¥åãèŠã€ããããšãã§ããŸãã
ãã¹ãããããã«ã split.py
ãäœæãããã®äŸãipythonã§å®è¡ãããšãæ©èœããŸãã ç§ã¯é·ãéãããã®ã«ã¹ã¿ã CVã€ãã¬ãŒã¿ãŒãä»äºã§äœ¿çšããŠããŸããããåé¡ã¯ãããŸããã ãšããã§ãç§ã¯0.23.xã§ã¯ãªãscikit-learn 0.22.2ã䜿çšããŠããã®ã§ããããåé¡ã®åå ã§ãããã©ããã¯ããããŸããã 以äžã®äŸãå®è¡ããŠãåçŸã§ãããã©ããã確èªããŠãã ããã å¯èœã§ããã°ãããã¯ããªãã®ä»äºã«y
ãšgroups
ãå«ãŸããŠãããã®ãããããŸããã
In [6]: import numpy as np
...: from split import StratifiedGroupKFold
...:
...: X = np.ones((17, 2))
...: y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
...: groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8])
...: cv = StratifiedGroupKFold(n_splits=3, shuffle=True, random_state=777)
...: for train_idxs, test_idxs in cv.split(X, y, groups):
...: print("TRAIN:", groups[train_idxs])
...: print(" ", y[train_idxs])
...: print(" TEST:", groups[test_idxs])
...: print(" ", y[test_idxs])
...:
TRAIN: [2 2 4 5 5 5 5 6 6 7]
[1 1 1 0 0 0 0 0 0 0]
TEST: [1 1 3 3 3 8 8]
[0 0 1 1 1 0 0]
TRAIN: [1 1 3 3 3 4 5 5 5 5 8 8]
[0 0 1 1 1 1 0 0 0 0 0 0]
TEST: [2 2 6 6 7]
[1 1 0 0 0]
TRAIN: [1 1 2 2 3 3 3 6 6 7 8 8]
[0 0 1 1 1 1 1 0 0 0 0 0]
TEST: [4 5 5 5 5]
[1 0 0 0 0]
ãã®æ©èœã @ hermidalcã«ã¯å®æçã«é¢å¿ãããããã§ããã
ããªããæ°ã«ããªããã°ã誰ãããããçµããããã®ãèŠã€ãããããããŸããã
@hermidalc 'åãã°ã«ãŒãå
ã®ãã¹ãŠã®ãµã³ãã«ãåãã¯ã©ã¹ã©ãã«ãæã£ãŠããããšã確èªããå¿
èŠããããŸãã æããã«ãããåé¡ã§ãã åãã°ã«ãŒãã®ç§ã®ãµã³ãã«ã¯åãã¯ã©ã¹ãå
±æããŠããŸããã ããŒã...ããã¯éçºã®å¥ã®ãã©ã³ãã®ããã§ãã
ãšã«ããããããšãããããŸããã
@hermidalc 'åãã°ã«ãŒãå ã®ãã¹ãŠã®ãµã³ãã«ãåãã¯ã©ã¹ã©ãã«ãæã£ãŠããããšã確èªããå¿ èŠããããŸãã æããã«ãããåé¡ã§ãã åãã°ã«ãŒãã®ç§ã®ãµã³ãã«ã¯åãã¯ã©ã¹ãå ±æããŠããŸããã ããŒã...ããã¯éçºã®å¥ã®ãã©ã³ãã®ããã§ãã
ãšã«ããããããšãããããŸãããã¯ããããã¯ããã®ããŸããŸãªã¹ã¬ããã§è°è«ãããŠããŸãã ããã¯äŸ¿å©ãªãã1ã€ã®ããè€éãªãŠãŒã¹ã±ãŒã¹ã§ãããç§ã®ãããªå€ãã®äººã¯çŸåšãã®ãŠãŒã¹ã±ãŒã¹ãå¿ èŠãšããŸããããã°ã«ãŒãããŸãšããªãããµã³ãã«ãéå±€åããäœããå¿ èŠã§ãã äžèšã®ã³ãŒãã®èŠä»¶ã¯ãåã°ã«ãŒãã®ãã¹ãŠã®ãµã³ãã«ãåãã¯ã©ã¹ã«å±ããŠããããšã§ãã
å®éã @ dispinkã¯ééã£ãŠããŸããããã®ã¢ã«ãŽãªãºã ã§ã¯ãã°ã«ãŒãã®ãã¹ãŠã®ã¡ã³ããŒãåãã¯ã©ã¹ã«å±ããŠããå¿ èŠã¯ãããŸããã äŸãã°ïŒ
In [2]: X = np.ones((17, 2))
...: y = np.array([0, 2, 1, 1, 2, 0, 0, 1, 2, 1, 1, 1, 0, 2, 0, 1, 0])
...: groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8])
...: cv = StratifiedGroupKFold(n_splits=3)
...: for train_idxs, test_idxs in cv.split(X, y, groups):
...: print("TRAIN:", groups[train_idxs])
...: print(" ", y[train_idxs])
...: print(" TEST:", groups[test_idxs])
...: print(" ", y[test_idxs])
...:
TRAIN: [1 1 2 2 3 3 3 4 8 8]
[0 2 1 1 2 0 0 1 1 0]
TEST: [5 5 5 5 6 6 7]
[2 1 1 1 0 2 0]
TRAIN: [1 1 4 5 5 5 5 6 6 7 8 8]
[0 2 1 2 1 1 1 0 2 0 1 0]
TEST: [2 2 3 3 3]
[1 1 2 0 0]
TRAIN: [2 2 3 3 3 5 5 5 5 6 6 7]
[1 1 2 0 0 2 1 1 1 0 2 0]
TEST: [1 1 4 8 8]
[0 2 1 1 0]
ãããã£ãŠãã¹ã¯ãªãŒã³ã·ã§ããã䜿çšããŠããããŒã¿ã¬ã€ã¢ãŠããäœã§ãããäœãèµ·ãã£ãŠããã®ããå®éã«ç¢ºèªããããšã¯ã§ããªããããããŒã¿ã§äœãèµ·ãã£ãŠããã®ãããããããŸããã ããã§ç€ºããäŸãæåã«åçŸããŠãscikit-learnããŒãžã§ã³ã®åé¡ã§ã¯ãªãããšã確èªããããšããå§ãããŸãïŒ0.22.2ã䜿çšããŠããããïŒãåçŸã§ããå Žåã¯ãããŒã¿ãšããããã¹ãããŸãã ã104kã®ãµã³ãã«ã䜿çšãããšããã©ãã«ã·ã¥ãŒãã£ã³ã°ãå°é£ã«ãªããŸãã
@hermidalcè¿ä¿¡ããããšãããããŸãïŒ
äžèšã®çµæãå®éã«åçŸã§ããã®ã§ãçŸåšã¯ããå°ããªããŒã¿ã§ãã©ãã«ã·ã¥ãŒãã£ã³ã°ãè¡ã£ãŠããŸãã
+1
ç§ããã®åé¡ãåãäžããŠãããã§ããïŒ
ïŒ15239ãšhttps://github.com/scikit-learn/scikit-learn/issues/13621#issuecomment -600894432ã«ã¯ãã§ã«å®è£
ããããåäœãã¹ãã®ã¿ãå®è¡ããå¿
èŠãããããã§ãã
æãåèã«ãªãã³ã¡ã³ã
å埩枬å®ãè¡ã£ãå Žåã®å»åŠããã³çç©åŠã§ã®éåžžã«äžè¬çãªäœ¿çšäŸã
äŸïŒMRç»åããã¢ã«ããã€ããŒç ïŒADïŒãšå¥åº·ãªå¯Ÿç §ãªã©ã®ç æ°ãåé¡ããããšããŸãã åãäž»é¡ã«ã€ããŠãïŒãã©ããŒã¢ããã»ãã·ã§ã³ãŸãã¯çžŠæçããŒã¿ããã®ïŒè€æ°ã®ã¹ãã£ã³ãããå ŽåããããŸãã åèš1000人ã®è¢«éšè ãããŠããã®ãã¡200人ãADïŒäžåè¡¡ãªã¯ã©ã¹ïŒãšèšºæãããŠãããšä»®å®ããŸãã ã»ãšãã©ã®è¢«éšè ã¯1åã®ã¹ãã£ã³ãè¡ããŸãããäžéšã®è¢«éšè ã§ã¯2ã€ãŸãã¯3ã€ã®ç»åã䜿çšã§ããŸãã åé¡åšããã¬ãŒãã³ã°/ãã¹ããããšãã¯ãããŒã¿ã®æŒæŽ©ãé²ãããã«ãåã被åäœããã®ç»åãåžžã«åãæãç³ã¿ã«ããããšã確èªããå¿ èŠããããŸãã
ããã«ã¯StratifiedGroupKFoldã䜿çšããã®ãæé©ã§ããå±€åããŠã¯ã©ã¹ã®äžåè¡¡ãèæ ®ããŸããããµããžã§ã¯ããç°ãªããã©ãŒã«ãã«è¡šç€ºãããŠã¯ãªããªããšããã°ã«ãŒãå¶çŽããããŸãã
NBïŒãããç¹°ãè¿ãå¯èœã«ãããšããã§ãããã
以äžã®å®è£ äŸã¯ã kaggle-kernelã«è§ŠçºãããŠããŸãã
RepeatedStratifiedKFold
ïŒåãã°ã«ãŒãã®ãµã³ãã«ãäž¡æ¹ã®ãã©ãŒã«ãã«è¡šç€ºãããå ŽåããããŸãïŒãšRepeatedStratifiedGroupKFold
ã®æ¯èŒïŒ