![]() of the models (both RF and GBDT) to behave better on your shifted distribution (by using a validation set as in the future as you are allowed).ītw, worth checking out the recent TensorFlow Decision Forests, it has various variations of RF and GBDT all in one package (and works with Keras hp tuner) - including some recent interesting work, like sparse oblique projections. You can tune number of trees, depth, minimum number of element in leaf, etc. You already alluded to it, one could try some regularization hyperparameter tuning - bagging is not the only way to control variance, and the best way usually depends on the data. I don't know of publications, but as I read your description I had no surprises: that's exactly what I would have expected from RandomForests and GBDT of timeseries with a distribution shift (very common, at least on the datasets I work). PS As a sanity check I also tried with a logistic regression and a gaussian NB, which have the same consistent decrease in performance (0.7 to 0.45-0.6). Provided I have not performed extensive hyperparameter tuning and further testing and this might be a really particular case dependent on data and hyperparameters, still, I was wondering:ĭoes there exist some literature (I cannot find) on the robustness out of distribution of Random Forest vs Boosting algorithms which might explain this behavior?īecause intuitively, it might make sense that the variance reduction obtained by bagging would help even out of distribution, as some learners might still have learnt something relevant, but I am not sure it is enough. ![]() But here Random Forest still learns to generalize decently (for 2020 data we have a AUC of 0.704 if trained on 2017 and 0.706 if trained on 2018), while the boosting algorithms have on average worse performance, with a big difference for LightGbm between the two datasets ( For 2017 Xgboost 0.567, LGBM, 0.565, Catboost 0.639, EBM 0.5 Xgboost 0.661, LightGBM 0.734 (?), Catboost 0.639, EBM 0.685). (and the learned models' feature importances/PdP are quite different between the years). This data is slightly out of distribution, as there is for sure label shift and data is quite different. This dramatically changes, when I train a model on 2017 or 2018 data for 2020. In particular, while training with data from 2019, all the boosting algorithms obtain better performances than random forest (0.78-0.79 AUC vs 0.76). Trying to train different models (Random Forest, XgBoost, LightGBM, Catboost, Explainable Boosting Machines) on separate data with one year at a time from 2017 to 2019 and looking at the results for 2020, I see a curious behavior and I would like to understand whether it is a normal one in the literature or dependent on the particular data. I'm working on a classification task where I have data from a certain company for years between 20. Metacademy is a great resource which compiles lesson plans on popular machine learning topics.įor Beginner questions please try /r/LearnMachineLearning, /r/MLQuestions or įor career related questions, visit /r/cscareerquestions/ Please have a look at our FAQ and Link-Collection ![]() If you want to get started with boosted trees, check out XGBoost.Rules For Posts + Research + Discussion + Project + News on Twitter Chat with us on Slack Beginners: If you want to get started with random forests, you can do so with scikit-learn’s RandomForestEstimator. In practice, boosting seems to work better most of the time as long as you tune and evaluate properly to avoid overfitting. With bagging: more trees do not lead to more overfitting.With boosting: more trees eventually lead to overfitting.Having provided these rules of thumb, you can also try both in parallel to find out which performs better for you!. ![]()
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |