During training, your model will realize that if it classifies everything as positive, it will end up getting away with it. Consider the case where you have a dataset with 99 data points labeled as positive and only one labeled as negative. If this is the true issue at hand, it leaves an open question: what is the purpose of all the resampling methods intended to balance the dataset: oversampling, undersampling, SMOTE, etc? Clearly they don't address the problem of implicitly having a small sample size, you can't create information out of nothing! If any method is suitable for the number of people in the rarer class, there should be no issue if their proportion membership is imbalanced. Therefore, at least in regression (but I suspect in all circumstances), the only problem with imbalanced data is that you effectively have small sample size. I agree with AdamO that in general, unbalanced data poses no conceptual problem to a well-specified model.ĪdamO argues that the "problem" with class balance is really one of class rarity In my experience, the advice to "avoid unbalanced data" is either algorithm-specific, or inherited wisdom. there isn't a low level problem with using unbalanced data. Henry L., in an up-voted comment to an accepted answer, states The general sense of the upvoted answers is that "it's not, at least if you are thoughtful in your modeling". The idea is that datasets with an imbalance between the positive and negative class cause problems for some machine learning classification (I'm including probabilistic models here) algorithms, and methods should be sought to "balance" the dataset, restoring the perfect 50/50 split between positive and negative classes. In a recent, well recieved, question, Tim asks when is unbalanced data really a problem in Machine Learning? The premise of the question is that there is a lot of machine learning literature discussing class balance and the problem of imbalanced classes.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |