Title: | Balancing Multiclass Datasets for Classification Tasks |
---|---|
Description: | Imbalanced training datasets impede many popular classifiers. To balance training data, a combination of oversampling minority classes and undersampling majority classes is useful. This package implements the SCUT (SMOTE and Cluster-based Undersampling Technique) algorithm as described in Agrawal et. al. (2015) <doi:10.5220/0005595502260234>. Their paper uses model-based clustering and synthetic oversampling to balance multiclass training datasets, although other resampling methods are provided in this package. |
Authors: | Keenan Ganz [aut, cre] |
Maintainer: | Keenan Ganz <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.0 |
Built: | 2025-02-02 04:25:25 UTC |
Source: | https://github.com/s-kganz/scutr |
An imbalanced dataset with a minor class centered around the origin with a majority class surrounding the center.
bullseye
bullseye
a data.frame with 1000 rows and 3 columns.
https://gist.github.com/s-kganz/c2534666e369f8e19491bb29d53c619d
An imbalanced dataset with randomly placed normal distributions around the origin. The nth class has n * 10 observations.
imbalance
imbalance
a data.frame with 2100 rows and 11 columns
https://gist.github.com/s-kganz/d08473f9492d48ea0e56c3c8a3fe1a74
Oversample a dataset by SMOTE.
oversample_smote(data, cls, cls_col, m, k = NA)
oversample_smote(data, cls, cls_col, m, k = NA)
data |
Dataset to be oversampled. |
cls |
Class to be oversampled. |
cls_col |
Column containing class information. |
m |
Desired number of samples in the oversampled data. |
k |
Number of neighbors used in |
The oversampled dataset.
table(iris$Species) smoted <- oversample_smote(iris, "setosa", "Species", 100) nrow(smoted)
table(iris$Species) smoted <- oversample_smote(iris, "setosa", "Species", 100) nrow(smoted)
This function is used to resample a dataset by randomly removing or duplicating rows. It is usable for both oversampling and undersampling.
resample_random(data, cls, cls_col, m)
resample_random(data, cls, cls_col, m)
data |
Dataframe to be resampled. |
cls |
Class that should be randomly resampled. |
cls_col |
Column containing class information. |
m |
Desired number of samples. |
Resampled dataframe containing only cls
.
set.seed(1234) only2 <- resample_random(wine, 2, "type", 15)
set.seed(1234) only2 <- resample_random(wine, 2, "type", 15)
Stratified index sample of different values in a vector.
sample_classes(vec, tot_sample)
sample_classes(vec, tot_sample)
vec |
Vector of values to sample from. |
tot_sample |
Total number of samples. |
A vector of indices that can be used to select a balanced population of values from vec
.
vec <- sample(1:5, 30, replace = TRUE) table(vec) sample_ind <- sample_classes(vec, 15) table(vec[sample_ind])
vec <- sample(1:5, 30, replace = TRUE) table(vec) sample_ind <- sample_classes(vec, 15) table(vec[sample_ind])
This function balances multiclass training datasets. In a dataframe with n
classes and m
rows, the resulting dataframe will have m / n
rows per class. SCUT_parallel()
distributes each over/undersampling task across multiple cores. Speedup usually occurs only if there are many classes using one of the slower resampling techniques (e.g. undersample_mclust()
). Note that SCUT_parallel()
will always run on one core on Windows.
SCUT( data, cls_col, oversample = oversample_smote, undersample = undersample_mclust, osamp_opts = list(), usamp_opts = list() ) SCUT_parallel( data, cls_col, ncores = detectCores()%/%2, oversample = oversample_smote, undersample = undersample_mclust, osamp_opts = list(), usamp_opts = list() )
SCUT( data, cls_col, oversample = oversample_smote, undersample = undersample_mclust, osamp_opts = list(), usamp_opts = list() ) SCUT_parallel( data, cls_col, ncores = detectCores()%/%2, oversample = oversample_smote, undersample = undersample_mclust, osamp_opts = list(), usamp_opts = list() )
data |
Numeric data frame. |
cls_col |
The column in |
oversample |
Oversampling method. Must be a function with the signature |
undersample |
Undersampling method. Must be a function with the signature |
osamp_opts |
List of options passed to the oversampling function. |
usamp_opts |
List of options passed to the undersampling function. |
ncores |
Number of cores to use with |
Custom functions can be used to perform under/oversampling (see the required signature below). Parameters represented by ...
should be passsed via osamp_opts
or usamp_opts
as a list.
A dataframe with equal class distribution.
Agrawal A, Viktor HL, Paquet E (2015). 'SCUT: Multi-class imbalanced data classification using SMOTE and cluster-based undersampling.' In 2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K), volume 01, 226-234.
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002). 'SMOTE: Synthetic Minority Over-sampling Technique.' Journal of Artificial Intelligence Research, 16, 321-357. ISSN 1076-9757, doi:10.1613/jair.953, https://www.jair.org/index.php/jair/article/view/10302.
ret <- SCUT(iris, "Species", undersample = undersample_hclust, usamp_opts = list(dist_calc="manhattan")) ret2 <- SCUT(chickwts, "feed", undersample = undersample_kmeans) table(ret$Species) table(ret2$feed) # SCUT_parallel fires a warning if ncores > 1 on Windows and will run on # one core only. ret <- SCUT_parallel(wine, "type", ncores = 1, undersample = undersample_kmeans) table(ret$type)
ret <- SCUT(iris, "Species", undersample = undersample_hclust, usamp_opts = list(dist_calc="manhattan")) ret2 <- SCUT(chickwts, "feed", undersample = undersample_kmeans) table(ret$Species) table(ret2$feed) # SCUT_parallel fires a warning if ncores > 1 on Windows and will run on # one core only. ret <- SCUT_parallel(wine, "type", ncores = 1, undersample = undersample_kmeans) table(ret$type)
Undersample a dataset by hierarchical clustering.
undersample_hclust(data, cls, cls_col, m, k = 5, h = NA, ...)
undersample_hclust(data, cls, cls_col, m, k = 5, h = NA, ...)
data |
Dataset to be undersampled. |
cls |
Majority class that will be undersampled. |
cls_col |
Column in data containing class memberships. |
m |
Number of samples in undersampled dataset. |
k |
Number of clusters to derive from clustering. |
h |
Height at which to cut the clustering tree. |
... |
Additional arguments passed to |
Undersampled dataframe containing only cls
.
table(iris$Species) undersamp <- undersample_hclust(iris, "setosa", "Species", 15) nrow(undersamp)
table(iris$Species) undersamp <- undersample_hclust(iris, "setosa", "Species", 15) nrow(undersamp)
Undersample a dataset by kmeans clustering.
undersample_kmeans(data, cls, cls_col, m, k = 5, ...)
undersample_kmeans(data, cls, cls_col, m, k = 5, ...)
data |
Dataset to be undersampled. |
cls |
Class to be undersampled. |
cls_col |
Column containing class information. |
m |
Number of samples in undersampled dataset. |
k |
Number of centers in clustering. |
... |
Additional arguments passed to |
The undersampled dataframe containing only instances of cls
.
table(iris$Species) undersamp <- undersample_kmeans(iris, "setosa", "Species", 15) nrow(undersamp)
table(iris$Species) undersamp <- undersample_kmeans(iris, "setosa", "Species", 15) nrow(undersamp)
Undersample a dataset by expectation-maximization clustering
undersample_mclust(data, cls, cls_col, m, ...)
undersample_mclust(data, cls, cls_col, m, ...)
data |
Data to be undersampled. |
cls |
Class to be undersampled. |
cls_col |
Class column. |
m |
Number of samples in undersampled dataset. |
... |
Additional arguments passed to |
The undersampled dataframe containing only instance of cls
.
setosa <- iris[iris$Species == "setosa", ] nrow(setosa) undersamp <- undersample_mclust(setosa, "setosa", "Species", 15) nrow(undersamp)
setosa <- iris[iris$Species == "setosa", ] nrow(setosa) undersamp <- undersample_mclust(setosa, "setosa", "Species", 15) nrow(undersamp)
Undersample a dataset by iteratively removing the observation with the lowest total distance to its neighbors of the same class.
undersample_mindist(data, cls, cls_col, m, ...)
undersample_mindist(data, cls, cls_col, m, ...)
data |
Dataset to undersample. Aside from |
cls |
Class to be undersampled. |
cls_col |
Column containing class information. |
m |
Desired number of observations after undersampling. |
... |
Additional arguments passed to |
An undersampled dataframe.
setosa <- iris[iris$Species == "setosa", ] nrow(setosa) undersamp <- undersample_mindist(setosa, "setosa", "Species", 50) nrow(undersamp)
setosa <- iris[iris$Species == "setosa", ] nrow(setosa) undersamp <- undersample_mindist(setosa, "setosa", "Species", 50) nrow(undersamp)
A Tomek link is a minority instance and majority instance that are each other's nearest neighbor. This function removes sufficient Tomek links that are an instance of cls to yield m instances of cls. If desired, samples are randomly discarded to yield m rows if insufficient Tomek links are in the data.
undersample_tomek(data, cls, cls_col, m, tomek = "minor", force_m = TRUE, ...)
undersample_tomek(data, cls, cls_col, m, tomek = "minor", force_m = TRUE, ...)
data |
Dataset to be undersampled. |
cls |
Majority class to be undersampled. |
cls_col |
Column in data containing class memberships. |
m |
Desired number of samples in undersampled dataset. |
tomek |
Definition used to determine if a point is considered a minority in the Tomek link definition.
|
force_m |
If |
... |
Additional arguments passed to |
Undersampled dataframe containing only cls
.
table(iris$Species) undersamp <- undersample_tomek(iris, "setosa", "Species", 15, tomek = "diff", force_m = TRUE) nrow(undersamp) undersamp2 <- undersample_tomek(iris, "setosa", "Species", 15, tomek = "diff", force_m = FALSE) nrow(undersamp2)
table(iris$Species) undersamp <- undersample_tomek(iris, "setosa", "Species", 15, tomek = "diff", force_m = TRUE) nrow(undersamp) undersamp2 <- undersample_tomek(iris, "setosa", "Species", 15, tomek = "diff", force_m = FALSE) nrow(undersamp2)
This functions checks that the given column is present in the data and that all columns besides the class column are numeric.
validate_dataset(data, cls_col)
validate_dataset(data, cls_col)
data |
Dataframe to validate. |
cls_col |
Column with class information. |
NA
Type and chemical analysis of three different kinds of wine.
wine
wine
a data.frame with 178 rows and 14 columns
https://archive.ics.uci.edu/ml/datasets/Wine