Package 'scutr' reference manual

Title:	Balancing Multiclass Datasets for Classification Tasks
Description:	Imbalanced training datasets impede many popular classifiers. To balance training data, a combination of oversampling minority classes and undersampling majority classes is useful. This package implements the SCUT (SMOTE and Cluster-based Undersampling Technique) algorithm as described in Agrawal et. al. (2015) <doi:10.5220/0005595502260234>. Their paper uses model-based clustering and synthetic oversampling to balance multiclass training datasets, although other resampling methods are provided in this package.
Authors:	Keenan Ganz [aut, cre]
Maintainer:	Keenan Ganz <[email protected]>
License:	MIT + file LICENSE
Version:	0.2.0
Built:	2025-04-03 04:34:42 UTC
Source:	https://github.com/s-kganz/scutr

An imbalanced dataset with a minor class centered around the origin with a majority class surrounding the center.

Description

An imbalanced dataset with a minor class centered around the origin with a majority class surrounding the center.

Usage

bullseye
bullseye

Format

a data.frame with 1000 rows and 3 columns.

Source

https://gist.github.com/s-kganz/c2534666e369f8e19491bb29d53c619d

An imbalanced dataset with randomly placed normal distributions around the origin. The nth class has n * 10 observations.

Description

An imbalanced dataset with randomly placed normal distributions around the origin. The nth class has n * 10 observations.

Usage

imbalance
imbalance

Format

a data.frame with 2100 rows and 11 columns

Source

https://gist.github.com/s-kganz/d08473f9492d48ea0e56c3c8a3fe1a74

Oversample a dataset by SMOTE.

Description

Oversample a dataset by SMOTE.

Usage

oversample_smote(data, cls, cls_col, m, k = NA)
oversample_smote(data, cls, cls_col, m, k = NA)

Arguments

`data`	Dataset to be oversampled.
`cls`	Class to be oversampled.
`cls_col`	Column containing class information.
`m`	Desired number of samples in the oversampled data.
`k`	Number of neighbors used in `SMOTE()` to generate synthetic minority instances. This value must be smaller than the number of minority instances already present for a given class. If `NA`, `min(5, n-1)` is chosen, where n is the number of instances of the minority class.

Value

The oversampled dataset.

Examples

table(iris$Species)
smoted <- oversample_smote(iris, "setosa", "Species", 100)
nrow(smoted)
table(iris$Species)
smoted <- oversample_smote(iris, "setosa", "Species", 100)
nrow(smoted)

Randomly resample a dataset.

Description

This function is used to resample a dataset by randomly removing or duplicating rows. It is usable for both oversampling and undersampling.

Usage

resample_random(data, cls, cls_col, m)
resample_random(data, cls, cls_col, m)

Arguments

`data`	Dataframe to be resampled.
`cls`	Class that should be randomly resampled.
`cls_col`	Column containing class information.
`m`	Desired number of samples.

Value

Resampled dataframe containing only cls.

Examples

set.seed(1234)
only2 <- resample_random(wine, 2, "type", 15)
set.seed(1234)
only2 <- resample_random(wine, 2, "type", 15)

Stratified index sample of different values in a vector.

Description

Stratified index sample of different values in a vector.

Usage

sample_classes(vec, tot_sample)
sample_classes(vec, tot_sample)

Arguments

`vec`	Vector of values to sample from.
`tot_sample`	Total number of samples.

Value

A vector of indices that can be used to select a balanced population of values from vec.

Examples

vec <- sample(1:5, 30, replace = TRUE)
table(vec)
sample_ind <- sample_classes(vec, 15)
table(vec[sample_ind])
vec <- sample(1:5, 30, replace = TRUE)
table(vec)
sample_ind <- sample_classes(vec, 15)
table(vec[sample_ind])

SMOTE and cluster-based undersampling technique.

Description

This function balances multiclass training datasets. In a dataframe with n classes and m rows, the resulting dataframe will have m / n rows per class. SCUT_parallel() distributes each over/undersampling task across multiple cores. Speedup usually occurs only if there are many classes using one of the slower resampling techniques (e.g. undersample_mclust()). Note that SCUT_parallel() will always run on one core on Windows.

Usage

SCUT(
  data,
  cls_col,
  oversample = oversample_smote,
  undersample = undersample_mclust,
  osamp_opts = list(),
  usamp_opts = list()
)

SCUT_parallel(
  data,
  cls_col,
  ncores = detectCores()%/%2,
  oversample = oversample_smote,
  undersample = undersample_mclust,
  osamp_opts = list(),
  usamp_opts = list()
)
SCUT(
  data,
  cls_col,
  oversample = oversample_smote,
  undersample = undersample_mclust,
  osamp_opts = list(),
  usamp_opts = list()
)

SCUT_parallel(
  data,
  cls_col,
  ncores = detectCores()%/%2,
  oversample = oversample_smote,
  undersample = undersample_mclust,
  osamp_opts = list(),
  usamp_opts = list()
)

Arguments

`data`	Numeric data frame.
`cls_col`	The column in `data` with class membership.
`oversample`	Oversampling method. Must be a function with the signature `foo(data, cls, cls_col, m, ...)` that returns a data frame, one of the `⁠oversample_*⁠` functions, or `resample_random()`.
`undersample`	Undersampling method. Must be a function with the signature `foo(data, cls, cls_col, m, ...)` that returns a data frame, one of the `⁠undersample_*⁠` functions, or `resample_random()`.
`osamp_opts`	List of options passed to the oversampling function.
`usamp_opts`	List of options passed to the undersampling function.
`ncores`	Number of cores to use with `SCUT_parallel()`.

Details

Custom functions can be used to perform under/oversampling (see the required signature below). Parameters represented by ... should be passsed via osamp_opts or usamp_opts as a list.

Value

A dataframe with equal class distribution.

References

Agrawal A, Viktor HL, Paquet E (2015). 'SCUT: Multi-class imbalanced data classification using SMOTE and cluster-based undersampling.' In 2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K), volume 01, 226-234.

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002). 'SMOTE: Synthetic Minority Over-sampling Technique.' Journal of Artificial Intelligence Research, 16, 321-357. ISSN 1076-9757, doi:10.1613/jair.953, https://www.jair.org/index.php/jair/article/view/10302.

Examples

ret <- SCUT(iris, "Species", undersample = undersample_hclust,
            usamp_opts = list(dist_calc="manhattan"))
ret2 <- SCUT(chickwts, "feed", undersample = undersample_kmeans)
table(ret$Species)
table(ret2$feed)
# SCUT_parallel fires a warning if ncores > 1 on Windows and will run on
# one core only.
ret <- SCUT_parallel(wine, "type", ncores = 1, undersample = undersample_kmeans)
table(ret$type)
ret <- SCUT(iris, "Species", undersample = undersample_hclust,
            usamp_opts = list(dist_calc="manhattan"))
ret2 <- SCUT(chickwts, "feed", undersample = undersample_kmeans)
table(ret$Species)
table(ret2$feed)
# SCUT_parallel fires a warning if ncores > 1 on Windows and will run on
# one core only.
ret <- SCUT_parallel(wine, "type", ncores = 1, undersample = undersample_kmeans)
table(ret$type)

Undersample a dataset by hierarchical clustering.

Description

Undersample a dataset by hierarchical clustering.

Usage

undersample_hclust(data, cls, cls_col, m, k = 5, h = NA, ...)
undersample_hclust(data, cls, cls_col, m, k = 5, h = NA, ...)

Arguments

`data`	Dataset to be undersampled.
`cls`	Majority class that will be undersampled.
`cls_col`	Column in data containing class memberships.
`m`	Number of samples in undersampled dataset.
`k`	Number of clusters to derive from clustering.
`h`	Height at which to cut the clustering tree. `k` must be `NA` for this to be used.
`...`	Additional arguments passed to `dist()`.

Value

Undersampled dataframe containing only cls.

Examples

table(iris$Species)
undersamp <- undersample_hclust(iris, "setosa", "Species", 15)
nrow(undersamp)
table(iris$Species)
undersamp <- undersample_hclust(iris, "setosa", "Species", 15)
nrow(undersamp)

Undersample a dataset by kmeans clustering.

Description

Undersample a dataset by kmeans clustering.

Usage

undersample_kmeans(data, cls, cls_col, m, k = 5, ...)
undersample_kmeans(data, cls, cls_col, m, k = 5, ...)

Arguments

`data`	Dataset to be undersampled.
`cls`	Class to be undersampled.
`cls_col`	Column containing class information.
`m`	Number of samples in undersampled dataset.
`k`	Number of centers in clustering.
`...`	Additional arguments passed to `kmeans()`

Value

The undersampled dataframe containing only instances of cls.

Examples

table(iris$Species)
undersamp <- undersample_kmeans(iris, "setosa", "Species", 15)
nrow(undersamp)
table(iris$Species)
undersamp <- undersample_kmeans(iris, "setosa", "Species", 15)
nrow(undersamp)

Undersample a dataset by expectation-maximization clustering

Description

Undersample a dataset by expectation-maximization clustering

Usage

undersample_mclust(data, cls, cls_col, m, ...)
undersample_mclust(data, cls, cls_col, m, ...)

Arguments

`data`	Data to be undersampled.
`cls`	Class to be undersampled.
`cls_col`	Class column.
`m`	Number of samples in undersampled dataset.
`...`	Additional arguments passed to `Mclust()`

Value

The undersampled dataframe containing only instance of cls.

Examples

setosa <- iris[iris$Species == "setosa", ]
nrow(setosa)
undersamp <- undersample_mclust(setosa, "setosa", "Species", 15)
nrow(undersamp)
setosa <- iris[iris$Species == "setosa", ]
nrow(setosa)
undersamp <- undersample_mclust(setosa, "setosa", "Species", 15)
nrow(undersamp)

Undersample a dataset by iteratively removing the observation with the lowest total distance to its neighbors of the same class.

Description

Undersample a dataset by iteratively removing the observation with the lowest total distance to its neighbors of the same class.

Usage

undersample_mindist(data, cls, cls_col, m, ...)
undersample_mindist(data, cls, cls_col, m, ...)

Arguments

`data`	Dataset to undersample. Aside from `cls_col`, must be numeric.
`cls`	Class to be undersampled.
`cls_col`	Column containing class information.
`m`	Desired number of observations after undersampling.
`...`	Additional arguments passed to `dist()`.

Value

An undersampled dataframe.

Examples

setosa <- iris[iris$Species == "setosa", ]
nrow(setosa)
undersamp <- undersample_mindist(setosa, "setosa", "Species", 50)
nrow(undersamp)
setosa <- iris[iris$Species == "setosa", ]
nrow(setosa)
undersamp <- undersample_mindist(setosa, "setosa", "Species", 50)
nrow(undersamp)

Undersample a dataset by removing Tomek links.

Description

A Tomek link is a minority instance and majority instance that are each other's nearest neighbor. This function removes sufficient Tomek links that are an instance of cls to yield m instances of cls. If desired, samples are randomly discarded to yield m rows if insufficient Tomek links are in the data.

Usage

undersample_tomek(data, cls, cls_col, m, tomek = "minor", force_m = TRUE, ...)
undersample_tomek(data, cls, cls_col, m, tomek = "minor", force_m = TRUE, ...)

Arguments

`data`	Dataset to be undersampled.
`cls`	Majority class to be undersampled.
`cls_col`	Column in data containing class memberships.
`m`	Desired number of samples in undersampled dataset.
`tomek`	Definition used to determine if a point is considered a minority in the Tomek link definition. `minor`: Minor classes are all those with fewer than `m` instances. `diff`: Minor classes are all those that aren't `cls`.
`force_m`	If `TRUE`, uses random undersampling to discard samples if insufficient Tomek links are present to yield `m` rows of data.
`...`	Additional arguments passed to `dist()`.

Value

Undersampled dataframe containing only cls.

Examples

table(iris$Species)
undersamp <- undersample_tomek(iris, "setosa", "Species", 15, tomek = "diff", force_m = TRUE)
nrow(undersamp)
undersamp2 <- undersample_tomek(iris, "setosa", "Species", 15, tomek = "diff", force_m = FALSE)
nrow(undersamp2)
table(iris$Species)
undersamp <- undersample_tomek(iris, "setosa", "Species", 15, tomek = "diff", force_m = TRUE)
nrow(undersamp)
undersamp2 <- undersample_tomek(iris, "setosa", "Species", 15, tomek = "diff", force_m = FALSE)
nrow(undersamp2)

Validate a dataset for resampling.

Description

This functions checks that the given column is present in the data and that all columns besides the class column are numeric.

Usage

validate_dataset(data, cls_col)
validate_dataset(data, cls_col)

Arguments

`data`	Dataframe to validate.
`cls_col`	Column with class information.

Value

Type and chemical analysis of three different kinds of wine.

Description

Type and chemical analysis of three different kinds of wine.

Usage

wine
wine

Format

a data.frame with 178 rows and 14 columns

Source

https://archive.ics.uci.edu/ml/datasets/Wine

Package 'scutr'

Help Index

An imbalanced dataset with a minor class centered around the origin with a majority class surrounding the center.

Description

Usage

Format

Source

An imbalanced dataset with randomly placed normal distributions around the origin. The nth class has n * 10 observations.

Description

Usage

Format

Source

Oversample a dataset by SMOTE.

Description

Usage

Arguments

Value

Examples

Randomly resample a dataset.

Description

Usage

Arguments

Value

Examples

Stratified index sample of different values in a vector.

Description

Usage

Arguments

Value

Examples

SMOTE and cluster-based undersampling technique.

Description

Usage

Arguments

Details

Value

References

Examples

Undersample a dataset by hierarchical clustering.

Description

Usage

Arguments

Value

Examples

Undersample a dataset by kmeans clustering.

Description

Usage

Arguments

Value

Examples

Undersample a dataset by expectation-maximization clustering

Description

Usage

Arguments

Value

Examples

Undersample a dataset by iteratively removing the observation with the lowest total distance to its neighbors of the same class.

Description

Usage

Arguments

Value

Examples

Undersample a dataset by removing Tomek links.

Description

Usage

Arguments

Value

Examples

Validate a dataset for resampling.

Description

Usage

Arguments

Value

Type and chemical analysis of three different kinds of wine.

Description

Usage

Format

Source