Sweep factor levels to group similar levels together

This function attempts to automatically standardise input labels that should have been the same by making a few assumptions. The assumptions include that the levels with high frequency are correct and low frequency levels may contain typos or alternative representation of other existing levels.

Usage

sai_fct_sweep(
  .f,
  known = NULL,
  wrong = NULL,
  nlevels_max = length(unique(.f)) - length(wrong),
  nlevels_min = length(unique(known)) + 1,
  nlevels_top = round(nlevels_max * 0.25),
  nlevels_bottom = 0,
  n_min = 1L,
  ...
)

sai_lvl_sweep(
  .f,
  known = NULL,
  wrong = NULL,
  nlevels_max = length(unique(.f)) - length(wrong),
  nlevels_min = length(unique(known)) + 1,
  nlevels_top = round(nlevels_max * 0.25),
  nlevels_bottom = 0,
  n_min = 1L,
  ...
)

Arguments

.f: A factor
known: A character vector of the levels that are known to be correct. If none are provided, it is assumed that no correct values are known. If an element has a name associated with it, it is assumed that the name is what is recorded and the value is what the actual label should be.
wrong: A character vector of the levels known to be wrong and should be grouped with another level.
nlevels_max: The maximum number of levels.
nlevels_min: The minimum number of levels.
nlevels_top: The number of levels that are correct based on the top frequencies, excluding levels that have observations less than n_min.
nlevels_bottom: The number of levels that are incorrect based on the bottom frequencies, excluding those that have observation less than n_min.
n_min: The minimum of observations for each level. The default is 1.

Details

Be warned that this function is experimental and may not work as intended.

Sweep factor levels to group similar levels together

Usage

Arguments

Details

See also