Skip to contents

This function attempts to automatically standardise input labels that should have been the same by making a few assumptions. The assumptions include that the levels with high frequency are correct and low frequency levels may contain typos or alternative representation of other existing levels.

Usage

sai_fct_sweep(
  .f,
  known = NULL,
  wrong = NULL,
  nlevels_max = length(unique(.f)) - length(wrong),
  nlevels_min = length(unique(known)) + 1,
  nlevels_top = round(nlevels_max * 0.25),
  nlevels_bottom = 0,
  n_min = 1L,
  ...
)

sai_lvl_sweep(
  .f,
  known = NULL,
  wrong = NULL,
  nlevels_max = length(unique(.f)) - length(wrong),
  nlevels_min = length(unique(known)) + 1,
  nlevels_top = round(nlevels_max * 0.25),
  nlevels_bottom = 0,
  n_min = 1L,
  ...
)

Arguments

.f

A factor

known

A character vector of the levels that are known to be correct. If none are provided, it is assumed that no correct values are known. If an element has a name associated with it, it is assumed that the name is what is recorded and the value is what the actual label should be.

wrong

A character vector of the levels known to be wrong and should be grouped with another level.

nlevels_max

The maximum number of levels.

nlevels_min

The minimum number of levels.

nlevels_top

The number of levels that are correct based on the top frequencies, excluding levels that have observations less than n_min.

nlevels_bottom

The number of levels that are incorrect based on the bottom frequencies, excluding those that have observation less than n_min.

n_min

The minimum of observations for each level. The default is 1.

Details

Be warned that this function is experimental and may not work as intended.

See also