This function attempts to automatically standardise input labels that should have been the same by making a few assumptions. The assumptions include that the levels with high frequency are correct and low frequency levels may contain typos or alternative representation of other existing levels.
Usage
sai_fct_sweep(
.f,
known = NULL,
wrong = NULL,
nlevels_max = length(unique(.f)) - length(wrong),
nlevels_min = length(unique(known)) + 1,
nlevels_top = round(nlevels_max * 0.25),
nlevels_bottom = 0,
n_min = 1L,
...
)
sai_lvl_sweep(
.f,
known = NULL,
wrong = NULL,
nlevels_max = length(unique(.f)) - length(wrong),
nlevels_min = length(unique(known)) + 1,
nlevels_top = round(nlevels_max * 0.25),
nlevels_bottom = 0,
n_min = 1L,
...
)
Arguments
- .f
A factor
- known
A character vector of the levels that are known to be correct. If none are provided, it is assumed that no correct values are known. If an element has a name associated with it, it is assumed that the name is what is recorded and the value is what the actual label should be.
- wrong
A character vector of the levels known to be wrong and should be grouped with another level.
- nlevels_max
The maximum number of levels.
- nlevels_min
The minimum number of levels.
- nlevels_top
The number of levels that are correct based on the top frequencies, excluding levels that have observations less than
n_min
.- nlevels_bottom
The number of levels that are incorrect based on the bottom frequencies, excluding those that have observation less than
n_min
.- n_min
The minimum of observations for each level. The default is 1.