I would like to identify a sensible method to detect and select the ‘principle’ text-containing data columns from my data.frames.
These columns contain open-ended survey responses, so they have heterogeneous strings comprised chiefly of letter characters. Ideally, this method would
- remove all factor, numeric, date and logical columns
- remove sparsely-populated text columns
- remove text columns with few unique elements
- be able to deal with non-standard characters
Here is an example of what I'd like to achieve:
Input data
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14
1 Na Gu Rx Ll bird a a 1 88,626 1 1 ç a TRUE
2 Ue Ho Iy <NA> bird b b 2 48,666 2 2 é b FALSE
3 Vk Lv <NA> <NA> bird a c 3 12,559 3 1 ëç TRUE
4 Pd Hk <NA> <NA> bird b d 4 3,794 4 2 õ d FALSE
5 Ay Nd <NA> <NA> <NA> a e 5 75,239 5 1 ïé TRUE
6 Xj <NA> <NA> <NA> <NA> b a 6 44,559 6 2 í f FALSE
7 Zn <NA> <NA> <NA> <NA> a b 7 21,100 7 1 ð g TRUE
8 Mw <NA> <NA> <NA> <NA> b c 8 7,790 8 2 ø h FALSE
9 Yx <NA> <NA> <NA> <NA> a d 9 84,470 9 1 ö i TRUE
10 Oj <NA> <NA> <NA> <NA> b e 10 45,724 10 2 ò j FALSE
Desired output
v1 v2 v7 v12 v13
1 Na Gu a c a
2 Ue Ho b e b
3 Vk Lv c e c
4 Pd Hk d o d
5 Ay Nd e i e
6 Xj <NA> a i f
7 Zn <NA> b d g
8 Mw <NA> c o h
9 Yx <NA> d o i
10 Oj <NA> e o j
Here is the code for the input data:
# made-up data
df <- data.frame(stringsAsFactors = F,
v1 = paste0(sample(LETTERS, 10, replace = T), sample(letters, 10, replace = T)),
v2 = c(paste0(sample(LETTERS, 5, replace = T), sample(letters, 5, replace = T)), rep(NA, 5)),
v3 = c(paste0(sample(LETTERS, 2, replace = T), sample(letters, 2, replace = T)), rep(NA, 8)),
v4 = c(paste0(sample(LETTERS, 1, replace = T), sample(letters, 1, replace = T)), rep(NA, 9)),
v5 = c(rep("bird", 4), rep(NA, 6)),
v6 = factor(rep(c("a", "b"), 5)),
v7 = rep(c("a", "b", "c", "d", "e"),2),
v8 = 1:10,
v9 = paste0(sample(1:99, 10, replace =T), ",", sample(1:999, 10, replace =T)),
v10 = as.character(1:10),
v11 = factor(rep(c(1, 2), 5)),
v12 = c('ç','é','ë','õ','ï','í','ð','ø','ö','ò'),
v13 = c('a','b', 'ç','d','é',letters[6:10]),
v14 = as.logical(rep(c("TRUE", "FALSE"), 5)))
So far I've been able to isolate the character vectors
df <- df[, sapply(df, is.character)]
And convert all the characer to Latin_ASCII, to replace non-standard letters
df[] <- lapply(df, stringi::stri_trans_general, "Latin-ASCII")
But I am trying to find a sensible/reliable solution to remove sparsely-populated (like v3 and v4), highly-repetitive (like v5), or would-be numeric data formatted as characters (like v9 and v10). What's a good approach?