Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 209005

Winnow data.frame to principle text-containing columns?

$
0
0

I would like to identify a sensible method to detect and select the ‘principle’ text-containing data columns from my data.frames.

These columns contain open-ended survey responses, so they have heterogeneous strings comprised chiefly of letter characters. Ideally, this method would

  1. remove all factor, numeric, date and logical columns
  2. remove sparsely-populated text columns
  3. remove text columns with few unique elements
  4. be able to deal with non-standard characters

Here is an example of what I'd like to achieve:

Input data

   v1   v2   v3   v4   v5 v6 v7 v8     v9 v10 v11 v12 v13   v14
1  Na   Gu   Rx   Ll bird  a  a  1 88,626   1   1   ç   a  TRUE
2  Ue   Ho   Iy <NA> bird  b  b  2 48,666   2   2   é   b FALSE
3  Vk   Lv <NA> <NA> bird  a  c  3 12,559   3   1   ëç  TRUE
4  Pd   Hk <NA> <NA> bird  b  d  4  3,794   4   2   õ   d FALSE
5  Ay   Nd <NA> <NA> <NA>  a  e  5 75,239   5   1   ïé  TRUE
6  Xj <NA> <NA> <NA> <NA>  b  a  6 44,559   6   2   í   f FALSE
7  Zn <NA> <NA> <NA> <NA>  a  b  7 21,100   7   1   ð   g  TRUE
8  Mw <NA> <NA> <NA> <NA>  b  c  8  7,790   8   2   ø   h FALSE
9  Yx <NA> <NA> <NA> <NA>  a  d  9 84,470   9   1   ö   i  TRUE
10 Oj <NA> <NA> <NA> <NA>  b  e 10 45,724  10   2   ò   j FALSE

Desired output

    v1   v2 v7 v12 v13
1  Na   Gu  a   c   a
2  Ue   Ho  b   e   b
3  Vk   Lv  c   e   c
4  Pd   Hk  d   o   d
5  Ay   Nd  e   i   e
6  Xj <NA>  a   i   f
7  Zn <NA>  b   d   g
8  Mw <NA>  c   o   h
9  Yx <NA>  d   o   i
10 Oj <NA>  e   o   j

Here is the code for the input data:

# made-up data
df <- data.frame(stringsAsFactors = F,
  v1 = paste0(sample(LETTERS, 10, replace = T), sample(letters, 10, replace = T)),
  v2 = c(paste0(sample(LETTERS, 5, replace = T), sample(letters, 5, replace = T)), rep(NA, 5)),
  v3 = c(paste0(sample(LETTERS, 2, replace = T), sample(letters, 2, replace = T)), rep(NA, 8)),
  v4 = c(paste0(sample(LETTERS, 1, replace = T), sample(letters, 1, replace = T)), rep(NA, 9)),
  v5 = c(rep("bird", 4), rep(NA, 6)),
  v6 = factor(rep(c("a", "b"), 5)),
  v7 = rep(c("a", "b", "c", "d", "e"),2),
  v8 = 1:10,
  v9 = paste0(sample(1:99, 10, replace =T), ",", sample(1:999, 10, replace =T)),
  v10 = as.character(1:10),
  v11 = factor(rep(c(1, 2), 5)),
  v12 = c('ç','é','ë','õ','ï','í','ð','ø','ö','ò'),
  v13 = c('a','b', 'ç','d','é',letters[6:10]),
  v14 = as.logical(rep(c("TRUE", "FALSE"), 5)))

So far I've been able to isolate the character vectors

df <- df[, sapply(df, is.character)]

And convert all the characer to Latin_ASCII, to replace non-standard letters

df[] <- lapply(df, stringi::stri_trans_general, "Latin-ASCII")  

But I am trying to find a sensible/reliable solution to remove sparsely-populated (like v3 and v4), highly-repetitive (like v5), or would-be numeric data formatted as characters (like v9 and v10). What's a good approach?


Viewing all articles
Browse latest Browse all 209005

Latest Images

Trending Articles



Latest Images

<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>