I am in the midst of creating a data analysis package that scrapes data from a variety of sources. In lots of cases, the name of columns have inconsistent naming for the same type of data at each data source.
In my case, I'm looking to create a rename function that calls a dictionary formatted as a tbl_df
, and renames the columns.
This example works when there is a new column name for all of the original columns, but not when there's an additional column that isn't part of the dictionary in the tbl_df
library(tidyverse)
df1 <- tibble::tribble(
~name, ~birthday, ~height,
"John Smith", "01/20/1990", "5'10"
)
df2 <- tibble::tribble(
~name, ~score, ~grade,
"John Smith", 95, 8
)
# column renaming dictionaries
people_dictionary <- tibble::tribble(
~namePeople, ~nameActual,
"name", "studentName",
"birthday", "dob",
"height", "height"
)
test_dictionary <- tibble::tribble(
~nameTest, ~nameActual,
"name", "studentName",
"score", "examScore",
"grade", "schoolGrade"
)
rename_function <- function(data, data_source = "people") {
# find dictionary based on data source
if (data_source == "people") {
actual_names_df <- people_dictionary
}
if (data_source == "test") {
actual_names_df <- test_dictionary
}
# get column names of data
original_names <- colnames(data)
# create column name filter depending on data source
name_columns <- case_when(
data_source == "people" ~ "namePeople",
data_source == "people" ~ "nameTest"
)
# Match Original Names to Renamed Column Names
actual_names <-
seq_along(original_names) %>%
purrr::map_chr(function(x) {
actual <-
actual_names_df %>%
# rlang used to unquote dynamic name column
filter((!!rlang::sym(name_columns)) == original_names[x]) %>%
.$nameActual
})
# rename columns
data <- data %>%
purrr::set_names(actual_names)
}
renamed_df1 <- df1 %>% rename_function(data_source = "people")
# original df
df1
#> # A tibble: 1 x 3
#> name birthday height
#> <chr> <chr> <chr>
#> 1 John Smith 01/20/1990 5'10
# renamed columns of df1
renamed_df1
#> # A tibble: 1 x 3
#> studentName dob height
#> <chr> <chr> <chr>
#> 1 John Smith 01/20/1990 5'10
# additional column not named in dictionary
df3 <- tibble::tribble(
~name, ~birthday, ~height, ~weight,
"John Smith", "01/20/1990", "5'10", 165
)
df3 %>% rename_function(data_source = "people")
#> Error: Result 4 must be a single string, not a character vector of length 0
Created on 2020-02-06 by the reprex package (v0.3.0)
There are a few pieces of my rename function that I believe could be improved upon:
- Is there a better way to call the right dictionary (
tbl_df
) that isn't a ton ofif
statements? Would I be able to create atbl_df
orcsv
that has a column fordata_source
and another column that lists the name of thetbl_df
within the package?
tibble::tribble(
~data_source, ~dictionary_name,
"people", "people_dictionary",
"test", "test_dictionary"
)
- How can I re-work my function to follow a similar workflow but not get the rename error when all columns don't have new names in the dictionary?
- Is there a better process for storing "dictionaries" in an
r
package for my use case?