Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 201894

How do I create a function that renames column names based on specified dictionary?

$
0
0

I am in the midst of creating a data analysis package that scrapes data from a variety of sources. In lots of cases, the name of columns have inconsistent naming for the same type of data at each data source.

In my case, I'm looking to create a rename function that calls a dictionary formatted as a tbl_df, and renames the columns.

This example works when there is a new column name for all of the original columns, but not when there's an additional column that isn't part of the dictionary in the tbl_df

library(tidyverse)

df1 <- tibble::tribble(
  ~name, ~birthday, ~height,
  "John Smith", "01/20/1990", "5'10"
)

df2 <- tibble::tribble(
  ~name, ~score, ~grade,
  "John Smith", 95, 8
)


# column renaming dictionaries
people_dictionary <- tibble::tribble(
  ~namePeople, ~nameActual,
  "name", "studentName",
  "birthday", "dob",
  "height", "height"
)

test_dictionary <- tibble::tribble(
  ~nameTest, ~nameActual,
  "name", "studentName",
  "score", "examScore",
  "grade", "schoolGrade"
)

rename_function <- function(data, data_source = "people") {
  # find dictionary based on data source
  if (data_source == "people") {
    actual_names_df <- people_dictionary
  }
  if (data_source == "test") {
    actual_names_df <- test_dictionary
  }

  # get column names of data
  original_names <- colnames(data)

  # create column name filter depending on data source
  name_columns <- case_when(
    data_source == "people" ~ "namePeople",
    data_source == "people" ~ "nameTest"
  )

  # Match Original Names to Renamed Column Names
  actual_names <-
    seq_along(original_names) %>%
    purrr::map_chr(function(x) {
      actual <-
        actual_names_df %>%
        # rlang used to unquote dynamic name column
        filter((!!rlang::sym(name_columns)) == original_names[x]) %>%
        .$nameActual
    })

  # rename columns
  data <- data %>%
    purrr::set_names(actual_names)
}

renamed_df1 <- df1 %>% rename_function(data_source = "people")

# original df
df1
#> # A tibble: 1 x 3
#>   name       birthday   height
#>   <chr>      <chr>      <chr> 
#> 1 John Smith 01/20/1990 5'10

# renamed columns of df1
renamed_df1
#> # A tibble: 1 x 3
#>   studentName dob        height
#>   <chr>       <chr>      <chr> 
#> 1 John Smith  01/20/1990 5'10

# additional column not named in dictionary
df3 <- tibble::tribble(
  ~name, ~birthday, ~height, ~weight,
  "John Smith", "01/20/1990", "5'10", 165
)

df3 %>% rename_function(data_source = "people")
#> Error: Result 4 must be a single string, not a character vector of length 0

Created on 2020-02-06 by the reprex package (v0.3.0)

There are a few pieces of my rename function that I believe could be improved upon:

  1. Is there a better way to call the right dictionary (tbl_df) that isn't a ton of if statements? Would I be able to create a tbl_df or csv that has a column for data_source and another column that lists the name of the tbl_df within the package?
tibble::tribble(
  ~data_source,    ~dictionary_name,
      "people", "people_dictionary",
        "test",   "test_dictionary"
  )
  1. How can I re-work my function to follow a similar workflow but not get the rename error when all columns don't have new names in the dictionary?
  2. Is there a better process for storing "dictionaries" in an r package for my use case?

Viewing all articles
Browse latest Browse all 201894

Trending Articles