Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 201839

Recursively ensuring tibbles instead of data frames when parsing/manipulating nested JSON

$
0
0

I have to deal with JSON documents that contain nested documents and at some level have an array which in turn contains individual documents that conceptionally would map back to "data frame rows" when reading/parsing the JSON in R.

enter image description here

First order problem/question

I'm looking for a way to ensure that

  • either all data frames are always turned into tibbles

  • or that at least the "leaf data frames" become tibbles while the the "parent data frames" are allowed to become lists

for arbitrary nested structures, either directly upon parsing via {jsonlite} or afterwards via {purrr}.

Second order problem/question

How do I traverse lists and apply map recursively with {purrr}"the right way"?

Related


Example

Example data

json <- '[
  {
    "_id": "1234",
    "createdAt": "2020-01-13 09:00:00",
    "labels": ["label-a", "label-b"],
    "levelOne": {
      "levelTwo": {
        "levelThree": [
          {
            "x": "A",
            "y": 1,
            "z": true
          },
          {
            "x": "B",
            "y": 2,
            "z": false
          }
          ]
      }
    },
    "schema": "0.0.1"
  },
  {
    "_id": "5678",
    "createdAt": "2020-01-13 09:01:00",
    "labels": ["label-a", "label-b"],
    "levelOne": {
      "levelTwo": {
        "levelThree": [
          {
            "x": "A",
            "y": 1,
            "z": true
          },
          {
            "x": "B",
            "y": 2,
            "z": false
          }
          ]
      }
    },
    "schema": "0.0.1"
  }
]'

Result after parsing and turning into tibble

x <- jsonlite::fromJSON(json) %>% 
  tibble::as_tibble()

x %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of  5 variables:
#  $ _id      : chr  "1234""5678"
#  $ createdAt: chr  "2020-01-13 09:00:00""2020-01-13 09:01:00"
#  $ labels   :List of 2
#   ..$ : chr  "label-a""label-b"
#   ..$ : chr  "label-a""label-b"
#  $ levelOne :'data.frame':    2 obs. of  1 variable:
#   ..$ levelTwo:'data.frame':  2 obs. of  1 variable:
#   .. ..$ levelThree:List of 2
#   .. .. ..$ :'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A""B"
#   .. .. .. ..$ y: int  1 2
#   .. .. .. ..$ z: logi  TRUE FALSE
#   .. .. ..$ :'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A""B"
#   .. .. .. ..$ y: int  1 2
#   .. .. .. ..$ z: logi  TRUE FALSE
#  $ schema   : chr  "0.0.1""0.0.1"

Desired result

x <- jsonlite::fromJSON(json) %>% 
  tidy_nested_data_frames() %>% 
  tibble::as_tibble()

x %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of  5 variables:
#  $ _id      : chr  "1234""5678"
#  $ createdAt: chr  "2020-01-13 09:00:00""2020-01-13 09:01:00"
#  $ labels   :List of 2
#   ..$ : chr  "label-a""label-b"
#   ..$ : chr  "label-a""label-b"
#  $ levelOne :List of 2
#   ..$ levelTwo:List of 1
#   .. ..$ levelThree:List of 2
#   .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A""B"
#   .. .. .. ..$ y: int  1 2
#   .. .. .. ..$ z: logi  TRUE FALSE
#   .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A""B"
#   .. .. .. ..$ y: int  1 2
#   .. .. .. ..$ z: logi  TRUE FALSE
#   ..$ levelTwo:List of 1
#   .. ..$ levelThree:List of 2
#   .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A""B"
#   .. .. .. ..$ y: int  1 2
#   .. .. .. ..$ z: logi  TRUE FALSE
#   .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A""B"
#   .. .. .. ..$ y: int  1 2
#   .. .. .. ..$ z: logi  TRUE FALSE
#  $ schema   : chr  "0.0.1""0.0.1"

Current solution

I have something that works, but seems both too complicated as well as brittle as it was designed with one specific use case/JSON structure in mind:

tidy_nested_data_frames <- function(
  x
) {
  is_data_frame_that_should_be_list <- function(x) {
    is.data.frame(x) && purrr::map_lgl(x, is.data.frame)
  }
  y <- x %>%
    purrr::map_if(is_data_frame_that_should_be_list, as.list)

  # Check for next data frame columns to handle:
  false <- function(.x) FALSE
  class_info <- y %>%
    purrr::map_if(is.list, ~.x %>% purrr::map(is.data.frame), .else = false)

  trans_to_tibble <- function(x) {
    x %>% purrr::map(tibble::as_tibble)
  }
  purrr::map2(class_info, y, function(.x, .y) {
    go_deeper <- .x %>% as.logical() %>% all()

    if (go_deeper) {
      # Continue if data frame columns have been detected:

      tidy_nested_data_frames(.y[go_deeper])
    } else {
      # Handle data frames that have list columns that themselves carry the data
      # frames we want to turn into tibbles:

      # NOTE:
      # This probably does not generalize well yet as the logic seems to much
      # tied to my current use case!

      if (.y %>% is.data.frame()) {
        .y %>%
          purrr::map_if(is.list, trans_to_tibble)
      } else {
        .y
      }
    }
  })
}

Viewing all articles
Browse latest Browse all 201839

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>