I have to deal with JSON documents that contain nested documents and at some level have an array which in turn contains individual documents that conceptionally would map back to "data frame rows" when reading/parsing the JSON in R.
First order problem/question
I'm looking for a way to ensure that
either all
data frames
are always turned intotibbles
or that at least the "leaf data frames" become
tibbles
while the the "parent data frames" are allowed to becomelists
for arbitrary nested structures, either directly upon parsing via {jsonlite}
or afterwards via {purrr}
.
Second order problem/question
How do I traverse lists and apply map
recursively with {purrr}
"the right
way"?
Related
- https://hendrikvanb.gitlab.io/2018/07/nested_data-json_to_tibble/
- Ensure that data frames become tibbles when reading MongoDB data with {mongolite}
Example
Example data
json <- '[
{
"_id": "1234",
"createdAt": "2020-01-13 09:00:00",
"labels": ["label-a", "label-b"],
"levelOne": {
"levelTwo": {
"levelThree": [
{
"x": "A",
"y": 1,
"z": true
},
{
"x": "B",
"y": 2,
"z": false
}
]
}
},
"schema": "0.0.1"
},
{
"_id": "5678",
"createdAt": "2020-01-13 09:01:00",
"labels": ["label-a", "label-b"],
"levelOne": {
"levelTwo": {
"levelThree": [
{
"x": "A",
"y": 1,
"z": true
},
{
"x": "B",
"y": 2,
"z": false
}
]
}
},
"schema": "0.0.1"
}
]'
Result after parsing and turning into tibble
x <- jsonlite::fromJSON(json) %>%
tibble::as_tibble()
x %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 5 variables:
# $ _id : chr "1234""5678"
# $ createdAt: chr "2020-01-13 09:00:00""2020-01-13 09:01:00"
# $ labels :List of 2
# ..$ : chr "label-a""label-b"
# ..$ : chr "label-a""label-b"
# $ levelOne :'data.frame': 2 obs. of 1 variable:
# ..$ levelTwo:'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A""B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A""B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# $ schema : chr "0.0.1""0.0.1"
Desired result
x <- jsonlite::fromJSON(json) %>%
tidy_nested_data_frames() %>%
tibble::as_tibble()
x %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 5 variables:
# $ _id : chr "1234""5678"
# $ createdAt: chr "2020-01-13 09:00:00""2020-01-13 09:01:00"
# $ labels :List of 2
# ..$ : chr "label-a""label-b"
# ..$ : chr "label-a""label-b"
# $ levelOne :List of 2
# ..$ levelTwo:List of 1
# .. ..$ levelThree:List of 2
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A""B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A""B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# ..$ levelTwo:List of 1
# .. ..$ levelThree:List of 2
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A""B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A""B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# $ schema : chr "0.0.1""0.0.1"
Current solution
I have something that works, but seems both too complicated as well as brittle as it was designed with one specific use case/JSON structure in mind:
tidy_nested_data_frames <- function(
x
) {
is_data_frame_that_should_be_list <- function(x) {
is.data.frame(x) && purrr::map_lgl(x, is.data.frame)
}
y <- x %>%
purrr::map_if(is_data_frame_that_should_be_list, as.list)
# Check for next data frame columns to handle:
false <- function(.x) FALSE
class_info <- y %>%
purrr::map_if(is.list, ~.x %>% purrr::map(is.data.frame), .else = false)
trans_to_tibble <- function(x) {
x %>% purrr::map(tibble::as_tibble)
}
purrr::map2(class_info, y, function(.x, .y) {
go_deeper <- .x %>% as.logical() %>% all()
if (go_deeper) {
# Continue if data frame columns have been detected:
tidy_nested_data_frames(.y[go_deeper])
} else {
# Handle data frames that have list columns that themselves carry the data
# frames we want to turn into tibbles:
# NOTE:
# This probably does not generalize well yet as the logic seems to much
# tied to my current use case!
if (.y %>% is.data.frame()) {
.y %>%
purrr::map_if(is.list, trans_to_tibble)
} else {
.y
}
}
})
}