오렌지 맨숀🍊 - unnest() : Flatten back out into regular columns

Today Function : unnest

오늘의 함수는 tidyr 패키지의 unnest() 함수입니다.

unnest() 함수는 중첩된 데이터프레임을 풀 때 사용합니다.

Usage

unnest(
  data,
  cols,
  ...,
  keep_empty = FALSE,
  ptype = NULL,
  names_sep = NULL,
  names_repair = "check_unique",
  .drop = deprecated(),
  .id = deprecated(),
  .sep = deprecated(),
  .preserve = deprecated()
)

Argument

data : data.frame, tibble을 넣을 수 있습니다.
col : 중첩된 상태를 해제할 칼럼을 입력합니다. tidy-select expression을 활용해 선택 가능합니다.
keep_empty : 기본적으로 unnest() 함수는 각 요소별로 하나의 출력 행을 가져옵니다. NULL값이나 비어있는 경우엔 해당 행이 출력에서 삭제됩니다. 모든 행을 출력하려면 keep_empty = TRUE로 표시해야 합니다.
name_sep : 풀어지는 칼럼의 이름을 정합니다. NULL(기본값)일 경우엔 기존 이름이 그래도 유지됩니다.
names_repair : 출력되는 데이터프레임에 유효한 이름이 있는지 확인하는 데 사용합니다.

Example

library(tidyverse)

# tibble 함수를 통해 중첩된 tibble을 만들어보겠습니다.
df1 <- tibble(
  x = 1:3,
  y = list(
    NULL,
    tibble(a = 1, b = 2),
    tibble(a = 1:3, b = 3:1)
  )
)

df1

# A tibble: 3 × 2
      x y               
  <int> <list>          
1     1 <NULL>          
2     2 <tibble [1 × 2]>
3     3 <tibble [3 × 2]>

# unnest 함수를 통해 중첩된 tibble을 unnest 해보겠습니다.
df1 |> unnest(y)

# A tibble: 4 × 3
      x     a     b
  <int> <dbl> <dbl>
1     2     1     2
2     3     1     3
3     3     2     2
4     3     3     1

# keep_empty = TRUE로 처리할 경우 NULL값이 들어있던 1행도 출력됩니다.
df1 |> unnest(y, keep_empty = TRUE)

# A tibble: 5 × 3
      x     a     b
  <int> <dbl> <dbl>
1     1    NA    NA
2     2     1     2
3     3     1     3
4     3     2     2
5     3     3     1

names

이번에는 unnest() 함수를 통해 중첩을 푸는 과정에서 칼럼의 이름이 어떻게 결정되는지 확인해보겠습니다. palmerpenguins 패키지에 있는 펭귄 데이터를 불러와 종별로 총 4가지의 데이터(펭귄의 부리 길이, 깊이, 물갈퀴 길이, 몸무게)의 분위값을 정리해보겠습니다.

library(palmerpenguins)

penguins |> 
  select(c(species, bill_depth_mm, bill_length_mm, flipper_length_mm, body_mass_g)) |>
  group_by(species) |>
  summarise_all(.funs = function(x) list(enframe(
    quantile(x, probs = c(0.25, 0.5, 0.75), na.rm = TRUE))))

# A tibble: 3 × 5
  species   bill_depth_mm    bill_length_mm   flipper_length_mm body_mass_g     
  <fct>     <list>           <list>           <list>            <list>          
1 Adelie    <tibble [3 × 2]> <tibble [3 × 2]> <tibble [3 × 2]>  <tibble [3 × 2]>
2 Chinstrap <tibble [3 × 2]> <tibble [3 × 2]> <tibble [3 × 2]>  <tibble [3 × 2]>
3 Gentoo    <tibble [3 × 2]> <tibble [3 × 2]> <tibble [3 × 2]>  <tibble [3 × 2]>

펭귄 종 별로 4가지 데이터에 대한 분위값이 각각 tibble 형태로 담겨 있습니다. 이걸 unnest() 함수를 통해 풀어보겠습니다.

penguins |> 
  select(c(species, bill_depth_mm, bill_length_mm, flipper_length_mm, body_mass_g)) |>
  group_by(species) |>
  summarise_all(.funs = function(x) list(enframe(
    quantile(x, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)))) |>
  unnest()

# A tibble: 9 × 9
  species   name  value name1 value1 name2 value2 name3 value3
  <fct>     <chr> <dbl> <chr>  <dbl> <chr>  <dbl> <chr>  <dbl>
1 Adelie    25%    17.5 25%     36.8 25%      186 25%    3350 
2 Adelie    50%    18.4 50%     38.8 50%      190 50%    3700 
3 Adelie    75%    19   75%     40.8 75%      195 75%    4000 
4 Chinstrap 25%    17.5 25%     46.3 25%      191 25%    3488.
5 Chinstrap 50%    18.4 50%     49.6 50%      196 50%    3700 
6 Chinstrap 75%    19.4 75%     51.1 75%      201 75%    3950 
7 Gentoo    25%    14.2 25%     45.3 25%      212 25%    4700 
8 Gentoo    50%    15   50%     47.3 50%      216 50%    5000 
9 Gentoo    75%    15.7 75%     49.6 75%      221 75%    5500

문제가 발생했습니다. 중첩이 풀린 데이터의 칼럼이 모두 name과 value로 표시되어 구분할 수 없게 되었습니다. 이럴때 사용하는 게 바로 names_repair와 names_sep입니다. 우선 names_repair는 check_unique가 기본값으로 되어 있습니다. 겹치는 변수가 없도록 name, name2, name3 같은 고유의 이름을 부여해주죠. 하지만 우리는 각 칼럼이 어떤 데이터인지 이름을 알고 싶습니다. 이럴 땐 name_sep을 사용합니다. 구분자를 무엇으로 할 지 설정해주면 해당 칼럼과 구분자를 합쳐서 칼럼명을 부여해줍니다.

# names_sep = "_" 입력
penguins |> 
  select(c(species, bill_depth_mm, bill_length_mm, flipper_length_mm, body_mass_g)) |>
  group_by(species) |>
  summarise_all(.funs = function(x) list(enframe(
    quantile(x, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)))) |>
  unnest(names_sep = "_")

# A tibble: 9 × 9
  species   bill_depth…¹ bill_…² bill_…³ bill_…⁴ flipp…⁵ flipp…⁶ body_…⁷ body_…⁸
  <fct>     <chr>          <dbl> <chr>     <dbl> <chr>     <dbl> <chr>     <dbl>
1 Adelie    25%             17.5 25%        36.8 25%         186 25%       3350 
2 Adelie    50%             18.4 50%        38.8 50%         190 50%       3700 
3 Adelie    75%             19   75%        40.8 75%         195 75%       4000 
4 Chinstrap 25%             17.5 25%        46.3 25%         191 25%       3488.
5 Chinstrap 50%             18.4 50%        49.6 50%         196 50%       3700 
6 Chinstrap 75%             19.4 75%        51.1 75%         201 75%       3950 
7 Gentoo    25%             14.2 25%        45.3 25%         212 25%       4700 
8 Gentoo    50%             15   50%        47.3 50%         216 50%       5000 
9 Gentoo    75%             15.7 75%        49.6 75%         221 75%       5500 
# … with abbreviated variable names ¹bill_depth_mm_name, ²bill_depth_mm_value,
#   ³bill_length_mm_name, ⁴bill_length_mm_value, ⁵flipper_length_mm_name,
#   ⁶flipper_length_mm_value, ⁷body_mass_g_name, ⁸body_mass_g_value

lists of lists

리스트와 리스트가 중첩된 복잡한 데이터프레임을 풀려면 unnest() 함수를 두 번 사용하면 됩니다. 복잡하게 중첩된 데이터라면 hoist(), unnest_wider(), unnest_longer() 함수를 사용하면 좋습니다. 위 3개의 함수는 이른바 rectangling 작업에 사용되는 함수인데 이 녀석들은 나중에 따로 정리해보겠습니다.

df2 <- tibble(
  a = list(c("a", "b"), "c"),
  b = list(1:2, 3),
  c = c(11, 22)
)

df2

# A tibble: 2 × 3
  a         b             c
  <list>    <list>    <dbl>
1 <chr [2]> <int [2]>    11
2 <chr [1]> <dbl [1]>    22

# unnest를 이용해 동시에 여러 열의 중첩을 해제할 수 있습니다.
df2 |> unnest(c(a, b))

# A tibble: 3 × 3
  a         b     c
  <chr> <dbl> <dbl>
1 a         1    11
2 b         2    11
3 c         3    22

# 단계적으로 중첩을 해제하면 다음과 같은 결과를 얻습니다.
df2 |> unnest(a) |> unnest(b)

# A tibble: 5 × 3
  a         b     c
  <chr> <dbl> <dbl>
1 a         1    11
2 a         2    11
3 b         1    11
4 b         2    11
5 c         3    22