library(dplyr)
library(tidyr)
<- data.frame(
input_df id = c(1, 2, 3, 4),
text = c("apple, banana", "apple", "banana ,orange", "pear")
)
%>% knitr::kable() input_df
id | text |
---|---|
1 | apple, banana |
2 | apple |
3 | banana ,orange |
4 | pear |
August 30, 2023
Code to convert a dataframe text column with delimited text substrings (e.g., CSV) into multiple Boolean (or numeric) columns specifying presence or absence of that substring. Requires an identifier column and a delimited text column.
One use case is data collected from multiple checkboxes, but with the results of all the checked boxes concatenated into a delimited list.
Create a dataframe with 2 columns and 4 rows. Each row has a column with a unique ID and another with a CSV list of fruit.
(I intentionally put some white space before and after the comma delimiter to make sure it can be properly handled.)
library(dplyr)
library(tidyr)
input_df <- data.frame(
id = c(1, 2, 3, 4),
text = c("apple, banana", "apple", "banana ,orange", "pear")
)
input_df %>% knitr::kable()
id | text |
---|---|
1 | apple, banana |
2 | apple |
3 | banana ,orange |
4 | pear |
The following code snippet
delimeter <- ","
prefix <- ""
output_df <- input_df %>%
separate_rows(text, sep = delimeter) %>% # `text` is the delimited string column
mutate(text = trimws(text, which = "both")) %>% # remove white space around delimiter
distinct() %>% # remove any repeat substrings
pivot_wider(
names_from = text,
names_prefix = prefix,
values_from = text,
values_fn = ~ as.numeric(!is.na(.x)), # output as 0/1 numeric
values_fill = 0
# values_fn = ~ !is.na(.x), # output as Boolean
# values_fill = FALSE
)
Here’s how the output looks, with the original text column added back for comparison.