library(tidyverse)
Code snippets to deterministically anonymize identifiers and dates, while maintaining intra-individual time series relationships, to facilitate sharing examples of clinical dashboards.
Sharing examples of clinical dashboards is difficult because of the presence of protected health information. Anonymizing the information leads to potential challenges as well:
- converting identifiers (like names or medical record numbers) to a random string of characters is not easily human readable
- converting dates randomly loses intra-individual time series relationships, like the trend of a lab value or weight over time
- a deterministic, rather than fully random, method of anonymizing identifiers and dates, is useful if new data for individuals is acquired and needs to be linked to previously anonymized data
Below are a pair of functions that deterministically anonymize identifiers and dates.
Convert an identifier into a human-readable name
The identifier is concatenated to a salt, hashed, converted into two integers, and indexed into the babynames
database to yield a human readable name. If the salt is generated randomly (and not saved), this essentially results in full randomization of the identifier. But if the salt (and identifier) are known, then the conversion into names can be reproduced.
#' Deterministically anonymize an identifier into a name
#'
#' Deterministically convert an identifier into a concatenation of two names from the babynames database.
#' NOT guaranteed to be unique, but there should be 9.47 billion unique combinations as of 2022-10.
#'
#' @param id vector of hashable identifier elements
#' @param salt `salt` is concatenated to `id` prior to hashing
#' @export
<- function(id, salt = "") {
random_names # for each `id`, return a deterministically "random" concatenation of two names from the babynames database
library(babynames)
<- unique(babynames$name)
namelist
<- vapply(
hashes X = paste0(id, salt[1]), # apply salt to the inputs
FUN = rlang::hash, # rlang::hash() returns 32-hex digits
FUN.VALUE = "chr",
USE.NAMES = FALSE
)
# take the first and second 10 digits of the hex and convert to a decimal integer
# use as indices into the unique babynames (use `%%` modulo to generate proper range)
# return the concatenated pair of names
return(
paste(
as.numeric(paste0("0x", substr(hashes, 1, 10))) %% length(namelist) + 1],
namelist[as.numeric(paste0("0x", substr(hashes, 11, 20))) %% length(namelist) + 1]
namelist[
)
) }
random_names(c("Joe", "Hadley", "Garrett", "Yihui", "Tibshirani"), salt = "secret_salt") # known salt
[1] "Blaza Navraj" "Kreigh Kimyra" "Charlita Delishia"
[4] "Rashana Chantina" "Kentavia Dajonna"
random_names(c("Joe", "Hadley", "Garrett", "Yihui", "Tibshirani"), salt = runif(1)) # random salt
[1] "Shanice Daymien" "Achilles Adlene" "Yanal Mizell"
[4] "Chakakhan Ziaire" "Hortencia Shaleia"
random_names(c("Joe", "Hadley", "Garrett", "Yihui", "Tibshirani"), salt = runif(1)) # random salt
[1] "Matheson Davyne" "Twanna Sheddrick" "Kindel Mckinleigh"
[4] "Buzz Deem" "Athene Almalik"
random_names(c("Joe", "Hadley", "Garrett", "Yihui", "Tibshirani"), salt = "secret_salt") # reproduce with known salt
[1] "Blaza Navraj" "Kreigh Kimyra" "Charlita Delishia"
[4] "Rashana Chantina" "Kentavia Dajonna"
Shift a POSIXct date-time to within a target year
An identifier is concatenated to a salt, hashed, and converted into an integer, which is used to select a random minute within the target year. For any given individual, a baseline POSIXct
date-time (e.g., date of birth or date of admission) is used to calculate the amount of “shift” that should be applied to all date-times for that individual, which maintains relative times for that individual.
#' Deterministically shift POSIXct to within a given year
#'
#' Deterministically return a shift (in seconds) of a POSIXct vector to within a given year, to
#' anonymize. The shift is determined by a hash of the id and salt. If the same shift is applied
#' to all POSIXct for an individual, relative times can be preserved, for time series analysis.
#'
#' @param id Shift generated by hash of `id` + `salt`
#' @param salt `salt` is concatenated to `id` prior to hashing
#' @param year Target year to randomize the dates to
#' @export
<- function(base_date, id, salt = "", year = 2000) {
random_timeshift # deterministically return a shift (in seconds) to a `base_date` POSIXct column to shift to within a given year
# - deterministic based on `id` + `salt`
# - shifts to the minute (no seconds)
# - vectorizes successfully across `year`
<- as.POSIXct(paste0(year, "-01-01"))
start <- as.POSIXct(paste0(year+1, "-01-01"))
end <- as.numeric(difftime(end, start), units = "secs") # will account for leap years and leap seconds
seconds_in_year <- as.integer(seconds_in_year / 60)
minutes_in_year
# map_chr because rlang::hash is NOT vectorized
= map_chr(id, .f = function(x) substr(rlang::hash(paste0(x, salt)), 1, 8))
hash <- start + 60 * as.numeric(paste0("0x", hash)) %% minutes_in_year
hash_date <- as.numeric(difftime(hash_date, base_date), units = "secs")
shift_seconds return(shift_seconds)
}
Example of use
Generate a dataframe with identifiers and dates of birth:
<- tibble(
df id = 1:8,
dob = as.POSIXct("2022-01-01") + (id-1)*60*60*24 # sequential days
)
::kable(df) knitr
id | dob |
---|---|
1 | 2022-01-01 |
2 | 2022-01-02 |
3 | 2022-01-03 |
4 | 2022-01-04 |
5 | 2022-01-05 |
6 | 2022-01-06 |
7 | 2022-01-07 |
8 | 2022-01-08 |
Then, anonymize that dataframe by converting the identifier into a random name and shifting the dates of birth “randomly”:
%>%
df mutate(
name = random_names(id, salt = Sys.time()),
shift = random_timeshift(dob, name, salt = "", year = 2000),
dob_shift = dob + shift
%>%
) ::kable() knitr
id | dob | name | shift | dob_shift |
---|---|---|---|---|
1 | 2022-01-01 | Damiun Issabell | -679764420 | 2000-06-17 09:33:00 |
2 | 2022-01-02 | Onesti Miyona | -665991540 | 2000-11-24 18:21:00 |
3 | 2022-01-03 | Tristram Faatima | -665407200 | 2000-12-02 12:40:00 |
4 | 2022-01-04 | Evodio Dejuane | -678727920 | 2000-07-02 09:28:00 |
5 | 2022-01-05 | Yazin Desai | -665049480 | 2000-12-08 16:02:00 |
6 | 2022-01-06 | Jessamarie Quillen | -664541520 | 2000-12-15 13:08:00 |
7 | 2022-01-07 | Carra Raimy | -665548620 | 2000-12-04 21:23:00 |
8 | 2022-01-08 | Marcandrew Kayelee | -676239960 | 2000-08-04 04:34:00 |
Example converting multiple date-time columns
If there were multiple time series columns associated with each id
, they could all be shifted by the appropriate shift
to maintain relative timing.
load("biliData.Rda")
<- enclist %>% # start with fully-identified dataset
enclist_anon select(
# select patient identifier
PatientID, contains("DTS") # select multiple date-time columns
%>%
) mutate(
# convert all columns containing "DTS" from string to POSIXct
across(contains("DTS"), function(x) { as.POSIXct(strptime(x, format = '%Y-%m-%d %H:%M:%S')) } )
%>%
) mutate(
PatientID = random_names(PatientID, salt = runif(1)), # anonymize identifiers
.shift = random_timeshift(BirthDTS, PatientID, salt = runif(1)), # generate per-individual time shift
across(contains("DTS"), function(x) {x + .shift} ) # apply time shifts
%>%
) select(-.shift) # remove the time shifts
The resulting dataset has now been anonymized, but the relative timing between the dates of birth, hospital admission, and hospital discharge are all maintained.
> summary(enclist_anon)
PatientID BirthDTS HospitalAdmitDTS HospitalDischargeDTS
Length:23649 Min. :2000-01-01 01:14:00 Min. :2000-01-01 01:14:00 Min. :2000-01-02 05:27:00
Class :character 1st Qu.:2000-04-03 00:10:00 1st Qu.:2000-04-17 01:18:30 1st Qu.:2000-04-20 02:58:30
Mode :character Median :2000-07-02 12:04:00 Median :2000-07-20 13:24:00 Median :2000-07-23 11:45:00
Mean :2000-07-02 02:26:36 Mean :2000-07-21 18:41:51 Mean :2000-07-24 22:54:20
3rd Qu.:2000-09-30 21:23:00 3rd Qu.:2000-10-21 10:54:30 3rd Qu.:2000-10-25 00:41:30
Max. :2000-12-31 23:25:00 Max. :2001-10-21 07:39:00 Max. :2001-10-21 20:21:00
NA's :450 NA's :662
> sample_n(enclist_anon, 8) %>% arrange(BirthDTS)
PatientID BirthDTS HospitalAdmitDTS HospitalDischargeDTS
1 Larod Dorita 2000-02-07 14:48:00 2000-02-07 14:48:00 2000-02-09 01:57:00
2 Indalecio Ninon 2000-03-01 02:04:00 2000-03-06 04:18:00 2000-03-06 12:23:00
3 Connal Abron 2000-04-19 13:53:00 2000-08-17 03:37:00 2000-08-17 05:03:00
4 Kaytlynne Lorren 2000-05-12 17:52:00 2000-05-12 17:52:00 2000-05-20 04:54:00
5 Turkessa Montse 2000-06-30 04:41:00 2000-06-30 04:41:00 2000-07-02 09:14:00
6 Venesta Sofie 2000-07-18 00:22:00 2000-07-18 00:22:00 2000-07-19 14:12:00
7 Edisson Teric 2000-07-24 21:42:00 2000-12-25 11:09:00 2000-12-25 13:58:00
8 Latonia Santrez 2000-10-17 09:27:00 2000-10-21 07:47:00 2000-10-21 20:51:00