case_when() Lacks Safe Handling for Unexpected Values #7653

ja-ortiz-uniandes · 2025-02-07T00:00:29Z

Currently, case_when() does not provide a built-in way to validate categorical inputs and throw an error when an unexpected value is encountered. The function requires all return values to have the same type, making it impossible to safely use in cases where an unexpected value is encountered. The function is also incompatible in most cases with stop().

This makes case_when() unsafe in cases where developers need both:

A normal transformation for known values
A hard error for unknown values

Reproducible Example:

library(dplyr)


replace_func <- function(x) {
  
  case_when(
    x == "A" ~ 1,
    x == "B" ~ 2,
    x == "C" ~ 3,
    
    # If there is a different value I want the function to throw an error
    # and stop the execution
    .default = stop(paste0("Invalid value", x))
  )
  
data <- tibble(x = c("A", "B", "A", "C"))

# This will throw an error - even though all values are specified in the function
data %>% mutate(new_x = replace_func(x))
# Expected behavior would be to return something like:

# A tibble: 4 x 2
#   x     new_x
#   <chr> <dbl>
# 1 A         1
# 2 B         2
# 3 A         1
# 4 C         3


# But for it to fail if there is a value not specified in the function
data1 <- tibble(x = c("A", "B", "A", "C", "D"))


# This should throw an error because the default value is stop() and the value
# "D" is not specified in the function
data1 %>% mutate(new_x = replace_func(x))

Currently, the only alternatives for handling unknown values in case_when() are:

A manual check after executing case_when(), which is an imperfect solution with unnecessary complexity or
Leaving .default = NA, which can lead to silent failures—an unknown value that should have been handled explicitly might be mistakenly transformed into NA instead of triggering an error.

Neither of these solutions is ideal.

Proposed Solution

I believe the default behavior should be something along the lines of .default = stop(paste0("Unknown value: ", x)). This would force users to explicitly handle unknown values within their program, ensuring safer data transformations. If users want to allow unknown values to default to NA, they should be required to specify it explicitly by using .default = NA or TRUE ~ NA. This approach would provide better safety by default, preventing unintended NA values from propagating due to missing mappings in case_when().

Would love to hear your thoughts on this!

The text was updated successfully, but these errors were encountered:

philibe · 2025-02-07T16:26:26Z

In case_when() the .default parameter is expected to be a value, not a function.

In SQL the NULL is by default in CASE WHEN ... END where there is not the ELSE

https://en.wikipedia.org/wiki/SQL_syntax#Conditional_(CASE)_expressions
SQL tests WHEN conditions in the order they appear in the source. If the source does not specify an ELSE expression, SQL defaults to ELSE NULL.

Same in case_when():

https://dplyr.tidyverse.org/reference/case_when.html
case_when() is an R equivalent of the SQL "searched" CASE WHEN statement.

In dplyr::left_join() there is na_matches :

na_matches : Should two NA or two NaN values match?

na, the default, treats two NA or two NaN values as equal, like %in%, match(), and merge().

never treats two NA or two NaN values as different, and will never match them together or to any other values. This is similar to joins for database sources and to base::merge(incomparables = NA).

PS: I'm a simple user.

philibe · 2025-02-10T14:43:23Z

library(dplyr)

replace_func <- function(x) {
  
  res <- case_when(
    x == "A" ~ 1,
    x == "B" ~ 2,
    x == "C" ~ 3,
    
    # If there is a different value I want the function to throw an error
    # and stop the execution
    .default = NA
  )
  
  if (any(is.na(res))) {
    stop(cat(paste0("Invalid value : ", x[which(is.na(res))]), sep="\n", fill=TRUE))
  }
  
  res
}

data  <- tibble(x = c("A", "B", "A", "C"))
data1 <- tibble(x = c("A", "B", "A", "C", "D"))
data2 <- tibble(x = c("A", "B", "E" ,"A", "C" , "D" ,"F"))

data  %>% mutate(new_x = replace_func(x))
#> # A tibble: 4 × 2
#>   x     new_x
#>   <chr> <dbl>
#> 1 A         1
#> 2 B         2
#> 3 A         1
#> 4 C         3

data1 %>% mutate(new_x = replace_func(x))
#> Invalid value : D
#> Error in `mutate()`:
#> ℹ In argument: `new_x = replace_func(x)`.
#> Caused by error in `replace_func()`:

data2 %>% mutate(new_x = replace_func(x))
#> Invalid value : E
#> Invalid value : D
#> Invalid value : F
#> Error in `mutate()`:
#> ℹ In argument: `new_x = replace_func(x)`.
#> Caused by error in `replace_func()`:

ja-ortiz-uniandes · 2025-02-10T21:58:40Z

@philibe Thank you. You are correct this does have the expected behavior but I think it's more of a workaround than a proper solution.

NA values are common in many data sets, doing a check if a specific value is in a data set does indeed throw an error and it does behave as I described in the expected behavior section. However, what about the case where you do want to manipulate NA too? what will you do then? Assign a different default value and then check for that new default? I think this is not ideal. Not only is it cumbersome, requiring edits to different logic steps, it also makes the code harder to read/understand, and finally it is also, subjectively, inelegant. Being able to do .default = stop() would make understanding and editing the code simpler.

In these cases, what I ended up doing is using a named vector to do my replacements instead and then checking if there are any values not in the names of that vector. This is more code than it would be under case_when() but it makes it clear to anyone reading the code that a value not in the vector will throw an error, because it is not in the vector, not because it was the catch-all at the end. This makes maintaining the code easier, you only need to add new values to the vector, without altering any downstream or upstream logic (as opposed to using a default value which requires you alter 1. the value in case_when() 2. the value in .defalut and 3. the value in the error check).

Lastly, this workaround works fine inside a function but in more direct scenarios it becomes harder to implement. For example when doing mutate(new_x = case_when(...)) in such a case you would necessarily have to wrap case_when() in another function with either:

mutate(new_x = function(x) {
  x = case_when(...) 
  if (any(x) == 'default value') {
    stop(...)
  }
  x
})

or:

mutate(new_x = \(x) {
  x = case_when(...) 
  if (any(x) == 'default value') {
    stop(...)
  }
  x
})

or similar. Which when compared to

mutate(new_x = case_when(..., .default = stop(...)))

Seems like it is particularly convoluted.

philibe · 2025-02-13T12:29:56Z

Anyway I wouldn't like that the stop you ask becomes the default behavior, but only an optional feature. I like to choose for each case I need. :)

In a reprex environment on small R Shiny App I understand, but on my big app with many menus, I don't want that everything stop, and if it is not the case I create functions, in the lowest level of my application, which encapsulated original function with a conditionnal tryCatch().

RaymondBalise · 2025-02-14T16:27:40Z

I am also dealing with this. I have data on surgical complications that are coming in as NA = "No surgery yet", 1 = "Yes", 2 = "No". My code to handle it looks like this:

demo <- data.frame(
  complications = c(2, 2, 1, 3, NA)
)

demo |>
  mutate(
    complications_lgl = case_when(
      is.na(complications) ~ NA,
      complications == 1 ~ TRUE,
      complications == 2 ~ FALSE
    )
  ) %>% # base R pipe does not work here
  {
    unmatched <- 
      .$complications[is.na(.$complications_lgl) & !is.na(.$complications)] 
    unmatched <- unique(unmatched)
    if (length(unmatched) > 0) {
      stop("Unmatched cases for values: ", paste(unmatched, collapse = ", "))
    }
    .
  }

Turning this into a function makes my head spin, and the current version will make my novice students' heads explode. So add me to the list of people looking for a built-in solution.

ja-ortiz-uniandes · 2025-02-19T21:22:25Z

@RaymondBalise thanks appreciate the support!
@philibe this would be a breaking change yes, but I would hope you are using renv, anaconda or similar to make sure your code is robust to updates. If your project is large as you say, chances are something will break eventually. Nobody wants to re-write their code base, but if this becomes default behaviour in future updates, I believe it would help all future cases. Future developers would avoid cases in which their data is replaced with NAs without them realizing it. Errors that run perfectly without even a warning are particularly insidious.

philibe · 2025-02-21T00:03:08Z

Yes, you're right, my concern is primarily that is a breaking change.

In Rust, I don't use it, only Poc beginner, everything is very strict, but the rule is at the beginning. Base R and dplyr < 1.1 are permissives, the developer has the responsibility, and after dplyr 1.1 more and more seat belts are locked.

It is a paradigm change: for me it should have been a major version 2. In my big app, evolving since 7 years, I won't rewrite everything. :)

philibe · 2025-02-21T00:16:06Z

But I understand the need to have more strict control for the robustness, but not in minor version.

RaymondBalise · 2025-02-21T15:30:19Z

I completely agree that making this a breaking change would be a bad idea.

I think the way to handle this is to add an option to throw an error (or warning) if there are untrapped (unexpected) values/conditions. The default should be set to have that option be FALSE.

That would allow people to continue to have the same behavior. That is, untrapped conditions will continue to be set to NA by default. For people, like me, who are trying to innumerate all the legal values we could trap bad/unexpected data if a warn_on_unexpected_value or error_on_unexpected_value argument is set to TRUE.

Alejandro-Ortiz-WBG · 2025-02-21T15:56:02Z

@RaymondBalise No one likes breaking changes and ther gravity is not lost in this discussion the thing is the issue here goes beyond a simple inconvenience, the point is to combat the incedious silent errors. In typical tidyverse manner, perhaps case_when() could be superceded with a new function like value_map() with the safe handling. Users will get a warning to use the new function for several revisions.

RaymondBalise · 2025-02-21T16:16:48Z

@Alejandro-Ortiz-WBG I totally agree that the current design is dangerous. It reminds me of the old Excel "design feature" where it would silently null character values if they first showed up after the first eight records.

A new function would be a wonderful gift but it raises some interesting engineering issues because I think the guts of the new function would be very similar to case_when().

The social engineering is going to be a beast but I also totally agree that steps should be taken to help guide people toward the new functionality. It would go along way toward having people fall into a "pit of success" rather than stumbling along with unexpected missing values.

ja-ortiz-uniandes changed the title ~~case_when() does not fail safely, making it unsafe~~ case_when() Lacks Safe Handling for Unexpected Values Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

case_when() Lacks Safe Handling for Unexpected Values #7653

case_when() Lacks Safe Handling for Unexpected Values #7653

ja-ortiz-uniandes commented Feb 7, 2025 •

edited

Loading

philibe commented Feb 7, 2025 •

edited

Loading

philibe commented Feb 10, 2025

ja-ortiz-uniandes commented Feb 10, 2025 •

edited

Loading

philibe commented Feb 13, 2025 •

edited

Loading

RaymondBalise commented Feb 14, 2025 •

edited

Loading

ja-ortiz-uniandes commented Feb 19, 2025 •

edited

Loading

philibe commented Feb 21, 2025

philibe commented Feb 21, 2025

RaymondBalise commented Feb 21, 2025 •

edited

Loading

Alejandro-Ortiz-WBG commented Feb 21, 2025

RaymondBalise commented Feb 21, 2025

case_when() Lacks Safe Handling for Unexpected Values #7653

case_when() Lacks Safe Handling for Unexpected Values #7653

Comments

ja-ortiz-uniandes commented Feb 7, 2025 • edited Loading

Reproducible Example:

Proposed Solution

philibe commented Feb 7, 2025 • edited Loading

philibe commented Feb 10, 2025

ja-ortiz-uniandes commented Feb 10, 2025 • edited Loading

philibe commented Feb 13, 2025 • edited Loading

RaymondBalise commented Feb 14, 2025 • edited Loading

ja-ortiz-uniandes commented Feb 19, 2025 • edited Loading

philibe commented Feb 21, 2025

philibe commented Feb 21, 2025

RaymondBalise commented Feb 21, 2025 • edited Loading

Alejandro-Ortiz-WBG commented Feb 21, 2025

RaymondBalise commented Feb 21, 2025

ja-ortiz-uniandes commented Feb 7, 2025 •

edited

Loading

philibe commented Feb 7, 2025 •

edited

Loading

ja-ortiz-uniandes commented Feb 10, 2025 •

edited

Loading

philibe commented Feb 13, 2025 •

edited

Loading

RaymondBalise commented Feb 14, 2025 •

edited

Loading

ja-ortiz-uniandes commented Feb 19, 2025 •

edited

Loading

RaymondBalise commented Feb 21, 2025 •

edited

Loading