[R] filtering a dataset down to a single partition (dplyr interface) prior to collection still accesses other partitions, leading to slow reads #44725

debrouwere · 2024-11-14T09:45:36Z

Describe the bug, including details regarding any error messages, version, and platform.

I have a hive-style partitioned Parquet dataset, and each partition consists of only a single file, part-0.parquet. When I filter down to a single partition using the dplyr interface, open_dataset still ends up unnecessarily accessing 10 to 15 files which leads to unexpectedly slow loads. The filtering itself is correct, it's the performance I'm concerned about.

I am using R 4.4.1 on macOS (Intel), with a precompiled version of Arrow, 17.0.0.1. I have generated the partitioned dataset with the same version of Arrow and the arrow R package that is used to read it in. It was written using the default version (2.4 as far as I know) and zstd compression, though I have just now rewritten the entire dataset with version 2.6 and am still getting the same behavior.

This takes about 20 seconds for me:

since <- Sys.time()
assessments <- read_parquet('build/pisa.rx/cycle=2022/country=Belgium/part-0.parquet', col_select = starts_with('w_'))
until <- Sys.time()
until - since

whereas this takes about 50 seconds:

since <- Sys.time()
assessments <- open_dataset('build/pisa.rx') |>
  filter(country == 'Belgium', cycle == 2022) |>
  select(starts_with('w_')) |>
  collect()
until <- Sys.time()
until - since

and even this takes 40 seconds:

rx_schema <- assessments |> schema()

since <- Sys.time()
assessments <- open_dataset('build/pisa.rx',
                            hive_style = TRUE,
                            partitioning = partitioning,
                            unify_schemas = FALSE,
                            format = 'parquet',
                            schema = rx_schema) |>
  filter(country == 'Belgium', cycle == 2022) |>
  select(starts_with('w_')) |>
  collect()
until <- Sys.time()
until - since

and in fact not filtering down to a single partition seems to be faster, 35 seconds, even though it's reading 8 times as much data:

since <- Sys.time()
assessments <- open_dataset('build/pisa.rx') |>
  filter(country == 'Belgium') |>
  select(starts_with('w_')) |>
  collect()
until <- Sys.time()
until - since

I know open_dataset is reading in so many files because I get about 10 or more (innocuous) "invalid metadata$r" errors (see #40423) for a single call to open_dataset. Possibly it is trying to unify the schemas by peeking in these other files, even though unify_schemas = FALSE, or perhaps it is ignoring (part of) the hive-style partitioning and resorting to scanning the rows in order to filter? There are 8 cycles and over 30 countries in the dataset though, so it doesn't look like it is going through the entire dataset either.

I realize that open_dataset has some overhead relative to read_parquet because it has to walk the directory structure etc., but if I filter down to a single partition surely open_dataset should only access that particular partition?

I'm not sure but I don't think "invalid metadata$r" itself is the problem because although the metadata contains the schema, as you can see above I have also tried to load the data with a valid schema pre-specified.

Component(s)

R

The text was updated successfully, but these errors were encountered:

debrouwere added the Type: bug label Nov 14, 2024

github-actions bot added the Component: R label Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] filtering a dataset down to a single partition (dplyr interface) prior to collection still accesses other partitions, leading to slow reads #44725

[R] filtering a dataset down to a single partition (dplyr interface) prior to collection still accesses other partitions, leading to slow reads #44725

debrouwere commented Nov 14, 2024 •

edited

Loading

[R] filtering a dataset down to a single partition (dplyr interface) prior to collection still accesses other partitions, leading to slow reads #44725

[R] filtering a dataset down to a single partition (dplyr interface) prior to collection still accesses other partitions, leading to slow reads #44725

Comments

debrouwere commented Nov 14, 2024 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

debrouwere commented Nov 14, 2024 •

edited

Loading