You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug, including details regarding any error messages, version, and platform.
I have a hive-style partitioned Parquet dataset, and each partition consists of only a single file, part-0.parquet. When I filter down to a single partition using the dplyr interface, open_dataset still ends up unnecessarily accessing 10 to 15 files which leads to unexpectedly slow loads. The filtering itself is correct, it's the performance I'm concerned about.
I am using R 4.4.1 on macOS (Intel), with a precompiled version of Arrow, 17.0.0.1. I have generated the partitioned dataset with the same version of Arrow and the arrow R package that is used to read it in. It was written using the default version (2.4 as far as I know) and zstd compression, though I have just now rewritten the entire dataset with version 2.6 and am still getting the same behavior.
I know open_dataset is reading in so many files because I get about 10 or more (innocuous) "invalid metadata$r" errors (see #40423) for a single call to open_dataset. Possibly it is trying to unify the schemas by peeking in these other files, even though unify_schemas = FALSE, or perhaps it is ignoring (part of) the hive-style partitioning and resorting to scanning the rows in order to filter? There are 8 cycles and over 30 countries in the dataset though, so it doesn't look like it is going through the entire dataset either.
I realize that open_dataset has some overhead relative to read_parquet because it has to walk the directory structure etc., but if I filter down to a single partition surely open_dataset should only access that particular partition?
I'm not sure but I don't think "invalid metadata$r" itself is the problem because although the metadata contains the schema, as you can see above I have also tried to load the data with a valid schema pre-specified.
Component(s)
R
The text was updated successfully, but these errors were encountered:
raulcd
changed the title
filtering a dataset down to a single partition (dplyr interface) prior to collection still accesses other partitions, leading to slow reads
[R] filtering a dataset down to a single partition (dplyr interface) prior to collection still accesses other partitions, leading to slow reads
Nov 14, 2024
Describe the bug, including details regarding any error messages, version, and platform.
I have a hive-style partitioned Parquet dataset, and each partition consists of only a single file,
part-0.parquet
. When I filter down to a single partition using the dplyr interface,open_dataset
still ends up unnecessarily accessing 10 to 15 files which leads to unexpectedly slow loads. The filtering itself is correct, it's the performance I'm concerned about.I am using R 4.4.1 on macOS (Intel), with a precompiled version of Arrow, 17.0.0.1. I have generated the partitioned dataset with the same version of Arrow and the arrow R package that is used to read it in. It was written using the default version (2.4 as far as I know) and zstd compression, though I have just now rewritten the entire dataset with version 2.6 and am still getting the same behavior.
This takes about 20 seconds for me:
whereas this takes about 50 seconds:
and even this takes 40 seconds:
and in fact not filtering down to a single partition seems to be faster, 35 seconds, even though it's reading 8 times as much data:
I know
open_dataset
is reading in so many files because I get about 10 or more (innocuous) "invalid metadata$r" errors (see #40423) for a single call to open_dataset. Possibly it is trying to unify the schemas by peeking in these other files, even thoughunify_schemas = FALSE
, or perhaps it is ignoring (part of) the hive-style partitioning and resorting to scanning the rows in order to filter? There are 8 cycles and over 30 countries in the dataset though, so it doesn't look like it is going through the entire dataset either.I realize that
open_dataset
has some overhead relative toread_parquet
because it has to walk the directory structure etc., but if I filter down to a single partition surelyopen_dataset
should only access that particular partition?I'm not sure but I don't think "invalid metadata$r" itself is the problem because although the metadata contains the schema, as you can see above I have also tried to load the data with a valid schema pre-specified.
Component(s)
R
The text was updated successfully, but these errors were encountered: