Skip to content

Conversation

BlakeOrth
Copy link
Contributor

Which issue does this PR close?

This does not fully close, but is an incremental building block component for:

The full context of how this code is likely to progress can be seen in the POC for this effort:

Rationale for this change

Further fills out the missing methods that have yet to be instrumented in the instrumented object store.

What changes are included in this PR?

  • Adds instrumentation around put_opts
  • Adds instrumentation around put_multipart
  • Adds tests for newly instrumented methods

Are these changes tested?

Yes. Unit tests have been added for the new methods

Example output:

DataFusion CLI v50.2.0
> CREATE EXTERNAL TABLE
test(a bigint, b bigint)
STORED AS parquet LOCATION '../../test_table/';
0 row(s) fetched.
Elapsed 0.003 seconds.

> \object_store_profiling trace
ObjectStore Profile mode set to Trace
> INSERT INTO test values (1, 2), (3, 4);
+-------+
| count |
+-------+
| 2     |
+-------+
1 row(s) fetched.
Elapsed 0.007 seconds.

Object Store Profiling
Instrumented Object Store: instrument_mode: Trace, inner: LocalFileSystem(file:///)
2025-10-17T19:02:15.440246215+00:00 operation=List path=home/blake/open_source_src/datafusion-BlakeOrth/test_table
2025-10-17T19:02:15.444096012+00:00 operation=Put duration=0.000249s size=815 path=home/blake/open_source_src/datafusion-BlakeOrth/test_table/a9pjKBxSOtXZobJO_0.parquet

Summaries:
List
count: 1

Put
count: 1
duration min: 0.000249s
duration max: 0.000249s
duration avg: 0.000249s
size min: 815 B
size max: 815 B
size avg: 815 B
size sum: 815 B

>

(note: I have no idea how to exercise/show a multi-part put operation, or if DataFusion even utilizes multipart puts for large files)

Are there any user-facing changes?

No-ish

cc @alamb

 - Adds instrumentation around put_opts
 - Adds instrumentation around put_multipart
 - Adds tests for newly instrumented methods
op: Operation::Put,
path: location.clone(),
timestamp,
duration: Some(elapsed),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm feeling a bit torn on using a duration here. Unlike list() this duration is accurate for what's happening, however, I fear it may be misleading. The duration for a multipart put will just be the duration spent initiating a multipart put session with the backing store. It won't be able to capture the true duration of uploading any data, which is what I think a user would expect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we log a ticket to try and make the accounting more accurate.

In general I think trying to get the level of timing might be a better case for https://github.com/datafusion-contrib/datafusion-tracing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's probably a good way to address this. I noted that you made one for tracking duration for list and this probably falls into a similar category.

Prior to closing the overarching issue for these PRs should document some of the current caveats for duration metrics?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior to closing the overarching issue for these PRs should document some of the current caveats for duration metrics?

That sounds good to me -- I suggest putting it in the code (not just the docs) so it is more discoverable -- maybe a note after the summary output?


Object Store Profiling
....
Put
count: 1
duration min: 0.000249s
duration max: 0.000249s
duration avg: 0.000249s
size min: 815 B
size max: 815 B
size avg: 815 B
size sum: 815 B

*** NEW ***
Note: Duration for multipart PUT is time spent initiating a multipart PUT session with the backing store. It does not include the time to actually upload data.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @BlakeOrth

@alamb
Copy link
Contributor

alamb commented Oct 17, 2025

I merged up from main to resolve a conflict

@alamb alamb added this pull request to the merge queue Oct 18, 2025
Merged via the queue into apache:main with commit 93f136c Oct 18, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants