-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicated rows in Parquet files #163
Comments
Thanks for the report! In principle all Postgres threads should run in the same transaction snapshot as explained here. However, not all Postgres servers support this and this feature is disabled for e.g. AWS Aurora. If you were to repeat the run with As a work-around you can then disable multi-threading (using e.g. |
Thanks - we're running on regular Postgres (hosted by Crunchybridge) and I can see those statements executing when I turn on the debug option.
This job runs nightly and possibly overlaps with a full logical backup. Is there anything that could interact there? |
Interestingly, I just took last night's dump which was a single threaded operation and outputted a single Parquet file for each table and found the same problem: SELECT id, ROW_NUMBER() OVER (PARTITION BY id) FROM read_parquet('table.parquet') QUALIFY ROW_NUMBER() OVER (PARTITION BY id) > 1;
┌──────────┬─────────────────────────────────────┐
│ id │ row_number() OVER (PARTITION BY id) │
│ int64 │ int64 │
├──────────┼─────────────────────────────────────┤
│ 61122606 │ 2 │
│ 61150142 │ 2 │
│ 61009086 │ 2 │
│ 58830824 │ 2 │
│ 60938241 │ 2 │
│ 60944502 │ 2 │
│ 61145921 │ 2 │
│ 61181739 │ 2 │
│ 60944526 │ 2 │
│ 61131057 │ 2 │
│ 60944530 │ 2 │
│ 59597315 │ 2 │
│ 61135999 │ 2 │
│ 61150647 │ 2 │
│ 61150686 │ 2 │
│ 60944567 │ 2 │
│ 61157702 │ 2 │
│ 61181778 │ 2 │
│ 60922176 │ 2 │
│ 60944457 │ 2 │
│ 60921690 │ 2 │
│ 61129287 │ 2 │
│ 61150691 │ 2 │
│ 60370107 │ 2 │
│ 61132247 │ 2 │
│ 60366945 │ 2 │
│ 60938573 │ 2 │
│ 60944543 │ 2 │
│ 59268777 │ 2 │
│ 60938539 │ 2 │
│ 60938545 │ 2 │
│ 60923318 │ 2 │
│ 60938808 │ 2 │
│ 61130376 │ 2 │
│ 60938569 │ 2 │
│ 61181722 │ 2 │
│ 61161559 │ 2 │
│ 60944418 │ 2 │
├──────────┴─────────────────────────────────────┤
│ 38 rows 2 columns │
└────────────────────────────────────────────────┘ Performing a similar query against production produces zero results. SELECT
id,
row_number
FROM (
SELECT
id,
row_number() OVER (PARTITION BY id) AS row_number
FROM
table) t
WHERE
row_number > 1 |
That's interesting. How many rows are in the table, and what kind of operations are running over the system in parallel (only insertions, or also updates/deletes)? For sanity - could you perhaps try writing to a DuckDB table and checking if that produces the same problem to ensure this is not perhaps triggering an issue in the Parquet writer? |
Sure thing - what's the best way to execute that to achieve the same effect? I know how to do
But I'm not sure if that tests the same path? |
Yes that tests the same path from the Postgres' reader point of view, so that would be great. Perhaps try running with a single thread again to isolate as much as possible? |
I just ran that end to end twice into a DuckDB database and dumped The data is moving constantly with inserts/deletes/updates. I estimate 20% of rows change on a daily basis. But it sounds like maybe the Parquet writer might be at fault here? |
That is possible. It could also be some connection between the Postgres scanner and the Parquet writer. Could you try dumping the contents of the DuckDB file to Parquet, and seeing if there are still no duplicate rows? |
I had to |
@Mytherin any more thoughts on this? 🙏 |
Could you try |
These have been running with |
I don't know if it's useful, but I'd be happy to send you one of the sample Parquet files privately. |
Unfortunately this is still present in DuckDB 0.10.0. |
Could you send over one of the broken Parquet files ([email protected])? |
Another thing I can think of - could you perhaps export the result of |
I have this issue as well. From what I can tell it is only and issue with rows that were updated during the COPY process. Debugged the queries that are executed on Postgres from Duckdb and found this: When multiple connections are used, the first connection will start a transaction:
All other connections use:
A read committed transaction can see updates from other transactions, so I think this is what is causing the issue. Furthermore, this will be a problem even though we limit threads, and also if we limit amount of postgres connections with |
That's interesting - thanks for investigating! Considering the problem also occurred in single-threaded operation the isolation level could definitely be a likely culprit. I've pushed a PR that switches the main transaction to use repeatable read here - #207. Perhaps you could give that a shot and see if it resolves the issue? |
Nice @Mytherin, its in my interest to get a potential fix out asap so happy to test it out. How would I easiest test out your PR? Or did you mean waiting for the nightly build? |
You should be able to fetch the artifacts from the PR itself - see here https://github.com/duckdb/postgres_scanner/actions/runs/8601970047?pr=207 (scroll down to "Artifacts") |
Thanks a lot @Mytherin, I was able to load the unsigned extension and try this out. I ran a single connection without any threading to get the most likelihood of changes to be applied during different copy statements with ctid ranges. And concurrently ran a script to update a bunch of rows in Postgres. Was able to verify that with the current extension in 0.10.1 duplicates from the different transactions are created, but with the new extension from the PR this was not the case. Let me know if you need anything else here, I'd really like to help push this out asap. |
Thanks so much for spotting the underlying issue here @noppaz! |
Thanks - I’ve merged the PR. I can push out a new version of the extension tomorrow after the nightly builds. |
I've published the nightly - |
What happens?
We run a nightly process to dump some Postgres tables into Parquet files. Sometimes we see a handful of rows duplicated in the output. In last night's example, we saw 33 rows out of 5,396,400 with a duplicate copy. This might be unavoidable with
PER_THREAD_OUTPUT
enabled against a moving data set, but it might be worth documenting this.Does it also mean some rows might be missing?
To Reproduce
I think this might be very difficult to reproduce, but our script looks something like this:
Then:
OS:
Linux
PostgreSQL Version:
14.7
DuckDB Version:
0.9.2
DuckDB Client:
CLI
Full Name:
Tom Taylor
Affiliation:
Breakroom
Have you tried this on the latest
main
branch?Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
The text was updated successfully, but these errors were encountered: