Skip to content

Please add support for AWS Redshift (based on PostgreSQL 8.0.2) #181

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 tasks done
CedricYauLBD opened this issue Feb 14, 2024 · 14 comments · Fixed by #194
Open
2 tasks done

Please add support for AWS Redshift (based on PostgreSQL 8.0.2) #181

CedricYauLBD opened this issue Feb 14, 2024 · 14 comments · Fixed by #194

Comments

@CedricYauLBD
Copy link

What happens?

When I try to use the postgres extension with AWS Redshift I get an error message:

Error: Invalid Error: Failed to execute query "SELECT CURRENT_SETTING('server_version'), (SELECT COUNT(*) FROM pg_settings WHERE name LIKE 'rds%')": ERROR:  must be superuser to examine "server_version"

Based on https://stackoverflow.com/questions/51422236/what-role-permission-needed-for-the-user-to-get-the-server-version-in-amazon-red

I can use SELECT version(); on Redshift.

cyau@redshift:dev> SELECT version();
+---------------------------------------------------------------------------------------------------------------------------+
| version                                                                                                                   |
|---------------------------------------------------------------------------------------------------------------------------|
| PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.62878 |
+---------------------------------------------------------------------------------------------------------------------------+
SELECT 1
Time: 0.043s

This appears to be caused by the query here:
https://github.com/duckdb/postgres_scanner/blob/883a8f1a8a487264855a5166f7df1f46ad386434/src/postgres_connection.cpp#L122

To Reproduce

v0.10.0 20b1486d11
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D attach 'dbname=mydb user=username host=redshift.server.address password=password port=5439' as db (TYPE postgres);
Error: Invalid Error: Failed to execute query "SELECT CURRENT_SETTING('server_version'), (SELECT COUNT(*) FROM pg_settings WHERE name LIKE 'rds%')": ERROR:  must be superuser to examine "server_version"

OS:

Ubuntu 22.04 LTS

PostgreSQL Version:

Redshift (based on PostgreSQL 8.0.2)

DuckDB Version:

v0.10.0 20b1486d11

DuckDB Client:

CLI

Full Name:

Cedric Yau

Affiliation:

LinebackerData

Have you tried this on the latest main branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree
@minaguib
Copy link

minaguib commented Mar 1, 2024

Would LOVE to see this happen (especially if the version check is the only real blocker)

@Mytherin
Copy link
Contributor

I've modified the version check to use version() instead in #194. I've tried to get a redshift server up and running myself but didn't manage in a reasonable time frame. If you could give that PR a shot and see if it fixes the issue that would be great.

@minaguib
Copy link

@Mytherin I downloaded and built from source, but am running into an issue when invoking ATTACH:

connection to server at "ZZZZ, port 5439 failed: FATAL:  no PostgreSQL user name specified in startup packet

This happens both in main and redshift branches, so I'm 100% sure it's before the actual version check occurs and might be something related to my local env (google checks hints at some openssl macos VS homebrew issues).

I'll figure it out then confirm the logical fix.

Mytherin added a commit that referenced this issue Mar 19, 2024
Fix #181 - use version() instead, and if version query fails default to unknown version
@Mytherin
Copy link
Contributor

I've merged it into main - you should be able to get a nightly build for v0.10.1 after CI completes (in ~1 hour maybe?) with the following command:

force install postgres_scanner from 'http://nightly-extensions.duckdb.org';

@minaguib
Copy link

@Mytherin Success! ATTACH succeeds, no errors :)

Unfortunately I can't 1. introspect the schema (via .schema) or 2. run a simple query (which internally wants to introspect schema)

": ERROR:  relation "pg_enum" does not exist

If this is something you'd like to pursue supporting, LMK if you'd like additional details here or in a new ticket.

@Mytherin
Copy link
Contributor

I've pushed a fix for that here - #197 - that should disable pg_enum for old postgres versions (before 8.3 when support for pg_enum was added). If you could try it out that would be great.

@minaguib
Copy link

@Mytherin Close! But no cigar.

  • ATTACH works!
  • .schema introspection works!
  • Simple query fails
D select count(*) from redshift.foo.tab1;

Invalid Error: Failed to prepare COPY "
	COPY (SELECT NULL FROM "foo"."tab1" ) TO STDOUT (FORMAT binary);
	": ERROR:  syntax error at or near "("
LINE 2:  COPY (SELECT NULL FROM "foo"."tab1" ) TO ...
              ^

I feel this'll be a bit of a yak-shaving exercise. I don't think RedShift supports PG's COPY ... TO STDOUT

@Mytherin Mytherin reopened this Mar 20, 2024
@Mytherin
Copy link
Contributor

Unfortunately this is harder to solve - all of the current result conversion code is written for the binary copy. In order to solve this we would need to make an alternative result conversion function that uses the (regular) text protocol and add support for that for all data types. I will leave the issue open for now as I don't have time to pick this up at this point.

@minaguib
Copy link

minaguib commented Mar 20, 2024

@Mytherin Very appreciated. Thank you. Was hoping for quick wins, but I completely understand where it's at now.

Also (thinking out loud):
RedShift is often used as a medium-big analytical DB. Without reasonable predicate-pushdown or the ability to hand-craft a query to run verbatim in destination, the naive "select *" for each query/subquery will also have limitations on practical usefulness.

@Mytherin
Copy link
Contributor

That makes sense - although data transfer can always be useful even if the performance is not fantastic. We do also support the postgres_query method.

@misteliy
Copy link

@Mytherin so is it possible to use the PostgreSQL Extension and copy command with redshift? https://duckdb.org/docs/extensions/postgres#running-sql-queries-in-postgres-with-postgres_query

@minaguib
Copy link

@misteliy You can test it, but my hunch based on the above is that it's not yet possible to transfer real data from redshift over the wire to the duckdb instance.

@danvinci
Copy link

danvinci commented Nov 20, 2024

I've recently tried using DuckDB (with postgres_scanner extension) to work with a Redshift instance: it attaches the DB correctly, can list its tables and describe them - but a simple "select * from <table> limit 10" query would inevitably fail with Error: Failed to prepare COPY

In my case I'd need to periodically query/dump data locally for ad-hoc analyses, even having basic querying working would be quite useful to automate my workflow.

Have there been any developments on Redshift support?


EDIT

The workaround I found so far is to UNLOAD data from Redshift to Parquet on S3 (using ODBC) and then using DuckDB to query the Parquet files.

@misteliy
Copy link

The AWS Sagemaker Lakehouse setup might bring some changes, but I still believe Redshift support through Postgres would be really useful, especially for low-latency analytics use cases (rather than traditional big data scenarios with Redshift). It would offer the best of both worlds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants