Skip to content

Support HTTP compression #700

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks
amotl opened this issue Feb 21, 2025 · 11 comments
Open
3 tasks

Support HTTP compression #700

amotl opened this issue Feb 21, 2025 · 11 comments
Labels

Comments

@amotl
Copy link
Member

amotl commented Feb 21, 2025

About

CrateDB’s HTTP interface supports gzip and deflate compressed requests, but the crate-python client currently does not utilize this capability. Adding request compression would reduce bandwidth usage, improve performance for large queries and bulk inserts, and align crate-python with best practices seen in other database clients.

As a user, I want the option to send compressed requests to CrateDB to improve performance on congested networks.

Requirements:

  • Add a configuration option to enable request compression (gzip or deflate) when sending requests to CrateDB.
  • The default should enable compression
  • TBD: Introduce a size threshold to determine when compression is applied.
    Context: Sending a Content-Encoding header for every request adds unnecessary overhead, so compression should only be used when the request size exceeds a configurable threshold (e.g., 1 KB, 2 KB, or 4 KB, similar to other libraries).

Warning

This is primarily about request encoding / compression. HTTP response encoding is vulnerable to BREACH and therefore requires additional measurements.


@proddata said:

It seems like CrateDB's HTTP interface accepts gzip / deflate compressed data.
It might also be interesting to add this capability to crate-python.

@surister said:

import gzip
import json
import requests

objects = [
    [1, "test"] for _ in range(200_000)
]

body = {
    "stmt": "INSERT INTO t VALUES (?, ?)",
    "bulk_args": objects
}
response = requests.post('http://192.168.88.251:4200/_sql', json=body)


print(response.request.headers.get('content-length'))

response = requests.post('http://192.168.88.251:4200/_sql',
                         data=gzip.compress(json.dumps(body).encode('utf8')),
                         headers={'Content-Encoding': 'gzip',
                                  'Content-Type': 'application/gzip; charset=utf-8'})

print(response.request.headers.get('content-length'))
2600054
5149

References

@amotl amotl added the feature label Feb 21, 2025
@amotl
Copy link
Member Author

amotl commented Feb 25, 2025

@proddata: Thanks for your feedback, the requirements sound reasonable if we really need to have the need to turn off the feature on demand.

Naming things

I know this driver uses a standard protocol, while most other low-level database drivers are using a different protocol, mostly binary or otherwise proprietary to their needs, where compression might be handled differently, or is enabled from the start.

In order to adhere to relevant (naming) conventions as good as possible, how to make the feature configurable like we are proposing it: Can we look into, most prominently to learn about interfaces and parameter names, how others are doing it?

a) How other Python DBAPI libraries handle this situation how to make relevant parameters configurable (names, values, units), and
b) in particular, how PostgreSQL (clients) are doing it, both in Python lands, and beyond.

... of course, this only matches if compression is also a concern there. I think it might do, but I might also bee too naive.

Rationale

How others are doing it?

In particular, I am not exclusively concerned about the DBAPI driver here, because it rarely has direct exposure other than using it from applications. However, SQLAlchemy is a differerent animal, because its connection string is exposed to wider audiences of people and machines, using it in downstream applications of many kinds, standalone or cloud-based, library-shaped or ephemerally-hosted, or not. You name it.

In this spirit, we aim to standardize on naming conventions here,

so I am asking to do the same, looking how others are naming their parameters, also for the compression feature.

@proddata
Copy link
Member

In the Elasticsearch Python client, HTTP compression is controlled by a simple on/off switch:
Configuration Reference
Implementation Details

However, it’s important to distinguish between the two aspects of compression support in the client: sending and receiving compressed data. Only the content itself is compressed, not the overall request structure. For example, a simple query like SELECT 1 results in approximately 250 bytes transmitted in the request body. Enabling compression in such cases provides little benefit in terms of data reduction but, in initial tests, introduced a slight increase in latency (a few milliseconds per request).

b) in particular, how PostgreSQL (clients) are doing it, both in Python lands, and beyond.

PostgreSQL does not support request compression in the same way. Neither the PostgreSQL wire protocol nor common client implementations (in Python or other languages) provide native compression mechanisms similar to Elasticsearch’s HTTP-based approach.

@amotl
Copy link
Member Author

amotl commented Feb 26, 2025

Thanks, I've added the information about Elasticsearch to the table below.

With the PostgreSQL wire protocol, compression can be enabled, if your OpenSSL library supports zlib, by toggling the connection parameter sslcompression=1, which has been enabled by default starting with its introduction on version 9.2, but this default has been disabled again with version 11.

I have not been able to spot any threshold parameters other than with Oracle, up until now. Of course this enumeration is neither exhaustive, nor deep, and just tries to tap a little bit into the topic of proper "naming things", and "exploring the landscape".

Database Client docs Parameter names (driver/server/SQL) Parameter names (JDBC,ODBC) Parameter names (SQLAlchemy)
Elasticsearch Python client: HTTP compression api argument:http_compress=True http_compress
MongoDB MongoDB Network Compression: A Win-Win api argument:compressors='zstd' n/a
MySQL 6.2.8 Connection Compression Control setting:protocol_compression_algorithms
cli:--compression-algorithms
cli:--compress
compress
Oracle Advanced Network Compression Whitepaper SQLNET.COMPRESSION, SQLNET.COMPRESSION_LEVELS, SQLNET.COMPRESSION_THRESHOLD
PostgreSQL 31.1. Database Connection Control Functions » 31.1.2. Parameter Key Words (via SO) connection:sslcompression=1
env:PGSSLCOMPRESSION=1
Did not discover anything. Maybe generic pass-through?

@proddata
Copy link
Member

With the PostgreSQL wire protocol, compression can be enabled, if your OpenSSL library supports zlib, by toggling the connection parameter sslcompression=1, which has been enabled by default starting with its introduction on version 9.2, but this default has been disabled again with version 11.

TLS compression has been removed in TLSs v1.3 due to CRIME


HTTP compression is somewhat vulnerable to BREACH, so it’s important to differentiate between request and response encoding. Ideally, a client should support both, but they should be managed with separate settings, as request compression generally doesn’t present the same risks. That is also partially why I only initially talked about request encoding.

@amotl
Copy link
Member Author

amotl commented Feb 26, 2025

I see. Thank you very much.

So, in order to be able to use other implementations and their parameterizations as blueprints on "naming things", eventually, we need to focus on databases that use traditional OpenSSL, but specifically HTTP as a communication protocol, because those details (e.g. request vs. response compression parameters) will only be present and of concern in such environments.

In this case, selecting Elasticsearch is a perfect choice 1. However, relevant parameter sets seem pretty thin in this regard: http_compress neither provides the option to discriminate between request and response, nor conveys any threshold details. Maybe OpenSearch offers a better way to adjust those parameters?

Footnotes

  1. Can you think of any other database servers that use HTTP?

@proddata
Copy link
Member

OpenSearch Python Client

The OpenSearch Python Low-Level Client supports HTTP compression for request bodies:

http_compress = True  # Enables gzip compression for request bodies

Contrary to ES, they mention only request bodies (haven't checked the actual implementation)

ClickHouse Python Client

The ClickHouse Python Client (with limited SQLAlchemy support) also provides compression settings:

ClickHouse Java Client

The ClickHouse Java Client uses the HTTP interface and provides three compression-related settings:

Setting Description Effect
compressServerResponse(boolean enabled) enabled - flag that indicates if the option should be enabled Sets whether the server should compress its responses.
compressClientRequest(boolean enabled) enabled - flag that indicates if the option should be enabled Sets whether the client should compress its requests.
useHttpCompression(boolean enabled) enabled - flag that indicates if the option should be enabled Enables HTTP compression for client/server communication if the corresponding options are enabled.

ClickHouse JavaScript Client

The ClickHouse JavaScript Client differentiates between request and response compression:

@amotl
Copy link
Member Author

amotl commented Feb 28, 2025

Thank you. What do you think about those parameter names, for both DB API's .connect() function, and SQLAlchemy's engine or query options?

  • compress_request=true vs. compress_response=true
  • compress_request_threshold=2k
  • compress_request_encoding= vs. compress_response_encoding= (one of lz4, zstd, br, gzip, deflate, when applicable)

@proddata
Copy link
Member

Since these settings primarily deal with content encoding, we might consider naming them accordingly and aligning with HTTP conventions, such as:

  • encoding_request='gzip' | 'none' | <any others in the future>
  • encoding_response='none' | 'gzip' | <any others in the future>

Semantically, this would also remove the need for a dedicated on/off parameter.

WDYT?

For the threshold, I think it would make sense to run some benchmarks—potentially, this parameter might not be needed after all.

@amotl
Copy link
Member Author

amotl commented Feb 28, 2025

Hi. I would like to use a naming scheme that is very much independent from the used protocol / not necessarily tied to it, when possible, focusing on the semantic meaning around the compress label, like the other database vendors are doing it, because everyone will understand what it is about, contrary to encoding.

Maybe let's get rid of request/response instead, instead using client/server, like ClickHouse is doing it?

@proddata
Copy link
Member

proddata commented Feb 28, 2025

Then maybe:

  • compress = 'request' | 'response' | 'both' / 'all' | 'none'
  • or compress = 'client' | 'server' | 'both' / 'all' | 'none'
    which in the future would allow for:
  • compress_encoding = 'gzip' | 'zstd' | 'lz4' | 'br' | 'deflate'

Or alternatively:

  • compress = True
  • compress_mode = 'request' | 'response' | 'both' | 'none'
    which in the future would allow for:

@amotl amotl assigned amotl and unassigned amotl Feb 28, 2025
@amotl
Copy link
Member Author

amotl commented Apr 14, 2025

Slightly OT: Discovered per Dependabot update, it looks like Prometheus also introduced zstd compression support recently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants