Support HTTP compression #700

amotl · 2025-02-21T19:05:19Z

About

CrateDB’s HTTP interface supports gzip and deflate compressed requests, but the crate-python client currently does not utilize this capability. Adding request compression would reduce bandwidth usage, improve performance for large queries and bulk inserts, and align crate-python with best practices seen in other database clients.

As a user, I want the option to send compressed requests to CrateDB to improve performance on congested networks.

Requirements:

Add a configuration option to enable request compression (gzip or deflate) when sending requests to CrateDB.
The default should enable compression
TBD: Introduce a size threshold to determine when compression is applied.
Context: Sending a Content-Encoding header for every request adds unnecessary overhead, so compression should only be used when the request size exceeds a configurable threshold (e.g., 1 KB, 2 KB, or 4 KB, similar to other libraries).

Warning

This is primarily about request encoding / compression. HTTP response encoding is vulnerable to BREACH and therefore requires additional measurements.

@proddata said:

It seems like CrateDB's HTTP interface accepts gzip / deflate compressed data.
It might also be interesting to add this capability to crate-python.

@surister said:

import gzip
import json
import requests

objects = [
    [1, "test"] for _ in range(200_000)
]

body = {
    "stmt": "INSERT INTO t VALUES (?, ?)",
    "bulk_args": objects
}
response = requests.post('http://192.168.88.251:4200/_sql', json=body)


print(response.request.headers.get('content-length'))

response = requests.post('http://192.168.88.251:4200/_sql',
                         data=gzip.compress(json.dumps(body).encode('utf8')),
                         headers={'Content-Encoding': 'gzip',
                                  'Content-Type': 'application/gzip; charset=utf-8'})

print(response.request.headers.get('content-length'))

2600054
5149

References

Enable HTTP response compression crate#17494

The text was updated successfully, but these errors were encountered:

amotl · 2025-02-25T15:03:00Z

@proddata: Thanks for your feedback, the requirements sound reasonable if we really need to have the need to turn off the feature on demand.

Naming things

I know this driver uses a standard protocol, while most other low-level database drivers are using a different protocol, mostly binary or otherwise proprietary to their needs, where compression might be handled differently, or is enabled from the start.

In order to adhere to relevant (naming) conventions as good as possible, how to make the feature configurable like we are proposing it: Can we look into, most prominently to learn about interfaces and parameter names, how others are doing it?

a) How other Python DBAPI libraries handle this situation how to make relevant parameters configurable (names, values, units), and
b) in particular, how PostgreSQL (clients) are doing it, both in Python lands, and beyond.

... of course, this only matches if compression is also a concern there. I think it might do, but I might also bee too naive.

Rationale

How others are doing it?

In particular, I am not exclusively concerned about the DBAPI driver here, because it rarely has direct exposure other than using it from applications. However, SQLAlchemy is a differerent animal, because its connection string is exposed to wider audiences of people and machines, using it in downstream applications of many kinds, standalone or cloud-based, library-shaped or ephemerally-hosted, or not. You name it.

In this spirit, we aim to standardize on naming conventions here,

Respect sslmode option sqlalchemy-cratedb#197

so I am asking to do the same, looking how others are naming their parameters, also for the compression feature.

proddata · 2025-02-25T16:46:58Z

In the Elasticsearch Python client, HTTP compression is controlled by a simple on/off switch:
Configuration Reference
Implementation Details

However, it’s important to distinguish between the two aspects of compression support in the client: sending and receiving compressed data. Only the content itself is compressed, not the overall request structure. For example, a simple query like SELECT 1 results in approximately 250 bytes transmitted in the request body. Enabling compression in such cases provides little benefit in terms of data reduction but, in initial tests, introduced a slight increase in latency (a few milliseconds per request).

b) in particular, how PostgreSQL (clients) are doing it, both in Python lands, and beyond.

PostgreSQL does not support request compression in the same way. Neither the PostgreSQL wire protocol nor common client implementations (in Python or other languages) provide native compression mechanisms similar to Elasticsearch’s HTTP-based approach.

amotl · 2025-02-26T00:34:41Z

Thanks, I've added the information about Elasticsearch to the table below.

With the PostgreSQL wire protocol, compression can be enabled, if your OpenSSL library supports zlib, by toggling the connection parameter sslcompression=1, which has been enabled by default starting with its introduction on version 9.2, but this default has been disabled again with version 11.

I have not been able to spot any threshold parameters other than with Oracle, up until now. Of course this enumeration is neither exhaustive, nor deep, and just tries to tap a little bit into the topic of proper "naming things", and "exploring the landscape".

Database	Client docs	Parameter names (driver/server/SQL)	Parameter names (SQLAlchemy)
Elasticsearch	Python client: HTTP compression	api argument:`http_compress=True`	`http_compress`ℹ
MongoDB	MongoDB Network Compression: A Win-Win	api argument:`compressors='zstd'`	n/a
MySQL	6.2.8 Connection Compression Control	setting:`protocol_compression_algorithms` cli:`--compression-algorithms` cli:`--compress`	`compress`ℹ
Oracle	Advanced Network Compression Whitepaper	`SQLNET.COMPRESSION`, `SQLNET.COMPRESSION_LEVELS`, `SQLNET.COMPRESSION_THRESHOLD`
PostgreSQL	31.1. Database Connection Control Functions » 31.1.2. Parameter Key Words (via SO)	connection:`sslcompression=1` env:`PGSSLCOMPRESSION=1`	Did not discover anything. Maybe generic pass-through?

proddata · 2025-02-26T07:31:02Z

With the PostgreSQL wire protocol, compression can be enabled, if your OpenSSL library supports zlib, by toggling the connection parameter sslcompression=1, which has been enabled by default starting with its introduction on version 9.2, but this default has been disabled again with version 11.

TLS compression has been removed in TLSs v1.3 due to CRIME

HTTP compression is somewhat vulnerable to BREACH, so it’s important to differentiate between request and response encoding. Ideally, a client should support both, but they should be managed with separate settings, as request compression generally doesn’t present the same risks. That is also partially why I only initially talked about request encoding.

amotl · 2025-02-26T14:00:39Z

I see. Thank you very much.

So, in order to be able to use other implementations and their parameterizations as blueprints on "naming things", eventually, we need to focus on databases that use traditional OpenSSL, but specifically HTTP as a communication protocol, because those details (e.g. request vs. response compression parameters) will only be present and of concern in such environments.

In this case, selecting Elasticsearch is a perfect choice ¹. However, relevant parameter sets seem pretty thin in this regard: http_compress neither provides the option to discriminate between request and response, nor conveys any threshold details. Maybe OpenSearch offers a better way to adjust those parameters?

Can you think of any other database servers that use HTTP? ↩

proddata · 2025-02-26T14:53:05Z

OpenSearch Python Client

The OpenSearch Python Low-Level Client supports HTTP compression for request bodies:

http_compress = True  # Enables gzip compression for request bodies

Contrary to ES, they mention only request bodies (haven't checked the actual implementation)

ClickHouse Python Client

The ClickHouse Python Client (with limited SQLAlchemy support) also provides compression settings:

ClickHouse Java Client

The ClickHouse Java Client uses the HTTP interface and provides three compression-related settings:

Setting	Description	Effect
`compressServerResponse(boolean enabled)`	`enabled` - flag that indicates if the option should be enabled	Sets whether the server should compress its responses.
`compressClientRequest(boolean enabled)`	`enabled` - flag that indicates if the option should be enabled	Sets whether the client should compress its requests.
`useHttpCompression(boolean enabled)`	`enabled` - flag that indicates if the option should be enabled	Enables HTTP compression for client/server communication if the corresponding options are enabled.

ClickHouse JavaScript Client

The ClickHouse JavaScript Client differentiates between request and response compression:

Compression settings

amotl · 2025-02-28T03:22:22Z

Thank you. What do you think about those parameter names, for both DB API's .connect() function, and SQLAlchemy's engine or query options?

compress_request=true vs. compress_response=true
compress_request_threshold=2k
compress_request_encoding= vs. compress_response_encoding= (one of lz4, zstd, br, gzip, deflate, when applicable)

proddata · 2025-02-28T08:31:24Z

Since these settings primarily deal with content encoding, we might consider naming them accordingly and aligning with HTTP conventions, such as:

encoding_request='gzip' | 'none' | <any others in the future>
encoding_response='none' | 'gzip' | <any others in the future>

Semantically, this would also remove the need for a dedicated on/off parameter.

WDYT?

For the threshold, I think it would make sense to run some benchmarks—potentially, this parameter might not be needed after all.

amotl · 2025-02-28T09:19:13Z

Hi. I would like to use a naming scheme that is very much independent from the used protocol / not necessarily tied to it, when possible, focusing on the semantic meaning around the compress label, like the other database vendors are doing it, because everyone will understand what it is about, contrary to encoding.

Maybe let's get rid of request/response instead, instead using client/server, like ClickHouse is doing it?

proddata · 2025-02-28T09:27:52Z

Then maybe:

compress = 'request' | 'response' | 'both' / 'all' | 'none'
or compress = 'client' | 'server' | 'both' / 'all' | 'none'
which in the future would allow for:
compress_encoding = 'gzip' | 'zstd' | 'lz4' | 'br' | 'deflate'

Or alternatively:

compress = True
compress_mode = 'request' | 'response' | 'both' | 'none'
which in the future would allow for:

amotl · 2025-04-14T07:32:36Z

Slightly OT: Discovered per Dependabot update, it looks like Prometheus also introduced zstd compression support recently.

amotl added the feature label Feb 21, 2025

amotl assigned amotl and unassigned amotl Feb 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support HTTP compression #700

Support HTTP compression #700

amotl commented Feb 21, 2025 •

edited by proddata

Loading

amotl commented Feb 25, 2025 •

edited

Loading

proddata commented Feb 25, 2025

amotl commented Feb 26, 2025 •

edited

Loading

proddata commented Feb 26, 2025

amotl commented Feb 26, 2025 •

edited

Loading

proddata commented Feb 26, 2025

amotl commented Feb 28, 2025 •

edited

Loading

proddata commented Feb 28, 2025

amotl commented Feb 28, 2025 •

edited

Loading

proddata commented Feb 28, 2025 •

edited

Loading

amotl commented Apr 14, 2025

Support HTTP compression #700

Support HTTP compression #700

Comments

amotl commented Feb 21, 2025 • edited by proddata Loading

About

References

amotl commented Feb 25, 2025 • edited Loading

Naming things

Rationale

proddata commented Feb 25, 2025

amotl commented Feb 26, 2025 • edited Loading

proddata commented Feb 26, 2025

amotl commented Feb 26, 2025 • edited Loading

Footnotes

proddata commented Feb 26, 2025

OpenSearch Python Client

ClickHouse Python Client

ClickHouse Java Client

ClickHouse JavaScript Client

amotl commented Feb 28, 2025 • edited Loading

proddata commented Feb 28, 2025

amotl commented Feb 28, 2025 • edited Loading

proddata commented Feb 28, 2025 • edited Loading

amotl commented Apr 14, 2025

amotl commented Feb 21, 2025 •

edited by proddata

Loading

amotl commented Feb 25, 2025 •

edited

Loading

amotl commented Feb 26, 2025 •

edited

Loading

amotl commented Feb 26, 2025 •

edited

Loading

amotl commented Feb 28, 2025 •

edited

Loading

amotl commented Feb 28, 2025 •

edited

Loading

proddata commented Feb 28, 2025 •

edited

Loading