Provide performance benchmarks with and without Proxy #574

phoenix2x · 2023-07-07T16:35:30Z

Question

Hi there,

We're trying to migrate from RDS to GCP SQL. The application is running on GKE. We're using cloud-sql-go-connector v1.3.0 to connect a Postgres instance like this:

	cloudSQLDialer, err := cloudsqlconn.NewDialer(ctx, db.WithIAMAuthN(), db.WithDefaultDialOptions(db.WithPrivateIP()))
	if err != nil {
		panic(errors.Wrap(err, "failed to setup cloud sql dialer"))
	}

	config.Dialer = func(ctx context.Context, _, _ string) (net.Conn, error) {
		span := spans.NewChildSpanFromContext(ctx, "dial-cloudsql") ----------< this span p99 = 300ms
		defer span.Finish()
		// use context.Background() to avoid cancellation
		// canceling context during dial means that connection never makes it to the pool
		// so the next query has to pay the price again, potentially hitting timeout as well
		//nolint: contextcheck
		conn, err := cloudSQLDialer.Dial(context.Background(), connectionName)
		if err != nil {
			span.SetTag(ext.Error, err)
		}

		return conn, err
	}

We noticed that it takes significantly more time for the Dialer to finish ~300ms as opposed to ~30ms for RDS. We assumed this is caused by additional work that the Cloud SQL Proxy server is doing. And indeed we confirmed it by directly connecting to the Cloud SQL Postgres instance using the IP:5432, latency was staying at ~70ms.

Since we do use connection pooling it works with no errors most of the time. But during traffic spikes or after a lot of connections died under heavy load this added latency causes a lot of errors. What is going on there is that when we try to open a significant number of connections at the same time the dialer latency spikes up with every new connection(latency goes up to ~4s), causing the application to try opening even more connections since it can't get enough to fulfill incoming requests. The end result is that the application opens up to the pool limit number of connections(500 in this specific test) and has a lot of errors when it uses cloudsqlconn. And it only opens ~100 connections with no errors when it connects to the IP:5432 directly. Both tests use the same traffic numbers.

Is there anything we can do to mitigate the issue?

Sorry for a lot of text, just want to make sure I give enough info:)

Code

No response

Additional Details

No response

enocom · 2023-07-07T16:54:56Z

Hey @phoenix2x thanks for the issue.

Some follow-up questions:

Are you using pgxpool?
What settings are you using for the pool?
During the traffic spikes, what CPU and memory usage do you see?

Slightly unrelated, but are you using the built-in traces?

phoenix2x · 2023-07-07T18:18:26Z

Thanks for the quick follow up:)

we use bun which uses database/sql pool under the hood
MaxIdleConns and MaxOpenConns = 20, ConnMaxIdleTime = 15m, ConnMaxLifetime = 1h
a. If the question is about application cpu/mem then it's pretty similar in both tests. There are ~36 pods 250mcores each running at the time of the spike. HPA is set up to 40% target CPU utilization so there is plenty of headroom, almost no CPU throttling. The memory stays around 75Mb per pod while the requests are 250Mb, again plenty of headroom.
b. If this is a question about cpu/mem of the database instance then the picture is completely different between tests:
With cloudsqlconn: CPU 40%, mem component usage 30%, connections 508
No cloudsqlconn: CPU 7%, mem component usage 7%, connections 84

We do use builtin traces, but since we can't use them for no-cloudsqlconn test we added our custom span in the dialer for both tests, just to compare apples to apples. Here they are:

enocom · 2023-07-07T21:01:51Z

Thanks, @phoenix2x this is really helpful.

How many instances does each dialer connect to?

For some background info, the Dialer does this:

On a new connection, it reaches out to the Cloud SQL Admin API to create an ephemeral certificate that lasts for 1 hour.
The certificate and associated TLS configuration is stored in a map protected by a mutex.
Subsequent dial attempts will read the TLS configuration info from that map.
In the background, a goroutine runs to update the certificate ~4 minutes before it expires.
If any of the refreshes fail, at worst case you'll see a new connection recreating the ephemeral certificate before establishing the connection.

phoenix2x · 2023-07-08T03:21:38Z

It's a single Postgres instance in this test.
As I can see from the internal spans all the time is spent in cloud.google.com/go/cloudsqlconn/internal.Connect span. cloud.google.com/go/cloudsqlconn/internal.InstanceInfo stays extremely fast and there were no cloud.google.com/go/cloudsqlconn/internal.RefreshConnection spans at the time of the test. I tried looking into the library code and looks good to me, I don't think there is anything in there that could cause a 4 seconds delay in opening a new connection.

enocom · 2023-07-11T21:01:51Z

These numbers are surprising to me. Let me talk with the backend folks to see if we can shed some light on what's going on.

Are you doing manual load testing for this? How are you doing it?

enocom · 2023-07-11T21:51:11Z

Also, if you have a support account, you might consider opening a case so the backend team can look at your instance.

phoenix2x · 2023-07-13T21:50:14Z

Yes, we use nightly load testing to prevent regressions. As soon as we switched from rds to GCP SQL we noticed this issue. The load test is just a script that hits our services.
Sure, I'll open the ticket with support, was hoping that you've already seen something like this. I'll post their response here if we're able to solve the issue.

Thank you.

enocom · 2023-07-13T22:12:12Z

Those latency numbers are much higher than I'd expect. We've been thinking about publishing some baseline numbers as part of a benchmark and this helps increase the priority of that work.

Otherwise, there might be some insight the backend team can add.

enocom · 2023-07-14T16:08:27Z

Meanwhile, I'm going to make this an issue for publishing benchmark numbers with and without the Dialer.

enocom · 2023-07-14T16:09:05Z

Related to GoogleCloudPlatform/cloud-sql-proxy#1871.

enocom · 2023-07-26T16:51:16Z

Just to circle back here and provide some information for others who run into this issue:

One thing to keep in mind is that Auto IAM AuthN has a limit of login requests at 3000 logins to an instance / minute. When traffic spikes hit that threshold, the latency can jump way up (as we see above). Generally, though, we expect p99 latency to be much much lower, while still accounting for the network hops (app with connector -> proxy server -> instance and sometimes a call to verify the IAM user).

enocom · 2023-08-01T15:46:13Z

@phoenix2x FYI it's possible to use auto IAM authn without the Go Connector.

You'll need to ensure a few things:

The token has only sql.login scope (i.e., https://www.googleapis.com/auth/sqlservice.login)
The token isn't transmitted over an unencrypted channel.

We're working on making this path easier for folks, but for now I'll share the mechanics for visibility.

Assuming you're using pgx, you can do this:

import (
        "context"
        "fmt"
        "time"

        "github.com/jackc/pgx/v5"
        "github.com/jackc/pgx/v5/pgxpool"
)

func main() {
        // use instance IP + native port (5432)
        // for best security use client certificates + server cert in DSN
        config, err := pgxpool.ParseConfig("host=INSTANCE_IP user=postgres password=empty sslmode=require")
        if err != nil {
                panic(err)
        }
        config.BeforeConnect = func(ctx context.Context, cfg *pgx.ConnConfig) error {
                // This gets called before a connection is created and allows you to
                // refresh the OAuth2 token here as needed. A fancier implementation would cache the token,
                // and refresh only if the token were about to expire.
                cfg.Password = "mycooltoken"
                return nil
        }

        pool, err := pgxpool.NewWithConfig(context.Background(), config)
        if err != nil {
        }

        conn, err := pool.Acquire(context.Background())
        if err != nil {
                panic(err)
        }
        defer conn.Release()

        row := conn.QueryRow(context.Background(), "SELECT NOW()")
        var t time.Time
        if err := row.Scan(&t); err != nil {
                panic(err)
        }

        fmt.Println(t)
}

phoenix2x · 2023-08-01T16:15:30Z

Hi @enocom,

This is very interesting, thank you:)

Is this supposed to target Server Side Proxy on port 3307 or native 5432?

enocom · 2023-08-01T16:46:31Z

Native port. We're working on making this more obvious and possibly even providing some helper functions.

phoenix2x · 2023-08-01T16:58:25Z

Nice, we should definitely give it a try.

phoenix2x added the type: question Request for information or clarification. label Jul 7, 2023

blunderbuss-gcf bot assigned jackwotherspoon Jul 7, 2023

enocom assigned enocom and unassigned jackwotherspoon Jul 7, 2023

enocom changed the title ~~Dial latency~~ Provide performance benchmarks with and without Proxy Jul 14, 2023

enocom added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Jul 14, 2023

enocom added priority: p2 Moderately-important priority. Fix may not be included in next release. and removed priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Oct 4, 2023

enocom assigned jackwotherspoon and unassigned enocom May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide performance benchmarks with and without Proxy #574

Provide performance benchmarks with and without Proxy #574

phoenix2x commented Jul 7, 2023 •

edited by enocom

Loading

enocom commented Jul 7, 2023

phoenix2x commented Jul 7, 2023

enocom commented Jul 7, 2023

phoenix2x commented Jul 8, 2023

enocom commented Jul 11, 2023

enocom commented Jul 11, 2023

phoenix2x commented Jul 13, 2023

enocom commented Jul 13, 2023

enocom commented Jul 14, 2023

enocom commented Jul 14, 2023

enocom commented Jul 26, 2023 •

edited

Loading

enocom commented Aug 1, 2023 •

edited by jackwotherspoon

Loading

phoenix2x commented Aug 1, 2023

enocom commented Aug 1, 2023

phoenix2x commented Aug 1, 2023

Provide performance benchmarks with and without Proxy #574

Provide performance benchmarks with and without Proxy #574

Comments

phoenix2x commented Jul 7, 2023 • edited by enocom Loading

Question

Code

Additional Details

enocom commented Jul 7, 2023

phoenix2x commented Jul 7, 2023

enocom commented Jul 7, 2023

phoenix2x commented Jul 8, 2023

enocom commented Jul 11, 2023

enocom commented Jul 11, 2023

phoenix2x commented Jul 13, 2023

enocom commented Jul 13, 2023

enocom commented Jul 14, 2023

enocom commented Jul 14, 2023

enocom commented Jul 26, 2023 • edited Loading

enocom commented Aug 1, 2023 • edited by jackwotherspoon Loading

phoenix2x commented Aug 1, 2023

enocom commented Aug 1, 2023

phoenix2x commented Aug 1, 2023

phoenix2x commented Jul 7, 2023 •

edited by enocom

Loading

enocom commented Jul 26, 2023 •

edited

Loading

enocom commented Aug 1, 2023 •

edited by jackwotherspoon

Loading