Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blind tunneling breaks with Google Chrome and TLS Kyber cipher #11758

Open
fsecure-kilkanen opened this issue Sep 9, 2024 · 8 comments
Open
Assignees
Labels

Comments

@fsecure-kilkanen
Copy link

fsecure-kilkanen commented Sep 9, 2024

Executive summary

When TLS Client Hello is split into two TCP segments and segments arrive to TrafficServer in different parts, blind tunneling flow breaks.

Details

Use case

Trafficserver runs as a transparent forward proxy for TLS traffic. A custom plugin uses SSL_CERT_HOOK to determine destination hostname of the TLS connection.

If the destination is accepted, connection is converted into a blind tunnel by calling TSVConnTunnel(sslvc) and TSVConnReenable(sslvc) in the hook callback function.

Affected versions

Reproduced with TrafficServer 8.x branch and TrafficServer 9.2.4. Issue should also exist in 10.x branch because affected code paths have not changed.

The issue

When a connection is started with Kyber cipher, the TLS Client Hello packet contains more key material than previous ciphers. With Kyber cipher, the TLS Client Hello packet is 1700-2100 bytes in size.

Since the network MTU is usually 1500, it means that the TLS Client Hello is split into two TCP segments. In some conditions, clients' network connectivity is slow such that the gap between the first and second segment of Client Hello is big enough to trigger the bug.

https://github.com/apache/trafficserver/blob/master/src/iocore/net/SSLNetVConnection.cc#L423 is the place where TLS Client Hello is read. This is a call to read data from socket that is for the incoming TCP connection. There is no guarantee that this call will return both segments of the Client Hello. If data for complete Client Hello is received here, the blind tunnel operation is succesful.

However, if only first segment of Client Hello is received, then the following sequence of events happen:

https://github.com/apache/trafficserver/blob/master/src/iocore/net/SSLNetVConnection.cc#L1365 OpenSSL ssl_accept() is called. It returns SSL_ERROR_WANT_READ, which is returned as SSL_HANDSHAKE_WANT_READ.

https://github.com/apache/trafficserver/blob/master/src/iocore/net/SSLNetVConnection.cc#L661 is the branch taken in this case. Processing goes to update_rbio at https://github.com/apache/trafficserver/blob/master/src/iocore/net/SSLNetVConnection.cc#L529 with move_to_socket set to true.

In update_rbio, OpenSSL BIO is set to the file descriptor of the connection, and existing handshake buffers are deallocated.

Now, when the second segment of Client Hello arrives, the blind tunnel branch https://github.com/apache/trafficserver/blob/master/src/iocore/net/SSLNetVConnection.cc#L624 is not taken, because this->handShakeReader is nullptr after it was deallocated previously in update_rbio.

To fix the issue, the case of SSL_HANDSHAKE_WANT_READ should be handled so that the second TCP segment of handshake is read into the existing SSL handshake buffer, and then ssl_accept() is called with that buffer.

I have tried a couple of workarounds to do this. However, since I don't fully understand the context why current implementation is like this, I am not confident of making big changes, because they might affect other use cases.

@fsecure-kilkanen
Copy link
Author

I will add more detailed reproduction instructions.

@fsecure-kilkanen
Copy link
Author

These are relevant debug logs that show what happens.

caesars-with-kyber-fail.txt
caesars-without-kyber-slow.txt

In the caesars-with-kyber-fail.txt one can see how 1291 bytes of client handshake are first received, and then OpenSSL ssl_accept() returns WANT_READ. Second part is received by OpenSSL and processed properly. However, then processing gets stuck with probe needs data, read.. step.

In caesars-without-kyber-slow.txt, the Kyber cipher was disabled. There, 655 bytes of handshake data is read, and OpenSSL ssl_accept() processes it properly.

Here one can see that after probe needs data, read.. step the action moves to HttpSessionAccept correcly, and operation finishes succesfully.

@bryancall bryancall self-assigned this Sep 9, 2024
@fsecure-kilkanen
Copy link
Author

https://github.com/fsecure-kilkanen/kyber-handshaketest/blob/main/kyber-handshaketest.py is a simple Python script that sends a previously recorded TLS Client Hello packet with Kyber cipher to a chosen destination.

It sends the client hello packet in two ways:

  1. One call to send(). This means TCP stack will do TCP segmentation.
  2. Client Hello is split into two parts, and the parts are sent separately with slight delay. This simulates what happens in the error case above.

Both cases succeed when running directly against a web server.

With ATS, first case succeeds depending on network conditions. Second case breaks always, as ATS receives the ClientHello always in two separate socket read calls.

@fsecure-kilkanen
Copy link
Author

https://tldr.fail/ is a website that covers these classes of bugs.

@fsecure-kilkanen
Copy link
Author

fsecure-kilkanen commented Oct 15, 2024

@bryancall Are there any plans to fix this bug?

@masaori335
Copy link
Contributor

masaori335 commented Oct 25, 2024

Hello, Tero. Thanks for report.

I recently start taking a look at PQTLS and tested ATS with tldr_fail_test.py. If I run ATS as a reverse proxy, there're no issue with large Client Hello nor separated Client Hello. However, as you pointed out, the Blind Tunnel case is not working with separated Client Hello.

To fix the issue, the case of SSL_HANDSHAKE_WANT_READ should be handled so that the second TCP segment of handshake is read into the existing SSL handshake buffer, and then ssl_accept() is called with that buffer.

I agree with we should read all packets that has Client Hello and forward them to origin server as tunnel.

I have tried a couple of workarounds to do this. However, since I don't fully understand the context why current implementation is like this, I am not confident of making big changes, because they might affect other use cases.

There're several test cases around this code. If you open a PR, they should run automatically :)

@fsecure-kilkanen
Copy link
Author

I have some questions about the implementation.

What is the reason for switching OpenSSL to read from the socket after reading first packet?

My plan is to check the length of handshake from TLS client hello header, and then compare that to amount of data that was read. If the whole handshake was not read, then further read of handshake needs to be triggered.

How do I properly trigger a second read at this point using the async event system in ATS?

The handshake data is read into IOBufferBlock:

IOBufferBlock *b = this->handShakeBuffer->first_write_block();

If I do a second read to this IOBufferBlock, will the data be stored in continuous memory block? So that when I get this->handShakeBuffer->start() and this->handShakeBuffer->end(), it will have the full handshake data without any extra data?

Thanks for the assistance.

@fsecure-kilkanen
Copy link
Author

Hi @shinrich I see that you have worked with this code path, could you give some input on my questions above?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants