HAProxy - Speeding up SSL

Posted on Apr 27, 2017

I have been a haproxy user for quite a few years now. Even using snapshots in production for a very long time. Especially after support was added to terminate SSL connections directly in haproxy. Getting rid of stunnel was so nice.

For a very long time i was doing really well with this setup. But over time more and more services were put behind haproxy and the connections/s and the total amount of connections went up. We started to see some performance issues. Which at first sounds weird … if you look at the benchmarks on the haproxy website they can do thousands if not hundred thousands of connections per seconds.

So what happened?

HAProxy is a single process event driven program at its core. It means while we handle one connection and doing computations for that request, no other connections will be really handled. So every bit of code needs to be as fast as possible so you can quickly switch to the next connection. In general this model works really well and is used widely.

If SSL comes into the picture things get tricky. SSL handshakes are quite expensive computations, so at the end your requests/s are limited by the number of handshakes you can do per second.

Need for speed

You have two options to get more speed now. Faster computations (meaning faster CPUs or crypto accelerators) or spread the work on more CPUs. The first is not always an option. Per core performance is still growing, but not too much. Which leaves us with option 2.

HAProxy has the nbproc directive but the documentation discourages its use. So step 1 is to ask upstream about it. The answer in short: your problem is the only real use case for it. I tried to argue that the warning could be relaxed perhaps, but without the warning people would probably use the directive even in situations where SSL was not the problem. I learned that during my time with lighttpd but at least i had some steps to proceed.

Our base configuration:

global
  log /dev/log daemon
  maxconn 32768
  chroot /var/lib/haproxy
  user haproxy
  group haproxy
  stats socket /var/lib/haproxy/stats user haproxy group haproxy mode 0640 level operator

  tune.bufsize 32768
  tune.ssl.default-dh-param 2048

  ssl-default-bind-ciphers ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256
  ssl-default-bind-options ssl-min-ver TLSv1.2 ssl-max-ver TLSv1.3
  ssl-default-server-ciphers ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256
  ssl-default-server-options ssl-min-ver TLSv1.2 ssl-max-ver TLSv1.3

defaults
  log     global
  mode    http
  option  log-health-checks
  option  log-separate-errors
  option  dontlognull
  option  httplog
  option  splice-auto
  option  socket-stats
  retries 3
  option  redispatch
  maxconn 10000
  timeout connect     5s
  timeout client     60s
  timeout server    450s

frontend http
  bind 0.0.0.0:80    tfo
  bind :::80 v6only  tfo

  bind 0.0.0.0:443   tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/
  bind :::443 v6only tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/

  acl is_ssl ssl_fc

  default_backend nginx

backend nginx
  option forwardfor
  server nginx 127.0.0.1:81 check inter 2s

A few comments about the config:

  • We define the ciphers globally, so we don’t have to specify it on each bind statement later.
  • IPv6 sockets get bound with v6only so we don’t handle IPv4 connections on the IPv6 sockets.

So lets find out what our base line speed is. Both benchmarks ran in parallel.

$ ab2 -c 20 -n 10000 https://myhost.mydomain.tld/
[snip]
Requests per second:    247.93 [#/sec] (mean)
Time per request:       80.668 [ms] (mean)
Time per request:       4.033 [ms] (mean, across all concurrent requests)
Transfer rate:          55.45 [Kbytes/sec] received
[snip]

$ ab2 -c 20 -n 10000 http://myhost.mydomain.tld/
[snip]
Requests per second:    630.70 [#/sec] (mean)
Time per request:       31.711 [ms] (mean)
Time per request:       1.586 [ms] (mean, across all concurrent requests)
Transfer rate:          141.04 [Kbytes/sec] received
[snip]

The properties of the SSL connection: TLSv1.2,ECDHE-RSA-AES256-GCM-SHA384,4096,256

The Solution

The changes to the global options and defaults are minor. First we configure that we want 7 processes launched in total. 1 for our request routing and plain http access. 6 processes for handling SSL. With the setting in the default block we ensure that any block without a bind-process statement, will be owned by the first process.

global
[snip]
  nbproc 7

defaults
[snip]
  bind-process 1

Next is our new SSL only frontend. All we do here is doing the SSL termination and then forward it back to our routing backend. The SSL backend binds the process 2-7.

listen ssl
  bind-process 2-7

The intuitive idea for the bind statements would be:

  bind 0.0.0.0:443   tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/ process 2-7
  bind :::443 v6only tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/ process 2-7

Each bind statement gets assigned 6 processes. You could leave out the process statement at the end of the bind lines and then all processes assigned to this block will be used.

This creates 2 sockets shared among the 6 processes. When the kernel sends a connection to those sockets each process tries to grab the connection but in the end only 1 process gets it. This might lead to unbalanced work.

Another option is shown below where we get multiple sockets with SO_REUSEPORT. In this case the kernel will distribute the load more fairly over all the processes involved. Our configuration block would look something like this:

  bind 0.0.0.0:443   tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/ process 2
  bind 0.0.0.0:443   tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/ process 3
  bind 0.0.0.0:443   tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/ process 4
  bind 0.0.0.0:443   tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/ process 5
  bind 0.0.0.0:443   tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/ process 6
  bind 0.0.0.0:443   tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/ process 7

  bind :::443 v6only tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/ process 2
  bind :::443 v6only tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/ process 3
  bind :::443 v6only tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/ process 4
  bind :::443 v6only tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/ process 5
  bind :::443 v6only tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/ process 6
  bind :::443 v6only tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/ process 7

The ssl session cache is shared between all processes and the TLS tickets are encrypted using a private key that is generated before all the child processes are forked.

A small optimization for the internal traffic is the tcp-smart-connect option. HAProxy directly sends the data (ie: the proxy protocol header and request data) inthe first packet. This ensures that the http backend has the request available immediately and saves it from having to poll for the data.

  option tcp-smart-connect

As the last step in the the listen block, we configure forwarding to next backend. We use send-proxy-v2 here so the http backend knows the real remote IP In theory we could use unix domain sockets here. But there is no splice support for unix domain sockets yet sad face

  server http 127.0.0.1:84 send-proxy-v2

Last but not least we define the additional listening socket for our internal routing. We can not use the normal port 80 sockets as we want to allow the proxy protocol on the socket

frontend http
[snip]
  #
  # the socket for routing the requests
  #
  bind 127.0.0.1:84  tfo accept-proxy

  acl is_ssl fc_rcvd_proxy
[snip]

On important last point is the is_ssl ACL. Normally you would use ssl_fc (ssl frontend connection) to see if your connection was received via a SSL socket. As we moved the SSL out of this scope we can not use it anymore. But the connections from the SSL frontend also forward us the connection data via the proxy protocol. As we only have one socket using this feature, we can safely assume that those connections came from our SSL frontend and thus are secure.

The complete configuration with inline comments for all the changes:

global
  log /dev/log daemon
  maxconn 32768
  chroot /var/lib/haproxy
  user haproxy
  group haproxy
  stats socket /var/lib/haproxy/stats user haproxy group haproxy mode 0640 level operator

  tune.bufsize 32768
  tune.ssl.default-dh-param 2048

  ssl-default-bind-ciphers ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256
  ssl-default-bind-options ssl-min-ver TLSv1.2 ssl-max-ver TLSv1.3
  ssl-default-server-ciphers ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256
  ssl-default-server-options ssl-min-ver TLSv1.2 ssl-max-ver TLSv1.3

  # launch 7 process
  # process 1 for plain routing
  # process 2-7 for SSL
  nbproc 7

defaults
  log     global
  mode    http
  option  log-health-checks
  option  log-separate-errors
  option  dontlognull
  option  httplog
  option  splice-auto
  option  socket-stats
  retries 3
  option  redispatch
  maxconn 10000
  timeout connect     5s
  timeout client     60s
  timeout server    450s
  # by default use process 1
  bind-process 1

# our new SSL only frontend. All we do here is doing the SSL termination and
# then forward it back to our routing backend
listen ssl
  # bind processes 2-7
  bind-process 2-7
  #
  # Intuitive idea:
  # bind 0.0.0.0:443   tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/ process 2-7
  # bind :::443 v6only tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/ process 2-7
  #
  # Each bind statement gets assigned 6 processes. You could leave out the
  # process statement at the end of the bind lines and then all processes
  # assigned to this block will be used.
  #
  # This creates 2 sockets shared among the 6 processes. When the kernel sends
  # a connection to those sockets each process tries to grab the connection
  # but in the end only 1 process gets it. This might lead to unbalanced work.
  #
  # Another option is shown below where we get multiple sockets with 
  # SO_REUSEPORT. In this case the kernel will distribute the load more fairly
  # over all the processes involved.
  #
  # The ssl session cache is shared between all processes and the TLS tickets
  # are encrypted using a private key that is generated before all the 
  # child processes are forked.  
  #
  bind 0.0.0.0:443   tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/ process 2
  bind 0.0.0.0:443   tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/ process 3
  bind 0.0.0.0:443   tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/ process 4
  bind 0.0.0.0:443   tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/ process 5
  bind 0.0.0.0:443   tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/ process 6
  bind 0.0.0.0:443   tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/ process 7

  bind :::443 v6only tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/ process 2
  bind :::443 v6only tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/ process 3
  bind :::443 v6only tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/ process 4
  bind :::443 v6only tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/ process 5
  bind :::443 v6only tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/ process 6
  bind :::443 v6only tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/ process 7

  #
  # Optimization:
  #
  # Directly send the data (ie: the proxy protocol header and request data) in
  # the first packet. This ensures that the http backend has the request
  # available immediately and saves it from having to poll for the data.
  #
  option tcp-smart-connect

  # forward to next backend.
  # we use send-proxy-v2 here so the http backend knows the real remote IP
  #
  # In theory we could use unix domain sockets here. But there is no splice support for unix domain sockets yet *sad face*
  #
  server http 127.0.0.1:84 send-proxy-v2

frontend http
  bind 0.0.0.0:80    tfo
  bind :::80 v6only  tfo

  # bind 0.0.0.0:443   tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/
  # bind :::443 v6only tfo ssl alpn h2,http/1.1 npn h2,http/1.1 crt /etc/ssl/services/

  #
  # the socket for routing the requests
  #
  bind 127.0.0.1:84  tfo accept-proxy

  acl is_ssl fc_rcvd_proxy

  default_backend nginx

backend nginx
  option forwardfor
  server nginx 127.0.0.1:81 check inter 2s

And we can validate the different listens:

$ ss -tplen | grep haproxy
LISTEN     0      128    127.0.0.1:84    *:*    users:(("haproxy",pid=8194,fd=21)) ino:281097 sk:f <->
LISTEN     0      128          *:443   *:*    users:(("haproxy",pid=8200,fd=11)) ino:281087 sk:10 <->
LISTEN     0      128          *:443   *:*    users:(("haproxy",pid=8199,fd=10)) ino:281086 sk:11 <->
LISTEN     0      128          *:443   *:*    users:(("haproxy",pid=8198,fd=9)) ino:281085 sk:12 <->
LISTEN     0      128          *:443   *:*    users:(("haproxy",pid=8197,fd=8)) ino:281084 sk:13 <->
LISTEN     0      128          *:443   *:*    users:(("haproxy",pid=8196,fd=7)) ino:281083 sk:14 <->
LISTEN     0      128          *:443   *:*    users:(("haproxy",pid=8195,fd=6)) ino:281082 sk:15 <->
LISTEN     0      128          *:80    *:*    users:(("haproxy",pid=8194,fd=19)) ino:281095 sk:16 <->
LISTEN     0      128         :::443  :::*    users:(("haproxy",pid=8200,fd=17)) ino:281093 sk:17 v6only:1 <->
LISTEN     0      128         :::443  :::*    users:(("haproxy",pid=8199,fd=16)) ino:281092 sk:18 v6only:1 <->
LISTEN     0      128         :::443  :::*    users:(("haproxy",pid=8198,fd=15)) ino:281091 sk:19 v6only:1 <->
LISTEN     0      128         :::443  :::*    users:(("haproxy",pid=8197,fd=14)) ino:281090 sk:1a v6only:1 <->
LISTEN     0      128         :::443  :::*    users:(("haproxy",pid=8196,fd=13)) ino:281089 sk:1b v6only:1 <->
LISTEN     0      128         :::443  :::*    users:(("haproxy",pid=8195,fd=12)) ino:281088 sk:1c v6only:1 <->
LISTEN     0      128         :::80   :::*    users:(("haproxy",pid=8194,fd=20)) ino:281096 sk:1d v6only:1 <->

In the example above we see:

  • pid 8194 handles *:80, [::]:80 and 127.0.0.1:84.
  • pids 8195-8200 handle *:443 and [::]:443

And this really helped?

$ ab2 -c 20 -n 10000 https://myhost.mydomain.tld/
[snip]
Requests per second:    678.84 [#/sec] (mean)
Time per request:       29.462 [ms] (mean)
Time per request:       1.473 [ms] (mean, across all concurrent requests)
Transfer rate:          151.81 [Kbytes/sec] received
[snip]

One could say … yes. With the burden of SSL removed from the main process, we get a sharp increase of performance here as well:

$ ab2 -c 20 -n 10000 http://myhost.mydomain.tld/
[snip]
Requests per second:    19338.32 [#/sec] (mean)
Time per request:       1.034 [ms] (mean)
Time per request:       0.052 [ms] (mean, across all concurrent requests)
Transfer rate:          4324.68 [Kbytes/sec] received
[snip]