HAProxy - a journey into multithreading (and SSL)

HAProxy - a journey into multithreading (and SSL)

I'm running some load balancers which are using HAProxy to distribute HTTP traffic to multiple systems.

While using SSL with HAProxy is possible since some time, it wasn't in the early days. So we decided for some customers, which was in need to provide encryption, to offload it with Apache.
When later HAProxy got added SSL support this also had benefits when keeping this setup for larger sites, because HAProxy had a single process model and doing encryption is indeed way more resource consuming.
Still using Apache for SSL offloading was a good choice because it comes with the Multi-Processing Modules worker and event that are threading capable. We did choose the event mpm cause it should deal better with the 'keep alive problem' in HTTP. So far so good.

Last year some large setups started to suffer accepting new connections out of the blue. Unfortunately I found nothing in the logs and also couldn't reproduce this behaviour. After some time I decided to try using another Apache mpm and switched over to the worker model. And guess what ... the connection issues vanished.
Some days later I surprisingly learned about the Apache Bug in Debian BTS "Event MPM listener thread may get blocked by SSL shutdowns" which was an exact description of my problem.

While being back in safe waters I thought it would be good to have a look into HAProxy again and learned that threading support was added in version 1.8 and in 1.9 got some more improvements.
So we started to look into it on a system with a couple of real CPUs:

# grep processor /proc/cpuinfo | tail -1
processor	: 19

At first we needed to install a newer version of HAProxy, since 1.8.x is available via backports but 1.9.x is only available via haproxy.debian.net. I thought I should start with a simple configuration and keep 2 spare CPUs for other tasks:

        # one process
        nbproc 1
        # 18 threads
        nbthread 18
        # mapped to the first 18 CPU cores
        cpu-map auto:1/1-18 0-17

Now let's start:

# haproxy -c -V -f /etc/haproxy/haproxy.cfg
# service haproxy reload
# pstree haproxy
No processes found.
# grep "worker #1" /var/log/haproxy.log | tail -2
Mar 20 13:06:51 lb13 haproxy[22156]: [NOTICE] 078/130651 (22156) : New worker #1 (22157) forked
Mar 20 13:06:51 lb13 haproxy[22156]: [ALERT] 078/130651 (22156) : Current worker #1 (22157) exited with code 139 (Segmentation fault)

Okay .. cool! ;) So I started lowering the used CPUs since without threading I did not experiencing segfaults. With 17 threads it seems to be better:

# service haproxy restart
# pstree haproxy
# grep "worker #1" /var/log/haproxy.log | tail -2
Mar 20 13:06:51 lb13 haproxy[22156]: [ALERT] 078/130651 (22156) : Current worker #1 (22157) exited with code 139 (Segmentation fault)
Mar 20 13:14:33 lb13 haproxy[27001]: [NOTICE] 078/131433 (27001) : New worker #1 (27002) forked

Now I started to move traffic from Apache to HAProxy slowly and watching logs carefully. With shifting more and more traffic over, the amount of SSL handshake failure entries went up. While there was the possibility this were just some clients not supporting our ciphers and/or TLS versions I had some doubts, but our own monitoring was unsuspicious. So I started to have a look on external monitoring and after some time I cought some interesting errors:

error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error
error:140943FC:SSL routines:ssl3_read_bytes:sslv3 alert bad record mac

The last time I had issues I lowered the thread count so I did this again. And you might guessed it already, this worked out. With 12 threads I had no issues anymore:

        # one process
        nbproc 1
        # 12 threads
        nbthread 12
        # mapped to the first 12 CPU cores (with more then 17 cpus haproxy segfaults, with 16 cpus we have a high rate of ssl errors)
        cpu-map auto:1/1-12 0-11

So we got rid of SSL offloading and the proxy on localhost, with the downside that HAProxy is failing 1/146 h2spec test, which is a conformance testing tool for HTTP/2 implementation, where Apache was failing not a single test.