Our Latest News

Online four machines at the same time all OOM, in the end what happened?

Online four machines at the same time all OOM, in the end what happened?

Crime scene

Last night, we suddenly received a large number of alerts from APM (short for Application Performance Management), a system we built to monitor and alert the performance and reliability of our applications online.

Immediately after the call from Operations told the online deployment of the four machines all OOM (out of memory, insufficient memory), the service is all unavailable, hurry to check the problem!

Troubleshooting

First of all, the maintenance restarted the machine to ensure that the online service is available, and then carefully looked at the online logs, and indeed the service is unavailable due to OOM

The first thing I thought of was dumping the memory state at that time, but since Ops restarted the machine in order to get the line back in service as soon as possible, it was impossible to dump the memory at the time of the incident. So I took a look at the JVM monitoring chart in our APM

Voiceover: one way doesn’t work, try another angle to cut through! Again, surveillance is very important! Perfect surveillance can restore the scene of the incident and facilitate the positioning of the problem.

The number of threads created in the application actually went up every moment from 16:00 until about 3w, and after restarting (blue arrow), the threads also kept growing), what is the number of threads under normal circumstances, 600! The threads that were created have not died out! I found that the release record is only such a suspicious code diff: in the HttpClient initialization time to add an additional evictExpiredConnections configuration

The problem is located, it should be caused by this configuration! (The point in time when the threads went up and the point in time when they were released coincided perfectly!) The thread count was back to normal after the new configuration was put online. So what did evictExpiredConnections do to cause the thread count to rise every minute? What was this configuration added to solve the problem? So I found the relevant colleagues to understand the causes and consequences of adding this configuration

Restoration of what happened

Recently there are many NoHttpResponseException exceptions on the line, and the above configuration is added to solve this exception, so what caused this exception?

Before talking about this problem, we have to understand the keep-alive mechanism of http.

First look at the life cycle of a normal TCP connection

You can see that each TCP connection has to go through three handshakes to establish the connection before sending data and four waves to disconnect, if each TCP connection is disconnected immediately after the server returns a response, multiple HTTP requests will have to be created and disconnected several times, which is undoubtedly very performance-consuming in the case of many Http requests. If the server does not disconnect the TCP link immediately after the response is returned, but reuses the link for the next Http request, it in effect eliminates much of the overhead of creating/disconnecting the TCP, which will undoubtedly be a big performance improvement.

As shown in the figure below, the left figure is the case of multiple HTTP requests without TCP reuse, and the right figure is the case of TCP reuse, you can see that three HTTP requests are initiated, and TCP reuse can save the overhead of creating/disconnecting TCP twice, so theoretically an application only needs to open one TCP connection, and all other HTTP requests can reuse this TCP connection, so that n HTTP requests can save the overhead of creating/disconnecting TCP twice. requests can eliminate the overhead of creating/disconnecting TCP n-1 times. This is certainly a huge performance improvement.

In retrospect, what keep-alive (also known as persistent connections, connection multiplexing) does is multiplex connections to ensure that they last.

Extra voice: After Http 1.1 keep-alive is supported and enabled by default, most sites are now using http 1.1, which means that most support connection multiplexing by default

There is no free lunch, although keep-alive saves a lot of unnecessary handshake/waving operations, but because the connection is kept alive for a long time, if there is no http request, the connection will be idle for a long time, which will occupy system resources and sometimes bring more performance consumption than reusing the connection. So we usually set a timeout for keep-alive, so that if the connection remains idle (no data transfer occurs) during the set timeout time, the connection will be released after the timeout time, saving system overhead.

It seems that adding a timeout to keep-alive is perfect, but it introduces a new problem (one wave has subsided and another has started!) Consider the following scenario:

If the server closes the connection and sends a FIN packet (note: if the server never receives a request from the client within the set timeout, the server will initiate a request with the FIN flag to disconnect and release resources), during the period when the FIN packet is sent but has not yet reached the client, if the client continues to reuse the TCP connection to send HTTP request messages, the server will not be able to send HTTP request messages because of the four waves. The server will send an RST message to the client because it does not receive the message during the four waves, and the client will be prompted with an exception (i.e. NoHttpResponseException) when it receives the RST message.

Let’s use the flow chart to carefully sort out the reasons for this NoHttpResponseException, so that we can see more clearly

After so much effort, we finally know the reason for generating NoHttpResponseException, then how to solve it, there are two strategies

Retry, after receiving an exception, retry once or twice, as the client will use a valid connection to request after retry, so you can avoid this situation, but once to pay attention to the number of retries to avoid causing an avalanche!

Set a timer thread to clean up the above idle connections at regular intervals, you can set this timer time to half of the keep alive timeout time to ensure recovery before the timeout.

evictExpiredConnections is the second strategy used above, see the official usage description

Make this HttpClient instance use a background thread to actively evict idle connections from the connection pool.

Calling this method will only generate a timed thread, then why the application threads will always increase it, because we have created a request for each HttpClient! This is due to the creation of each HttpClient instance j will call evictExpiredConnections, resulting in how many requests will create how many timed threads!

There is another problem, why four machines on the line almost the same point in time all hang it? Because due to load balancing, the weight of the four machines is the same, the hardware configuration is the same, the requests received can actually be considered similar, so that the four machines due to the creation of HttpClient and the background threads generated at the same time to reach the highest point, and then the same time OOM.

Solving the problem

So for the above mentioned problem, we first changed the HttpClient to a single instance, so as to ensure that there will only be a regular cleanup thread after the service is started, in addition we also let the operation and maintenance for the application of the number of threads to do monitoring, if more than a certain threshold directly alarm, so that the application can be found in time to deal with OOM before.

Voiceover: Again, monitoring is very important to nip the problem in the bud!

Conclusion

This article through the phenomenon of four machines on the line at the same time OOM, to analyze in detail to locate the cause of the problem, you can see that we apply a library first of all to have a full understanding of the library (the creation of the above HttpClient without a single case is obviously a problem), followed by the necessary knowledge of the network or need, so to become a qualified programmer, not only the language itself to understand So to become a qualified programmer, not only understanding of the language itself, but also the network, database, etc. should also be involved, these troubleshooting problems and performance tuning, etc. will be very helpful, once again, the perfect monitoring is very important, by triggering a threshold in advance of the alarm, you can kill the problem in the cradle!

    GET A FREE QUOTE

    FPGA IC & FULL BOM LIST

    We'd love to

    hear from you

    Highlight multiple sections with this eye-catching call to action style.

      Contact Us

      Exhibition Bay South Squre, Fuhai Bao’an Shenzhen China

      • Sales@ebics.com
      • +86.755.27389663