I’m not able to reproduce this, but I’d love to dig in to see what this is! I see something, but what you’re seeing is so much more frequent that I’m not confident at all that they’re related.
We’ve had a small-but-larger-than-I-would-expect number of timeouts for a while. We’ve been able to whittle it down to a pretty small number (but still larger than I would expect). It’s been tricky to shrink it any further due to things like reproducibility and issues with logging things, but if this is nice and reproducible for you, this may be really helpful!
First, I checked to see if we were seeing failed requests or timeouts. Our load balancer gets requests, and then forwards them onto an application server, which processes the request and returns the output back to the load balancer, which then responds to the request. I grabbed the last few hours of logs, and the load balancer is seeing many of these requests–but all of the ones in the logs were responded to with a success (200). They have pretty consistent internal response times, too. Many of these are 1 minute apart from each other, but there are definitely frequent gaps.
This implies that the issue is happening before the load balancer successfully processes this as an HTTPS request, i.e it’s more likely that something is happening between your computer and the load balancer saying “yup, got this request, thanks!”.
As part of reducing timeouts, we set up some monitoring where StatusCake tries a variety of pages every few seconds from around the world. StatusCake recommends having it check “downtime” from more than one location before “counting it as downtime”. We do that for our alerts, but we also have a second copy of monitors that count a single failed request from a single location as downtime.
The “check from 2 locations every minute, and if they’re both down, mark it as downtime” has the downtime as November 30th, 2020. (Woooo!) The “check from a single site and scream if even a single one has a problem” is a little interesting.
May 1: 100%
May 2: 100%
May 3: 100%
May 4: 100%
May 5: 100%
May 6: 99.92%
May 7: 100%
May 8: 100%
May 9: 100%
May 10: 100%
May 11: 99.92%
May 12: 99.92%
May 13: 100%
May 14: 100%
May 15: 100%
May 16: 100%
May 17: 100%
May 18: 99.92%
May 19: 99.92%
May 20: 99.47%
May 21: 99.69%
May 22: 99.66%
May 23: 99.71%
May 24: 99.56%
All of these issues are marked “Timeout/Connection Refused”.
They don’t show up in the load balancer logs, and they don’t correlate with times of increased traffic or resource utilization on the load balancer. We haven’t changed anything in our DNS or in our load balancer since before May 1st–and if this was related to an application change, I would really, really, really expect that our load balancer would be logging the incoming requests and marking that there was an error getting the response back from an application server (We do see this, when that does happen ).
I’ll work on creating a ticket with support at our host about that increase since May 18th.
From what I’m seeing in the load balancer logs, you’re having issues more like 50% of the time, not 0.5% of the time, like StatusCake is seeing.
Just to check the rest of all of this, I grabbed my goals.json 1000 times, 1 second between them, and all of them succeeded.
OK, that’s a lot of words to say “Thanks! We’ve had issues with network timeouts in the past, but have resolved many of them. We’re not seeing errors in our logs for your requests, even at the very front of the load balancer, but we are seeing gaps that presumably correspond to your timeouts. There’s possibly a systemic issue that started on the 18th of May that is increasing our timeouts to nearly 0.5%–but you’re seeing a lot more than 0.5% go missing!”
First question: Can you use
curl -v maybe ten times, and take a look at the output? Do the ones that come back look different have different requests than the ones that don’t? (Make sure not to post your auth token here!)