Skip to content

Production httpbin site randomly timing out #12

@dmuth

Description

@dmuth

A few days ago I noticed https://httpbin.dmuth.org/ started hanging for no reason. My dashboards would look like this:

Screenshot by Dropbox Capture

Screenshot by Dropbox Capture

...and I started seeing errors like these in the logs from fly.io:

could not find a good candidate within 90 attempts at load balancing. last error: no known healthy instances found for route tcp/443. (hint: is your app shut down? is there an ongoing deployment with a volume or are you using the 'immediate' strategy? have your app's instances all reached their hard limit?)

I then SSHed into the instance and saw that uvicorn was using 100% of the CPU.

I also poked around in /proc/ and saw that there was only about a dozen file descriptors open, so it's not a resource exhaustion issue.

I tried the following things so far, but have been unable to resolve it:

  • ✅ Restarting the VM
  • ✅ Changing the count of machines with the fly scale command to 0 and then 1 to spin up a new machine
  • ✅ Running fly deploy again
  • ✅ Turning off Fly's raw TCP check, thinking it was tripping up Uvicorn somehow.

I am continuing to investigate, and have a few other things to try:

  • ✅ Turning off the HTTP check from Fly.io
  • ✅ Adjusting the URLs that NodePing is hitting
  • ✅ Upgrading FastAPI to the latest version and redeploying (this is in progress)
  • ✅ Increase the number of workers to 3
  • Seeing if I can capture log output from Uvicorn by setting an environment variable.
  • Changing the server to Hypercorn

Metadata

Metadata

Assignees

Labels

In ProgressActively being worked on

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions