Putting Antenna in production¶
High-level things¶
Antenna is a Python WSGI application that uses Gunicorn as the WSGI server.
We use nginx in front of that, but you might be able to use other web servers.
Note
Make sure to use HTTPS–don’t send your users’ crash reports over HTTP.
Health endpoints¶
Antenna exposes several URL endpoints to help you run it at scale.
/__lbheartbeat__
Always returns an HTTP 200. This tells the load balancer that this Antenna instance is handling connections still.
/__heartbeat__
This endpoint returns some more data. Depending on how you have Antenna configured, this might write a test file to the Google Cloud Storage bucket or other things. It returns its findings in the HTTP response body.
/__version__
Returns information related to the git sha being run in the HTTP response body.
For example:
{ "commit": "5cc6f5170973c7af2e4bb2097679a82ae5659466", "version": "", "source": "https://github.com/mozilla-services/antenna", "build": "https://circleci.com/gh/mozilla-services/antenna/331" }
/__broken__
Intentionally throws an exception that is unhandled. This helps testing Sentry or other error monitoring.
This is a nuissance if abused, so it’s worth setting up your webserver to require basic auth or something for this endpoint.
Environments¶
Will this run in Heroku? Probably, but you’ll need to do some Heroku footwork to set it up.
Will this run on GCP? Yes–that’s what we do.
Will this run on [insert favorite environment]? I have no experience with other systems, but it’s probably the case you can get it to work. If you can’t save crashes to Google Cloud Storage, you can always write your own storage class to save it somewhere else.
EC2 instance size and scaling¶
Circa 2017
We’re setting up Antenna to run on Amazon EC2 x4.large nodes. Antenna isn’t very CPU intensive, but it is very network intensive (it’s essentially an upload server) and it queues things in memory.
Our cluster is set to autoscale on network in. When network in hits 600,000,000, then it adds another node to the cluster. Network in is being used as a proxy for number of incoming crashes.
For Socorro, our median incoming crash is 400k and our typical load is between 1,500 crashes/min.
Antenna on two x4.large nodes can handle 1,500 crashes/min without a problem and without scaling up.
What happens after Antenna collects a crash?¶
Antenna saves the crash to the crash storage system you specify. We save our crashes to Google Cloud Storage.
Then it publishes the crash to the designated crash queue. We queue crashes for processing with Google Cloud Pub/Sub. The processor pulls crash report ids to process from Pub/Sub.
Troubleshooting¶
Antenna doesn’t start up¶
Antenna won’t start up if it’s configured wrong.
Things to check:
If you’re using Sentry and it’s set up correctly, then Antenna will send startup errors to Sentry and you can see it there.
Check the logs for startup errors. They’ll have the string “Unhandled startup exception”.
Is the configuration correct?
Google Cloud Storage bucket permission issues¶
At startup, Antenna will try to write to the Google Cloud Storage bucket and if it fails, will refuse to start up. It does this so that it doesn’t start up, then get a crash and then fail to submit the crash due to permission issues. At that point, you’d have lost the crash.
If you’re seeing errors like:
2024-07-15 23:15:30,532 DEBUG - antenna - antenna.app - Verifying GcsCrashStorage.verify_write_to_bucket
[2024-07-15 23:15:30 +0000] [9] [ERROR] Worker (pid:10) exited with code 3
it means that the credentials that Antenna is using don’t have the right permissions to the Google Cloud Storage bucket.
Things to check:
Check the bucket that Antenna is configured with. It’ll be in the logs when Antenna starts up.
Check that Antenna has the right GCP credentials.
Try using the credentials that Antenna is using to access the bucket.
Logs are getting lost / StatsD data is getting lost¶
Depending on how you’re collecting logs and StatsD data, it’s possible that you might lose this data if Antenna is under so much load that it’s saturating the network interface.
You might see evidence of this by seeing lines in the logs saying a crash was saved, but no line indicating it was received. Or vice versa.
You might see evidence of this in StatsD when incoming crashes and saved crashes off by a large number.
Things to check:
What’s the network out amount for this node? Is it too low?
What happens if you increase the capacity for the node? Or if the node is in a cluster, add more nodes to the cluster?
The save_queue_size > 0
and climbing¶
This means Antenna is having trouble keeping up with incoming crashes.
Things to check:
Increase or decrease the number in the
concurrent_crashmovers
configuration variable.Too many will cause a single crash to take longer to save.
Too few will reduce the efficiency regarding parallelizing around network I/O slowness.
If you’ve already tuned this configuration variable, skip this step.
Increase the number of nodes in the cluster to better share the load.
Increase the node capacity so that it has more network out bandwidth.