But there always seem to be things that just aren't covered.
We use Zabbix, StatsD, CollectD and MMS.
But still find things that aren't handled the way we want.
Currently working on a API monitoring solution that can:
- do chains of calls that need to happen in order
- call into various levels of our stack (we have a lot of redundancy which can mask individual component failures... but we'd obviously still like to know about them)
The general solution should also be able to support 'app specific' monitoring.
For example:
- use PyMongo to query various MongoDB values... like if Balancing is enabled.
- use the Requests python module to query restful endpoints on our Haproxies & ELBs to confirm they are up and healthy.
And cuz this feels like an exceedingly verbose and not visually appealing post, here is a link to something on devops reactions:
"Before diving into Legacy Code"