Graphite is a great tool to collect metrics on anything. Graphite has great performances by itself. We collect over 300K metrics per second on a single virtual server. But for the hardware we have, that’s close to the limit, we start to loose metrics from time to time. So, weekend exercise : how to scale Graphite to multiple servers.
The goal is :
- increase the number of servers to allow more scalability
- keep a single interface to consult metrics
- keep a single interface to publish metrics
- keep things as simple as possible (but no more)
- have everything deployed with Puppet
The result looks more or less like this (based on PlantUML):
There is already good documentation, I will not repeat it. But I was not able to find a complete code example. My implementation is expressed as a few Vagrant VMs and is available on GitHub.
Still to be done
- Adding more servers to the Graphite cluster means moving existing data around, some scripts need to be integrated for that
- scripts to clean up corrupted data need to be added to the puppet-carbon module