Category: performance

Scaling Graphite

Graphite is a great tool to collect metrics on anything. Graphite has great performances by itself. We collect over 300K metrics per second on a single virtual server. But for the hardware we have, that’s close to the limit, we start to loose metrics from time to time. So, weekend exercise : how to scale Graphite to multiple servers.

The goal is :

  • increase the number of servers to allow more scalability
  • keep a single interface to consult metrics
  • keep a single interface to publish metrics
  • keep things as simple as possible (but no more)
  • have everything deployed with Puppet

The result looks more or less like this (based on PlantUML):
graphite-components

There is already good documentation, I will not repeat it. But I was not able to find a complete code example. My implementation is expressed as a few Vagrant VMs and is available on GitHub.

Still to be done

  • Adding more servers to the Graphite cluster means moving existing data around, some scripts need to be integrated for that
  • scripts to clean up corrupted data need to be added to the puppet-carbon module

Helpful sources

Advertisements

The cost of HTTP redirects

We all know HTTP redirects are costly. But how costly ?

I had this conversation at work. To serve different images for different platforms (desktop, mobile, tablet, …) we have to introduce a level of indirection between the image URL and the actual file. For example, the URL http://mysite.net/images/123/mobile will resolve to a file called 456.png. Pretty standard, should be easy. Unfortunately, we use a framework that will make us write pretty ugly code to achieve this server side. The question was asked: “Could we not use HTTP redirect? This is a standard web technique…”.

I dont think we should ever expose crap to our clients, even if it allow us to write cleaner code. In this world appearence is key, not inner beauty. Still the question was asked: “How much would those redirects cost?”. Should be simple to test …

An evening of coding in front of the TV later, I have a very simple test case, which is probably wrong on so many levels that Christopher Nolan might have written it himslef, but still, I have numbers:

Downloading 10 images to my desktop with an HTTP redirect each time takes on the order of 450ms, while downloading the same images without the redirects takes about 250ms. On my mobile phone on a 3G connection, the number climb to 1200ms and 750ms respectively. Not the end of the world, but still, we could do a lot of better things in those 100s of milliseconds.

The test is available on CloudBees, go see for yourself (your numbers may vary). The source is available on GitHub. The implementation is very simple :

  • a servlet redirecting all requests
  • a servlet serving always the same image (different URL, so no caching)
  • an HTML page loading images with Javascript and publishing the time it took

Graphing performance of your datasource

One of the project on which I am working is extremely dependent on the database performance. This project executes an extremely high number of small SQL queries. Each query usually takes less than one millisecond. If the average query time goes from 1 [ms] to 1.5 [ms], after a complex calculation, you can see that we degrade performance by 50%.

Performance from the database side are still considered as good. We need a way to measure those performance in a very precise way to see if performance degradation actually come from the database or if we need to look for problem somewhere else. And we need a way to graph those data to identify problems quickly.

We already use Graphite and Statsd, and we already use Tomcat with its tomcat-jdbc pool. The tomcat-jdbc pool provides an interceptor mechanism that can be used in our case.

The implementation of our interceptor is minimal. First step, let’s get the timings of all operations on the connections. For that, we just need to override the invoke() method :

@Override
public final Object invoke(final Object proxy, final Method method,
        final Object[] args) throws Throwable {
    String methodName = method.getName();
    long start = System.nanoTime();

    Object result = super.invoke(proxy, method, args);

    increment(methodName + ".count");
    timing(methodName + ".timing", System.nanoTime() - start);
    return result;
}

The increment() and timing() methods are taken almost directly from the Statsd Java example. The actual implementation is slightly more complex than what is shown here to ensure that we also proxy the statements created from the connection and get their performances as well.

We need to give the interceptor some information, the name and port of our Statsd server, the sampling rate to make sure we dont hurt performances to badly and a prefix to organize our data.

Activating the interceptor is done by modifying your Tocmat datasource :

<Resource name="jdbc/TestDB"
          type="javax.sql.DataSource"
          factory="org.apache.tomcat.jdbc.pool.DataSourceFactory"
          jdbcInterceptors="ch.ledcom.tomcat.interceptors.StatsdInterceptor(hostname=localhost,port=8125,sampleRate=0.1,prefix=myapplication.jdbc)"
          username="myusername"
          password="password"
          [...]
          driverClassName="com.mysql.jdbc.Driver"
          url="jdbc:mysql://localhost:3306/mysql"/>

Conclusion: JDBC interceptors are a great and easy way to get interesting information from your datasource !

The full code can be found on GitHub, binaries can be downloaded from Maven repo1 repository.

SiteSpeed.io

I just found out about Sitespeed.io. Basically it is an integration between a site scraper, a standalone, headless webkit implementation (PhantomJS) and YSlow. Sounds like a simple idea, but the execution is really good ! There are plenty of browser plugins, or online solutions to measure the client-side performance of your website, but they are not easy to automate. Sitespeed.io allows me to add a simple step to my integration tests and get metrics on the quality of my app each time a developer makes a commit. This should be implemented in my apps next week.

What could be improved ?

No, we don’t live in a perfect world, so yes, there is always something that can be improved.

  1. Find a way to collect metrics over time,
  2. integrate with Maven,
  3. support Java 6.

Metrics over time:

I have strong suspicions towards any kind of metrics on their own. I don’t really care if my test coverage is 60% or 80% or 110%. What I care about is that my metrics tell me where I can improve my application and that my application is getting better overtime, not worse. In my workplace, we use Sonar for that. That way, we can make sure that test coverage improve overtime, or we can easily identify which classes are too complex and need to be refactored. I’d love to have the same kind of metrics for client side code quality as well.

Maven integration:

Maven integration would allow to integrate the collection of metrics in the standard build process. Integrating SiteSpeed.io in a Continuous Integration tool is pretty easy, so this is definitely a minor point. A better integration with our build tool would make it even easier for developers to collect metrics. Even if I am not personally a big fan of this approach, it would allow us to fail the build if a minimum quality is not reached.

Java 6 support:

Yes, I know, Java 7 is out and is the third best thing in the world (just after sliced bread and portioned coffee). But for us stuck in the corporate world, Java 6 is still the norm (when it’s not Java 5). It should be pretty easy to compile dependent Jars for Java 6. I’ll have a go at that as soon as I can …

Overall, great project ! I’ll tell you more as soon as I have a bit more experience with it.