The Gilt technology organization. We make gilt.com work.

A Synthetic Monitoring Case Study

Client-side web application monitoring comes in two main flavors. Real User Monitoring, or RUM, uses an agent that runs within each web page and reports on the page load data for every request. Typically, the browser’s performance timing API data is used. Gilt uses New Relic for RUM, and it provides a good overview of the application ecosystem. In the old forest/tree metaphor, it’s the forest.

Synthetic monitoring does not run in the web application. Rather, synthetic monitoring vendors provide remote hardware that hits a web site periodically, and stores data for what it sees on that particular request. It’s “synthetic” because it’s not your users’ data; requests are made specifically for collecting data on your page loads. But it’s also controlled. The requests are made from predictable hardware over predictable connections. If RUM is the forest, synthetic monitoring is the trees.

Gilt has been using Rigor for synthetic monitoring. It’s interesting to see how the data from each kind of monitoring provides a different perspective on the health of an application.

Different Perspectives

RUM collects data on every web request from every customer. So it’s big data, and our view of that data is going to be broad and not deep. For example, average page load times include every hit to the page, cached requests and uncached requests. These averages make it more difficult to see performance problems when they happen. The following is the performance of our sale listing page for a month in New Relic:

image

Unlike RUM, synthetic monitoring is always dealing with a single uncached request from consistent hardware. This makes it much easier to see when a change affects performance. The following is a single request to the same sale listing page, for the same time period.

image

Looks different, doesn’t it? There are obvious spikes and valleys. When we layer the two images on top of each other, we see that one really significant spike in the synthetic chart doesn’t even register on the RUM chart:

imageSo which type of monitoring should we use? The answer is YES. Both RUM and synthetic monitoring give different views of our performance, and are useful for different things. RUM helps us understand long-term trends, and synthetic monitoring helps us diagnose and solve shorter-term performance problems.

Case Study

Around October 1, I noticed an uptick in our sale listing page load time in Rigor. Incidentally, this is the same spike that above didn’t really show up on the New Relic RUM chart.

image

Looking through the git log of the repo, I didn’t see anything too suspicious. I then isolated the deployed git tag in which the change happened and performed a diff between that tag and the previous tag, comparing their frozen package.json files (containing the fully resolved AMD module versions). That showed a couple of possibilities, so I started looking into Rigor’s data.

Synthetic monitoring tools generally give access to the waterfall charts for every page load. So, I isolated a few waterfalls prior to the uptick and a few following the uptick. What I found: an additional 25 images and a couple of additional requests. I was pretty sure I had my smoking gun.

image

The culprit was most likely a new AMD module, called component.trending_products. Its job is to request products that our customers are currently buying in real time. This is a great feature for our customers, but at this point seemed to be adding upwards of a second of load time to our pages.

Confirming the Suspicion

Fortunately, synthetic monitoring is very good at showing us data around things even if customers are not seeing them. I was able to create an A/B test using our configuration service and use it to disable the feature for any visitor with a certain query parameter. This allowed me to run a synthetic test for a few days, which made it very clear this new feature was the problem. You can see that the lines move parallel to each other, with the only slight exception indicating a load time outlier that affected the average and 90%/95%/99% lines.

image

Solving the Problem

Fortunately, issues like these are relatively simple to improve.

The first problem is the number of images. The component fetches JSON data from a service endpoint, runs it through a Handlebars template, and injects the resulting HTML into the page. The data source can contain any number of products to show to the user; typically around 25-30. But the product images are displayed in a carousel:

image

As much as we’d like to tell ourselves otherwise, most users are not going to advance the carousel to look at additional products, so there’s really no need to load the image data for the images that are off-screen. This is easily accomplished with a template like this:

<img data-gilt-src="/path/to/image" />

This way, the image URI is in the rendered HTML. Of course, since none of the images have src attributes, I didn’t have any images showing on page load. The next step: activate the visible ones immediately after the template renders and is injected. (In the code below, the pageSize variable was created using window.matchMedia and was equal to the number of carousel items showing at each screen resolution.)

target.find('img[data-gilt-src]').each(function (i) {
  if (i < pageSize) {
    $(this).attr('src', $(this).attr('data-gilt-src'));
    $(this).removeAttr('data-gilt-src');
  }
});

The images also were missing their quality parameter. We use a service provided by our CDN to compress our JPGs slightly, which saves roughly 30% on the file transfer. Adding ?oq=85 to the image immediately reduces the bandwidth going over the wire.

Finally, I just needed a little bit of JavaScript to load additional images when the user interacts with the carousel. I was able to use our carousel’s API to do this (element is already in scope as the carousel container):

carousel.subscribe('elementSwitched', function (data) {
  element.find('img[data-gilt-src]').each(function (i) {
    $(this).attr('src', $(this).attr('data-gilt-src'));
    $(this).removeAttr('data-gilt-src');
  });
}

Loading all of the images at this point makes sense no matter how many carousel pages there are. By the time the user is interacting with the carousel, it’s a fairly safe assumption that the page is done loading and that there isn’t too much other HTTP traffic.

All these changes can’t eliminate the load time involved in adding a new feature to a page. However, you can certainly tell from the chart below when the fixes were released, which is a continuation of the earlier chart showing just the A/B test variants. The decrease of all load times is actually unrelated; it’s the coming together of the two lines, with and without the trending products feature, that shows the improvement.

image

As time has passed, this new feature has settled into adding about 150ms to the page load, which is reasonable for the additional customer benefit it provides.

Synthetic monitoring provides a perspective on your application that RUM can’t. The controlled environment makes problems more visible, and also gives you the opportunity to test your hypotheses and make corrections without requiring customer traffic. This case study confirmed for us the usefulness of synthetic monitoring, and we’re excited to integrate the tool into our workflow more in the future, including testing the performance of new features before we send customer traffic to them.

eric shepherd 2 performance engineering 2 web performance 2 web perf 2 synthetic monitoring 2 Gilt 340 Gilt Groupe 282 Gilt Tech 287 gilttech 332 real user monitoring 1 RUM 2 client-side web application monitoring 1 New Relic 3 Rigor 1 big data 17 cached request 1 uncached request 1 component.trending_products 1 ab testing 2 a/b testing 2 JSON 4 HTML 8 Handlebars 4 JavaScript 18