Monitoring network traffic and service chatter with Boundary

We recently published a case study with Boundary regarding how we, at Gilt Groupe, are using their product and I wanted to give some additional details concerning our decision process, what we were looking for, what we looked at and why we decided that going to Boundary was the best choice for us moving forward.

Gilt Groupe’s architecture is now very much a case of micro-service architecture. We have hundreds of JVM-based HTTP services interacting with each others or with backend systems such as PostgreSQL, MongoDB, RabbitMQ, Kafka, Zookeeper, and many more third-party solutions over various data interchange formats and protocols.

A few months ago, we felt we needed to get more insight into the detailed amount of traffic that was going in and out of every service or backend system. When various teams are working on new features that require more communication patterns and data exchange, it starts to be difficult to do capacity planning when you don’t know where you are.

Moreover, in our experience, we have seen that most features generally go from a normal usage pattern for months to a sudden very large adoption by our business operations. The amount of data can suddenly grow 1 to 2 orders of magnitude, which does not generally go without its own set of challenges.

To get better insight into the amount of data exchanged, we started the effort to monitor the data transferred out of our HTTP services (we use Jetty) using the excellent Metrics library from Coda Hale. This can be trivially done extending the existing Metrics InstrumentedHandler for Jetty:


public class CustomInstrumentedHandler extends InstrumentedHandler { 

  private final Meter bytesTransferred = Metrics.newMeter(handler.getClass(), "bytes-transferred", "responses", TimeUnit.SECONDS);
  
  private final Histogram bytesResponse = Metrics.newHistogram(handler.getClass(), "bytes-responses"); 
  
  // constructor omitted for blog readability
  @Override
  public void handle(String target, Request request, HttpServletRequest httpRequest, HttpServletResponse httpResponse) throws IOException, ServletException {
    final AsyncContinuation continuation = request.getAsyncContinuation();
    try {
      super.handle(target, request, httpRequest, httpResponse);
    } finally {
      if (continuation.isInitial()) {
        long count = request.getResponse().getContentCount();
        bytesResponse.update(count);
        bytesTransferred.mark(count);
      }
    }
  }
}

The clients effort would be a bit more challenging however. In our JVM-based services we end using a menagerie of HTTP clients: AsyncHTTPClient with Netty 3.x provider, Apache HttpComponents 4.x, Apache Commons HTTPClient 3.x, and the venerable JDK HttpURLConnection.

This is the reality of having to deal with various third-party integration, it makes things more complicated than we would like to, but it can be a bit annoying to rewrite or extend some existing SDKs to try to use one and only one HTTP client across the platform (especially when non-extensible, or worse, closed-source).

The immediate problem faced is effectively how to instrument *all* those clients.

AsyncHTTPClient can be done easily using a RequestFilter and an AsyncHandler. The code would be something similar to the snippet below. There is not much overhead of doing it as you just need to count chunk size as they the HttpResponseBodyPart objects are received.

 public class InstrumentedAsyncHttpClientRequestFilter implements RequestFilter { 
  private final Meter bytesTransferred;
  private final Histogram bytesResponses;

// ... initialization omitted for readability
public FilterContext filter(FilterContext ctx) throws FilterException { return new FilterContext.FilterContextBuilder(ctx) .asyncHandler(new MetricsAsyncHandler(ctx.getRequest(), ctx.getAsyncHandler())) .build(); } public class MetricsAsyncHandler implements AsyncHandler { private AsyncHandler delegate; private long totalBytesTransferred = 0;
// ... initialization omitted for readability
public STATE onBodyPartReceived(HttpResponseBodyPart bodyPart) throws Exception { long bytes = bodyPart.getBodyPartBytes().length; totalBytesTransferred += bytes; metrics.bytesTransferred.mark(bytes); return delegate.onBodyPartReceived(bodyPart); } public T onCompleted() throws Exception { T o = delegate.onCompleted(); metrics.bytesResponses.update(totalBytesTransferred); return o; } } }

Note that we tend to give a name to each service client which would map to a Metrics scope, which makes it useful to distinguish metrics between each client (some services use a dozen of clients).

For all the others clients, it is a bit more intrusive to be practical. And it doesn’t address how to monitor the traffic in/out going directly through the Socket api like for Zookeeper, Play Framework (Netty server), MongoDB and JDBC drivers, etc…

Another solution would be to write a JVM Java Agent via the java.lang.Instrument API to instrument some well known libraries (NewRelic uses a similar technique, but doesn’t track traffic). While it may looked like the less intrusive solution, it is also a fairly significant undertaking to develop instrumentations for several third-party libraries which you have to maintain over time.

Also, knowing we were looking with an interested eye to add systems such as Riak, Redis and possibly some various asynchronous drivers and having to deal with multiple versions of Scala… this was a cool project to work on technically, but maybe not excessively practical.

What we needed was something similar to nethogs minus the text interface. A tool capable of grouping the bandwidth by process, but ideally it would have some features also found in Wireshark.

We did not find anything matching those requirements.

Until a week or two later. We had Cliff Moon, Co-Founder and CTO of Boundary, visiting our New-York office to present Boundary and do a Tech Talk on Distributed Systems (which we blogged about).

We installed Boundary on some our servers to get a better idea. This was truly a revelation. The installation was painless with just a single command and as soon as the agent was up, it started to report data to the dashboard within the next second.

image

Each of the line represent the traffic volume happening on a given port/protocol across all nodes at a 1 second resolution. Traffic can easily be broken down. For example you have the ability to group servers, either manually or dynamically using pattern matching which makes it easy to segment your front-end from your backend machines and see traffic flowing between those groups (this is where a descriptive naming policy for your machines comes handy).

You can further segment your traffic by port / protocol. For example TCP 5432 would be the traffic to/from PostgreSQL. You can then easily analyze the traffic that is going from your backend machines (or a subset of those) to your PostgreSQL. Same thing could be done to know the chatter around our messaging infrastructure on RabbitMQ.

A lot more details on how all of this can be done is visible on a Youtube video ‘Isolate your traffic with filters and conversations’.

There is a shortcoming currently for us where we are effectively losing a bit of visibility in our conversations. For instance, traffic to our services is always going through a set of dedicated service load balancers. For example we reach to it via a canonical url such as http://svc-product and the load balancer will balance between node1:7501, node2:7501, node3:7501. It means traffic from the caller to/from the load balancer happens on port 80 while the traffic to/from the load balancer to the callee is on port 7501.

caller ← port 80 → svc-lb ← port 7501 → callee 

Which means that the traffic flowing on port 80 is basically the aggregate of all service traffic and that we cannot see the traffic directly from caller to callee, but only the aggregate from caller to svc-lb and from svc-lb to callee.

This is something that may be alleviated a bit in the future as we are thinking about removing the load balancer and having applications doing the load balancing themselves using information from Zookeeper.

Boundary settings on the dashboard can be driven entirely from their REST API, which provides the added convenience of being able to integrate with your own configuration management system such as Puppet or Chef and a set of backend applications which may contain metadata about your environment.

The REST API is useful to define application aliases which gives names to a protocol:port (eg: ‘svc-product’ for TCP 7501), send deployment events or integrate with other systems (it can subscribe to NewRelic events via RSS).

We have only scratched the surface of Boundary so far and we are very excited about the direction it is taking and what is being developed. It has already proved extremely useful in identifying traffic volume and patterns occurring between services and databases. Something that would have required a lot more tedious investigative work can basically be done now in a few minutes and with much more flexibility than we could initially imagine and with no direct investment.

I hope that this (long) blog post will be helpful to some people who are facing the same challenges of not having enough visibility in their network traffic. If however you know of any interesting tool in that space, feel free to drop a note.

On a slightly unrelated note, we are also users of a nice library from Boundary called Ordasity. It is a great way to distribute workload across nodes via Zookeeper. It was brought to our attention during Scott Andreas’s tech talk at Gilt Groupe (another one !), and it might be the topic of another blog post.

— stephane

Mobile Web How-To: Inspect Elements On Android’s Internet Browser

I’m building Gilt’s new Android app and a good portion of the website is an Android WebView. As you may or may not know, this WebView uses the default Android Internet Browser to render webpages. You probably know this app best by its logo in the lower right hand corner of this screenshot:

image

When you use Google Chrome on the Android device, inspection is very straightforward — I’ll cover this in a later post. But for Android Internet Browser, there is not to my knowledge a good way to inspect and manipulate the DOM.

I needed to inspect the DOM because I inherited a JavaScript file that allows us to mimic scrolling events on mobile devices via webpages. The scrolling library works as expected in all other browsers on all other devices. So, I needed to better understand what was happening in the Android Internet Browser.

The tool that bridged my device to an inspector tool is Adobe Inspect. To get going, you have to install 3 components:

1. Adobe Inspect on your computer: http://html.adobe.com/edge/inspect/

2. Google Chrome Extension Adobe Inspect: https://chrome.google.com/webstore/detail/adobe-edge-inspect/ijoeapleklopieoejahbpdnhkjjgddem?hl=en

3. Google Play Store Adobe Inspect: https://play.google.com/store/apps/details?id=com.adobe.shadow.android&hl=en


Once you have installed everything, connect your Android device to your computer and make sure you’re on the same WiFi. On your Android Device, open Adobe Inspect and click the plus sign in the upper-right hand corner to get to this screen:

image

Get your ip address from your computer and enter it in Adobe Inspect on your Android Device. If you skip ahead, you can find your ip address on your computer by opening Google Chrome and clicking on the Adobe Inspect icon in the nav bar — you’ll see it there as well. After you input your ip address into the Android device, you’ll receive this screen:

image

Now, open Google Chrome and open the url of your choice with the Adobe Inspect extension enabled. In the upper right hand corner, you’ll see the Adobe Inspect icon with a green plus sign.

image

Click on this icon to reveal a menu that displays your computer, your IP address, and your device name.

image

Enter the Passcode you received from your Android device.

image

Now that you’re connected, click on this button next to your device name:

image

Then, a new window should open that looks like this:

image

You can see that this is a standard Google Chrome inspector with the name of your device and the url that is currently being inspected. Click on this link and then click on Elements.

image

Here, you have a standard inspection workflow similar to what you would use for the full screen experience. You can use the console and other features in a more limited manner to what you would use on the full screen experience. And, Adobe Inspect will highlight what DOM elements are being inspected on the device:

image

There is much more you can do but, hopefully, you’re now set up to debug the Android Internet Browser (not that you needed to debug anything in the first place).

Gilt Mobile Web Nav Redesigned

Hello!

Yesterday (April 23, 2013), I sent to production the third phase of the Gilt Mobile Web (m.gilt.com) redesign. In this phase, I updated the primary and secondary nav on the mobile web so that there is (hopefully!) a much better user experience. And like the first and second phase of the redesign, the goal with the m.gilt.com nav redesign was to incorporate learnings taken from the Gilt iPhone App experience.

Here are some before and after screenshots:

image

On the Sales Listing Page, you’ll see that I moved the stores menu from above the primary nav to below it. This gives the overall navigation a hierarchy that was not present earlier. You’ll also notice that the nav feels less busy because the buttons in the primary nav are more concise.

image

In the Product Listing Page, you’ll see that the back button was changed and that the title of the sale occupies the secondary nav. This allows the user to easily know in which store she is. In the future, the secondary nav on the product listing page will disappear as the user scrolls down.

image

On the Product Detail Page, I tightened the primary nav a little more and there is still no secondary nav.

image

image

On the checkout page, I changed the styling of the Cart and Submit button. The colors now match the style that persists throughout the entire Gilt Mobile Web experience. And, the lock is gone. Our site is still secure regardless whether the lock is there or not.

Let me know what you think!

Greg

Gilt Mobile Web Redesigned - 10 Views Compared

Over the last three months, we overhauled the front end (jsp, html, handlebars, less/css, javascript, zepto) for Gilt’s mobile web experience (http://m.gilt.com). The redesign was inspired by learnings acquired from our iPhone App and the design is meant to replicate a lot of those features.

Here are 10 side by side comparisons with notes inline and at the bottom. Let us know what you think!

image

In the previous design, there was an assumption that the Gilt shopper uses the mobile experience quickly and wants to see as many sales as possible in a short amount of space and time. But according to our iPhone App results, our users want to see larger pictures. Having two sales on the first view compared to four has resulted in increased sales, largely due to our amazing imagery.

image

We dropped the black background on the product details because most of our imagery adopted a white background.

image

A nice feature of the redesign is that the add to cart button is visible on the bottom while scrolling on the product detail page. If you are shopping on m.gilt.com and want to purchase an item, you should have the add to cart accessible at all times. But, this doesn’t always work on all devices.

image

We created a more streamlined view for our sign in and registration pages by simplifying the experience.

image

Our cart page features a similar button to what existed on the product detail page. Again, we are trying to make it as easy as possible to shop on a mobile device.

image

The checkout experience mimics the iPhone App experience while trying to keep a style that is more generic so that it looks great on Android as well.

image

While scrolling to the bottom of the checkout page, the submit order button stays fixed. In our older version, the user had to scroll to the button to checkout.

image

We are reusing elements in our account page that we used in the checkout flow.

image

Our footer introduces more spacing as well as elements that were previously found in the checkout and account flow.

image

We have an amazing customer service team.

Here are 5 learnings from redesigning the Gilt Mobile Web experience.

1. Visual Components: Abstract away visual components that can be reused throughout the mobile experience. For example, the buttons in the footer are also used in the account page as well as in checkout. We only need to supply text and an optional image to create one.

2. Cross-Browser Compatibility: Handle browser-specific issues like fixed-positioning on a per-device basis. The fixed “submit button” appears at the bottom of the screen on the product detail page, cart, and checkout page for modern browsers. But for early versions of Android, for example, the buttons appear inline.

3. Iterating: Gilt Mobile Web is a complete experience in that it includes everything from the sales listing page through account and checkout. When trying to roll releases out to production, we searched for places where the old and new designs could coexist for a limited period of time. With this strategy, we didn’t have to redesign everything at once.

4. Cross-Device Compatibility: We tested m.gilt.com against several android and apple devices (as well as ie6 and ie7) to make sure it looked good on many different screen sizes. In many instances, DOM element sizes are determined as a percentage of the viewable screen. But since the majority of our users are accessing us from Safari on iPhone, development starts on the iOS Simulator.

5. If it doesn’t work, roll it back: I deployed our second phase to production and immediately upon doing so realized that “sold out” hover states were not… hovering. It was something that we missed and instead of making a quick fix, we rolled it back and spent a little more time testing. We pushed to production again a few days later with bug fixes in place and confidence that the experience was stable. Sometimes, we can’t catch it all on staging environments.

Codecademy ♥ Gilt

It’s with great pleasure that we’re announcing the awesome Gilt API lessons on Codecademy!

Be ready to take your Javascript skills to the next level and learn how to find the latest and most beautiful fashion products with the Gilt API.

Begin to learn our APIs now!

Users can now browse the Gilt daily sales in a native application for the Windows Mobile platform! codeArcher’s Glitter features gilt.com’s flash sales, making use of the Public API to pull in information on the curated sales as well as the products in those sales. If you’re rocking a Lumia or other Windows phone, check this app out and review it in the Windows Phone Marketplace!

NYC Fashion Hack Day 2012: June 16

What’s the NYC Fashion Hack Day? On Saturday, June 16, 2012, Gilt Tech is joining forces with two giants of their respective industries: Apigee, in the business of APIs and cloud platforms, and Tumblr, in the business of hosting and spreading the word about art and creativity and blogs about dogs on top of things. Together, we’re excited to host a 12-hour hackathon event at Gilt HQ in Midtown Manhattan in search of individuals and small teams who want to spend the day building a brand new application: web, desktop, mobile… whatever suits your fancy, we’d love to see you come out and make something!

All three companies will have people on hand to help you get up and running with our APIs & tools if that’s your thing, or else you can just come and work on a project if the idea of making something while meeting and networking with the NYC Tech scene is appealing to you.

At the end of the day we’ll be giving out prizes for best usage of the hosts’ APIs, most fashion-conscious application, and coolest technical accomplishment! Tickets are limited, so sign up today to reserve your spot!

Friday Fun with Gilt Games

The Gilt Public API is a great tool for developing new ways for people to shop quickly, efficiently, or on new platforms. But developers using the API are also finding unexpected things to do with Gilt data— they’re using it to make games.

So in the spirit of Friday, why not spend a few minutes playing with a couple of games that have been built on top of the Public API?

Gilt Memory was built by Karl Norling, a developer here at Gilt. See how quickly you can find the matching pairs of products, then challenge your friends to beat your score!

The Price is Right on Gilt was built overnight at the NYC Powered by MongoDB Hackathon last weekend. This live multiplayer game lets you and up to 3 other friends play a price-guessing game to see who knows Gilt’s products the best. Developer Yufei Liu built the project using Node.js and MongoDB.

Happy Friday, everyone!

Gilt Tech @ NYC Powered By MongoDB Hackathon

What’s better than spending 24 hours in Manhattan hacking on sweet projects using MongoDB and the Gilt API? Doing it while supporting HackNY, a New York City organization that promotes innovation and entrepreneurship among the city’s up-and-coming students and hackers. And that’s exactly what you can do next weekend, April 27-28, at 10gen’s MongoDB Hackathon. A lot of great people, and great companies, are going to be there, and we’re excited to see you come down and make the next cool project. On Saturday night an esteemed panel of judges will choose winning projects and award all manner of prizes, including iPads, Xbox 360s, and more.

We’ll be at the event to share how to get started with the Public API and answer any questions you may have, give you ideas on what you could build, and maybe give you a sneak peek into what we’re working on for the future of the API, so come find us!

For questions about the event, you can send a tweet to @MongoDB. For questions about the API, get in touch @GiltTech or send an email to api@gilt.com!