The Gilt technology organization. We make gilt.com work.

Gilt Tech

AWS SDK for Java, version 2.0

Sean Sullivan cloud

The Capital Region AWS User Group met on January 18th at the Nanotech Complex in Albany New York. CommerceHub hosted the meeting at their main office.

The topic of this month’s meetup was the AWS SDK for Java. At HBC, our development teams use the SDK to access AWS services such as DynamoDB, S3, CloudWatch, and SNS. The v1 SDK has been a core building block at HBC since 2014.

In June 2017, Amazon released a new implementation of the SDK for Java.

aws-sdk-image

The version 2.0 SDK is available as a developer preview. HBC is evaluating the new SDK and we look forward to using it in production later this year.

aws-sdk-image

Our engineering team has already started incorporating the v2 SDK into our helper libraries:

The v2 API uses java.util.concurrent.CompletableFuture to encapsulate the result of an AWS service call. HBC’s Scala libraries will use FutureConverters to convert Java CompletableFuture objects into Scala Future objects.

aws-sdk-image

If you want to learn more about the v2 SDK, review my slidedeck or watch Kyle Thomson’s re:invent 2017 presentation.

aws-sdk-java-speakerdeck-albany-image aws-reinvent-2017-kyle-thomson-image

aws 15 java 2 scala 18 cloud 3 albany 1 newyork 1 2018 1
Tech
Gilt Tech

Career Structure. It doesn't matter. Until it matters.

Adrian Trenaman leadership

In this article, I’m going to talk about career structure, career development, and career titles in a tech organisation. This post is more about organisational development than it is about technology; however, on the grounds that the health of your architecture and technology choices will be somewhat isomorphic to the health of your organisation, I believe this to be a worthwhile read for any engineering leader. I’m also writing this for anyone who is considering joining HBC Tech, so that they can understand our approach, and the meaning we give to our titles.

I’ll cover our motivation and need for a career structure, the core principles that drove our solution, and, perhaps most importantly, our experience in transitioning to that solution in a Dunbar-sized technology organisation at HBC Tech.

“I don’t care about my title.”

Titles are an amazing thing. When I joined Gilt in 2011, I didn’t care about titles; I was happy to shed my shiny ‘Distinguished Engineer’ title from IONA Technologies and join Gilt as a Senior Engineer. It was easy: the general rule was that every engineer at Gilt apart from our CTO was a Senior Engineer, and the democratic, meritocratic, startup culture obviated the need for titles.

However, over time something interesting happened, not quite to me directly, but to the organisation as a whole. Blips started to appear in the organisational title space - a new hire would arrive as a ‘Principal’, and peers would wonder “what does that mean?”. Or, someone would become a team lead, or Director of Engineering, which begged the question: is that higher than an Engineer, or, somehow equivalent? And then blips started attracting attention from outside of our organisational radar: our team would see their peers in the industry getting promotions and fancy titles, and wonder were they somehow stagnating in their careers?

Organisations develop and grow over time; the tech leadership team at Gilt realised in February of 2013 that the organisation needed and wanted a clear career structure, something different from our original loose, startup-y approach. The career structure we landed on at Gilt worked well for us in the subsequent three years. More recently in 2017 we set out to create a career structure for the larger HBC Tech organisation; we wanted to design it carefully so that it would align with our cultural values and learn from our experience. We started with a few principles to guide our thoughts.

The principles

Here’s what we felt was important:

  • Meritocracy (or, you earn your title): No-one is ‘entitled’ to a title: just because you’ve been here a while, doesn’t mean you qualify for a promotion. Ever. We expect you to perform at the next level for sustained period - at least six months - before you can get promoted. A good metric of your success in this regard is that your promotion should be a surprise to nobody; the only surprise should be a colleague saying “Hey, I thought you were at that level already!”.

  • Equivalence of Individual Contributors and People Leads: We feel strongly about this. We hire talented people who design and build great things, and we wanted to create a career development path that encourages development in the skill that make you great. No one should ever feel that they have to become “a manager” to get promoted. Equivalence should apply to compensation, but also in day-to-day operations: for example, in our Dublin office we extended our regular “Dublin Leads Meeting” to a “Dublin Leads & Principals Meeting”.

  • We want Leaders, not Managers: When you have talented individuals, working in small autonomous well-knit teams, you don’t need to manage. We develop our leaders from within: for us, leaders are people who love their craft, but also get a real kick out of leading people: showing the way, setting direction, getting alignment, and making the team successful.

  • Encourage people to try leadership out, with a route back to Individual Contributor: The idea of putting yourself forward as a team lead is daunting for many individual contributors; some of that may be down to humility (“Why do I think I’m better equipped to lead than everyone else on my team?”), or fear of failure (“What happens if I’m just not good at this?”). Perhaps the biggest fear is the idea of losing your grasp on the skills and talents that make you great as an engineer in the first place. We wanted to create an environment where individual contributors could experiment with leadership roles, and have an avenue back to individual contributor later should they so desire. I must confess: the first time I saw a team lead return to an individual contributor engineering role and be replaced by another more junior team member as lead, I honestly thought it would never work. I was wrong. The original lead (now engineer) and new lead just got stuck in at what they were good at, and the team was great.

With these principles in place, the question became “What kind of career structure can support this?”

What do other organisations do about career titles?

We triangulated ourselves off our understanding of organisations like Netflix, Facebook and Google, and how they manage their career structures. Organisations don’t often publish their career structure; however, pulling together from a number of sources and contacts, we saw some interesting patterns:

  • Netflix: Netflix operates a very simple structure. Ultimately, every engineer is a Senior Engineer, and there is a separate track for leadership, including Manager, Director, VP.

  • Google: Google has a multi-tier ‘levelled’ system, with individual contributor levels ranging from Level 1: Software Engineer I to Level 10: Senior Google Fellow, and a parallel engineering management track that ranges from level 5 to level 10.

  • Facebook: Facebook also has a multi-tier ‘levelled’ system. Engineers range from Level 3 - Software Engineer up to a Level 9, with associated Manager and Director roles on a parallel track above Level 5. Anecdotally, once an individual elects into the management track, the intent is that they will remain in that track.

  • Gilt: At Gilt, pre-2013 we had a structure similar to Netflix (see above!), and we found that this structure didn’t give a compelling career prospect for individual contributors in our engineering teams. Ironically, for a tech team with a strong startup mentality, we looked to the industry and adopted what outwardly seems to be a fairly traditional approach. Based on the Radford methodology, we created a career path and approach that had the following qualities:

    • Recognised different levels of ability, impact, contribution, scope and influence across individual contributor and leadership tracks; and,

    • Encouraged folk to jump across the Individual Contributor / Leadership boundary as their career develops.

The roll-out of the approach at Gilt was positive: it became clearer to staff who their peers were, and what ‘the next level’ really looked like. Career development discussions went from the abstract to the concrete. Perhaps the biggest result was that a generation of individual contributors, who had resisted the idea of taking on a leadership role, gave leadership a try and discovered they were good at it. More-so, on our engineering teams, we developed a culture of ‘leaders-who-code’: we found that engineering leads up to the level of Director and Senior Director continued to make significant impacts in terms on engineering contributions. Result!

The HBC Tech Career Path

After HBC acquired Gilt in 2016, the subsequent merger of the tech teams presented a challenge in terms of career paths: which career structure should we adopt, the existing HBC career structure, or, the Gilt career structure? Rather than just pick one of the existing, we took the opportunity to rethink both, and, landed on a structure loosely modelled on the ideas from Radford, Google and Gilt, as per the table below.

Level Individual Contributor Leadership
1 Software Engineer I -
2 Software Engineer II -
3 Software Engineer III -
4 Senior Engineer Lead Engineer
5 Staff Engineer Senior Lead Engineer
6 Principal Engineer Director of Engineering
7 Distinguished Engineer Senior Director of Engineering
8 Fellow VP Engineering
9 - SVP Engineering

This approach has all the qualities we were looking for: equality of levels, ability to experiment with leadership at the senior level, and supported our culture of valuing leadership and autonomy rather than management and rule.

For each level, we worked out key indicators of what it means to be at that level. For example:

  • Level of knowledge: how much domain knowledge / expertise is required and expected at a particular level?

  • Job complexity

  • Supervision: how much supervision / hand-holding should we expect the individual to need at a particular level?

  • Experience.

  • Sphere of influence. This is one of my favourite indicators: at early levels, the sphere of influence is really ‘self’. As we get to subsequent levels, we expect individual contributors and leads to influence their teams, department, tech, the wider organisation, and the industry & community.

  • Team size (for leads). There’s some good rules of thumb in terms of team size:

Lead Level Title Team size
4 Lead Engineer 7 ± 2 (a pizza size team)
5 Senior Lead 10 ± 2 (a large team, or team of teams)
6 Director 20 ± 4 (a classroom-size team of teams)
7 Senior Director 25+
8 VP Engineering ~80 (e.g. an engineering site / office)
9 SVP Engineering ~150 (a Dunbar)
  • Accountability: what are the systems that you own / are responsible for. How critical are they to the business?

One big learning is that these indicators are only guidelines. I dislike the idea of career advancement being a box-ticking exercise: there is a qualitative judgement that we as leaders need to apply, and sometimes a candidate’s excellence in one area may make up for a deficiency in another, or, make up for organisational blockers. For example, we once had a Director of Engineering with a small team of four or five reports; you could argue that this Director didn’t have a large enough team to warrant the title. However, when we considered his technical ability, the scope of the role, the level of accountability, and, the fact that we would most likely never have a classroom size team in this area, we felt great about promoting.

Qualitative judgement of levels is, of course, hard. One idea we found helpful was to look at peer groups, to ensure that staff at the same level are in the same peer group. If we find that a member of staff ended up grouped with other members who seem at a different level, we re-examine the case to make sure our evaluation has been correct and fair.

Rolling out a Career Path

With our new career path in place we then had to figure out how to roll it out. Ultimately, rolling out a new career structure means you may be changing your people’s titles: we want to make sure that we did that in a sensitive and proactive way. For example: if someone’s existing title is ‘Software Engineer’, then the question is what level are they in the new system (I, II, III)? How will someone feel if they think they’re a III, but we think they’re a II?

We settled on a couple of core ideas to help us make a smooth transition:

  • There are no salary or bonus adjustments as part of transitioning to the new structure. No one gets a pay cut, no one gets a pay rise: the focus is on getting the track and level right.

  • We apply the level as is: there are no promotions as part of the levelling. We didn’t want to use this exercise as a way for folk to canvas for a promotion.

  • We involved each individual in the selection of their level, through 1:1s with their current directors and leads. At the end of the exercise, everyone knows their level (1-9) and track (Individual Contributor / Lead), and has been part of the process.

  • Individuals can adopt their new title publicly if they wish, or retain their current title. Whatever about our internal career titles, we didn’t feel it was right to force people to change their LinkedIn profiles!

  • Everyone gets evaluated at their level and track going forward. This is a nice feedback loop back to having individuals involved in the selection of their role! Sure, maybe you did persuade your lead that you’re a Software Engineer III: if so, then you’ll be evaluated against your peers… you better be up to it :)

So then, we began the roll out! With a Dunbar-sized subset of the organisation we formulated a simple plan:

  • Communicate: let everyone know what’s going on! You really can’t communicate enough on these things: despite your best intentions, there will always be someone who “didn’t get the memo” or wasn’t there when you talked about it at an All Hands meeting. Communicate often across multiple channels, and be mindful of people who may not have missed out.

  • Educate: after the initial communications, we moved to educate our folk in groups of about 20, with a one-hour walkthrough of the career path, where it came from, why it’s important, and backed up with all the materials. This is a nice ‘classroom’ size: big enough to scale the communications, but small enough that folk feel comfortable talking about the issues, and asking questions.

  • Get personal: after the education sessions, we ran our 1:1s with folk to settle levels.

  • Close the loop with HR! In almost every organisation I’ve ever worked for, the system HR uses to store titles and levels has been different from the spreadsheet I’ve been working off. It’s crazy after all this levelling work to think that six months later someone might get an official letter with the wrong title! As a general rule in life, it’s always good to hammer the nail all the way in.

“It’s nice to have a title, so then you don’t have to care about it any more.”

The irony of all of this is that, really, still, I don’t care much about titles! When I was a tech consultant in a previous job we always had this saying: “You’re only ever as good as your last gig.” Likewise, when it comes to titles, there’s no resting on your laurels, proudly shining and polishing your trophy title: you must deliver, every day, for your team and for the organisation. That said, there are a couple of real benefits that we’ve seen from this work:

  • Having clarity on titles means that any fear, uncertainty, envy, distrust related to previous title confusion is now gone. Everyone knows where they are, and can just get down to work.

  • Career development conversations just got a whole lot more interesting. Now we can have meaningful conversations with our staff on where they are, where they want to go, and, how they can get there.

One final piece of advice: title systems and career paths are frameworks that exist to help us; they’re a tool that should enable us, not restrict us or box us in. While we have leaders-who-code (e.g. directors of engineering optimising our caching architecture), we also have coders-who-lead on some of our teams (e.g. principal engineers leading teams working on a technical area like ElasticSearch); and, this is as it should be. We flex the framework to our needs. And, if we find that this framework breaks, we’ll fix it, and perhaps write another post to let you know what we learnt.

culture 37 career path 1 titles 1
Tech
Gilt Tech

Sundial AWS EMR Integration

Giovanni Gargiulo aws data machine learning

AWS Elastic Map Reduce on Sundial

Today I want to talk about a recent improvement we implemented in Sundial, an Open Source product launched by Gilt in early 2016. With Sundial 2.0.0 it’s now possible to schedule AWS Elastic Map Reduce jobs.

For those of you who are not familiar with it, Sundial is a batch job scheduler, developed by the Gilt Personalization Team, that works with Amazon ECS and Amazon Batch.

Before jumping into the nitty gritty details, it’s worth taking a deeper dive into the current batch job processing setup in Gilt and the challenges we have recently started to face.

We will quickly cover the following areas:

  • the current batch jobs setup
  • batch job scalability

Batch processing today

Every night, the Gilt Aster data warehouse (DW) is locked down in order to update it with the latest data coming from the relevant area of the business. During this lock, Extract-Transform-Load (ETL) suites, or ELT as we prefer to call it, are run. When all the jobs complete, the DW gets unlocked and the normal access to Aster is resumed. There are a number of client systems relying on the DW, most relevant are BI tools, i.e Looker, and Sundial. Sundial in particular is used in personalization for scheduling additional jobs and to build Machine Learning models. Since there is no synchronization between Aster and Sundial, occasionally when Aster takes longer to complete, Sundial jobs would fail because of the DW being still locked down or data being stale.

Performance degradation

Because Aster is a shared resource, and the number of jobs relying on it is increasing day by day, in the past few weeks we’ve experienced significant performance degradation. This issue is particularly amplified at a specific time of the week, when BI reports are generated. The result is that batch jobs and reports are taking longer and longer to complete. This of course affects developers experience and productivity.

EMR adoption

Because of all the nuisances above, there is additional operational time spent to restart failed jobs. Furthermore, when developing a new model, most of the time is spent extracting and massaging data, rather than focusing on the actual job logic.

It’s easy to understand that Aster wasn’t a good candidate anymore for us and that we needed to migrate to a better and more elastic platform.

The solution we were looking for should:

  • work with multiple data formats
  • be scalable
  • be owned by the team
  • be easy to integrate with our scheduling solution

We didn’t have to look far to find a great candidate to solve our problems: Spark running on AWS EMR (Elastic Map Reduce). Amazon EMR provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. You can also run other popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink in Amazon EMR, and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB.

A complete list of open source applications (or components) running on AWS ERM can be found here.

AWS EMR also offers a nice SDK to spin a new dynamic EMR cluster, run a job and tear down resources on the fly and a cost per second billing system so to make the whole platform very cost efficient.

The last two perks of using AWS EMR are:

  • AWS Spot Instances: running hardware at a discounted price
  • Large variety of hardware: most of ELT jobs run on commodity hardware, some ML require intensive GPU computation and EMR offers hardware solutions for all of our use cases.

The Sundial EMR Integration

Since we were already using Sundial for most of our ETL and ML heavy lifting, we decided to extend the Sundial task_definition and add a new executable: the emr_command.

Features we’ve implemented are:

  • running a Spark EMR job on a pre-existing cluster
  • running a Spark EMR job on a new created-on-the-fly cluster (and automatic tear down of resources)
  • choose between on_demand vs spot instances
  • live logs

In the next two paragraphs I will go through two Sundial EMR task definition examples: the first is a Spark EMR job running on a pre-existing cluster, the second is the same job but running on a dynamically created cluster instead.

Running a job on a pre-existing EMR Cluster

Launching an EMR job on a pre-existing cluster is really simple, all that you need are some job details and the cluster_id where you want the job to run.

 "executable":{
    "emr_command":{
       "emr_cluster":{
          "existing_emr_cluster":{
             "cluster_id":"j-123ABC456DEF9"
          }
       },
       "job_name":"MyJobName1",
       "region":"us-east-1",
       "class":"com.company.job.spark.core.MainClass",
       "s3_jar_path":"s3://my-spark-job-release-bucket/my-job-spark-v1-0-0.jar",
       "spark_conf":[
          "spark.driver.extraJavaOptions=-Denvironment=production"
       ],
       "args":[
          "arg1", "arg2"
       ],
       "s3_log_details":{
          "log_group_name":"spark-emr-log-group",
          "log_stream_name":"spark-emr-log-stream"
       }
    }
 }

The other properties are:

  • class: the fully qualified main class of the job, e.g. “com.company.job.spark.core.MainClass”
  • s3_jar_path: the s3 path to the job jar file e.g “s3://my-spark-job-release-bucket/my-job-spark-v1-0-0.jar”
  • spark_conf: this is a list of attributes that you can pass to the spark driver, like memory or Java Opts (as per above example)
  • args: another list of params that will be passed to the MainClass as arguments (as per above example)
  • s3_log_details: Cloudwatch Log Group and Stream names for your job. See EMR Logs paragraph

EMR Logs

One nice feature of Sundial is the possibility of viewing jobs’ live logs. While AWS Elastic Container Service (ECS) and Batch natively offer a way to access live logs, EMR updates logs only every five minutes on S3 and it cannot be used as feed for live logs. Since there isn’t a straightforward way of fixing this, it is developer’s responsibility to implement the code that streams job’s log to AWS Cloudwatch Logs. One way of achieving this is via the log4j-cloudwatch-appender.

The downside of having jobs running on static AWS EMR clusters is that you will be paying for it even if no jobs are running. For this reason it would be ideal if we could spin up an EMR cluster on-the-fly, run a Spark job and then dispose all the resources.

If you want to know more, well, keep reading!

Running a job on a dynamic EMR Cluster

The Sundial Task definition that uses a dynamic cluster is fairly more complex and gives you some fine grained control when provisioning your cluster. At the same time though, if your jobs don’t require very specific configurations (e.g. permissions, aws market type), sensible default options have been provided so to simplify the Task Definition where possible.

Let’s dig into the different sections of the json template.

"emr_cluster":{
  "new_emr_cluster":{
     "name":"My Cluster Name",
     "release_label":"emr-5.11.0",
     "applications":[
        "Spark"
     ],
     "s3_log_uri":"s3://cluster-log-bucket",
     "master_instance":{
        "emr_instance_type":"m4.large",
        "instance_count":1,
        "aws_market":{
           "on_demand":"on_demand"
        }
     },
     "core_instance":{
        "emr_instance_type":"m4.xlarge",
        "instance_count":2,
        "aws_market":{
           "on_demand":"on_demand"
        }
     },
     "emr_service_role":{
        "default_emr_service_role":"EMR_DefaultRole"
     },
    "emr_job_flow_role": {
      "default_emr_job_flow_role": "EMR_EC2_DefaultRole"
    },
     "ec2_subnet":"subnet-a123456b",
     "visible_to_all_users":true
  }
}

The json object name for a dynamic emr cluster is new_emr_cluster. It is composed by the following attributes:

  • name: The name that will appear on the AWS EMR console
  • release_label: The EMR version of the cluster to create. Each EMR version maps to specific version of the applications that can run in the EMR cluster. Additional details are available on the AWS EMR components page
  • applications: The list of applications to launch on the cluster. For a comprehensive list of available applications, visit the AWS EMR components page
  • s3_log_uri: The s3 bucket where the EMR cluster put their log files. These are both cluster logs as well as stdout and stderr of the EMR job
  • master_instance: The master node hardware details (see below for more details.)
  • core_instance: The core node hardware details (see below for more details.)
  • task_instance: The task node hardware details (see below for more details.)
  • emr_service_role: The IAM role that Amazon EMR assumes to access AWS resources on your behalf. For more information, see Configure IAM Roles for Amazon EMR
  • emr_job_flow_role: (Also called instance profile and EC2 role.) Accepts an instance profile that’s associated with the role that you want to use. All EC2 instances in the cluster assume this role. For more information, see Create and Use IAM Roles for Amazon EMR in the Amazon EMR Management Guide
  • ec2_subnet: The subnet where to spin the EMR cluster. (Optional if the account has only the standard VPC)
  • visible_to_all_users: Indicates whether the instances in the cluster are visible to all IAM users in the AWS account. If you specify true, all IAM users can view and (if they have permissions) manage the instances. If you specify false, only the IAM user that created the cluster can view and manage it

Master, core and task instances

An EMR cluster is composed by exactly one master instance, at least one core instance and any number of tasks instances.

A detailed explanation of the different instance types is available in the AWS EMR plan instances page.

For simplicity I’ll paste a snippet of the AWS official documentation:

  • master node: The master node manages the cluster and typically runs master components of distributed applications. For example, the master node runs the YARN ResourceManager service to manage resources for applications, as well as the HDFS NameNode service. It also tracks the status of jobs submitted to the cluster and monitors the health of the instance groups. Because there is only one master node, the instance group or instance fleet consists of a single EC2 instance.
  • core node: Core nodes are managed by the master node. Core nodes run the Data Node daemon to coordinate data storage as part of the Hadoop Distributed File System (HDFS). They also run the Task Tracker daemon and perform other parallel computation tasks on data that installed applications require.
  • task node: Task nodes are optional. You can use them to add power to perform parallel computation tasks on data, such as Hadoop MapReduce tasks and Spark executors. Task nodes don’t run the Data Node daemon, nor do they store data in HDFS.

The json below describes configuration details of an EMR master instance:

 "master_instance":{
    "emr_instance_type":"m4.large",
    "instance_count":1,
    "aws_market":{
       "on_demand":"on_demand"
    }
 },

Please note that there can only be exactly one master node, if a different values is specified in the instance_count, it is ignored. For other instance group types the value instance_count represents, as the name says, the number of EC2 instances to launch for that instance type.

Other attributes are:

  • emr_instance_type: the EC2 instance type to use when launching the EMR instance
  • aws_market: the marketplace to provision instances for this group. It can be either on_demand or spot

An example of a EMR instance using spot is:

"aws_market": {
    "spot": {
      "bid_price": 0.07
    }
 }

Where bid_price is the Spot bid price in dollars.

Limitations

Because of some AWS EMR implementation details, Sundial has two major limitations when it comes to EMR job scheduling.

The first limitation is that Sundial is not able to stop EMR jobs running on pre-existing clusters. Since jobs on the EMR cluster are scheduled via yarn and since AWS did not build any api on top of it, once a job is scheduled on an existing EMR cluster, in order to kill it, it would be required to ssh on the EC2 instance where the master node is running, query yarn so to find out the correct application id and issue a yarn kill command. We decided to not implement this feature because it would have greatly over complicated the job definition. Jobs running on dynamic cluster are affected by the same issue. We’ve managed to still implement this feature by simply killing the whole EMR cluster.

The second limitation is about live logs. As previously mentioned live logs are not implemented out of the box. Developers require to stream logs to Cloudwatch Logs and set log group and log name in the task definition.

aws 15 sundial 2 etl 1 scheduling 1 machine learning 11
Tech
Gilt Tech

Revitalize Gilt City's Order Processing with Serverless Architecture

Liyu Ma aws

Instant Vouchers Initiative

Gilt City is Gilt’s high-end voucher portal that offers localised discounts on exclusive lifestyle experiences in dining, entertainment, beauty, fitness etc to our 3.4 million members across 13 U.S. cities. Gilt City’s legacy order processing backend is a scheduled-job based architecture in which functionality such as fraud scan, payment authorisation, order fulfillment are assigned to independent jobs that process orders in batches according to order status. Though this architecture can scale to meet peak time workload and provides some level of resilience (failed orders are retried the next time the job runs), it inevitably includes some idle time i.e. wait for the next job to pick up an order from the previous job. The resulting average processing time could add up to 15 minutes.

Since many of Gilt City’s offers are of an impulsive nature and time-sensitive, long processing time becomes a clear bottleneck to user experience. Team Marconi in Gilt have been driving the work on the Instant Vouchers Initiative for the past few months ago, in an effort to re-architect the backend of order processing using the latest cloud technologies. We believe that by reducing this wait time, it will significantly boost overall shopping experience and enable immediate use of vouchers and, in turn, it allows for new features such as location-based push notifications.

An Event Driven, Serverless Architecture

It is never easy to rewrite (or replace) a mission critical system. In our case, we have to keep the existing monolithic Ruby on Rails app running while spinning up a new pipeline. We took the strangler pattern (see this Martin Fowler article for an explanation) and built a new API layer for processing individual orders around the existing batch-processing, job-based system in the same Rails app. With this approach, the legacy job-based system gradually receives less traffic and becomes a fallback safety net to catch and retry failed orders from the instant processing pipeline.

The new instant order pipeline starts with the checkout system publishing a notification to an SNS topic whenever it creates an order object. An order notification contains the order ID to allow event subscribers to look up the order object in the order key-value store. An AWS Lambda application order-notification-dispatcher subscribes to this SNS topic and kicks off the processing by invoking an AWS Step Functions resource. See below a simplified architecture diagram of the order processing system.

The architecture leverages Lambda and Step Functions from the AWS Serverless suite to build several key components. At HBC, different teams have started embracing a serverless paradigm to build production applications. There are many benefits of adopting a serverless paradigm, such as abstraction from infrastructure, out-of-the-box scalability, and an on-demand cost model just to name a few. Compared to the alternative of building and maintaining an array of EC2/container instances, a serverless architecture goes a step beyond microservices to allow an even faster development iteration cycle. With the use of Step Functions as an orchestration engine, it is much easier to facilitate interaction between Lambda applications.

alt text

AWS Step Functions for Lambda Orchestration

As mentioned above, AWS Step Functions is an orchestration service that makes it easy to coordinate stateless Lambda applications by establishing a specification to transition application states. Behind the scenes, it is depicted as a state machine constructed with the JSON-based Amazon States Language. See below a sample execution from the order-processing step function.

alt text

Inside Step Functions

At the top level the specification includes various types of States, such as Task, Choice and Wait, to be used to compose simple business logic to transition application state. Inside a Task State, an AWS Lambda ARN can be specified to be invoked. The output of the Lambda will be directed as input to next State. This is an excerpt from the order-processing state machine:

{
  "Comment": "Order processing state machine",
  "StartAt": "ChangeOrderStatus",
  "States": {
    "ChangeOrderStatus": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:1234567890:function:start-order-processing:2",
      "TimeoutSeconds": 30,
      "Next": "FraudScan"
    },
    "FraudScan": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:1234567890:function:fraud-scan:2",
      "TimeoutSeconds": 30,      
      "Next": "IsFraudOrder"
    },
    "IsFraudOrder": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.fraud_verdict",
          "StringEquals": "cleared",
          "Next": "AuthorizePayment"
        },
        {
          "Variable": "$.fraud_verdict",
          "StringEquals": "fraud",
          "Next": "FraudOrderTerminal"
        }
        ...
      ]      
    },    
    "AuthorizePayment": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:1234567890:function:authorize-payments:2",
      "TimeoutSeconds": 30,      
      "Next": "WarehouseChoice"
    },
    "FraudOrderTerminal": {
      "Type": "Pass",      
      "Result": "This is the ending state for a fraud order",
      "End": true
    }
    ...
  }
}

Polling and Retry on Errors

A serverless paradigm fits really well in situations where computation completes within a short time (ideally seconds). However, sometimes we still need to run a task that will take slightly longer. For example, in our pipeline, we need to keep polling a service endpoint for a fraud-scan result, since it is an async process. We implemented this by defining a retry counter get_fraud_status_retries within a Choice state and set a max attempt count of 60 to terminate retries.

"IsFraudOrder": {
  "Type": "Choice",
  "Choices": [
    {
      "Variable": "$.fraud_verdict",
      "StringEquals": "cleared",
      "Next": "AuthorizePayment"
    },
    {
      "Variable": "$.fraud_verdict",
      "StringEquals": "fraud",
      "Next": "FraudOrderTerminal"
    },        
    {
      "Variable": "$.get_fraud_status_retries",
      "NumericLessThanEquals": 60,
      "Next": "FraudScanWait"
    },
    {
      "Variable": "$.get_fraud_status_retries",
      "NumericGreaterThan": 60,
      "Next": "FraudStatusUnavailableTerminal"
    }
  ]      
}

It is also critical to make cloud applications resilient to errors such as network timeouts. Step Functions provides error handling to allow catching/retrying of some predefined errors as well as customised Lambda error types. You can specify different retry strategies with properties such as MaxAttempts and BackoffRate. See the below example where we implemented a retry mechanism for different errors in the Task state to create redemption codes:

"CreateRedemptionCode": {
  "Type": "Task",
  "Resource": "arn:aws:lambda:us-east-1:1234567890:function:create-redemption-code:3",
  "TimeoutSeconds": 30,
  "Next": "FulfillElectronicOrder",
  "Retry": [
    {
      "ErrorEquals": [ "GatewayTimeoutError" ],
      "IntervalSeconds": 5,
      "MaxAttempts": 2
    }
  ],
  "Catch": [            
    {
      "ErrorEquals": [ "States.ALL" ],
      "Next": "CatchMissingRedemptionCode"
    }
  ]
}

Immutable Deployment & Partial Rollout

Deploying a mission critical service to a production environment is always a nervous process. At HBC we advocate immutable deployments whenever possible and leverage A/B testing to help us roll out new features to customers in a gradual manner. In a serverless world, it is a little different, since most of the infrastructure management is abstracted away.

Lambda Versioning

AWS Lambda’s versioning feature provides the ability to make Lambda functions immutable by taking a snapshot of the function (aka publishing a version). We really like this, since it ensures the Lambda function artifact as well as environment variables remain immutable once published. Note that in the above code snippets of state machine JSON, the ARN specified for each Lambda resource is Lambda version ARN instead of function ARN. We also use Lambda’s aliasing feature to have a prod alias mapped to the current production version, with immutable environment variables:

alt text

With aliasing we can easily roll back to a previous Lambda version in case of an unexpected production failure.

Blue/Green Stacks

So we have immutable Lambda functions, but we still want to make our Step Functions (SF) immutable. We decided to create a new SF resource every time we release it, meanwhile the old SF resource remains unchanged. Since AWS does not currently provide a versioning feature for Step Functions, we included semantic versioning in the SF name e.g. order-processing-v0.0.6. With both new and old versions (including historical SFs) we are able to apply a blue/green deployment and rollback procedure.

To route orders to either blue/green stack, we make the order-notification-dispatcher Lambda the de facto router by providing blue/green versions of SF as its environment variables. Here is the Node.js code to read the stack environment variables:

const stateMachineBlueVer = process.env.STATE_MACHINE_BLUE_VER;
const stateMachineGreenVer = process.env.STATE_MACHINE_GREEN_VER;

With fetched state machine stack version we can compose Step Function ARN with predefined format, then start a new execution with AWS sdk Step Function api:

const stateMachineVersion = ... // Read from environment vars
function dispatch(orderJson) {
  const orderId = orderJson.order_id;
  const stateMachine = preProcessingStepFunctionPrefix + stateMachineVersion; 
  const params = {
    stateMachineArn: stateMachine,
    name: orderId.toString(),
    input: JSON.stringify(orderJson)
  };  
  return new AWS.StepFunctions().startExecution(params).promise();
}

Partial Rollout

We make the order-notification-dispatcher query our a/b test engine to have simple routing logic for each order notification, so that it can shift traffic to either the blue/green Step Function stack according to test/control group the order falls into. Also note that AWS recently released a nice traffic shifting feature for Lambda applications. However, we didn’t use it as our a/b test engine provides finer-grain control which allows us to target certain groups such as HBC’s internal employees. Here is a diagram depicting the partial rollout process for new Step Function resources:

alt text

Conclusion

What We Have Achieved

As of today all of Gilt City’s orders have been directed to the instant processing pipeline, which shortens the majority of orders’ processing time from over 15 minutes to a few seconds. We are looking to expand the system to take over more workload including physical products to bring the instant order user experience to a wider customer base.

Step Functions Limitations

From our development exerience using AWS Step Functions we discovered some limitations of this service. First of all, it lacks of a feature like a Map state which would take a list of input objects and transform it to another list of result objects. A possible solution could be allowing invocation of a sub SF multiple times. In our case, an order object can be split into multiple order objects depending on the items in the original order. Unfortuntely SF does not offer a State type that can map a dynamic number of elements. We eventually made the workaround by creating an order-pre-processing SF and make it invoke the order-processing SF multiple times to process those ‘split’ orders.

Secondly, we hope AWS can provide versioning/aliasing for Step Functions so we can gain immutability out of the box instead of forcing immutability on our side. Any support for blue/green deployment would be even better.

Also, we expect AWS to provide better filtering/searching abilities on the Step Functions dashboard so we can gain some fundamental data analytics from historical executions. This could be obtained by declaring some “searchable” fields and relative types in the SF definition.

In the context of AWS Enterprise Support, we (Team Marconi) had a productive meeting directly with the AWS Step Functions Product Manager during which we have suggested our list of improvements. It was gratifying to hear that most of these are already or will be included in their development roadmap.

Future Work

From an architecture perspective, we are trying to standardize a continous delivery process for our serverless components. At the moment, what we have is “poor man’s CI/CD” - some bash/node scripts which use AWS CloudFormation SDK to provision resources. There are various tools available either from AWS or the serverless community such as Terraform, CodePipeline that we are trying to integrate with to provide a frictionless path to production.

aws 15 serverless 2 lambda 2 step function 1 gilt city 2 order processing 1
Tech
Gilt Tech

Presentations we love: 2017

HBC Tech presentations

2017 was a year of growth and learning at HBC Tech. Our organization embraced new technologies and new ways of building application software.

As the year comes to an end, let’s recognize some notable technical presentations from 2017.

Kubernetes Project update

Kelsey Hightower (@kelseyhightower) at KubeCon 2017

Kelsey Hightower video

Production: Designing for testability

Mike Bryzek (@mbryzek) at QCon New York 2017

Mike Bryzek video

Streaming Microservices: Contracts & Compatibility

Gwen Shapira (@gwenshap) at QCon New York 2017

contracts-streaming-image

Spinnaker and the Culture Behind the Tech

Dianne Marsh (@dmarsh) at KubeCon 2017

Dianne Marsh video

Embracing Change without breaking the world

Jim Flanagan and Kyle Thomson at AWS re:invent 2017

AWS Embracing Change

Developing Applications on AWS in the JVM

Kyle Thomson (@kiiadi) at AWS re:invent 2017

AWS JVM

Chaos Engineering at Netflix

Nora Jones (@nora_js) at AWS re:invent 2017

Chaos Engineering at Netflix

apibuilder

Sean Sullivan (@tinyrobots) at Scala Up North 2017

apibuilder-image

Managing Data in Microservices

Randy Shoup (@randyshoup) at QCon New York 2017

Randy Shoup - Managing Data

Crushing Tech Debt Through Automation at Coinbase

Rob Witoff (@rwitoff) at QCon London 2017

Rob Witoff - Tech Debt

Gilt’s iOS codebase evolution

Evan Maloney (@_emaloney_) at the Brooklyn Swift Developers Meetup

evan-maloney-image

Apache Struts and the Equifax Data Breach

Sean Sullivan (@tinyrobots) at the Portland Java User Group

struts-equifax-image

Promcon 2017

Giovanni Gargiulo (@giannigar) at Promcon 2017 (Munich)

giovanni-gargiulo-image

community 3 conferences 29 qcon 2 aws 15 cloud 3 2017 1
Tech
Gilt Tech

Dublin Scala Spree

Gregor Heine open source

Dublin Scala Spree

This Friday the Gilt/HBC Digital Dublin office will be hosting the first ever Dublin Scala Spree, a day-long Scala Open Source Hackathon. The event is organized by the Dublin Scala Usergroup in cooperation with Dublin Functional Kubs and the Scala Center at EPFL in Lausanne, Switzerland.

  • Date & Time: Friday, 15th September, 10am - 4pm
  • Location: Gilt/HBC Digital Office, Shelbourne Rd., Dublin 4, Ireland
  • Sign-Up: Please register for the event via the Dublin Scala Users Group
  • Organizers: Dublin Scala Meetup and Dublin Functional Kubs in cooperation with the Scala Center @ EPFL in Lausanne

What is a Scala Spree?

Scala Spree is a free community event aiming to popularize Open Source Software. It brings together Open Source authors, maintainers and software engineers willing to contribute to OSS projects. Under the guidance of seasoned experts, newcomers learn about the inner working of some popular tools and Scala libraries, and contribute to make them even better. For library authors, it’s an opportunity to improve their tools and get fresh feedback. For attendees it is a unique opportunity to lean more about Scala, contribute to Open Source Software and expand their skills. And for everyone it’s a great opportunity to meet and have fun!

For this week’s Spree we have the following special guests and their OSS projects:

If you have a Scala open source project that you would like to feature at the Spree, please get in touch with the Dublin Scala Users Group organizers.

Like all Dublin Scala Community events, Scala Spree is free of charge and the only real requirement is an open mind and the will to contribute! – Apart from bringing your own computer to use, but chances are you figured that out already.

Duration and pace

To begin with, maintainers gather together in front of all the contributors to briefly explain their projects and tickets in one minute. The idea is to give a good high-level explanation to motivate participants without going into too much detail. When they are done, participants approach the projects they are most interested in and get it contact with the maintainers. At this point, maintainers usually listen to the participants’ experience and provide personal guidance on tickets that would suit them. Then, the fun begins! Participants start hacking on their projects and maintainers review PRs as they come, assisting participants when they ask for help. We encourage maintainers to merge as many PRs as possible in the place, for two reasons: Participants get a small token of appreciation from the Scala Center. It increases the motivation of the participants. If participants get the first PR merged, they are invited to continue solving issues until they are happy with their work! At the middle of the spree, we will provide free lunch and refreshments. Participants can leave the event at any time they want. When the time approaches the end, everyone starts to wrap up: participants finish their PRs while maintainers finish their reviews, and organizers of the spree give away swag.

Places will be strictly limited and will be allocated on a first come first served basis. Registration through the Dublin Scala Users Group is required and only successfull RSVPs can attend.

open source 67 hackathon 8 culture 37 community 3 meetups 37
Tech
Gilt Tech

Team Rookie 2017

Team Rookie internship

Who We Are

Team-Rookie-2017, as we pride ourselves with being the most awesome team ever, has spent the summer improving the browsing experience for Gilt users as well as to collect data for our personalization team. The end result of our project included the crafted front-end user experience and a back-end service for data processing.

Project Ideation

The final project idea rose to the top through countless meetings and discussions with various teams in the organization. With the initially decided problem-solution proven to be unexecutable, our team, along with all of our mentors, took efforts to come up with a new solution to solve the given problem with the limited resources we had. This immersive process, in the very beginning of the program, ensured the understanding of the engineering problem and established the success of our project.

To arrive at the best possible solution, we spent time learning the technology stack end-to-end. We went through many tutorials and labs with our mentors on the technologies we were going to eventually use, namely Scala, Android, and the Play framework. As we gained familiarities with these tools and technologies daily, we were quickly able to finalize on our ideas and the project has finally taken off.

Problem Space:

So let’s talk about the problem. With a growing user base, the Gilt platform needs to better understand what the users’ interests are in order to tailor unique shopping experiences to different user groups. Currently, users are able to “shop-the-look.” This feature allows a user to browse a completed set of apparels, such as the combination of a shirt, a pair of jeans, and shoes. It rids the hassle of a lot of users having to discover these items separately, they are able to find them all at once and make one single purchase. At the moment, these completed looks are selected by stylists who understand them. While stylists may provide the highest quality pairings, we are unable to scale human labor to the entire catalog. As fashion trends change, we need to update our pairings accordingly. Therefore, we aim to continuously collect user opinions on possible pairings. With these we can develop machine learning models to infer item compatibility. This is an ambitious goal, but not unachievable. We just need a steady supply of data.

Solution:

To tackle this problem, we proposed to create a fun and engaging experience for the users while they are shopping: completing their own outfits. One key requirement for this experience is that it can not interfere with the current purchase flow, meaning that if a user is closing in on a purchase, that process should not be interrupted. Therefore, rather than inserting the experience within the current workflow, we’ve decided to include the feature on the search page where users are able to favorite items they like. This is shown in the figure below.

For our experience, to minimize disruption to the current workflow, we’ve added an additional hover link on the favorite button, and this will direct the users to our experience.

We provide the users with additional items that can potentially be paired with the initial favorited item to form completed looks. These products, limited by category and price based on the favorited items, will be presented to the users for individual selections. The users can let their imaginations go wild and pick what they think are the best combinations. During this process, we will collected this data and persist it through our back-end API to the database.

Finally, in order to complete the experience and make it as engaging as possible, we’ve decided to allow the users to immediately purchase the selected items if they wish. Since these items are what they specifically picked out from a pool of products, they will have a greater likelihood for conversion.

So in a nutshell, this is the completed project of the 10 week internship filled with hard work, grind, sweat (mostly from our daily trips to equinox right down stairs), and a whole lot of fun.

Intern Activities

While we were not busy being awesome engineers, team-rookie spent most of our leisure time exploring New York and staying cool. Here are some of the highlights.

Mentorship

Team Rookie would like to give out a huge shout out to all of our mentors that helped us along they way and made this project possible (you know who you are)! With a special thanks to Doochan and Mike, who led the intern committee through all of our battles and came out on the other end with a solid victory. The complete-the-look experience would not have been possible without you guys.

internship 4 web 2 team rookie 2
Tech
Gilt Tech

HBC Tech Talks: February 2017 through July 2017

HBC Tech conferences

We’ve had a busy 2017 at HBC. The great work of our teams has created opportunities to share what we’ve learned with audiences around the world. This year our folks have been on stage in Austin, Sydney, Portland, Seattle, San Diego, Boston, London, Israel and on our home turf in NYC and Dublin. The talks have covered deep learning, design thinking, data streaming and developer experience to name just a few.

Lucky for you, if you haven’t been able to check out our talks in person, we’ve compiled the decks and videos from a bunch of our talks right here. Enjoy!

February

March

April

May

June

July

  • Sean Sullivan spoke at Scala Up North and the Portland Java User Group about ApiBuilder.
  • Sophie Huang spoke at the Customer Love Summit in Seattle.
  • Kyla Robinson gave a keynote on Key to Success: Creating A Mobile–First Mentality.
  • Sera Chin and Yi Cao spoke at the NYC Scrum User Group about HBC’s Design Sprints.
meetups 37 conferences 29 evangelism 4
Tech
Gilt Tech

Sundial or AWS Batch, Why not both?

Kevin O'Riordan data

Sundial on AWS Batch

About a year ago, we (the Gilt/HBC personalization team) open sourced Sundial (https://github.com/gilt/sundial) , a batch job orchestration system leveraging Amazon EC2 Container Service.

We built Sundial to provide the following features on top of the standard ECS setup:

  • Streaming Logs (to Cloudwatch and S3 and live in Sundial UI)
  • Metadata collection (through Graphite and displayed live in Sundial UI)
  • Dependency management between jobs
  • Retry strategies for failed jobs
  • Cron style scheduling for jobs
  • Email status reporting for jobs
  • Pagerduty integration for notifying team members about failing critical jobs

alt text

Other solutions available at the time didn’t suit our needs. Solutions we considered included Chronos which lacked the features we needed and required a Mesos cluster, Spotify Luigi and Airbnb Airflow, which was immature at the time.

At the time, we chose ECS because we hoped to take advantages of AWS features such as autoscaling in order to save costs by scaling the cluster up and down by demand. In practice, this required too much manual effort and moving parts so we lived with a long running cluster scaled to handle peak load.

Since then, our needs have grown and we have jobs ranging in size from a couple of hundred MB of memory to 60GB of memory. Having a cluster scaled to handle peak load with all these job sizes had become too expensive. Most job failure noise has been due to cluster resources not being available or smaller jobs taking up space on instances meant to be dedicated to bigger jobs. (ECS is weak when it comes to task placement strategies).

Thankfully AWS have come along with their own enhancements on top of ECS in the form of AWS Batch.

What we love about Batch

  • Managed compute environment. This means AWS handles scaling up and down the cluster in response to workload.
  • Heterogenous instance types (useful when we have outlier jobs taking large amounts of CPU/memory resources)
  • Spot instances (save over half on on-demand instance costs)
  • Easy integration with Cloudwatch Logs (stdout and stderr captured automatically)

What sucks

  • Not being able to run “linked” containers (We relied on this for metadata service and log upload to S3)
  • Needing a custom AMI to configure extra disk space on the instances.

What we’d love for Batch to do better

  • Make disk space on managed instances configurable. Currently the workaround is to create a custom AMI with the disk space you need if you have jobs that store a lot of data on disk (Not uncommon in a data processing environment). Gilt has a feature request open with Amazon on this issue.

Why not dump Sundial in favour of using Batch directly?

Sundial still provides features that Batch doesn’t provide:

  • Email reporting
  • Pagerduty integration
  • Easy transition, processes can be a mixed workload of jobs running on ECS and Batch.
  • Configurable backoff strategy for job retries.
  • Time limits for jobs. If a job hangs, we can kill and retry after a certain period of time
  • Nice dashboard of processes (At a glance see what’s green and what’s red)

alt text

Sure enough, some of the above can be configured through hooking up lambdas/SNS messages etc. but Sundial gives it to you out of the box.

What next?

Sundial with AWS Batch backend now works great for the use cases we encounter doing personalization. We may consider enhancements such as Prometheus push gateway integration (to replace the Graphite service we had with ECS and to keep track of metrics over time) and UI enhancements to Sundial.

In the long term we may consider other open source solutions as maintaining a job system counts as technical debt that is a distraction from product focused tasks. The HBC data team, who have very similar requirements to us, have started adopting Airflow (by Airbnb). As part of their adoption, they have contributed to an open source effort to make Airflow support Batch as a backend: https://github.com/gilt/incubator-airflow/tree/aws_batch. If it works well, this is a solution we may adopt in the future.

batch 1 aws 15 tech 22 personalization 15
Tech
Gilt Tech

Visually Similar Recommendations

Chris Curro personalization

Previously we’ve written about about Tiefvision , a technical demo showcasing the ability to automatically find similar dresses to a particular one of interest. For example:

Since then, we’ve worked on taking the ideas at play in Tiefvision, and making them usable in a production scalable way, that allows us to roll out to new product categories besides dresses quickly and efficiently. Today, we’re excited to announce that we’ve rolled out visually similar recommendations on Gilt for all dresses, t-shirts, and handbags, as well as to women’s shoes, women’s denim, women’s pants, and men’s outerwear.

Let’s start with a brief overview. Consider the general task at hand. We have a landing page for every product on our online stores. For the Gilt store, we refer to this as the product detail page (PDP). On the PDP we would like to offer the user a variety of alternatives to the product they are looking at, so that they can best make a purchasing decision. There exist a variety of approaches to selecting other products to display as alternatives; a particularly popular approach is called collaborative filtering which leverages purchase history across users to make recommendations. However this approach is what we call content-agnostic – it has no knowledge of what a particular garment looks like. Instead, we’d like to look at the photographs of garments and recommend similar looking garments within the same category.

Narrowing our focus a little bit, our task is to take a photograph of a garment and find similar looking photographs. First, we need to come up with some similarity measure for photographs, then we will need to be able to quickly query for the most similar photographs from our large catalog.

This is something we need to do numerically. Recall that we can represent a photograph as some tensor (in other words a three dimensional array with entries in between 0 and 1). Given that we have a numerical representation for an photograph, you might think we could so something simple to the measure the similarity between two photographs. Consider:

which we’d refer to as the Frobenius norm of the difference between the two photographs. The problem with this, although it is simple, is that we’re not measuring the difference between semantically meaningful features. Consider these three dresses: a red floral print, pink stripes, and a blue floral print.

With this “pixel-space” approach the red floral print and the pink stripes are more likely to be recognized as similar than the red floral print and the blue floral print, because they have pixels of similar colors at similar locations. The “pixel-space” approach ignores locality and global reasoning, and has no insight into semantic concepts.

What we’d like to do is find some function that extracts semantically meaningful features. We can then compute our similarity metric in the feature-space rather than the pixel-space. Where do we get this ? In our case, we leverage deep neural networks (deep learning) for this function. Neural networks are hierarchical functions composed of typically sequential connections of simple building blocks. This structure allows us take a neural network trained for a specific task, like arbitrary object recognition and pull from some intermediate point in the network. For example say we take a network, trained to recognize objects in the ImageNet dataset, composed of building blocks :

We might take the output of and call those our features:

In the case of convolutional networks like the VGG, Inception, or Resnet families our output features would lie in some vector space . The first two dimensions correspond to the original spatial dimensions (at some reduced resolution) while the third dimension corresponds to some set of feature types. So in other words, if one of our feature types detects a human face, we might see a high numerical value in spatial position near where a person’s face is in the photograph. In our use cases, we’ve determined that this spatial information isn’t nearly as important as the feature types that we detect, so at this point we aggregate over the spatial dimensions to get a vector in . A simple way to do this aggregation is with a simple arithmetic mean but other methods work as well.

From there we could build up some matrix where is the number of items in a category of interest. We could then construct an similarity matrix

Then to find the most similar items to a query , we look at the locations of the highest values in row of the matrix.

This approach is infeasible as becomes large, as it has computational complexity and space complexity . To alleviate this issue, we can leverage a variety of approximate nearest neighbor methods. We empirically find that approximate neighbors are sufficient. Also when we consider that our feature space represents some arbitrary embedding with no guarantees of any particular notion of optimality, it becomes clear there’s no grounded reason to warrant exact nearest neighbor searches.

How do we do it?

We leverage several open source technologies, as well as established results from published research to serve visually similar garments. As far as open source technology is concerned, we use Tensorflow, and (our very own) Sundial. Below you can see a block diagram of our implementation:

Let’s walk through this process. First, we have a Sundial job that accomplishes two tasks. We check for new products, and then we compute embeddings using Tensorflow and a pretrained network of a particular type for particular categories of products. We persist the embeddings on AWS S3. Second, we have another Sundial job, again with two tasks. This job filters the set of products to ones of some particular interest and generates a nearest neighbors index for fast nearest neighbor look-ups. The job completes, persisting the index on AWS S3. Finally, we wrap a cluster of servers in a load balancer. Our product recommendation service can query these nodes to get visually similar recommendations as desired.

Now, we can take a bit of a deeper dive into the thought process behind some of the decisions we make as we roll out to new categories. First, and perhaps the most important, is what network type and where to tap it off so that we can compute embeddings. If we recall that neural networks produce hierarchical representations, we can deduce (and notice empirically) that deeper tap-points (more steps removed from the input) produce embeddings that pick up on “higher level” concepts rather than “low level” textures. So, for example, if we wish to pick up on basic fabric textures we might pull from near the input, and if we wish to pick up something higher level like silhouette type we might pull from deeper in the network.

The filtering step before we generate a index is also critically important. At this point we can narrow down our products to only come from one particular category, or even some further sub-categorization to leverage the deep knowledge of fashion present at HBC.

Finally, we must select the parameters for the index generation, which control the error rate and performance trade-off in the approximate nearest neighbors search. We can select these parameters empirically. We utilize our knowledge of fashion, once again, to determine a good operation point.

What’s next?

We’ll be working to roll out to more and more categories, and even do some cross category elements, perhaps completing outfits based on their visual compatibility.

machine learning 11 deep learning 5 personalization 15 recommendation 2
Tech
Page 1 of 69