The Gilt technology organization. We make work.

Gilt Tech

Sundial AWS EMR Integration

Giovanni Gargiulo aws data machine learning

AWS Elastic Map Reduce on Sundial

Today I want to talk about a recent improvement we implemented in Sundial, an Open Source product launched by Gilt in early 2016. With Sundial 2.0.0 it’s now possible to schedule AWS Elastic Map Reduce jobs.

For those of you who are not familiar with it, Sundial is a batch job scheduler, developed by the Gilt Personalization Team, that works with Amazon ECS and Amazon Batch.

Before jumping into the nitty gritty details, it’s worth taking a deeper dive into the current batch job processing setup in Gilt and the challenges we have recently started to face.

We will quickly cover the following areas:

  • the current batch jobs setup
  • batch job scalability

Batch processing today

Every night, the Gilt Aster data warehouse (DW) is locked down in order to update it with the latest data coming from the relevant area of the business. During this lock, Extract-Transform-Load (ETL) suites, or ELT as we prefer to call it, are run. When all the jobs complete, the DW gets unlocked and the normal access to Aster is resumed. There are a number of client systems relying on the DW, most relevant are BI tools, i.e Looker, and Sundial. Sundial in particular is used in personalization for scheduling additional jobs and to build Machine Learning models. Since there is no synchronization between Aster and Sundial, occasionally when Aster takes longer to complete, Sundial jobs would fail because of the DW being still locked down or data being stale.

Performance degradation

Because Aster is a shared resource, and the number of jobs relying on it is increasing day by day, in the past few weeks we’ve experienced significant performance degradation. This issue is particularly amplified at a specific time of the week, when BI reports are generated. The result is that batch jobs and reports are taking longer and longer to complete. This of course affects developers experience and productivity.

EMR adoption

Because of all the nuisances above, there is additional operational time spent to restart failed jobs. Furthermore, when developing a new model, most of the time is spent extracting and massaging data, rather than focusing on the actual job logic.

It’s easy to understand that Aster wasn’t a good candidate anymore for us and that we needed to migrate to a better and more elastic platform.

The solution we were looking for should:

  • work with multiple data formats
  • be scalable
  • be owned by the team
  • be easy to integrate with our scheduling solution

We didn’t have to look far to find a great candidate to solve our problems: Spark running on AWS EMR (Elastic Map Reduce). Amazon EMR provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. You can also run other popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink in Amazon EMR, and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB.

A complete list of open source applications (or components) running on AWS ERM can be found here.

AWS EMR also offers a nice SDK to spin a new dynamic EMR cluster, run a job and tear down resources on the fly and a cost per second billing system so to make the whole platform very cost efficient.

The last two perks of using AWS EMR are:

  • AWS Spot Instances: running hardware at a discounted price
  • Large variety of hardware: most of ELT jobs run on commodity hardware, some ML require intensive GPU computation and EMR offers hardware solutions for all of our use cases.

The Sundial EMR Integration

Since we were already using Sundial for most of our ETL and ML heavy lifting, we decided to extend the Sundial task_definition and add a new executable: the emr_command.

Features we’ve implemented are:

  • running a Spark EMR job on a pre-existing cluster
  • running a Spark EMR job on a new created-on-the-fly cluster (and automatic tear down of resources)
  • choose between on_demand vs spot instances
  • live logs

In the next two paragraphs I will go through two Sundial EMR task definition examples: the first is a Spark EMR job running on a pre-existing cluster, the second is the same job but running on a dynamically created cluster instead.

Running a job on a pre-existing EMR Cluster

Launching an EMR job on a pre-existing cluster is really simple, all that you need are some job details and the cluster_id where you want the job to run.

          "arg1", "arg2"

The other properties are:

  • class: the fully qualified main class of the job, e.g. “”
  • s3_jar_path: the s3 path to the job jar file e.g “s3://my-spark-job-release-bucket/my-job-spark-v1-0-0.jar”
  • spark_conf: this is a list of attributes that you can pass to the spark driver, like memory or Java Opts (as per above example)
  • args: another list of params that will be passed to the MainClass as arguments (as per above example)
  • s3_log_details: Cloudwatch Log Group and Stream names for your job. See EMR Logs paragraph

EMR Logs

One nice feature of Sundial is the possibility of viewing jobs’ live logs. While AWS Elastic Container Service (ECS) and Batch natively offer a way to access live logs, EMR updates logs only every five minutes on S3 and it cannot be used as feed for live logs. Since there isn’t a straightforward way of fixing this, it is developer’s responsibility to implement the code that streams job’s log to AWS Cloudwatch Logs. One way of achieving this is via the log4j-cloudwatch-appender.

The downside of having jobs running on static AWS EMR clusters is that you will be paying for it even if no jobs are running. For this reason it would be ideal if we could spin up an EMR cluster on-the-fly, run a Spark job and then dispose all the resources.

If you want to know more, well, keep reading!

Running a job on a dynamic EMR Cluster

The Sundial Task definition that uses a dynamic cluster is fairly more complex and gives you some fine grained control when provisioning your cluster. At the same time though, if your jobs don’t require very specific configurations (e.g. permissions, aws market type), sensible default options have been provided so to simplify the Task Definition where possible.

Let’s dig into the different sections of the json template.

     "name":"My Cluster Name",
    "emr_job_flow_role": {
      "default_emr_job_flow_role": "EMR_EC2_DefaultRole"

The json object name for a dynamic emr cluster is new_emr_cluster. It is composed by the following attributes:

  • name: The name that will appear on the AWS EMR console
  • release_label: The EMR version of the cluster to create. Each EMR version maps to specific version of the applications that can run in the EMR cluster. Additional details are available on the AWS EMR components page
  • applications: The list of applications to launch on the cluster. For a comprehensive list of available applications, visit the AWS EMR components page
  • s3_log_uri: The s3 bucket where the EMR cluster put their log files. These are both cluster logs as well as stdout and stderr of the EMR job
  • master_instance: The master node hardware details (see below for more details.)
  • core_instance: The core node hardware details (see below for more details.)
  • task_instance: The task node hardware details (see below for more details.)
  • emr_service_role: The IAM role that Amazon EMR assumes to access AWS resources on your behalf. For more information, see Configure IAM Roles for Amazon EMR
  • emr_job_flow_role: (Also called instance profile and EC2 role.) Accepts an instance profile that’s associated with the role that you want to use. All EC2 instances in the cluster assume this role. For more information, see Create and Use IAM Roles for Amazon EMR in the Amazon EMR Management Guide
  • ec2_subnet: The subnet where to spin the EMR cluster. (Optional if the account has only the standard VPC)
  • visible_to_all_users: Indicates whether the instances in the cluster are visible to all IAM users in the AWS account. If you specify true, all IAM users can view and (if they have permissions) manage the instances. If you specify false, only the IAM user that created the cluster can view and manage it

Master, core and task instances

An EMR cluster is composed by exactly one master instance, at least one core instance and any number of tasks instances.

A detailed explanation of the different instance types is available in the AWS EMR plan instances page.

For simplicity I’ll paste a snippet of the AWS official documentation:

  • master node: The master node manages the cluster and typically runs master components of distributed applications. For example, the master node runs the YARN ResourceManager service to manage resources for applications, as well as the HDFS NameNode service. It also tracks the status of jobs submitted to the cluster and monitors the health of the instance groups. Because there is only one master node, the instance group or instance fleet consists of a single EC2 instance.
  • core node: Core nodes are managed by the master node. Core nodes run the Data Node daemon to coordinate data storage as part of the Hadoop Distributed File System (HDFS). They also run the Task Tracker daemon and perform other parallel computation tasks on data that installed applications require.
  • task node: Task nodes are optional. You can use them to add power to perform parallel computation tasks on data, such as Hadoop MapReduce tasks and Spark executors. Task nodes don’t run the Data Node daemon, nor do they store data in HDFS.

The json below describes configuration details of an EMR master instance:


Please note that there can only be exactly one master node, if a different values is specified in the instance_count, it is ignored. For other instance group types the value instance_count represents, as the name says, the number of EC2 instances to launch for that instance type.

Other attributes are:

  • emr_instance_type: the EC2 instance type to use when launching the EMR instance
  • aws_market: the marketplace to provision instances for this group. It can be either on_demand or spot

An example of a EMR instance using spot is:

"aws_market": {
    "spot": {
      "bid_price": 0.07

Where bid_price is the Spot bid price in dollars.


Because of some AWS EMR implementation details, Sundial has two major limitations when it comes to EMR job scheduling.

The first limitation is that Sundial is not able to stop EMR jobs running on pre-existing clusters. Since jobs on the EMR cluster are scheduled via yarn and since AWS did not build any api on top of it, once a job is scheduled on an existing EMR cluster, in order to kill it, it would be required to ssh on the EC2 instance where the master node is running, query yarn so to find out the correct application id and issue a yarn kill command. We decided to not implement this feature because it would have greatly over complicated the job definition. Jobs running on dynamic cluster are affected by the same issue. We’ve managed to still implement this feature by simply killing the whole EMR cluster.

The second limitation is about live logs. As previously mentioned live logs are not implemented out of the box. Developers require to stream logs to Cloudwatch Logs and set log group and log name in the task definition.

aws 14 sundial 2 etl 1 scheduling 1 machine learning 11
Gilt Tech

Revitalize Gilt City's Order Processing with Serverless Architecture

Liyu Ma aws

Instant Vouchers Initiative

Gilt City is Gilt’s high-end voucher portal that offers localised discounts on exclusive lifestyle experiences in dining, entertainment, beauty, fitness etc to our 3.4 million members across 13 U.S. cities. Gilt City’s legacy order processing backend is a scheduled-job based architecture in which functionality such as fraud scan, payment authorisation, order fulfillment are assigned to independent jobs that process orders in batches according to order status. Though this architecture can scale to meet peak time workload and provides some level of resilience (failed orders are retried the next time the job runs), it inevitably includes some idle time i.e. wait for the next job to pick up an order from the previous job. The resulting average processing time could add up to 15 minutes.

Since many of Gilt City’s offers are of an impulsive nature and time-sensitive, long processing time becomes a clear bottleneck to user experience. Team Marconi in Gilt have been driving the work on the Instant Vouchers Initiative for the past few months ago, in an effort to re-architect the backend of order processing using the latest cloud technologies. We believe that by reducing this wait time, it will significantly boost overall shopping experience and enable immediate use of vouchers and, in turn, it allows for new features such as location-based push notifications.

An Event Driven, Serverless Architecture

It is never easy to rewrite (or replace) a mission critical system. In our case, we have to keep the existing monolithic Ruby on Rails app running while spinning up a new pipeline. We took the strangler pattern (see this Martin Fowler article for an explanation) and built a new API layer for processing individual orders around the existing batch-processing, job-based system in the same Rails app. With this approach, the legacy job-based system gradually receives less traffic and becomes a fallback safety net to catch and retry failed orders from the instant processing pipeline.

The new instant order pipeline starts with the checkout system publishing a notification to an SNS topic whenever it creates an order object. An order notification contains the order ID to allow event subscribers to look up the order object in the order key-value store. An AWS Lambda application order-notification-dispatcher subscribes to this SNS topic and kicks off the processing by invoking an AWS Step Functions resource. See below a simplified architecture diagram of the order processing system.

The architecture leverages Lambda and Step Functions from the AWS Serverless suite to build several key components. At HBC, different teams have started embracing a serverless paradigm to build production applications. There are many benefits of adopting a serverless paradigm, such as abstraction from infrastructure, out-of-the-box scalability, and an on-demand cost model just to name a few. Compared to the alternative of building and maintaining an array of EC2/container instances, a serverless architecture goes a step beyond microservices to allow an even faster development iteration cycle. With the use of Step Functions as an orchestration engine, it is much easier to facilitate interaction between Lambda applications.

alt text

AWS Step Functions for Lambda Orchestration

As mentioned above, AWS Step Functions is an orchestration service that makes it easy to coordinate stateless Lambda applications by establishing a specification to transition application states. Behind the scenes, it is depicted as a state machine constructed with the JSON-based Amazon States Language. See below a sample execution from the order-processing step function.

alt text

Inside Step Functions

At the top level the specification includes various types of States, such as Task, Choice and Wait, to be used to compose simple business logic to transition application state. Inside a Task State, an AWS Lambda ARN can be specified to be invoked. The output of the Lambda will be directed as input to next State. This is an excerpt from the order-processing state machine:

  "Comment": "Order processing state machine",
  "StartAt": "ChangeOrderStatus",
  "States": {
    "ChangeOrderStatus": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:1234567890:function:start-order-processing:2",
      "TimeoutSeconds": 30,
      "Next": "FraudScan"
    "FraudScan": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:1234567890:function:fraud-scan:2",
      "TimeoutSeconds": 30,      
      "Next": "IsFraudOrder"
    "IsFraudOrder": {
      "Type": "Choice",
      "Choices": [
          "Variable": "$.fraud_verdict",
          "StringEquals": "cleared",
          "Next": "AuthorizePayment"
          "Variable": "$.fraud_verdict",
          "StringEquals": "fraud",
          "Next": "FraudOrderTerminal"
    "AuthorizePayment": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:1234567890:function:authorize-payments:2",
      "TimeoutSeconds": 30,      
      "Next": "WarehouseChoice"
    "FraudOrderTerminal": {
      "Type": "Pass",      
      "Result": "This is the ending state for a fraud order",
      "End": true

Polling and Retry on Errors

A serverless paradigm fits really well in situations where computation completes within a short time (ideally seconds). However, sometimes we still need to run a task that will take slightly longer. For example, in our pipeline, we need to keep polling a service endpoint for a fraud-scan result, since it is an async process. We implemented this by defining a retry counter get_fraud_status_retries within a Choice state and set a max attempt count of 60 to terminate retries.

"IsFraudOrder": {
  "Type": "Choice",
  "Choices": [
      "Variable": "$.fraud_verdict",
      "StringEquals": "cleared",
      "Next": "AuthorizePayment"
      "Variable": "$.fraud_verdict",
      "StringEquals": "fraud",
      "Next": "FraudOrderTerminal"
      "Variable": "$.get_fraud_status_retries",
      "NumericLessThanEquals": 60,
      "Next": "FraudScanWait"
      "Variable": "$.get_fraud_status_retries",
      "NumericGreaterThan": 60,
      "Next": "FraudStatusUnavailableTerminal"

It is also critical to make cloud applications resilient to errors such as network timeouts. Step Functions provides error handling to allow catching/retrying of some predefined errors as well as customised Lambda error types. You can specify different retry strategies with properties such as MaxAttempts and BackoffRate. See the below example where we implemented a retry mechanism for different errors in the Task state to create redemption codes:

"CreateRedemptionCode": {
  "Type": "Task",
  "Resource": "arn:aws:lambda:us-east-1:1234567890:function:create-redemption-code:3",
  "TimeoutSeconds": 30,
  "Next": "FulfillElectronicOrder",
  "Retry": [
      "ErrorEquals": [ "GatewayTimeoutError" ],
      "IntervalSeconds": 5,
      "MaxAttempts": 2
  "Catch": [            
      "ErrorEquals": [ "States.ALL" ],
      "Next": "CatchMissingRedemptionCode"

Immutable Deployment & Partial Rollout

Deploying a mission critical service to a production environment is always a nervous process. At HBC we advocate immutable deployments whenever possible and leverage A/B testing to help us roll out new features to customers in a gradual manner. In a serverless world, it is a little different, since most of the infrastructure management is abstracted away.

Lambda Versioning

AWS Lambda’s versioning feature provides the ability to make Lambda functions immutable by taking a snapshot of the function (aka publishing a version). We really like this, since it ensures the Lambda function artifact as well as environment variables remain immutable once published. Note that in the above code snippets of state machine JSON, the ARN specified for each Lambda resource is Lambda version ARN instead of function ARN. We also use Lambda’s aliasing feature to have a prod alias mapped to the current production version, with immutable environment variables:

alt text

With aliasing we can easily roll back to a previous Lambda version in case of an unexpected production failure.

Blue/Green Stacks

So we have immutable Lambda functions, but we still want to make our Step Functions (SF) immutable. We decided to create a new SF resource every time we release it, meanwhile the old SF resource remains unchanged. Since AWS does not currently provide a versioning feature for Step Functions, we included semantic versioning in the SF name e.g. order-processing-v0.0.6. With both new and old versions (including historical SFs) we are able to apply a blue/green deployment and rollback procedure.

To route orders to either blue/green stack, we make the order-notification-dispatcher Lambda the de facto router by providing blue/green versions of SF as its environment variables. Here is the Node.js code to read the stack environment variables:

const stateMachineBlueVer = process.env.STATE_MACHINE_BLUE_VER;
const stateMachineGreenVer = process.env.STATE_MACHINE_GREEN_VER;

With fetched state machine stack version we can compose Step Function ARN with predefined format, then start a new execution with AWS sdk Step Function api:

const stateMachineVersion = ... // Read from environment vars
function dispatch(orderJson) {
  const orderId = orderJson.order_id;
  const stateMachine = preProcessingStepFunctionPrefix + stateMachineVersion; 
  const params = {
    stateMachineArn: stateMachine,
    name: orderId.toString(),
    input: JSON.stringify(orderJson)
  return new AWS.StepFunctions().startExecution(params).promise();

Partial Rollout

We make the order-notification-dispatcher query our a/b test engine to have simple routing logic for each order notification, so that it can shift traffic to either the blue/green Step Function stack according to test/control group the order falls into. Also note that AWS recently released a nice traffic shifting feature for Lambda applications. However, we didn’t use it as our a/b test engine provides finer-grain control which allows us to target certain groups such as HBC’s internal employees. Here is a diagram depicting the partial rollout process for new Step Function resources:

alt text


What We Have Achieved

As of today all of Gilt City’s orders have been directed to the instant processing pipeline, which shortens the majority of orders’ processing time from over 15 minutes to a few seconds. We are looking to expand the system to take over more workload including physical products to bring the instant order user experience to a wider customer base.

Step Functions Limitations

From our development exerience using AWS Step Functions we discovered some limitations of this service. First of all, it lacks of a feature like a Map state which would take a list of input objects and transform it to another list of result objects. A possible solution could be allowing invocation of a sub SF multiple times. In our case, an order object can be split into multiple order objects depending on the items in the original order. Unfortuntely SF does not offer a State type that can map a dynamic number of elements. We eventually made the workaround by creating an order-pre-processing SF and make it invoke the order-processing SF multiple times to process those ‘split’ orders.

Secondly, we hope AWS can provide versioning/aliasing for Step Functions so we can gain immutability out of the box instead of forcing immutability on our side. Any support for blue/green deployment would be even better.

Also, we expect AWS to provide better filtering/searching abilities on the Step Functions dashboard so we can gain some fundamental data analytics from historical executions. This could be obtained by declaring some “searchable” fields and relative types in the SF definition.

In the context of AWS Enterprise Support, we (Team Marconi) had a productive meeting directly with the AWS Step Functions Product Manager during which we have suggested our list of improvements. It was gratifying to hear that most of these are already or will be included in their development roadmap.

Future Work

From an architecture perspective, we are trying to standardize a continous delivery process for our serverless components. At the moment, what we have is “poor man’s CI/CD” - some bash/node scripts which use AWS CloudFormation SDK to provision resources. There are various tools available either from AWS or the serverless community such as Terraform, CodePipeline that we are trying to integrate with to provide a frictionless path to production.

aws 14 serverless 2 lambda 2 step function 1 gilt city 2 order processing 1
Gilt Tech

Presentations we love: 2017

HBC Tech presentations

2017 was a year of growth and learning at HBC Tech. Our organization embraced new technologies and new ways of building application software.

As the year comes to an end, let’s recognize some notable technical presentations from 2017.

Kubernetes Project update

Kelsey Hightower (@kelseyhightower) at KubeCon 2017

Kelsey Hightower video

Production: Designing for testability

Mike Bryzek (@mbryzek) at QCon New York 2017

Mike Bryzek video

Streaming Microservices: Contracts & Compatibility

Gwen Shapira (@gwenshap) at QCon New York 2017


Spinnaker and the Culture Behind the Tech

Dianne Marsh (@dmarsh) at KubeCon 2017

Dianne Marsh video

Embracing Change without breaking the world

Jim Flanagan and Kyle Thomson at AWS re:invent 2017

AWS Embracing Change

Developing Applications on AWS in the JVM

Kyle Thomson (@kiiadi) at AWS re:invent 2017


Chaos Engineering at Netflix

Nora Jones (@nora_js) at AWS re:invent 2017

Chaos Engineering at Netflix


Sean Sullivan (@tinyrobots) at Scala Up North 2017


Managing Data in Microservices

Randy Shoup (@randyshoup) at QCon New York 2017

Randy Shoup - Managing Data

Crushing Tech Debt Through Automation at Coinbase

Rob Witoff (@rwitoff) at QCon London 2017

Rob Witoff - Tech Debt

Gilt’s iOS codebase evolution

Evan Maloney (@_emaloney_) at the Brooklyn Swift Developers Meetup


Apache Struts and the Equifax Data Breach

Sean Sullivan (@tinyrobots) at the Portland Java User Group


Promcon 2017

Giovanni Gargiulo (@giannigar) at Promcon 2017 (Munich)


community 3 conferences 29 qcon 2 aws 14 cloud 2 2017 1
Gilt Tech

Dublin Scala Spree

Gregor Heine open source

Dublin Scala Spree

This Friday the Gilt/HBC Digital Dublin office will be hosting the first ever Dublin Scala Spree, a day-long Scala Open Source Hackathon. The event is organized by the Dublin Scala Usergroup in cooperation with Dublin Functional Kubs and the Scala Center at EPFL in Lausanne, Switzerland.

  • Date & Time: Friday, 15th September, 10am - 4pm
  • Location: Gilt/HBC Digital Office, Shelbourne Rd., Dublin 4, Ireland
  • Sign-Up: Please register for the event via the Dublin Scala Users Group
  • Organizers: Dublin Scala Meetup and Dublin Functional Kubs in cooperation with the Scala Center @ EPFL in Lausanne

What is a Scala Spree?

Scala Spree is a free community event aiming to popularize Open Source Software. It brings together Open Source authors, maintainers and software engineers willing to contribute to OSS projects. Under the guidance of seasoned experts, newcomers learn about the inner working of some popular tools and Scala libraries, and contribute to make them even better. For library authors, it’s an opportunity to improve their tools and get fresh feedback. For attendees it is a unique opportunity to lean more about Scala, contribute to Open Source Software and expand their skills. And for everyone it’s a great opportunity to meet and have fun!

For this week’s Spree we have the following special guests and their OSS projects:

If you have a Scala open source project that you would like to feature at the Spree, please get in touch with the Dublin Scala Users Group organizers.

Like all Dublin Scala Community events, Scala Spree is free of charge and the only real requirement is an open mind and the will to contribute! – Apart from bringing your own computer to use, but chances are you figured that out already.

Duration and pace

To begin with, maintainers gather together in front of all the contributors to briefly explain their projects and tickets in one minute. The idea is to give a good high-level explanation to motivate participants without going into too much detail. When they are done, participants approach the projects they are most interested in and get it contact with the maintainers. At this point, maintainers usually listen to the participants’ experience and provide personal guidance on tickets that would suit them. Then, the fun begins! Participants start hacking on their projects and maintainers review PRs as they come, assisting participants when they ask for help. We encourage maintainers to merge as many PRs as possible in the place, for two reasons: Participants get a small token of appreciation from the Scala Center. It increases the motivation of the participants. If participants get the first PR merged, they are invited to continue solving issues until they are happy with their work! At the middle of the spree, we will provide free lunch and refreshments. Participants can leave the event at any time they want. When the time approaches the end, everyone starts to wrap up: participants finish their PRs while maintainers finish their reviews, and organizers of the spree give away swag.

Places will be strictly limited and will be allocated on a first come first served basis. Registration through the Dublin Scala Users Group is required and only successfull RSVPs can attend.

open source 67 hackathon 8 culture 36 community 3 meetups 37
Gilt Tech

Team Rookie 2017

Team Rookie internship

Who We Are

Team-Rookie-2017, as we pride ourselves with being the most awesome team ever, has spent the summer improving the browsing experience for Gilt users as well as to collect data for our personalization team. The end result of our project included the crafted front-end user experience and a back-end service for data processing.

Project Ideation

The final project idea rose to the top through countless meetings and discussions with various teams in the organization. With the initially decided problem-solution proven to be unexecutable, our team, along with all of our mentors, took efforts to come up with a new solution to solve the given problem with the limited resources we had. This immersive process, in the very beginning of the program, ensured the understanding of the engineering problem and established the success of our project.

To arrive at the best possible solution, we spent time learning the technology stack end-to-end. We went through many tutorials and labs with our mentors on the technologies we were going to eventually use, namely Scala, Android, and the Play framework. As we gained familiarities with these tools and technologies daily, we were quickly able to finalize on our ideas and the project has finally taken off.

Problem Space:

So let’s talk about the problem. With a growing user base, the Gilt platform needs to better understand what the users’ interests are in order to tailor unique shopping experiences to different user groups. Currently, users are able to “shop-the-look.” This feature allows a user to browse a completed set of apparels, such as the combination of a shirt, a pair of jeans, and shoes. It rids the hassle of a lot of users having to discover these items separately, they are able to find them all at once and make one single purchase. At the moment, these completed looks are selected by stylists who understand them. While stylists may provide the highest quality pairings, we are unable to scale human labor to the entire catalog. As fashion trends change, we need to update our pairings accordingly. Therefore, we aim to continuously collect user opinions on possible pairings. With these we can develop machine learning models to infer item compatibility. This is an ambitious goal, but not unachievable. We just need a steady supply of data.


To tackle this problem, we proposed to create a fun and engaging experience for the users while they are shopping: completing their own outfits. One key requirement for this experience is that it can not interfere with the current purchase flow, meaning that if a user is closing in on a purchase, that process should not be interrupted. Therefore, rather than inserting the experience within the current workflow, we’ve decided to include the feature on the search page where users are able to favorite items they like. This is shown in the figure below.

For our experience, to minimize disruption to the current workflow, we’ve added an additional hover link on the favorite button, and this will direct the users to our experience.

We provide the users with additional items that can potentially be paired with the initial favorited item to form completed looks. These products, limited by category and price based on the favorited items, will be presented to the users for individual selections. The users can let their imaginations go wild and pick what they think are the best combinations. During this process, we will collected this data and persist it through our back-end API to the database.

Finally, in order to complete the experience and make it as engaging as possible, we’ve decided to allow the users to immediately purchase the selected items if they wish. Since these items are what they specifically picked out from a pool of products, they will have a greater likelihood for conversion.

So in a nutshell, this is the completed project of the 10 week internship filled with hard work, grind, sweat (mostly from our daily trips to equinox right down stairs), and a whole lot of fun.

Intern Activities

While we were not busy being awesome engineers, team-rookie spent most of our leisure time exploring New York and staying cool. Here are some of the highlights.


Team Rookie would like to give out a huge shout out to all of our mentors that helped us along they way and made this project possible (you know who you are)! With a special thanks to Doochan and Mike, who led the intern committee through all of our battles and came out on the other end with a solid victory. The complete-the-look experience would not have been possible without you guys.

internship 4 web 2 team rookie 2
Gilt Tech

HBC Tech Talks: February 2017 through July 2017

HBC Tech conferences

We’ve had a busy 2017 at HBC. The great work of our teams has created opportunities to share what we’ve learned with audiences around the world. This year our folks have been on stage in Austin, Sydney, Portland, Seattle, San Diego, Boston, London, Israel and on our home turf in NYC and Dublin. The talks have covered deep learning, design thinking, data streaming and developer experience to name just a few.

Lucky for you, if you haven’t been able to check out our talks in person, we’ve compiled the decks and videos from a bunch of our talks right here. Enjoy!







  • Sean Sullivan spoke at Scala Up North and the Portland Java User Group about ApiBuilder.
  • Sophie Huang spoke at the Customer Love Summit in Seattle.
  • Kyla Robinson gave a keynote on Key to Success: Creating A Mobile–First Mentality.
  • Sera Chin and Yi Cao spoke at the NYC Scrum User Group about HBC’s Design Sprints.
meetups 37 conferences 29 evangelism 4
Gilt Tech

Sundial or AWS Batch, Why not both?

Kevin O'Riordan data

Sundial on AWS Batch

About a year ago, we (the Gilt/HBC personalization team) open sourced Sundial ( , a batch job orchestration system leveraging Amazon EC2 Container Service.

We built Sundial to provide the following features on top of the standard ECS setup:

  • Streaming Logs (to Cloudwatch and S3 and live in Sundial UI)
  • Metadata collection (through Graphite and displayed live in Sundial UI)
  • Dependency management between jobs
  • Retry strategies for failed jobs
  • Cron style scheduling for jobs
  • Email status reporting for jobs
  • Pagerduty integration for notifying team members about failing critical jobs

alt text

Other solutions available at the time didn’t suit our needs. Solutions we considered included Chronos which lacked the features we needed and required a Mesos cluster, Spotify Luigi and Airbnb Airflow, which was immature at the time.

At the time, we chose ECS because we hoped to take advantages of AWS features such as autoscaling in order to save costs by scaling the cluster up and down by demand. In practice, this required too much manual effort and moving parts so we lived with a long running cluster scaled to handle peak load.

Since then, our needs have grown and we have jobs ranging in size from a couple of hundred MB of memory to 60GB of memory. Having a cluster scaled to handle peak load with all these job sizes had become too expensive. Most job failure noise has been due to cluster resources not being available or smaller jobs taking up space on instances meant to be dedicated to bigger jobs. (ECS is weak when it comes to task placement strategies).

Thankfully AWS have come along with their own enhancements on top of ECS in the form of AWS Batch.

What we love about Batch

  • Managed compute environment. This means AWS handles scaling up and down the cluster in response to workload.
  • Heterogenous instance types (useful when we have outlier jobs taking large amounts of CPU/memory resources)
  • Spot instances (save over half on on-demand instance costs)
  • Easy integration with Cloudwatch Logs (stdout and stderr captured automatically)

What sucks

  • Not being able to run “linked” containers (We relied on this for metadata service and log upload to S3)
  • Needing a custom AMI to configure extra disk space on the instances.

What we’d love for Batch to do better

  • Make disk space on managed instances configurable. Currently the workaround is to create a custom AMI with the disk space you need if you have jobs that store a lot of data on disk (Not uncommon in a data processing environment). Gilt has a feature request open with Amazon on this issue.

Why not dump Sundial in favour of using Batch directly?

Sundial still provides features that Batch doesn’t provide:

  • Email reporting
  • Pagerduty integration
  • Easy transition, processes can be a mixed workload of jobs running on ECS and Batch.
  • Configurable backoff strategy for job retries.
  • Time limits for jobs. If a job hangs, we can kill and retry after a certain period of time
  • Nice dashboard of processes (At a glance see what’s green and what’s red)

alt text

Sure enough, some of the above can be configured through hooking up lambdas/SNS messages etc. but Sundial gives it to you out of the box.

What next?

Sundial with AWS Batch backend now works great for the use cases we encounter doing personalization. We may consider enhancements such as Prometheus push gateway integration (to replace the Graphite service we had with ECS and to keep track of metrics over time) and UI enhancements to Sundial.

In the long term we may consider other open source solutions as maintaining a job system counts as technical debt that is a distraction from product focused tasks. The HBC data team, who have very similar requirements to us, have started adopting Airflow (by Airbnb). As part of their adoption, they have contributed to an open source effort to make Airflow support Batch as a backend: If it works well, this is a solution we may adopt in the future.

batch 1 aws 14 tech 22 personalization 15
Gilt Tech

Visually Similar Recommendations

Chris Curro personalization

Previously we’ve written about about Tiefvision , a technical demo showcasing the ability to automatically find similar dresses to a particular one of interest. For example:

Since then, we’ve worked on taking the ideas at play in Tiefvision, and making them usable in a production scalable way, that allows us to roll out to new product categories besides dresses quickly and efficiently. Today, we’re excited to announce that we’ve rolled out visually similar recommendations on Gilt for all dresses, t-shirts, and handbags, as well as to women’s shoes, women’s denim, women’s pants, and men’s outerwear.

Let’s start with a brief overview. Consider the general task at hand. We have a landing page for every product on our online stores. For the Gilt store, we refer to this as the product detail page (PDP). On the PDP we would like to offer the user a variety of alternatives to the product they are looking at, so that they can best make a purchasing decision. There exist a variety of approaches to selecting other products to display as alternatives; a particularly popular approach is called collaborative filtering which leverages purchase history across users to make recommendations. However this approach is what we call content-agnostic – it has no knowledge of what a particular garment looks like. Instead, we’d like to look at the photographs of garments and recommend similar looking garments within the same category.

Narrowing our focus a little bit, our task is to take a photograph of a garment and find similar looking photographs. First, we need to come up with some similarity measure for photographs, then we will need to be able to quickly query for the most similar photographs from our large catalog.

This is something we need to do numerically. Recall that we can represent a photograph as some tensor (in other words a three dimensional array with entries in between 0 and 1). Given that we have a numerical representation for an photograph, you might think we could so something simple to the measure the similarity between two photographs. Consider:

which we’d refer to as the Frobenius norm of the difference between the two photographs. The problem with this, although it is simple, is that we’re not measuring the difference between semantically meaningful features. Consider these three dresses: a red floral print, pink stripes, and a blue floral print.

With this “pixel-space” approach the red floral print and the pink stripes are more likely to be recognized as similar than the red floral print and the blue floral print, because they have pixels of similar colors at similar locations. The “pixel-space” approach ignores locality and global reasoning, and has no insight into semantic concepts.

What we’d like to do is find some function that extracts semantically meaningful features. We can then compute our similarity metric in the feature-space rather than the pixel-space. Where do we get this ? In our case, we leverage deep neural networks (deep learning) for this function. Neural networks are hierarchical functions composed of typically sequential connections of simple building blocks. This structure allows us take a neural network trained for a specific task, like arbitrary object recognition and pull from some intermediate point in the network. For example say we take a network, trained to recognize objects in the ImageNet dataset, composed of building blocks :

We might take the output of and call those our features:

In the case of convolutional networks like the VGG, Inception, or Resnet families our output features would lie in some vector space . The first two dimensions correspond to the original spatial dimensions (at some reduced resolution) while the third dimension corresponds to some set of feature types. So in other words, if one of our feature types detects a human face, we might see a high numerical value in spatial position near where a person’s face is in the photograph. In our use cases, we’ve determined that this spatial information isn’t nearly as important as the feature types that we detect, so at this point we aggregate over the spatial dimensions to get a vector in . A simple way to do this aggregation is with a simple arithmetic mean but other methods work as well.

From there we could build up some matrix where is the number of items in a category of interest. We could then construct an similarity matrix

Then to find the most similar items to a query , we look at the locations of the highest values in row of the matrix.

This approach is infeasible as becomes large, as it has computational complexity and space complexity . To alleviate this issue, we can leverage a variety of approximate nearest neighbor methods. We empirically find that approximate neighbors are sufficient. Also when we consider that our feature space represents some arbitrary embedding with no guarantees of any particular notion of optimality, it becomes clear there’s no grounded reason to warrant exact nearest neighbor searches.

How do we do it?

We leverage several open source technologies, as well as established results from published research to serve visually similar garments. As far as open source technology is concerned, we use Tensorflow, and (our very own) Sundial. Below you can see a block diagram of our implementation:

Let’s walk through this process. First, we have a Sundial job that accomplishes two tasks. We check for new products, and then we compute embeddings using Tensorflow and a pretrained network of a particular type for particular categories of products. We persist the embeddings on AWS S3. Second, we have another Sundial job, again with two tasks. This job filters the set of products to ones of some particular interest and generates a nearest neighbors index for fast nearest neighbor look-ups. The job completes, persisting the index on AWS S3. Finally, we wrap a cluster of servers in a load balancer. Our product recommendation service can query these nodes to get visually similar recommendations as desired.

Now, we can take a bit of a deeper dive into the thought process behind some of the decisions we make as we roll out to new categories. First, and perhaps the most important, is what network type and where to tap it off so that we can compute embeddings. If we recall that neural networks produce hierarchical representations, we can deduce (and notice empirically) that deeper tap-points (more steps removed from the input) produce embeddings that pick up on “higher level” concepts rather than “low level” textures. So, for example, if we wish to pick up on basic fabric textures we might pull from near the input, and if we wish to pick up something higher level like silhouette type we might pull from deeper in the network.

The filtering step before we generate a index is also critically important. At this point we can narrow down our products to only come from one particular category, or even some further sub-categorization to leverage the deep knowledge of fashion present at HBC.

Finally, we must select the parameters for the index generation, which control the error rate and performance trade-off in the approximate nearest neighbors search. We can select these parameters empirically. We utilize our knowledge of fashion, once again, to determine a good operation point.

What’s next?

We’ll be working to roll out to more and more categories, and even do some cross category elements, perhaps completing outfits based on their visual compatibility.

machine learning 11 deep learning 5 personalization 15 recommendation 2
Gilt Tech

How Large Is YOUR Retrospective?

Dana Pylayeva agile

Can you recall the size and length of your typical retrospective? If your team operates by The Scrum Guide, your retrospectives likely have less than ten people in one room and last about an hour for a two-weeks Sprint.

What if your current team is larger than a typical Scrum team and a retrospective period is longer than a month? What if the team members are distributed across locations, countries, time zones and multiple third party vendors? Is this retrospective doomed to fail? Not quite. These factors just add an additional complexity and call for a different facilitation approach.

Last month at HBC we facilitated a large-scale mid-project retrospective for a 60 people-project team. While this project certainly didn’t start as an agile project, bringing in an agile retrospective practice helped identify significant improvements. Here is how we did it.

From Inquiry to Buy-in

This all started with one of the project sub-teams reaching out with an inquiry: “Can you facilitate a retrospective for us?” That didn’t sound like anything major. We’ve been advocating for and facilitating retrospectives on various occasions at HBC: regular Sprint retrospectives, process retrospectives, new hire onboarding retrospectives etc.

Further digging into a list of participants revealed that this retro would be unlike any others. We were about to pull together a group of 60 people from HBC and five consulting companies(!) In spite of working on the same project for a long time, these people never had a chance to step back and reflect on how they could work together differently.

In order to make it successful, we needed buy-in from the leadership team to bring the entire team (including consultants) into the retrospective. Our first intent was to bring everyone into the same space (physical and virtual) and facilitate a retrospective with Open Space Technology. Initial response wasn’t promising:

“We have another problem with this retro […] is concerned that it is all day and that the cost of doing this meeting is like $25K-$50K”

We had to go back and re-think the retrospective approach. How can we reduce the cost of this event without affecting the depth and breadth of the insights?

Options we considered

Thanks to the well-documented large retrospectives experiments by other agile practitioners, there was a number of options to evaluate:

1) Full project team, full day, face-to-face, Open Space-style retro 2) Decentralized, themes-based retros with learnings collected over a period of time and shared with the group 3) Decentralized retrospectives using Innovation Games Online platform 4) Overall retrospective (LeSS framework)

Around the same time, I was fortunate to join a Retrospective Facilitator’s Gathering (RFG2017) - an annual event that brought together the most experienced retrospective facilitators from around the World. Learning from their experience as well as brainstorming together on the possible format was really helpful. Thank you Tobias Baier, Allan Jepsen, Joanne Perold, George Dinwiddie and many others for sharing your insights! I was especially grateful for the in-depth conversation with Diana Larsen in which she pointed out to

“Clarify the goal and commitment of the key stakeholders before you start designing how to run the retrospective.”

Back to the drawing board again! More conversations, clarifications and convincing… With some modifications and adjustments, we finally were able to get the buy-in and moved forward with the retrospective.

What worked for us – a tiered format.

Tiered Retro

Individual team-level retrospectives

We had a mix of co-located and distributed sub-teams on this project and chose to enlist some help from multiple facilitators. To simplify data consolidation, each facilitator received a data gathering format along with a sample retrospective facilitation plan. Each individual sub-team was asked to identify two types of action items: ones that they felt were in their power to address and others that required a system-level thinking and the support from the larger project community. The former were selected by the sub-teams and put in motion by their respective owners. The latter were passed to the main facilitator for analysis and aggregation to serve as a starting point for the final retrospective.

Final retrospective

For the final retrospective we brought together two types of participants:

1) Leads and delegates from individual sub-teams who participated actively at all times. 2) Senior leaders of the organization who joined in the last hour to review and support team’s recommendations.

The goal of this workshop was to review the ideas from sub-teams, explore system level improvements and get the support from senior leadership to put the system-level changes into motion.

Retrospective plans

Each retrospective was structured according to the classic five-steps framework and included a number of activities selected from Retromat.

Example of an in-room sub-team retrospective (1 - 1.5 hours)

Set the Stage

We used a happiness histogram to get things started and get a sense for how the people felt about the overall project. Happiness Histogram

Instead of reading the Prime Directive once at the beginning with the team, we opted for displaying it in the room on a large poster as a visible reminder throughout the retrospective.

Gather Data

Everyone was instructed to think about the things they liked about the project (What worked well?) and the ones that could’ve been better (What didn’t work so well?). In a short time-boxed silent brainstorming each team member had to come up with at least two items in each category.

Next we facilitated a pair-share activity in a “speed dating” format. Forming two lines, we asked participants to face each other and take turns discussing what each of them wrote on their post-its. After two minutes the partners were switched and the new pairs were formed to continue discussions with the new set of people.

Pair Share

At the end of the timebox, we asked the last pairs to walk together to the four posters on the wall and place their post-its into respective categories: 1) Worked Well/ Can’t control 2) Worked Well/Can control 3) Didn’t work so well/Can’t control 4) Didn’t work so well/ Can control

After performing an affinity mapping and a dot-voting the group selected top three issues that they felt were in their control to address.

Generate Insights/Decide What To Do

Every selected issue got picked up by a self-organized sub-group. Using a template each sub-group designed a team level experiment defining the action they propose to take, an observable behavior they expect to see after taking that action and the specific measurement that will confirm a success of the experiment.


Close the Retro

We closed the retro by getting a feedback on the retro format, taking photos of the insights generated by the team. These were passed on to the main facilitator for further analysis and preparation for the final retrospective event.

Modifications for distributed teams

For those teams that had remote team members or were fully distributed, we used a FunRetro tool. Flexibility to configure columns and the number of votes, along with easy user interface, fun colors and free cost made this tool a good substitute for an in-room retrospective.

Fun Retro

Final Retrospective (3 hours)

Once all individual sub-teams retrospective were completed, we consolidated the project-level improvement proposals. These insights were reviewed, analyzed for trends and systemic issues and then shared during Tier 2 Final Retrospective.

Set the stage

We used story cubes to reflect and share how each of the participants felt about this project. This is a fun way to run a check in activity, equally effective with introverted and extraverted participants. The result is a collection of images that build a shared story about the project:

Story Cubes

We also reviewed an aggregated happiness histogram from each individual sub-teams to learn about the mood of 60 people on this project.

Gather data

Since the retrospective period was very long, building a timeline together was really helpful in re-constructing the full view of the project. We asked participants to sort the events into the ones that had a positive impact on the project (placing them above the timeline) and the ones that had a negative impact on the project (placing them below the timeline). The insight we gained from this exercise alone were invaluable!


Generate Insights

Next we paired the participants and asked them to walk to the consolidated recommendations posters. As a pair, they were tasked with selecting the most pressing issues and bringing them back for a follow up discussion at their table.

What Worked What Didn't

Each table used the LeanCoffee format to vote on the selected issues, prioritize them into a discussion backlog and explore as many of them as the timebox allowed. Participants used roman voting as a way to decide if they are ready to more on to the next topic or need more discussion about the current one. Closing each discussion, participants recorded their recommended action. At the end of the timebox all actions from each table were shared with the rest of the group to get feedback.


Decide What To Do/Close

In the final hour of the retrospective the action owners shared their proposed next steps with the senior leadership team and reviewed the insights from the consolidated teams’ feedback.


Was this experiment successful? Absolutely! One of the biggest benefits of this retrospective was this collective experience of working across sub-teams and designing organizational improvements together.

Could we have done it better? You bet! As the project continues, we will be looking to run the retrospectives more frequently and will take into account things we learnt in this experiment.

What did we learn?

  • Designing a retrospective of this size is a project in itself. You need to be clear about the vision, the stakeholders and the success criteria for the retrospective.
  • Do your research, tap into the knowledge of agile community and get inspired by the experience of others. Take what you like and then adapt to make it work in the context of your organization.
  • Ask for help. Involve additional facilitators to get feedback, speed up the execution and created a safe space for individual sub-teams.
  • Inclusion trumps exclusion. Invite consultants as well as full-time employees into your retrospective to better understand the project dynamic.
  • Beware of potential confusion around retrospective practice. Be ready to explain the benefits and highlight the differences between a retrospective and a postmortem.
  • Bringing senior leaders into the last hour of final retrospective can negatively affect the dynamics of the discussions. Either work on prepping them better or plan on re-establishing the safe space after they join.

What would we like to do next?

  • Continue promoting the retrospective practice across the organization.
  • Offer a retrospective facilitator training to Scrum Masters, Agile Project Managers and anyone who is interested in learning how to run an effective retro.
  • Establish retrospective facilitator circle to help maintain and improve the practice for all teams.

Inspired by our experiment? Have your own experience worth sharing? We’d love to hear from you and learn what works in your environment. Blog about it and tweet your questions at @hbcdigital.

World Retrospective Day

Whether you are a retrospective pro, have never tried one in the past or your experience is anywhere in between, please do yourself a favor and mark February 6, 2018 on your calendar. A group of experienced retrospective facilitators is currently planning a record-breaking World Retrospective Day with live local workshops on every continent and in every time zone along with many on-line learning opportunities. We are engaging with the industry thought leaders to make this one of the best and most engaging learning experience. We hope to see you there!

agile 12 retrospective 1 scaling 4
Gilt Tech

Advanced tips for building an iOS Notification Service Extension

Kyle Dorman ios

The Gilt iOS team is officially rolling out support for “rich notifications” in the coming days. By “rich notifications”, I mean the ability to include media (images/gifs/video/audio) with push notifications. Apple announced rich notifications as a part of iOS 10 at WWDC last year (2016). For a mobile first e-commerce company with high quality images, adding media to push notifications is an exciting way to continue to engage our users.

alt image

This post details four helpful advanced tips I wish I had when I started building a Notification Service Extension(NSE) for the iOS app. Although all of this information is available through different blog posts and Apple documentation, I am putting it all in one place in the context of building a NSE in the hopes that it saves someone the time I spent hunting and testing this niche feature. Specifically, I will go over things I learned after the point where I was actually seeing modified push notifications on a real device (even something as simple as appending MODIFIED to the notification title).

If you’ve stumbled upon this post, you’re most likely about to start building a NSE or started already and have hit an unexpected roadblock. If you have not already created the shell of your extension, I recommend reading the official Apple documentation and some other helpful blog posts found here and here. These posts give a great overview of how to get started receiving and displaying push notifications with media.

Tip 0: Sending notifications

When working with NSEs it is extremely helpful to have a reliable way of sending yourself push notifications. Whether you use a third party push platform or a home grown platform, validate that you can send yourself test notifications before going any further. Additionally, validate that you have the ability to send modified push payloads.

Tip 1: Debugging

Being able to debug your code while you work is paramount. If you’ve ever built an app extension this tip may be old hat to you but as a first time extension builder it was a revelation to me! Because a NSE is not actually a part of your app, but an extension, it does not run on the same process id as your application. When you install your app on an iOS device from Xcode, the Xcode debugger and console are only listening to the process id of your application. This means any print statements and break points you set in the NSE won’t show up in the Xcode console and won’t pause the execution of your NSE.

alt image

You actually can see all of your print statements in the mac Console app but the Console also includes every print/log statement of every process running on your iOS device and filtering these events is more pain than its worth.

alt image

Fortunately, there is another way. You can actually have Xcode listen to any of the processes running on your phone including low level processes like wifid, Xcode just happens to default to your application.

alt image

To attach to the NSE, you first need to send your device a notification to start up the NSE. Once you receive the notification, in Xcode go to the “Debug” tab, scroll down to “Attach to Process” and look to see if your NSE is listed under “Likely Targets”.

alt image

If you don’t see it, try sending another notification to your device. If you do, attach to it! If you successfully attached to your NSE process you should see it grayed out when yo go back to Debug > Attach to Process.

alt image

You should also be able to select the NSE from the Xcode debug area.

alt image

To validate both the debugger and print statements are working add a breakpoint and a print statement to your NSE. Note: Everytime you rebuild the app, you will unfortunately have to repeat the process of sending yourself a notification before attaching to the NSE process.

Amazing! Your NSE development experience will now be 10x faster than my own. I spent two days appending “print statements” to the body of the actual notification before I discovered the ability to attach to multiple processes.

alt image

Tip 2: Sharing data between your application and NSE

Although your NSE is bundled with your app, it is not part of your app, does not run on the same process id (see above), and does not have the same bundle identifier. Because of this, your application and NSE cannot talk to each other and cannot use the same file system. If you have any information you would like to share between the app and the NSE, you will need to add them both to an App Group. For the specifics of adding an app group check out Apple’s Sharing Data with Your Containing App.

This came up in Gilt’s NSE because we wanted to have the ability to get logs from the NSE and include them with the rest of the app. For background, the Gilt iOS team uses our own open sourced logging library, CleanroomLogger. The library writes log files in the app’s allocated file system. To collect the log files from the NSE in the application, we needed to save the log files from the NSE to the shared app group.

Another feature you get once you set up the App Group is the ability to share information using the app group’s NSUserDefaults. We aren’t using this feature right now, but might in the future.

Tip 3: Using frameworks in your NSE

If you haven’t already realized, rich notifications don’t send actual media but just links to media which your NSE will download. If you’re a bolder person than me, you might decide to forgo the use of an HTTP framework in your extension and re-implement any functions/classes you need. For the rest of us, its a good idea to include additional frameworks in your NSE. In the simplest case, adding a framework to a NSE is the same as including a framework in another framework or your container app. Unfortunately, not all frameworks can be used in an extension.

alt image

To use a framework in your application, the framework must check the “App Extensions” box.

alt image

Most popular open source frameworks are already set up to work with extensions but its something you should look out for. The Gilt iOS app has one internal framework which we weren’t able to use in extensions and I had to re-implement a few functions in the NSE. If you come across a framework that you think should work in an extension, but doesn’t, check out Apple’s Using an Embedded Framework to Share Code.

Tip 4: Display different media for thumbnail and expanded view

When the rich notification comes up on the device, users see a small thumbnail image beside the notification title and message.

alt image

And when the user expands the notification, iOS shows a larger image.

alt image

In the simple case (example above), you might just have a single image to use as the thumbnail and the large image. In this case setting a single attachment is fine. In the Gilt app, we came across a case where we wanted to show a specific square image as the thumbnail and a specific rectangular image when the notification is expanded. This is possible because UNMutableNotificationContent allows you to set a list of UNNotificationAttachment. Although this is not a documented feature it is possible.

var bestAttemptContent = request.content.mutableCopy() as? UNMutableNotificationContent
let expandedAttachment = UNNotificationAttachment(url: expandedURL, options: [UNNotificationAttachmentOptionsThumbnailHiddenKey : true])
let thumbnailAttachment = UNNotificationAttachment(url: thumbnailURL, options: [UNNotificationAttachmentOptionsThumbnailHiddenKey : false])
bestAttemptContent.attachments = [expandedAttachment, thumbnailAttachment]

This code snippet sets two attachments on the notification. This may be confusing because, currently, iOS only allows and app to show one attachment. If we can only show one attachment, then why set two attachments on the notification? I am setting two attachments because I want to show different images in the collapsed and expanded notification views. The first attchment in the array, expandedAttachment, is hidden in the collapsed view (UNNotificationAttachmentOptionsThumbnailHiddenKey : true). The second attachment, thumbnailAttachment, is not. In the collapsed view, iOS will select the first attachment where UNNotificationAttachmentOptionsThumbnailHiddenKey is false. But when the nofication is expanded, the first attachment in the array, in this case expandedAttachment, is displayed. If that is confusing see the example images below. Notice, this is not one rectangular image cropped for the thumbnail.

alt image

alt image

Note: There is a way to specify a clipping rectangle using the UNNotificationAttachmentOptionsThumbnailClippingRectKey option, but our backend system doesn’t include cropping rectangle information and we do have multiple approprite crops of product/sale images available.


Thats it! I hope this post was helpful and you will now fly through building a Notification Service Extension for your app. If there is anything you think I missed and should add to the blog please let us know,

alt image

ios 7 push notifications 5 notification service extension 1
Page 1 of 69