The Gilt technology organization. We make gilt.com work.

Sundial PagerDuty Integration

Giovanni Gargiulo aws

Sundial

A few months ago, Gilt Tech announced Sundial. Sundial is an open source batch job scheduler for Amazon ECS. Over the course of the last few months, Sundial has seen a significant adoption both inside and outside of Gilt.

Until Sundial v0.0.10, emailing was the only way of notifying job failures.

At the beginning when the number of jobs running on Sundial was small (and so was the number of failures!), it was fairly easy to spot emails of failed jobs and act accordingly.

Lately though, in the Personalization Team, Sundial schedules about a thousand job executions per day and it’s easy to imagine the amount of noise in our inbox generated by job notifications.

Beside the noise, it has happened more than once that failure of critical jobs went unnoticed. This was of course unacceptable.

Since PagerDuty is the de facto standard in Gilt when it comes to on call procedures and since PagerDuty offers a nice and reliable events API, we’ve redesigned the notification mechanism and integrated PagerDuty with Sundial.

Configuring PagerDuty on Sundial

Configuring your job to support both Emails and PagerDuty notifications is very straightforward and can be done by adding the following json snippet to your job definition:

{
"notifications": [
    {
      "email": {
        "name": "name",
        "email": "email",
        "notify_when": "on_state_change_and_failures"
      }
    },
    {
      "pagerduty": {
        "service_key": "my_pd_service_key",
        "num_consecutive_failures": 3,
        "api_url": "https://events.pagerduty.com"
      }
    }
  ]
}

Where

  • notify_when defines when email notifications will be sent. Possible values are:
    • always, Always notify when a process completes
    • on_failure, Notify when a process fails
    • on_state_change, Notify when a process goes from succeeding to failing and vice versa
    • on_state_change_and_failures, Notify when going from failing to succeeded and on each failure
    • never
  • my_pd_service_key is the key obtained in the Service Page in PagerDuty
  • num_consecutive_failures is the number of consecutive failures after which Sundial will trigger an alert in PagerDuty

Please note that the subscriptions object in the Process Definition json has been deprecated, so if you’ve already adopted Sundial and want to start using the new notifications, you will have to update your json accordingly.

More details can be found in the Sundial v.0.0.10 release page

aws 9 sundial 1 pagerduty 1