Modern Monitoring and Observability with Thoughtful Chaos Experiments by Datadog and Gremlin (webinar)

Published Wednesday, June 12, 2019

Notes from “Chaos engineering” webinar with Datadog and Gremlin.

Presenters:

  • Ana Medina
    • Chaos engineer @ Gremlin (“Chaos engineering as a service”)
    • @ana_m_medina
  • Jason Yee
    • Senior tech evangelist @ Datadog
    • @gitbisect

Three Types of Data

1. Work metrics (Business)

  • Throughput
  • Success
  • Performance
    • Latency
    • Perceived performance

2. Resource Metrics (Services)

  • Utilization
  • Saturation
  • Availability

3. Events

Things that influence how our system(s) behave

  • Code changes
  • Scaling events
  • CHAOS

Monitoring Tools

1. Logs

  • Information about an event
  • Snapshot, but not aggregate

2. Metrics

  • Context around an event
  • Helps us see trends

3. Traces

  • Causes leading to an event

Example

https://github.com/jyee/guestbook

Chaos Engineering

Evaluate your monitoring by running chaos experiments.

  • Start in a dev or staging environment
  • “Contain the blast radius” - keep the experiment safely scoped
  • Start with a small experiment
    • Form hypothesis in advance
    • Test
    • Re-evaluate

Datadog + Gremlin Integration

dtdg.co/gremlin-datadog