Ops Roadmap for Learn.co

Published on Thursday, March 22, 2018

Notes from Flatiron School engineering team’s code reading on future ops work.

Outline

  1. Review current setup
  2. Share upcoming challenges/priorities
  3. Share roadmap + new concepts/tools

Priorities

  1. Ensure new collaborative ventures are successful
  2. Support our team as we grow (make ops more automated + manageable)

Requirements

  • Move to AWS for hosting
  • Need high amounts of infrastructure + environment automation and orchestration (Terraform)
  • Scaling
  • Security
  • Lower maintenance costs as our team grows

Current Setup

  • Hosted on Digital Ocean
  • Self-hosted services:
    • Postgres
    • Redis
    • Elasticsearch
    • Memcached
    • Pushstream
  • Our virtual servers are on private network in DO region

Pain points

  • Communication between services is not automated (no robust tooling available)
  • Our servers are “pets not cattle”
  • High maintenance costs
    • Lots of outages
    • Infrastructure is not self-healing (no robust tooling available)
  • Low security
  • Noisy alerts (Nagios)
  • Relying on manageable (aka more brittle) deployment and provisioning processes (Chef)
  • Our virtual servers are on shared machines, so vulnerable to leaks / attacks

Roadmap

Security

Guiding principle: Principle of Least Privilege (limit surface area / attack vectors)

More on AWS Virtual Private Cloud

  • Public and Private subnets
  • Services that don’t need to be exposed to internet (redis, etc.) will live in private subnet
  • NAT Gateway rules to manage traffic

Scaling

All about automation

  • Managed services instead of self-hosting
  • Migrate DNS from Dyn to AWS Route 53
  • Terraform for “Infrastructure as Code” orchestration automation
  • Packer for automated AMI builds (images for Amazon instances)

  • Additional things we’re thinking about:
    • Deployments
    • Alerting
    • Monitoring
    • Logs
    • Containerization / Kubernetes (way down the road)

More about Terraform

Infrastructure as code: automates your environment to match your config file (declarative code)

  • Source controlled code
  • Reduces documentation (self-documenting system)
  • Support for multiple cloud providers

Next steps

  • Port Redis (tested)
  • Port workers and SQS/Rabbit (spike in progress)

More in the Learn.co ops series