Dev Ops Crash Course - Day Two
Year In Review Review
Q: What is DevOps?
A: When first created, more of a concept than a well-defined position. Even today, the term is still fairly broad, but in general refers to:
..a set of practices emphasizing the collaboration & communication of both software developers and operations professionals, while automating the process of software delivery and infrastructure changes.
- Read-Only replica: all transactions are mirrored to 2nd server
- Hot standby in case of failure
- Easy failover strategy
- Lowers risk of downtime
Backups stored in
/var/lib/postgres along with backup config.
How restore works:
- WAL-E (continuous archiving) takes base image of our database once a day and stores in S3.
- Postgres creates binary transaction logs (write-ahead logs) that have to complete before transaction succeeds. That way, can re-run if database crashes before completing transaction. Over the course of the day, we send up these write-ahead log files to S3.
- For restore, first imports base image to db, then applies deltas (write-ahead logs).
- WAL-E has interface for deleting backups. We go in every 6mo or so and rms old files.
/var/lib/postgres/9.3/main/recovery.conf lives on replica. When designated trigger file is created on the replica, then it becomes primary and starts running recovery.
Kick this off by running script (see operations wiki article).
- Set up another stand-by replica
- Use something like Octopus gem to handle replication, configured through yaml file
Load balancer in front of all of our hosts. Default in front of Learn. Separate ones for Rabbit, Elasticsearch, etc.
We get automatic failover from keepalived
When healthcheck fails, keepalived runs master script that reassigns the floating IP.
See operations wiki for more info on HAProxy Automatic Failover.
- ssh into load balancer (see DO floating ips for what’s active)
- config is in
- config set by chef recipe
listen admincreates an admin portal (but not accessible now bc bound locally`)
frontendwhere initial connection happens (i.e. learn.co), where we terminate ssl
backendour actual web servers
haproxy backend makes a httpchk GET request to
* that hits a healthcheck route defined in Rails app
routes.rb. Which brings us to…
When Elasticsearch is down, healthchecks will fail, and that server will be pulled out of pool. Temporary fix was to use load balancer to fail over to another Elasticsearch server, but issue remains if all Elasticsearch servers + load balancer are taken down (for example, during DO maintenance on 3/24).
Proposed solution: pull Elasticsearch out of healthchecks.
Healthchecks initializer, model, controller.
Get Your Bearings
Lookup current hostnames for db/environments: Digital Ocean > Networking > Floating IPs (or lookit Chef Nodes)
VIP: “virtual IP”
- entry point into your balancing strategy
- in our case, the floating IP
/var is usually log directory.
/etc is usually config directory.
Users and Permissions
Get all list of all users:
Get list of all users in groups:
Every user has an id. Groups have ids, too.
Every service that runs on server should have its own user. Reduces vulnerabilities.
We store dev team keys under single user to maintain separate concerns.
We also have a separate group for users with root access (all other users on box DO NOT have root access).
| Special | User | Group | World | |:------------|:-------------|:---------------|:---------------| | (d)ir | (r)ead == 4 | (w)rite == 2 | e(x)ecute == 1 | | | rwx | rwx | rwx | | | 101 | 001 | 111 | | | 5 | 1 | 7 |
Live Debugging: Server Goes Down
- Gather info
- Check memory on Librato
- Check logs on server (
- Lookit Passenger processes:
sudo passenger-status --show=requests
- In this case (3/21), requests suggest that problem might be with Elasticsearch (showing a lot of search uris)
root@ironboard08:/var/log/apache2# rvmsudo passenger-status --show=requests | grep uri uri = <search endpoint> uri = <search endpoint> uri = <search endpoint> uri = <search endpoint> uri = <search endpoint> uri = <search endpoint> uri = <search endpoint> uri = <search endpoint> uri = <search endpoint> uri = <search endpoint> uri = /api/v1/users/me?ile_login=true uri = * uri = *
- Hypothesis: do we have a timeout on Elasticsearch? Are we DDOSing ourselves?
- Test hypothesis: hit endpoint from browser
- Confirmed: DDOSing endpoint brought down Learn.co
- Restart all servers to bring back up
- Healthcheck: when Elasticsearch is down, healthcheck fails and load balancer takes all servers out of rotation, 500ing the site
- Searchkick: when Elasticsearch is down, Searchkick indexing fails and 500s the site
- Passenger: long search requests don’t timeout, overload queue
- Decouple Elasticsearch from Learn (bringing down Elasticsearch should not bring down site)
- Add timeouts to Elasticsearch
- Throttle Elasticsearch requests from client- and server-side (so we’re not sending 1 character queries)
- Upgrade Elasticsearch version (?)