Login

Factual Blog /

Docker, Mesos, Marathon, and the End of Pets

A Few Words on Pets and Cattle

The pets vs cattle metaphor is not a new one, and with apologies to my vegan friends, not mine, but bears briefly repeating. Essentially, ssh-ing into a machine, configuring it just so, and naming it after something cute or an erudite allusion – that’s a pet. You’re sad if she dies, gets sick, or has her blinky red light go black. Cattle, on the other hand, sit in numbered racks and referenced as numbers perhaps concatenated with a machine class like cow_with_big_ram_and_ssd_00003. If a member of the herd dies, a replacement is provisioned and launches automagically. This analogy applies both in the cloud and corporate datacenters.

By now most people agree that cattle, to wit, the metaphor, is better for all sorts of reasons. The ugly truth, however, is that it’s much less embraced than many people realize and most of us are willing to admit. So many shops, including this one, have datacenters or cloud deployments that are full of pets. True, they may be virtualized, but they’re pets nonetheless.

Enter the “Too Important” Services

I just claimed that almost everyone still runs lots of pets. This is true because sysops considers some services “Too Important” or “Too Core”. In some of the more enlightened shops, these are limited to things like bind, dhcp, haproxy, zookeeper, etcd, and in a twist of irony, a chef/puppet server, all the complicated stuff that controls frameworks like openstack, hadoop namenodes, and so on.

At Factual, I’ll admit, we still have a fair bit of this and it’s the reason I put some of my spare time into researching our options, trying some of these ideas out among our sysops and engineering teams, and eventually writing this post. The temptation, and ultimately, the trap, is that services that are “Too Core” seem too important to trust to arguably immature metal abstraction software and at the same time, too important to risk cohabitating with other services that might run away and eat up all of the RAM, CPU, or IO on a box they share. The tragic result of this is lots of little pets, each of which is a point of failure and/or an extra bit of complexity – at best a numbered machine that’s hard to remember and at worst, an obscure lesser deity.

The first thing we’ll do is knock down some of this complexity. In order to do this, we need to set some ground rules for pet-free core services. In no particular order, these are:

  • we put only the things we absolutely need to bootstrap into "core services"
  • automated, code-driven (as opposed to ssh-and-type-stuff) provisioning
  • ability to run multiple services on a small number of beefier (pun intended) machines
  • hostname and physical host independence
  • isolation with constraints for RAM and CPU per service
  • redundancy for all core services

The LXC Container, Resource Manager, and the Container Manager Wars

There’s a contest afoot for the hearts and minds of our servers and our devops. At this point, I think it’s safe to say that Docker is winning the container war. Most people who are into such things and most that aren’t, have embraced Docker, warts and all, as the new hotness, because it mostly works, enough, and is fairly ubiquitous. Betting on Dockerizing our services feels safe and the CoreOS kerfuffle seems unlikely to force a change in the short term.

The Resource Manager wars are more complicated. The Hadoop ecosystem has all but gone to Yarn. It’s hard to use for non-hadoopy-things, but for the most part, that suits us fine because our hadoopy machines follow a certain model – 2U machines with 12 large spinning drives. They’re really good for Hadoop ecosystem things, not so general-purpose by design, and are nearly 100% utilized. Your situation may differ and push you into a hadoop on Mesos or a Kubernetes on Yarn implementation to better utilize your metal.

There are many other services we need outside of the Hadoop world and, for managing the metal that runs those, I like Mesos. This is the part where I duck and wait for all of the vitriolic and “but think of the children” speeches to die down. Why Mesos? It has a straightforward high-availability setup based on a venerable and much-used zookeeper, some heavyweight adoption, ok documentation, and I like the amplab … and it might not matter that much – I’ll explain presently.

The Container Manager wars also seem quite undecided. Kubernetes has a Clintonesque inevitability to it and we can reasonably expect to adopt it eventually given the momentum, particularly since the Mesosphere folks are offering up a world where it can be just another Mesos framework. For now, I’ll demonstrate and write up an implementation based on Marathon because it’s easier to provision and document. I duck again. Anyway, so here’s the thing, if you compare the configs, they look almost the same – some json with the docker image, how many instances, how much CPU, RAM, some stuff with ports, cohort services, and slightly different jargon. As I said with the Resource Manager, I really think it doesn’t matter that much. Ducking some more… Ok, yes, I’m oversimplifying, but if I stare at a Kubernetes service config and a Marathon one and the others, I feel like I can switch to the better one if I ever figure out which that is. If that’s the case, instead of waiting for the one that finally unites the seven kingdoms, I’m just picking one and honestly don’t feel like my inevitably wrong choice is irreversible forever. In fact, the beauty of these abstractions is that they can coexist as you shift resources and attention to the better one.

That’s my thesis – Docker is where most of the real configuration effort and commitment lies and there isn’t a strong contender. Even if one emerges, they will almost certainly need a story for easy migration away from Docker. Meanwhile, the Container Manager wars may go on another lifetime, and that’s ok. I’m not waiting for them to play out to start enjoying the benefits of this approach.

The Docker Pets

Before I get to the meat (pun opportunity: unsuccessfully resisted) of the post, I want to point out one thing about Docker. It’s only ever ok to use Dockerfiles to build Docker images. If you attach or ssh to painstakingly handcraft an image and ever, ever, type ‘docker push’ afterward, that’s a pet, a Docker pet. In some countries they still eat dog. If that’s culturally acceptable, so be it. Manually making images? Whoah. Not. Okay. If you can’t go back and undo some config change you did 20 commands ago or swap out the base OS, and this is important, by changing code, then you’re making the Docker equivalent of a pet.

End of My Stream Of Consciousness Opinion-fest

So, that’s the motivation for setting things up the way I have. What follows is a step-by-step guide to get the core services all running inside Docker containers with essentially nothing tied to machine-specific configuration. Once we have a minimal set of core services, we can spin up everything else in a way that makes for easy deployment and scaling while giving us the ability to achieve extremely efficient machine utilization.

Finally Building The Cluster

The cluster we’re building consists of two sets of machines: our “core” machines that I alluded to earlier and then a bunch of Mesos slaves – our cattle. For simplicity, I’m intentionally not tackling a couple of issues in my sample configs, namely encryption, authentication, and rack awareness, but if you’re doing production things and on your own infrastructure, don’t skimp on those things.

Also, some of the machine setup commands will just be expressed as straight bash commands, but in practice, this should be done via PXE boot and some sort of ansible/chef/puppet thing.

On All Machines

After installing vanilla ubuntu or your prefered alternative…

enable swap and memory accounting

sudo sed -i 's/^GRUB_CMDLINE_LINUX=""/GRUB_CMDLINE_LINUX="cgroup_enable=memory swapaccount=1"/' /etc/default/grub
sudo update-grub

add docker repo and latest package

echo deb https://get.docker.com/ubuntu docker main > /etc/apt/sources.list.d/docker.list
apt-get update
apt-get install lxc-docker

The Core Machines

In setting up the machines that run our core services, we select 3 somewhat powerful machines to run everything we need to bootstrap our system. A typical set of services we’ll put on these systems includes bind, dhcp, zookeeper, mesos, and marathon. Each of these services, other than docker itself, will run in a container. Each container is allocated a reasonable amount of RAM and CPU to operate its service. They key here is that each of the services share a respective container image and have essentially identical configs where things like hostname, ip, and their peers are simply passed in via dynamic environment variables. This gives us redundancy and, just as importantly, isolation and portability, even for our “core” services.

In our setup, bind and dhcp are stateless masters rather than having a master/slave failover strategy. Instead, each instance pulls its config from an nginx server that syncs from a git repo.

zookeeper

set up data volumes

on each node, do this once

mkdir -p /disk/ssd/data/zookeeper
mkdir -p /disk/ssd/log/zookeeper

docker run -d -v /disk/ssd/data/zookeeper/data:/data --name zookeeper-data boritzio/docker-base true
docker run -d -v /disk/ssd/log/zookeeper:/data-log --name zookeeper-data-log boritzio/docker-base true

start zookeeper on each node

#this just assigns the zookeeper node id based on our numbering scheme with is machine1x0 -> x+1
MACHINE_NUMBER=`hostname | perl -nle '$_ =~ /(\d+)$/; print (($1+10-100)/10)'`
docker run -e ZK_SERVER_ID=$MACHINE_NUMBER --restart=on-failure:10 --name zookeeper -p 2181:2181 -p 2888:2888 -p 3888:3888 -e HOSTNAME=`hostname` -e HOSTS=ops100,ops110,ops120 -m 2g --volumes-from zookeeper-data --volumes-from zookeeper-data-log boritzio/docker-zookeeper

mesos master

set up data volumes

mkdir -p /disk/ssd/data/mesos/workdir
docker run -d -v /disk/ssd/data/mesos/workdir:/workdir --name mesos-workdir boritzio/docker-base true

start mesos master

docker run --restart=on-failure:10 --name mesos-master -p 5050:5050 -m 1g -e MESOS_ZK=zk://ops100:2181,ops110:2181,ops120:2181/mesos -e MESOS_CLUSTER=factual-mesosphere -e MESOS_WORK_DIR=/workdir -e MESOS_LOG_DIR=/var/log/mesos/ -e MESOS_QUORUM=2 -e HOSTNAME=`hostname` -e IP=`hostname -i` --volumes-from mesos-workdir boritzio/docker-mesos-master

marathon

start marathon

#use host network for now...
docker run --restart=on-failure:10 --net host --name marathon -m 1g -e MARATHON_MASTER=zk://ops100:2181,ops110:2181,ops120:2181/mesos -e MARATHON_ZK=zk://ops100:2181,ops110:2181,ops120:2181/marathon -e HOSTNAME=`hostname` boritzio/docker-marathon

The Mesos Slaves

This is our cluster of cattle machines. In practice, it’s important to divide these into appropriate racks and roles.

add mesosphere repo and latest package

echo "deb http://repos.mesosphere.io/ubuntu/ trusty main" > /etc/apt/sources.list.d/mesosphere.list
apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF
apt-get update && apt-get -y install mesos

set up slave

prevent zookeeper and mesos master from starting on slaves

sudo stop zookeeper
echo manual | sudo tee /etc/init/zookeeper.override

sudo stop mesos-master
echo manual | sudo tee /etc/init/mesos-master.override

now copy zookeeper config and set local ip on slaves

echo "zk://ops100:2181,ops110:2181,ops120:2181/mesos" > /etc/mesos/zk

HOST_IP=`hostname -i`
echo $HOST_IP > /etc/mesos-slave/ip

#add docker to the list of containerizers
echo 'docker,mesos' > /etc/mesos-slave/containerizers

#this gives it time to pull a large container
echo '5mins' > /etc/mesos-slave/executor_registration_timeout

#this gives the mesos slave access to the main storage device (/disk/ssd/)
mkdir -p /disk/ssd/mesos-slave-workdir
echo '/disk/ssd/mesos-slave-workdir' > /etc/mesos-slave/work_dir

#ok, now we start
start mesos-slave

The HAProxy Proxy

To make this system truly transparent with respect to the underlying machines, it’s necessary to run a load balancer. Marathon ships with a simple bash script that pulls the set of services from the marathon api and updates an HAProxy config. This is great, but because of the way marathon works, each accessible service is assigned an outward facing “service port”, which is what the HAProxy config exposes. The problem is that everybody wants to expose their web app on port 80, so another proxy is needed to sit in front of the marathon-configured HAProxy.

setting up aliases for services and masters

Mesos Master, Marathon, and Chronos all have a web interface and restful API. However, these services all run slightly different versions of a multi-master solution. In some cases, like Marathon and Chronos, talking to any slave will actually result in proxying the request to the master. In the case of Mesos Master, the web app on a slave redirects to the master.

This is a bad user experience – expecting your users to know which server happens to be the elected master. Instead, we’ll want to provide an alias to a load balanced proxy so that we can have mesos.mydomain.com, marathon.mydomain.com, and chronos.mydomain.com.

haproxy for mesos master

First we point dns for mesos.mydomain.com to our haproxy. We then add all of our mesos-master nodes to the backend. Now comes the trick. We only want haproxy to proxy requests to the elected master. To get this to work, we want the non-master nodes to fail their health check. This approach is resilient to a new election, always pointing to the currently elected master.

The trick here is that when looking at /master/state.json, the elected master has a slightly different response. In this case, we match the string ‘elected_time’, which only the current master returns.

backend mesos
  mode http
  balance roundrobin
  option httpclose
  option forwardfor
  option httpchk /master/state.json
  http-check expect string elected_time
  timeout check 10s
  server ops100 10.20.6.123:5050 check inter 10s fall 1 rise 3
  server ops110 10.20.40.203:5050 check inter 10s fall 1 rise 3
  server ops120 10.20.51.2:5050 check inter 10s fall 1 rise 3

This results in 2 failed nodes and one healthy master.

Our First Services

The approach here is to get just enough infrastructure in place to run the rest of our services on top of our new abstraction. Now that we’ve bootstrapped, the rest of our services can be managed by these frameworks. To illustrate this point, we’re actually going to run Chronos on top of Marathon inside a Docker container. Not only is this the right way to illustrate the point, in my opinion, it’s actually the right way to run it.

Launch Script

We can install the marathon gem or just make a simple launch script. For example:

#!/bin/bash

if [ "$#" -ne 1 ]; then
        echo "script takes json file as an argument"
    exit 1;
fi

curl -X POST -H "Content-Type: application/json" marathon.mycompany.com/v2/apps -d@"$@"

And after we launch a few services, our Marathon UI might look like this:

Chronos

Chronos is basically cron-on-mesos. It’s a Mesos framework to run scheduled tasks with containers. And we’re going to provision several instances of it using Marathon.

We can describe the service with this json config.

{
  "id": "chronos",
  "container": {
    "type": "DOCKER",
    "docker": {
      "image": "boritzio/docker-chronos",
      "network": "HOST",
      "parameters": [
        { "key": "env", "value": "MESOS_ZK=zk://ops100:2181,ops110:2181,ops120:2181/mesos" },
        { "key": "env", "value": "CHRONOS_ZK=ops100:2181,ops110:2181,ops120:2181" }
      ]
    }
  },
  "instances": 3,
  "cpus": 1,
  "mem": 1024,
  "ports": [4400]
}

And then we launch it by posting the json to our Marathon api server.

~/launch.sh chronos.json

Which gives us:

Launch Away!

So, that’s it. This should give you a good path for setting up a relatively pet-free environment without compromising on the “Too Core” services too much. Once you adopt an approach like this, sometimes ad hoc things take longer, but on the whole, everyone is better off, especially when the hard drive fails on hephaestus.

Gratuitous Promotion-y Stuff

If you’ve read all of this and have strong opinions on what you see here and want to come help us further automate and optimize our devops, take a look at our job openings, including: http://www.factual.com/jobs/oweb0fwF/Lead-Systems-DevOps-Engineer.

I seldom tweet, but when I do, I might call my next semi-substantiated opinionfest to your attention at @shimanovsky.

All The Code

I’ve included the automated-build ready Dockerfiles and marathon service configs for many of my examples below. Hopefully they address any details I missed in the post.

Core Services

Zookeeper

https://github.com/bfs/docker-zookeeper

Mesos Master

https://github.com/bfs/docker-mesos-master

Marathon

https://github.com/bfs/docker-marathon

Run These on Marathon

Chronos

https://github.com/bfs/docker-chronos

Some Sample Apps with Marathon-Friendly Configs

Ruby Web App (Rails or Sinatra) From Github

Here’s a setup that will pull a specified branch from Github and provision it as a ruby web app. We use this to launch several internal Sinatra apps.

Docker: https://github.com/bfs/docker-github-rubyapp

Marathon Config: https://gist.github.com/bfs/b34b7e09b0a2360e60e1

Postgresql with Postgis

Here’s a setup for a Postgresql server with Postgis that can be launched on Marathon.

Docker: https://github.com/bfs/docker-postgis

Marathon Config: https://gist.github.com/bfs/be77416a19bec481b584

-Boris Shimanovsky, VP of Engineering

Discuss this post on Hacker News.

We're hiring
See Job Openings