Table Of Contents
Home - Teleport Blog - Kubernetes Production Patterns and Anti-Patterns - Jun 25, 2017
Kubernetes Production Patterns and Anti-Patterns
In this post, we explore helpful techniques to improve resiliency and high availability of Kubernetes deployments and take a look at some common mistakes to avoid when working with Docker and Kubernetes.
Installation
First, follow the installation instructions on Github.
Anti-Pattern: Mixing build environment and runtime environment
Let's take a look at this dockerfile
FROM ubuntu:14.04
RUN apt-get update
RUN apt-get install gcc
RUN gcc hello.c -o /hello
It compiles and runs a simple helloworld program:
$ cd prod/build
$ docker build -t prod .
$ docker run prod
Hello World
There are a couple of problems with the resulting Dockerfile:
Size
$ docker images | grep prod
prod latest b2c197180350 14 minutes ago 293.7 MB
That's almost 300 megabytes to host several kilobytes of the c program! We are bringing in package manager, C compiler and lots of other unnecessary tools that are not required to run this program.
Which leads us to the second problem:
Security
We distribute the whole build toolchain. In addition to that, we ship the source code of the image:
$ docker run --entrypoint=cat prod /build/hello.c
## hello.c
#include<stdio.h>
#
#int main()
#{
# printf("Hello World\n");
# return 0;
#}
Splitting build environment and run environment
We are going to use "buildbox" pattern to build an image with build environment, and we will use a much smaller runtime environment to run our program
$ cd prod/build-fix
$ docker build -f build.dockerfile -t buildbox .
NOTE: We have used new -f
flag to specify the dockerfile we are going to use.
Now we have a buildbox
image that contains our build environment. We can use it to compile the C program now:
$ docker run -v $(pwd):/build buildbox gcc /build/hello.c -o /build/hello
We have not used docker build
this time, but mounted the source code and run the compiler directly.
NOTE: Docker will soon support this pattern natively by introducing build stages into the build process.
UPDATE: Multi-stage builds is now available in CE.
We can now use a much simpler (and smaller) dockerfile to run our image:
FROM quay.io/gravitational/debian-tall:0.0.1
ADD hello /hello
ENTRYPOINT ["/hello"]
$ docker build -f run.dockerfile -t prod:v2 .
$ docker run prod:v2
Hello World
$ docker images | grep prod
prod v2 ef93cea87a7c 17 seconds ago 11.05 MB
prod latest b2c197180350 45 minutes ago 293.7 MB
NOTE: Please be aware that you should either plan on providing the needed "shared libraries" in the runtime image or "statically build" you binaries to have them include all needed libraries.
Anti Pattern: Zombies and orphans
NOTICE: this example demonstration will only work on Linux
Orphans
It is quite easy to leave orphaned processes running in the background. Let's take an image we have built in the previous example:
docker run busybox sleep 10000
Now, let's open a separate terminal and locate the process
ps uax | grep sleep
sasha 14171 0.0 0.0 139736 17744 pts/18 Sl+ 13:25 0:00 docker run busybox sleep 10000
root 14221 0.1 0.0 1188 4 ? Ss 13:25 0:00 sleep 10000
As you see there are in fact two processes: docker run
and sleep 1000
running in a container.
Let's send kill signal to the docker run
(just as CI/CD job would do for long running processes):
kill 14171
docker run
process has not exited, and sleep
process is running!
ps uax | grep sleep
root 14221 0.0 0.0 1188 4 ? Ss 13:25 0:00 sleep 10000
Yelp engineers have a good answer for why this happens here:
The Linux kernel applies special signal handling to processes which run as PID 1. When processes are sent a signal on a normal Linux system, the kernel will first check for any custom handlers the process has registered for that signal, and otherwise fall back to default behavior (for example, killing the process on SIGTERM).
However, if the process receiving the signal is PID 1, it gets special treatment by the kernel; if it hasn't registered a handler for the signal, the kernel won't fall back to default behavior, and nothing happens. In other words, if your process doesn't explicitly handle these signals, sending it SIGTERM will have no effect at all.
To solve this (and other) issues, you need a simple init system that has proper signal handlers specified. Luckily Yelp
engineers built the simple and lightweight init system, dumb-init
docker run quay.io/gravitational/debian-tall /usr/bin/dumb-init /bin/sh -c "sleep 10000"
Now you can simply stop docker run
process using SIGTERM and it will handle shutdown properly
Anti-Pattern: Direct Use Of Pods
Kubernetes Pod is a building block that itself is not durable.
Do not use Pods directly in production. They won't get rescheduled, retain their data or guarantee any durability.
Instead, you can use Deployment
with replication factor 1, which will guarantee that pods will get rescheduled
and will survive eviction or node loss.
Anti-Pattern: Using background processes
$ cd prod/background
$ docker build -t $(minikube ip):5000/background:0.0.1 .
$ docker push $(minikube ip):5000/background:0.0.1
$ kubectl create -f crash.yaml
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
crash 1/1 Running 0 5s
The container appears to be running, but let's check if our server is running there:
$ kubectl exec -ti crash /bin/bash
root@crash:/#
root@crash:/#
root@crash:/# ps uax
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 21748 1596 ? Ss 00:17 0:00 /bin/bash /start.sh
root 6 0.0 0.0 5916 612 ? S 00:17 0:00 sleep 100000
root 7 0.0 0.0 21924 2044 ? Ss 00:18 0:00 /bin/bash
root 11 0.0 0.0 19180 1296 ? R+ 00:18 0:00 ps uax
root@crash:/#
Using Probes
We made a mistake and the HTTP server is not running there but there is no indication of this as the parent process is still running.
The first obvious fix is to use a proper init system and monitor the status of the web service. However, let's use this as an opportunity to use liveness probes:
apiVersion: v1
kind: Pod
metadata:
name: fix
namespace: default
spec:
containers:
- command: ["/start.sh"]
image: localhost:5000/background:0.0.1
name: server
imagePullPolicy: Always
livenessProbe:
httpGet:
path: /
port: 5000
timeoutSeconds: 1
$ kubectl create -f fix.yaml
The liveness probe will fail and the container will get restarted.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
crash 1/1 Running 0 11m
fix 1/1 Running 1 1m
Production Pattern: Logging
Set up your logs to go to stdout:
$ kubectl create -f logs/logs.yaml
$ kubectl logs logs
hello, world!
Kubernetes and Docker have a system of plugins to make sure logs sent to stdout and stderr will get collected, forwarded and rotated.
NOTE: This is one of the patterns of The Twelve Factor App and Kubernetes supports it out of the box!
Production Pattern: Immutable containers
Every time you write something to container's filesystem, it activates copy on write strategy.
A new storage layer is created using a storage driver (devicemapper, overlayfs or others). In case of active usage, it can put a lot of load on storage drivers, especially in case of Devicemapper or BTRFS.
Make sure your containers write data only to volumes. You can use tmpfs
for small (as tmpfs stores everything in memory) temporary files:
apiVersion: v1
kind: Pod
metadata:
name: test-pd
spec:
containers:
- image: busybox
name: test-container
volumeMounts:
- mountPath: /tmp
name: tempdir
volumes:
- name: tempdir
emptyDir: {}
Anti-Pattern: Using latest
tag
Do not use latest
tag in production. It creates ambiguity, as it's not clear what real version of the app this is.
It is ok to use latest
for development purposes, although make sure you set imagePullPolicy
to Always
, to make sure
Kubernetes always pulls the latest version when creating a pod:
apiVersion: v1
kind: Pod
metadata:
name: always
namespace: default
spec:
containers:
- command: ["/bin/sh", "-c", "echo hello, world!"]
image: busybox:latest
name: server
imagePullPolicy: Always
Production Pattern: Pod Readiness
Imagine a situation when your container takes some time to start. To simulate this, we are going to write a simple script:
#!/bin/bash
echo "Starting up"
sleep 30
echo "Started up successfully"
python -m http.serve 5000
Push the image and start service and deployment:
$ cd prod/delay
$ docker build -t $(minikube ip):5000/delay:0.0.1 .
$ docker push $(minikube ip):5000/delay:0.0.1
$ kubectl create -f service.yaml
$ kubectl create -f deployment.yaml
Enter curl container inside the cluster and make sure it all works:
kubectl run -i -t --rm cli --image=tutum/curl --restart=Never
curl http://delay:5000
<!DOCTYPE html>
...
You will notice that there's a connection refused error
, when you try to access it
for the first 30 seconds.
Update deployment to simulate deploy:
$ docker build -t $(minikube ip):5000/delay:0.0.2 .
$ docker push $(minikube ip):5000/delay:0.0.2
$ kubectl replace -f deployment-update.yaml
In the next window, let's try to to see if we got any service downtime:
curl http://delay:5000
curl: (7) Failed to connect to delay port 5000: Connection refused
We've got a production outage despite setting maxUnavailable: 0
in our rolling update strategy!
This happened because Kubernetes did not know about startup delay and readiness of the service.
Let's fix that by using readiness probe:
readinessProbe:
httpGet:
path: /
port: 5000
timeoutSeconds: 1
periodSeconds: 5
Readiness probe indicates the readiness of the pod containers and Kubernetes will take this into account when doing a deployment:
$ kubectl replace -f deployment-fix.yaml
This time we will get no downtime.
Anti-Pattern: unbound quickly failing jobs
Kubernetes provides new useful tool to schedule containers to perform one-time task: jobs
However, there is a problem:
apiVersion: batch/v1
kind: Job
metadata:
name: bad
spec:
template:
metadata:
name: bad
spec:
restartPolicy: Never
containers:
- name: box
image: busybox
command: ["/bin/sh", "-c", "exit 1"]
$ cd prod/jobs
$ kubectl create -f job.yaml
You are going to observe the race to create hundreds of containers for the job retrying forever:
$ kubectl describe jobs
Name: bad
Namespace: default
Image(s): busybox
Selector: controller-uid=18a6678e-11d1-11e7-8169-525400c83acf
Parallelism: 1
Completions: 1
Start Time: Sat, 25 Mar 2017 20:05:41 -0700
Labels: controller-uid=18a6678e-11d1-11e7-8169-525400c83acf
job-name=bad
Pods Statuses: 1 Running / 0 Succeeded / 24 Failed
No volumes.
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
1m 1m 1 {job-controller } Normal SuccessfulCreate Created pod: bad-fws8g
1m 1m 1 {job-controller } Normal SuccessfulCreate Created pod: bad-321pk
1m 1m 1 {job-controller } Normal SuccessfulCreate Created pod: bad-2pxq1
1m 1m 1 {job-controller } Normal SuccessfulCreate Created pod: bad-kl2tj
1m 1m 1 {job-controller } Normal SuccessfulCreate Created pod: bad-wfw8q
1m 1m 1 {job-controller } Normal SuccessfulCreate Created pod: bad-lz0hq
1m 1m 1 {job-controller } Normal SuccessfulCreate Created pod: bad-0dck0
1m 1m 1 {job-controller } Normal SuccessfulCreate Created pod: bad-0lm8k
1m 1m 1 {job-controller } Normal SuccessfulCreate Created pod: bad-q6ctf
1m 1s 16 {job-controller } Normal SuccessfulCreate (events with common reason combined)
Probably not the result you expected. Over time, the load on the nodes and docker will be quite substantial, especially if job is failing very quickly.
Let's clean up the busy failing job first:
$ kubectl delete jobs/bad
Now let's use activeDeadlineSeconds
to limit amount of retries:
apiVersion: batch/v1
kind: Job
metadata:
name: bound
spec:
activeDeadlineSeconds: 10
template:
metadata:
name: bound
spec:
restartPolicy: Never
containers:
- name: box
image: busybox
command: ["/bin/sh", "-c", "exit 1"]
$ kubectl create -f bound.yaml
Now you will see that after 10 seconds, the job has failed:
11s 11s 1 {job-controller } Normal DeadlineExceeded Job was active longer than specified deadline
NOTE: Sometimes it makes sense to retry forever. In this case make sure to set a proper pod restart policy to protect from accidental DDOS on your cluster.
Production pattern: Circuit Breaker
In this example, our web application is an imaginary web server for email. To render the page, our frontend has to make two requests to the backend:
- Talk to the weather service to get current weather
- Fetch current mail from the database
If the weather service is down, user still would like to review the email, so weather service is auxiliary, while current mail service is critical.
Here is our frontend, weather and mail services written in python:
Weather
from flask import Flask
app = Flask(__name__)
@app.route("/")
def hello():
return '''Pleasanton, CA
Saturday 8:00 PM
Partly Cloudy
12 C
Precipitation: 9%
Humidity: 74%
Wind: 14 km/h
'''
if __name__ == "__main__":
app.run(host='0.0.0.0')
from flask import Flask,jsonify
app = Flask(__name__)
@app.route("/")
def hello():
return jsonify([
{"from": "<[email protected]>", "subject": "lunch at noon tomorrow"},
{"from": "<[email protected]>", "subject": "compiler docs"}])
if __name__ == "__main__":
app.run(host='0.0.0.0')
Frontend
from flask import Flask
import requests
from datetime import datetime
app = Flask(__name__)
@app.route("/")
def hello():
weather = "weather unavailable"
try:
print "requesting weather..."
start = datetime.now()
r = requests.get('http://weather')
print "got weather in %s ..." % (datetime.now() - start)
if r.status_code == requests.codes.ok:
weather = r.text
except:
print "weather unavailable"
print "requesting mail..."
r = requests.get('http://mail')
mail = r.json()
print "got mail in %s ..." % (datetime.now() - start)
out = []
for letter in mail:
out.append("<li>From: %s Subject: %s</li>" % (letter['from'], letter['subject']))
return '''<html>
<body>
<h3>Weather</h3>
<p>%s</p>
<h3>Email</h3>
<p>
<ul>
%s
</ul>
</p>
</body>
''' % (weather, '<br/>'.join(out))
if __name__ == "__main__":
app.run(host='0.0.0.0')
Let's create our deployments and services:
$ cd prod/cbreaker
$ docker build -t $(minikube ip):5000/mail:0.0.1 .
$ docker push $(minikube ip):5000/mail:0.0.1
$ kubectl apply -f service.yaml
deployment "frontend" configured
deployment "weather" configured
deployment "mail" configured
service "frontend" configured
service "mail" configured
service "weather" configured
Check that everything is running smoothly:
$ kubectl run -i -t --rm cli --image=tutum/curl --restart=Never
$ curl http://frontend
<html>
<body>
<h3>Weather</h3>
<p>Pleasanton, CA
Saturday 8:00 PM
Partly Cloudy
12 C
Precipitation: 9%
Humidity: 74%
Wind: 14 km/h
</p>
<h3>Email</h3>
<p>
<ul>
<li>From: <[email protected]> Subject: lunch at noon tomorrow</li><br/><li>From: <[email protected]> Subject: compiler docs</li>
</ul>
</p>
</body>
Let's introduce weather service that crashes:
from flask import Flask
app = Flask(__name__)
@app.route("/")
def hello():
raise Exception("I am out of service")
if __name__ == "__main__":
app.run(host='0.0.0.0')
Build and redeploy:
$ docker build -t $(minikube ip):5000/weather-crash:0.0.1 -f weather-crash.dockerfile .
$ docker push $(minikube ip):5000/weather-crash:0.0.1
$ kubectl apply -f weather-crash.yaml
deployment "weather" configured
Let's make sure that it is crashing:
$ kubectl run -i -t --rm cli --image=tutum/curl --restart=Never
$ curl http://weather
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>500 Internal Server Error</title>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>
However our frontend should be all good:
$ kubectl run -i -t --rm cli --image=tutum/curl --restart=Never
curl http://frontend
<html>
<body>
<h3>Weather</h3>
<p>weather unavailable</p>
<h3>Email</h3>
<p>
<ul>
<li>From: <[email protected]> Subject: lunch at noon tomorrow</li><br/><li>From: <[email protected]> Subject: compiler docs</li>
</ul>
</p>
</body>
root@cli:/# curl http://frontend
<html>
<body>
<h3>Weather</h3>
<p>weather unavailable</p>
<h3>Email</h3>
<p>
<ul>
<li>From: <[email protected]> Subject: lunch at noon tomorrow</li><br/><li>From: <[email protected]> Subject: compiler docs</li>
</ul>
</p>
</body>
Everything is working as expected! There is one problem though, we have just observed the service is crashing quickly, let's see what happens if our weather service is slow. This happens way more often in production, e.g. due to network or database overload.
To simulate this failure we are going to introduce an artificial delay:
from flask import Flask
import time
app = Flask(__name__)
@app.route("/")
def hello():
time.sleep(30)
raise Exception("System overloaded")
if __name__ == "__main__":
app.run(host='0.0.0.0')
Build and redeploy:
$ docker build -t $(minikube ip):5000/weather-crash-slow:0.0.1 -f weather-crash-slow.dockerfile .
$ docker push $(minikube ip):5000/weather-crash-slow:0.0.1
$ kubectl apply -f weather-crash-slow.yaml
deployment "weather" configured
Just as expected, our weather service is timing out:
curl http://weather
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>500 Internal Server Error</title>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>
The problem though, is that every request to frontend takes 10 seconds as well
curl http://frontend
This is a much more common outage - users leave in frustration as the service is unavailable. To fix this issue we are going to introduce a special proxy with circuit breaker.
Circuit breaker is a special middleware that is designed to provide a fail-over action in case the service has degraded. It is very helpful to prevent cascading failures - where the failure of the one service leads to failure of another. Circuit breaker observes requests statistics and checks the stats against a special error condition.
Here is our simple circuit breaker written in python:
from flask import Flask
import requests
from datetime import datetime, timedelta
from threading import Lock
import logging, sys
app = Flask(__name__)
circuit_tripped_until = datetime.now()
mutex = Lock()
def trip():
global circuit_tripped_until
mutex.acquire()
try:
circuit_tripped_until = datetime.now() + timedelta(0,30)
app.logger.info("circuit tripped until %s" %(circuit_tripped_until))
finally:
mutex.release()
def is_tripped():
global circuit_tripped_until
mutex.acquire()
try:
return datetime.now() < circuit_tripped_until
finally:
mutex.release()
@app.route("/")
def hello():
weather = "weather unavailable"
try:
if is_tripped():
return "circuit breaker: service unavailable (tripped)"
r = requests.get('http://localhost:5000', timeout=1)
app.logger.info("requesting weather...")
start = datetime.now()
app.logger.info("got weather in %s ..." % (datetime.now() - start))
if r.status_code == requests.codes.ok:
return r.text
else:
trip()
return "circuit breaker: service unavailable (tripping 1)"
except:
app.logger.info("exception: %s", sys.exc_info()[0])
trip()
return "circuit breaker: service unavailable (tripping 2)"
if __name__ == "__main__":
app.logger.addHandler(logging.StreamHandler(sys.stdout))
app.logger.setLevel(logging.DEBUG)
app.run(host='0.0.0.0', port=6000)
Let's build and redeploy circuit breaker:
$ docker build -t $(minikube ip):5000/cbreaker:0.0.1 -f cbreaker.dockerfile .
$ docker push $(minikube ip):5000/cbreaker:0.0.1
$ kubectl apply -f weather-cbreaker.yaml
deployment "weather" configured
$ kubectl apply -f weather-service.yaml
service "weather" configured
Circuit breaker will detect service outage and auxiliary weather service will not bring our mail service down any more:
curl http://frontend
<html>
<body>
<h3>Weather</h3>
<p>circuit breaker: service unavailable (tripped)</p>
<h3>Email</h3>
<p>
<ul>
<li>From: <[email protected]> Subject: lunch at noon tomorrow</li><br/><li>From: <[email protected]> Subject: compiler docs</li>
</ul>
</p>
</body>
NOTE: There are some production level proxies that natively support circuit breaker pattern - such as Vulcand, Nginx plus or Envoy.
Teleport cybersecurity blog posts and tech news
Every other week we'll send a newsletter with the latest cybersecurity news and Teleport updates.
Production Pattern: Sidecar For Rate and Connection Limiting
In the previous example we have used a sidecar pattern - a special proxy local to the Pod, that adds additional logic to the service, such as error detection, TLS termination and other features.
Here is an example of sidecar nginx proxy that adds rate and connection limits:
$ cd prod/sidecar
$ docker build -t $(minikube ip):5000/sidecar:0.0.1 -f sidecar.dockerfile .
$ docker push $(minikube ip):5000/sidecar:0.0.1
$ docker build -t $(minikube ip):5000/service:0.0.1 -f service.dockerfile .
$ docker push $(minikube ip):5000/service:0.0.1
$ kubectl apply -f sidecar.yaml
deployment "sidecar" configured
Try to hit the service faster than one request per second and you will see the rate limiting in action
$ kubectl run -i -t --rm cli --image=tutum/curl --restart=Never
curl http://sidecar
Istio is an example of platform that embodies this design and more for instance.
Additional Reading: How to SSH into Docker Container
Tags
Teleport Newsletter
Stay up-to-date with the newest Teleport releases by subscribing to our monthly updates.