Elasticsearch on Kubernetes
January 28, 2019

Vulcan's EarthRanger™ is a software platform that collects information on activity in a protected area—the animals and assets being protected, the rangers protecting them, and threats or potential poaching—into a single, integrated, real-time visualized operational platform (“EarthRanger”).  With multiple deployed sites, the EarthRanger engineering team needs a consolidated source of log activity across all deployments to facilitate system-wide troubleshooting.  We turned to Elasticsearch®, a search engine based on the Lucene library (“Elasticsearch”), for searching and analyzing logs from the various deployments and from the Kubernetes® (k8s) orchestration system and environments where they run (“Kubernetes”).  Volumes are non-trivial at up to 200,000 documents ingested per hour received via a common Logstash instance, but query activity is very light, with only occasional usage via Kibana for application troubleshooting.  As with most of our applications, Elasticsearch is deployed to a Google® Kubernetes Engine (GKE)-managed k8s cluster in the cloud.  In this blog I'll try to capture some of our recent lessons learned from this experience.

Running on Kubernetes

There are countless examples on the web of how to spin up Elasticsearch on Kubernetes.  Like many, our current approach is a combination and refinement of what other have done before us.  First, the basics...

We're running Elasticsearch version 6.3.2 with 3 master nodes*, 3 data nodes, and 2 ingest nodes.  The master and data nodes run as statefulsets, while the ingest nodes run as a deployment.  The GKE cluster hosting it has 4 n1-standard-2 compute instances running k8s version 1.10.6-gke.13 (as of this writing).  Data and master nodes are backed by 100GB and 10GB (respectively) Google Compute Engine (GCE) persistent disks.  The Elasticsearch cluster is exposed via an Nginx ingress controller.

* In this post, "master nodes" refers to the master-eligible nodes, not just the elected master, unless otherwise noted.

Persistent Disks.  One of our design goals for all of our k8s deployments is to provision persistent storage that can survive destruction of the cluster.  This provides a handy troubleshooting option of simply destroying a misbehaving cluster, then re-creating and re-deploying to it.  Persistent disks survive this operation, so a re-deployed application can pick up where it left off.  But this gets tricky with stateful sets that have more than one replica (pod), because sometimes we want to ensure that a re-created pod mounts the same persistent disk as its predecessor.  While this is less important for Elasticsearch, it's critically important for Kafka, for example, which has a very tight binding between nodes and their disks.  We address this by explicitly creating GCE persistent disks with fixed names, such as elasticsearch-data-0, and Persistent Volumes and Persistent Volume Claims to match.

Persistent Volumes (PV) to represent the GCE persistent disks:

apiVersion: "v1"
kind: "PersistentVolume"
metadata:
  annotations:
      volume.beta.kubernetes.io/mount-options: "discard"
  name: elasticsearch-data-0
  labels:
    component: elasticsearch-data-0
spec:
  capacity:
    storage: "100Gi"
  accessModes:
    "ReadWriteOnce"
  persistentVolumeReclaimPolicy: Retain
  storageClassName: "ssd"
  gcePersistentDisk:
    fsType: "ext4"
    pdName: "elasticsearch-data-0"

Persistent Volume Claims (PVC) to request the PVs:

apiVersion: "v1"
kind: "PersistentVolumeClaim"
metadata:
  name: "elasticsearch-data-elasticsearch-data-0"
spec:
  accessModes:
    "ReadWriteOnce"
  resources:
    requests:
      storage: "100Gi"
  storageClassName: "ssd"
  selector:
    matchLabels:
      component: "elasticsearch-data-0"

Stateful Sets whose pods bind to those Persistent Volume Claims:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: elasticsearch-data
  labels:
    app: elasticsearch-data
    role: data
spec:
  serviceName: elasticsearch-data
  replicas: 3
  selector:
    matchLabels:
      app: elasticsearch-data
  template:
    metadata:
      labels:
        app: elasticsearch-data
        role: data
    spec:
      initContainers:
      - name: init-sysctl
        image: busybox:1.27.2
        command:
        - sysctl
        - -w
        - vm.max_map_count=262144
        securityContext:
          privileged: true
      containers:
      - name: elasticsearch-data
        securityContext:
          privileged: true
          capabilities:
            add:
              - IPC_LOCK
        image: quay.io/pires/docker-elasticsearch-kubernetes:6.3.2
        imagePullPolicy: "IfNotPresent"
        env:
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: "CLUSTER_NAME"
          value: "es-cluster"
        - name: NODE_MASTER
          value: "false"
        - name: DISCOVERY_SERVICE
          value: elasticsearch-master
        - name: NODE_INGEST
          value: "false"
        - name: HTTP_ENABLE
          value: "false"
        - name: "ES_JAVA_OPTS"
          value: "-Xms800m -Xmx800m"
        resources:
          requests:
            memory: 1Gi
        ports:
        - containerPort: 9300
          name: transport
          protocol: TCP
        volumeMounts:
        - mountPath: /usr/share/elasticsearch/data
          name: elasticsearch-data
  volumeClaimTemplates:
  - metadata:
      name: elasticsearch-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "ssd"
      resources:
        requests:
          storage: 100Gi
      selector:
        matchLabels:
          component: null

K8s appends an ordinal number to the end of the name of each of the three pods generated from this statefulset (eg, elasticsearch-data-0).  When each pod is created, k8s matches the pod with the PVC of the same name/ordinal combination that we created above.  In this way, pod elasticsearch-data-0 always binds to PVC elasticsearch-data-0, which always claims PV elasticsearch-data-0, which always binds to disk elasticsearch-data-0.

A note about that last volumeClaimTemplate selector: the purpose of a volume claim template is to help k8s automatically create a PVC if no matching PVC is found.  As noted above, however, we are being very explicit about which PVC a pod binds to.  By using components: null as the match label, we prohibit k8s from binding that pod to any other disk.

This is not strictly necessary.  Kubernetes can create Claims, Volumes, and even disks automatically and guarantee that a pod that restarts will receive the same claim-volume-disk combination as its predecessor.  The extra work we're doing above is to ensure that this consistent assignment survives complete destruction of the cluster, which is beyond the scope of guarantees that k8s provides.  It has disadvantages: namely that we cannot scale the stateful set up to more pods if needed without first provisioning more disks, Persistent Volume Claims, and Persistent Volumes.  But so far, this has not been a limitation.

Persistent Storage for Master Nodes

Curiously, many Elasticsearch/K8s examples in the wild include master nodes being created as k8s Deployments, which provides them only ephemeral storage.  For reasons well-explained here, master nodes should have persistent storage, or else you'll risk data loss in some plausible scenarios.  As such, our master nodes are deployed as a statefulset, as shown above.

Settings

The Elasticsearch documentation notes that "Elasticsearch ships with good defaults," and we agree, but there are still a few settings that really, really should be changed.  Most of these are widely recommended, but we'll repeat them here:

  1. CLUSTER_NAME - New Elasticsearch nodes will automatically attempt to join clusters of the same name, the default name being elasticsearch.  If you don't change from the default cluster name, any new node that can see your cluster may attempt to join it.
  2. ES_JAVA_OPTS - use this setting to change the Java heap size, because the default is too small for most applications.
  3. NUMBER_OF_MASTERS - The default is 1, necessary for getting started with a single-node "cluster".  But if you have more than one master node, always set this on the masters to avoid split brain.

Set the Sharding Level

Sizing an Elasticsearch cluster is hard due to the number of variables involved.  One important factor to keep in mind is that the number of shards in your cluster can push your capacity limits as much as the number of documents can.  By default, indices in Elasticsearch get a sharding factor of 5 and replication factor of 1.  This means that every Elasticsearch index actually manifests as 10 Lucene indices, each with its own resource demands.  Since time-series data like logs are often fed to a new index per day to make index curation easier, with multiple applications feeding logs to our cluster, we observed an explosion of shards.  Our first experience with cluster resource exhaustion was caused not by the number or size of the indexed documents, but by the sheer quantity of shards our cluster was managing.

By reducing the sharding factor to equal the number of data nodes (3), we were able to reduce the number of shards on our cluster by 40% while still ensuring that Elasticsearch could distribute primary and replica shards across data nodes to provide the same level of performance and failure protection.  Unfortunately, the sharding factor cannot be set by any configuration file or command line argument; it is applied as part of an index template by POSTing to the cluster's API:

POST _template/default
{
  "template": ["*"]
  "order": -1
  "settings": {
    "number_of_shards""3"
  }
}

This template will be applied to all indices created after the template is created.  Existing indices will retain their previous sharding factor.

Unfortunately, this one-time step cannot be performed until the cluster is created and healthy.  While this is easy to do by hand, it becomes an extra chore in an infrastructure automation environment.  Without a human in the loop, we have to write and maintain a separate automation step that detects when a new cluster is healthy then POSTs the index template.  We have an open question about this, but so far there doesn't seem to be a way to set the sharding factor at cluster provisioning time.

Speaking of Curation

When getting started with any data store, we have a natural tendency to pat ourselves on the back and move on once it's up and running.  But if you leave the faucet on, eventually the bathtub overflows.  Don't forget to add some kind of curation step to prevent your indices from growing without bounds.  If you're not paying attention, Elasticsearch will exhaust its JVM heap, at which point only careful surgery has a chance of returning your cluster to a healthy state.  One way to cull the herd is to close old indices, which saves on resource consumption while preserving the data if needed later.  In our case, deletion is acceptable.  How do we automate this?

We deploy a CronJob controller to k8s that run the Elasticsearch curator every morning at 1 a.m. to delete indices greater than 30 days old based on the index name.

apiVersion: v1
kind: ConfigMap
metadata:
  name: curator-config
data:
  action_file.yml: |-
    ---
    actions:
      1:
        action: delete_indices
        description: "Clean up ES by deleting old indices"
        options:
          timeout_override:
          continue_if_exception: False
          disable_action: False
          ignore_empty_list: True
        filters:
        - filtertype: age
          source: name
          direction: older
          timestring: '%Y.%m.%d'
          unit: days
          unit_count: 30
          field:
          stats_result:
          epoch:
          exclude: False
  config.yml: |-
    ---
    client:
      hosts:
        - elasticsearch
      port: 9200
      url_prefix:
      use_ssl: False
      certificate:
      client_cert:
      client_key:
      ssl_no_validate: False
      http_auth:
      timeout: 30
      master_only: False
    logging:
      loglevel: INFO
      logfile:
      logformat: default
      blacklist: ['elasticsearch''urllib3']
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: elasticsearch-curator
spec:
  schedule: '0 1 * * *'
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          imagePullSecrets:
            - name: infra-secret
          containers:
          - name: elasticsearch-curator
            image: quay.io/pires/docker-elasticsearch-curator:5.5.1
            args:
            - --config
            - /etc/config/config.yml
            - /etc/config/action_file.yml
            volumeMounts:
              - name: config-volume
                mountPath: /etc/config
          volumes:
            - name: config-volume
              configMap:
                name: curator-config
          restartPolicy: Never


Observability

In spite of our best effort, Elasticsearch can still go sideways.  As with most systems, we have found that the best way to prevent big problems is to know when they're coming.  We have developed a standard GrafanaAlertmanagerPrometheus (GAP) stack that deploys with every k8s cluster to monitor and alert on the health of the cluster itself and its workloads.  For clusters running Elasticsearch, we add custom Grafana dashboard and Prometheus alerts to help us monitor, detect, and act on brewing problems.  Informed by this excellent series of posts from Data Dog, we watch and alert on (among other items):

  • Overall cluster status (green/yellow/red)
  • JVM heap usage
  • Garbage collection, index, and query performance

We deploy the Justwatch Elasticsearch exporter to expose Prometheus metrics about the cluster:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: elasticsearch-exporter
spec:
  selector:
    matchLabels:
      component: elasticsearch-exporter
  template:
    metadata:
      labels:
        component: elasticsearch-exporter
      annotations:
        prometheus.io/port: '9108'
    spec:
      imagePullSecrets:
        - name: infra-secret
      containers:
      - image: justwatch/elasticsearch-exporter:1.0.2
        name: elasticsearch-exporter
        args:
        '-es.uri=http://elasticsearch:9200'
        '-es.all=true'
        '-es.timeout=30s'
        resources:
          requests:
            cpu: "10m"
        ports:
        - containerPort: 9108

And we visualize the metrics with a Grafana dashboard customized from Kristian Jensen's work, partially shown here:

Conclusion

Adopting Elasticsearch has been like skiing: easy to get started, but a little harder to master.  While much of its out-of-the-box behavior is sufficient for many applications at first, there are a number of operational considerations that can't be ignored for long.  However, none of them are too difficult, so running Elasticsearch on Kubernetes has been a reliable and resilient solution for us.




All source code in this post copyright Vulcan Inc., 2019, and licensed under Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0). 

About the Author
Matt S.
Senior Software Engineer
Matt has been working on the Vulcan developer experience for almost two years and is completing his Masters Degree in Computer Science from Georgia Institute of Technology.

Category Tags
Developer Experience
General
OnlyAtVulcan
About the Author
Matt S.
Senior Software Engineer
Matt has been working on the Vulcan developer experience for almost two years and is completing his Masters Degree in Computer Science from Georgia Institute of Technology.

Category Tags
Developer Experience
General
OnlyAtVulcan
Build a better future
Working at Vulcan