2016

Kafka — Accelerating!

Quick tips and insights on how to make Apache Kafka
work faster!

Hardware

  • CPU doesn’t matter that much.
  • Memory helps a lot (a lot) in performance.
  • SSDs are not required, since most operations are sequential read and writes.
  • If possible run in bare
    metal
    .

Linux

vm.dirty_background_ratio = 5
vm.dirty_ratio = 80
vm.swappiness = 1
  • Assuming you are using ext4, don’t waste space with reserved blocks:
tune2fs -m 0 -i 0 -c -1 /dev/device
  • Mount with noatime:
/dev/device       /mountpoint       ext4    defaults,noatime
  • Keep an eye on the number of free inodes:
tune2fs -l /dev/device | grep -i inode
  • Increase limits, for example, using systemd:
$ cat /etc/systemd/system/kafka.service.d/limits.conf
[Service]
LimitNOFILE=10000
  • Tweak your network settings, for example:
net.core.somaxconn = 1024
net.core.rmem_max = 67108864
net.core.wmem_max = 67108864
net.ipv4.tcp_rmem = 4096 87380 33554432
net.ipv4.tcp_wmem = 4096 65536 33554432
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.tcp_syncookies = 1

Kafka

  • log.dirs accepts a comma separated list of disks and will distribute
    partitions across them, however:
  1. Doesn’t rebalance, some disks could be full and others empty.
  2. Doesn’t tolerate any disk failure, more info in
    KIP-18.
  3. Raid 10 is probably the best middle ground between performance and
    reliability.
  • *num.io.threads, *number of I/O threads that the server uses for executing
    requests. You should have at least as many threads as you have disks.
  • num.network.threads, number of network threads that the server uses for
    handling network requests. Increase based on number of producers/consumers and
    replication factor.
  • Use Java 1.8 and G1 Garbage
    collector
    :
-XX:MetaspaceSize=96m
-XX:+UseG1GC            # use G1
-XX:MaxGCPauseMillis=20 # gc deadline
-XX:InitiatingHeapOccupancyPercent=35
-XX:G1HeapRegionSize=16M
-XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80
  • KAFKA_HEAP_OPTS, 5–8Gb heap should be enough for most deployments, file system
    cache is way more important. Linkedin runs 5Gb heap in 32Gb RAM
    servers
    .
  • pcstat can help understand how well the
    system is caching:
./pcstat /kafka/data/*

Any comments or suggestions are welcome!

Stability Patterns — Circuit Breaker

A circuit breaker is an automatically operated electrical switch designed to
protect an electrical circuit from damage caused by over-current or overload or
short circuit. Its basic function is to interrupt current flow after protective
relays detect a fault. A circuit breaker can be reset (either manually or
automatically) to resume normal operation.

The software analogue as described in Release
it!
chapter 5.2 can prevent repeated
calls to a failing service by detecting issues and providing a fallback, by
using this pattern it is possible to avoid cascading failures.

Ansible vs PCI

If you need to comply with PCI
requirements

like:

Requirement 2: Maintain an inventory of system components in scope for PCI DSS
to support effective scoping practices.

You will find that using public-key authentication is sometimes
forbidden
as it’s almost impossible to ensure employees are rotating the keys, keeping the
private key safe and with a strong password.

Using Ansible without ssh key based authentication is painful if you need to
run a playbook against hundreds of servers, as you will need to insert your
password ad nauseam.

Docker: Configuration files

Things no one tells you about.

One of Docker’s killer features is the environment
parity, yet it feels like one little detail was left untold: how to handle
configuration files
.

Unless you are using the same configuration between development, quality,
production, etc. you will end up with different endpoints, API keys, secret
tokens and feature switches for each environment.

Available Options

There are a couple of different ways to handle configuration in Docker. Below,
you will find a non-exhaustive list.

Collectd plugins made easy

Collectd is a Unix daemon that collects, transfers and stores performance
data of computers and network equipment. The acquired data is meant to help
system administrators maintain an overview over available resources to detect
existing or looming bottlenecks.

Collectd provides a long list of
plugins
available
out-of-box. However, if you need to collect additional metrics, one of the
easiest ways to do so is using the exec
plugin
.

In order to use the exec plugin, create an Collectd configuration file:

Chef Custom Resources

Lately I’ve been writing some Chef code. One of the best things about Chef
is custom resources:
https://docs.chef.io/custom_resources.html

Let’s see an example on how to create a Kafka topic
using Chef and how to make it
idempotent.

Before writing any Chef code it is important to understand how to manage a
topic (grouping of messages of a similar type).

From the Kafka install directory, first check if the topic already exists:

bin/kafka-topics.sh
    --zookeeper localhost:2181 \
    --describe \
    --topic <name>

If it doesn’t exist, create it: