Introduction to Apache Kafka

The Next Gen Event Streaming System

James Ward | @_JamesWard
Developer @ Salesforce.com

Data Integration Today

Integration Complexity

No System of Record
Synchronization is hard
Scaling ETL is hard
Processing is error-prone

Events not Tables

Streams as Ledgers

First-class Partitioning

Why not message systems?

Ordering?
Horizontal Scaling?
Push?

Kafka = Event Ledger - Distributed & Redundant

A Distributed Commit Log

Linear Clustering Performance

Near network speeds

Kafka Fundamentals

Messaging System Semantics
Clustering is Core
Durability & Ordering Guarantees

Use Cases

Modern ETL / CDC
Data Pipelines
Big Data Ingest

Demo!

Records

Key, Value, Timestamp
Immutable
Append Only
Persisted

AKA: A Log

Producers & Consumers

Broker = Node in the cluster
Producer writes records to a broker
Consumer reads records from a broker
Leader / Follower for cluster distribution

Topics & Partitions

Topic = Logical name with 1 or more partitions
Partitions are replicated
Ordering is guaranteed for a partition

Offsets

Unique sequential ID (per partition)
Consumers track offsets
Benefits: Replay, Different Speed Consumers, etc

Producer Partitioning

Writes are to the leader of a partition
Partitioning can be done manually or based on a key
Replication Factor is Topic-based
Auto-Rebalancing

Consumer Groups

Logical name for 1 or more consumers
Message consumption is load balanced across all consumers in a group

Delivery Guarantees

Producer

Async (No Guarantee)
Committed to Leader
Committed to Leader & Quorum

Consumer

At-least-once (Default)
At-most-once
Effectively-once
Exactly Once (Maybe)

Cool Features

Log Compaction
Disk not Heap
Pagecache to Socket
Balanced Partitions & Leaders
Producer & Consumer Quotas
Heroku Kafka

Clients

JVM is official
Most other platforms via the community
Polling Based

Basic Connection

Producer

class Producer {
    public void send(ProducerData<K,V> producerData);
}

Consumer

class SimpleConsumer {
    public ByteBufferMessageSet fetch(FetchRequest request);

    public long[] getOffsetsBefore(String topic, int partition, long time, int maxNumOffsets);
}

interface ConsumerConnector {
    public Map<String,List<KafkaStream>> createMessageStreams(Map<String,Int> topicCountMap);
}

Akka Streams

Impl of Reactive Streams
Source / Sink Stream Programming
Back-pressure, etc
Kafka Adapter:
https://github.com/akka/reactive-kafka

Akka Streams

val source = Source.repeat("hello, world")
val sink = Sink.foreach(println)
val flow = source to sink
flow.run()

Code!

https://github.com/jamesward/koober

Apache Flink

Real-time Data Analytics

Bounded & Unbounded Data Sets
Stream processing (map, fold, filter, etc)
Distributed Core
Fault Tolerant
Flexible Windowing

Flink

Continuous Processing for Unbounded Datasets

Flink - Windowing

Bounding with Time, Count, Session, or Data

Code!

https://github.com/jamesward/koober

Heroku Kafka

https://www.heroku.com/kafka

Questions?

Reach out: @_JamesWard