kafka – use cases

1.2 Use Cases

Here is a description of a few of the popular use cases for Apache Kafka™. For an overview of a number of these areas in action, see this blog post.

Messaging

Kafka works well as a replacement for a more traditional message broker. Message brokers are used for a variety of reasons (to decouple processing from data producers, to buffer unprocessed messages, etc). In comparison to most messaging systems Kafka has better throughput, built-in partitioning, replication, and fault-tolerance which makes it a good solution for large scale message processing applications.

In our experience messaging uses are often comparatively low-throughput, but may require low end-to-end latency and often depend on the strong durability guarantees Kafka provides.

In this domain Kafka is comparable to traditional messaging systems such as ActiveMQ or RabbitMQ.

Website Activity Tracking

The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds. This means site activity (page views, searches, or other actions users may take) is published to central topics with one topic per activity type. These feeds are available for subscription for a range of use cases including real-time processing, real-time monitoring, and loading into Hadoop or offline data warehousing systems for offline processing and reporting.

Activity tracking is often very high volume as many activity messages are generated for each user page view.

Metrics

Kafka is often used for operational monitoring data. This involves aggregating statistics from distributed applications to produce centralized feeds of operational data.

Log Aggregation

Many people use Kafka as a replacement for a log aggregation solution. Log aggregation typically collects physical log files off servers and puts them in a central place (a file server or HDFS perhaps) for processing. Kafka abstracts away the details of files and gives a cleaner abstraction of log or event data as a stream of messages. This allows for lower-latency processing and easier support for multiple data sources and distributed data consumption. In comparison to log-centric systems like Scribe or Flume, Kafka offers equally good performance, stronger durability guarantees due to replication, and much lower end-to-end latency.

Stream Processing

Many users of Kafka process data in processing pipelines consisting of multiple stages, where raw input data is consumed from Kafka topics and then aggregated, enriched, or otherwise transformed into new topics for further consumption or follow-up processing. For example, a processing pipeline for recommending news articles might crawl article content from RSS feeds and publish it to an “articles” topic; further processing might normalize or deduplicate this content and published the cleansed article content to a new topic; a final processing stage might attempt to recommend this content to users. Such processing pipelines create graphs of real-time data flows based on the individual topics. Starting in 0.10.0.0, a light-weight but powerful stream processing library called Kafka Streams is available in Apache Kafka to perform such data processing as described above. Apart from Kafka Streams, alternative open source stream processing tools include Apache Storm and Apache Samza.

Event Sourcing

Event sourcing is a style of application design where state changes are logged as a time-ordered sequence of records. Kafka’s support for very large stored log data makes it an excellent backend for an application built in this style.

Commit Log

Kafka can serve as a kind of external commit-log for a distributed system. The log helps replicate data between nodes and acts as a re-syncing mechanism for failed nodes to restore their data. The log compaction feature in Kafka helps support this usage. In this usage Kafka is similar to Apache BookKeeper project.

cassandra – write consistency

The consistency level determines the number of replicas on which the write must succeed before returning an acknowledgment to the client application.

Write Consistency Levels
Level Description Usage
ALL A write must be written to the commit log and memtable on all replica nodes in the cluster for that partition key. Provides the highest consistency and the lowest availability of any other level.
EACH_QUORUM Strong consistency. A write must be written to the commit log and memtable on a quorum of replica nodes in all data center. Used in multiple data center clusters to strictly maintain consistency at the same level in each data center. For example, choose this level if you want a read to fail when a data center is down and the QUORUM cannot be reached on that data center.
QUORUM A write must be written to the commit log and memtable on a quorum of replica nodes.

Provides strong consistency if you can tolerate some level of failure.

LOCAL_QUORUM Strong consistency. A write must be written to the commit log and memtable on a quorum of replica nodes in the same data center as thecoordinator node. Avoids latency of inter-data center communication. Used in multiple data center clusters with a rack-aware replica placement strategy, such as NetworkTopologyStrategy, and a properly configured snitch. Use to maintain consistency locally (within the single data center). Can be used with SimpleStrategy.
ONE A write must be written to the commit log and memtable of at least one replica node. Satisfies the needs of most users because consistency requirements are not stringent.
TWO A write must be written to the commit log and memtable of at least two replica nodes. Similar to ONE.
THREE A write must be written to the commit log and memtable of at least three replica nodes. Similar to TWO.
LOCAL_ONE A write must be sent to, and successfully acknowledged by, at least one replica node in the local data center. In a multiple data center clusters, a consistency level of ONE is often desirable, but cross-DC traffic is not. LOCAL_ONE accomplishes this. For security and quality reasons, you can use this consistency level in an offline datacenter to prevent automatic connection to online nodes in other data centers if an offline node goes down.
ANY A write must be written to at least one node. If all replica nodes for the given partition key are down, the write can still succeed after a hinted handoff has been written. If all replica nodes are down at write time, an ANY write is not readable until the replica nodes for that partition have recovered. Provides low latency and a guarantee that a write never fails. Delivers the lowest consistency and highest availability.
SERIAL Achieves linearizable consistency for lightweight transactions by preventing unconditional updates. You cannot configure this level as a normal consistency level, configured at the driver level using the consistency level field. You configure this level using the serial consistency field as part of the native protocol operation. See failure scenarios.
LOCAL_SERIAL Same as SERIAL but confined to the data center. A write must be written conditionally to the commit log and memtable on a quorum of replica nodes in the same data center. Same as SERIAL. Used for disaster recovery. See failure scenarios.

Even at low consistency levels, the write is still sent to all replicas for the written key, even replicas in other data centers. The consistency level just determines how many replicas are required to respond that they received the write.

 

cassandra – writes

About writes

To manage and access data in Cassandra, it is important to understand how Casssandra writes and reads data, the hinted handoff feature, areas of conformance and non-conformance to the ACID (atomic, consistent, isolated, durable) database properties. In Cassandra, consistency refers to how up-to-date and synchronized a row of data is on all of its replicas.

Cassandra includes client utilities and application programming interfaces (APIs) for developing applications for data storage and retrieval.

scala – List – apply, indices

One reason why random element selection is less popular for lists than for arrays is that xs(n) takes time proportional to the index n. In fact, apply is simply defined by a combination of drop and head:

xs apply n equals (xs drop n).head

 

The indices method returns a list consisting of all valid indices of a given list:

object Lists_Random extends App{


  val lst=List(1,2,3,4)

println(lst.apply(1))
  println(lst.indices)

}


2
Range(0, 1, 2, 3)

scala.collection.immutable.Range

scala – List – take, drop, splitAt

The expression “xs take n” returns the first n elements of the list xs.

If n is greater than xs.length, the whole list xs is returned.

 

The operation “xs drop n” returns all elements of the list xs except the first n ones.

If n is greater than xs.length, the empty list is returned.

 

The splitAt operationsplitsthelistatagivenindex, returningapairof two lists.4 It is defined by the equality:
xs splitAt n equals (xs take n, xs drop n)

 

object List_TakeDropSplitAt extends App{

  val v1=List(1,2,3)
  println(v1.take(1))
  println(v1 take(10))

  println(v1.drop(1))

  println(v1.splitAt(1))

}

List(1)
List(1, 2, 3)
List(2, 3)
(List(1),List(2, 3))

Scala – List patterns

List(…) is an instance of a library-defined extractor pattern.

The “cons” pattern x :: xs is a special case of an infix operation pattern. You know already that, when seen as an expression, an infix operation is equivalent to a method call.

 

When seen as a pattern, an infix operation such as p op q is equivalent to op(p, q). That is, the infix operator op is treated as a constructor pattern. In particular, a cons pattern such as x :: xs is treated as ::(x, xs).

There is a class named :: that corresponds to the pattern constructor. It is named scala.:: and is exactly the class that builds nonempty lists.

So :: exists twice in Scala, once as a name of a class in package scala, and again as a method in class List. The effect of the method :: is to produce an instance of the class scala.::.