How Apache Druid becomes a game changer in Big Data Pipelines?

Teepika R M
6 min readJul 28, 2022

--

About Apache Druid

Apache Druid is a real-time analytics database designed for fast slice-and-dice analytics (“OLAP” queries) on large data sets. It is one of the most popular open source solutions for OLAP. It is designed for serving both real-time(streaming sources like Kafka, Kinesis) and historical data(batch sources like HDFS, S3).

Special features of Apache Druid

  • Supports Time-based partitioning, column-oriented storage, and search indexes in a single system. Combines ideas of timeseries databases, column-oriented analytic databases, and search systems.
  • Supports real time data ingestion
  • Sub-second queries performance
  • High concurrency
  • High uptime
  • Supports solving complex OLAP queries on very large data sets.

Let’s see each term in detail for better understanding.

Time based partitioning:

In Druid, data is stored in datasources which is similar to tables in RDBMS. Each datasource is partitioned by time and if needed they can be further be partitioned (secondary partitioning) by other attributes or dimensions. Each partition time range is called “chunk” and each chunk has one or more segments based on the size of the segment file. Please refer the below diagram to understand how data is partitioned based on time & stored in Druid.

Though Druid takes ideas from timeseries database concepts, it offers even more complex functionalities than a timeseries database like InfluxDB, Prometheus. Generally timeseries databases have their data partitioned by time series and provide aggregation capability over numbers but not any complex analytics. But Druid being an analytics engine at heart, offers multi dimensional groupBys on non-time based columns, fast slice and dice analytics, fast search & filter through inverted indexes.

Column oriented storage:

Druid uses column-oriented storage. This enables retrieving only the relevant columns while querying. It greatly improves the query performance. Depending on each column’s type, compression method is chosen to reduce the space needed for that column storage. For eg, Storing strings directly is a costly method instead it can be encoded as dictionary.

Search Indexes:

Druid uses inverted indexes (in particular, compressed bitmaps) for fast searching and filtering. As mentioned earlier, each chunk has one or more segments in it. In segment files, data is stored and partitioned by time. Each segment file follows an internal structure laid out in separate data structures for each column.

Fig 1. Internal structure of Data

Timestamp and metric columns values are compressed with LZ4. Metrics columns are usually numeric values and used in aggregations and computations. Dimension columns are different since they support filter and group-by operations. Each dimension is stored with the help of three data structures,

Please find below, the three data structures for the column — Page in Dimensions of Fig 1.

  1. A dictionary that maps values to IDs
Dictionary that encodes column values 
{ “Justin Bieber”: 0, “Ke$ha”: 1 }

2. A list of column values with 1 for value present & 0 for absence.

Column data [0, 0, 1, 1]

3. A bitmap for each column value to indicate which row contains that column value.

Bitmaps — one for each unique value of the column 
value=”Justin Bieber”: [1,1,0,0]
value=”Ke$ha”: [0,0,1,1]

These bitmap are inverted indexes that help in faster querying and reduction in storage space through compression.

For filter queries with conditions like AND/OR/NOT, answers can be retrieved with the help of Datastructures #1 & #3.

#1 for knowing the encoded ID for the filter column values.#3 for fetching the rows that has that filter column values.

For group-by with filter queries, answers can be retrieved with the help of Datastructures #1, #2 & #3.

#1 for knowing the encoded ID for the filter column values.#3 for fetching the rows that has that filter column values.#2 for grouping the rows that has that filter column values. 

With the help of these indexes, Druid gives faster results for filter/group-by operations by accessing only the relevant rows & columns.

Real time data ingestion:

Druid supports data ingestion from streaming sources like Apache Kafka, Amazon Kinesis etc. It reads data directly from these sources.

Sub-second queries performance:

Druid is designed for cases where fast queries and ingest really matter. Better query performance is achieved through features like column oriented storage (better compression & accessing only the columns needed), preventing unnecessary scans through inverted indexes, pre-aggregation through roll-ups, caching, time-based partitioning (powering time based queries) etc.

Druid uses three different techniques to maximize query performance:

  • Pruning the set of segments accessed for a query.
  • Within each segment, using indexes to identify which rows must be accessed.
  • Within each segment, only reading the specific rows and columns that are relevant to a particular query.

High concurrency:

Druid supports concurrent queries from hundreds or thousands of concurrent users.

High uptime:

Druid is designed to have no single point of failure. Its architecture involves different types of nodes that are fairly independent of each other with minimal interaction. So, intra-cluster communication failures have minimal impact on data availability.

Roll ups:

Roll ups is a feature that can be enabled or disabled during data ingestion to dramatically reduce the size of data that needs to be stored. It in-turn helps in saving on storage resources and faster query results.

In many cases, individual events are not much needed for analysis because there may be trillions of events generated. So, summarization of such data gives better insights than the individual events. This summarization of data is made possible with a process called “roll-up” in Druid. Demonstration with an example will give a clear picture of what is what. Please consider the following example where each event has the timestamp, domain, gender and whether clicked or not. The left table represents individual events recorded from Jan 1st 2011, 00:00:00 to Jan 1st 2011, 01:00:00. These events recorded at seconds level, do not add much value for data analytics. Hence, during ingestion, roll up can be enabled to summarize events at hour level. The right table has summarized data that shows the number of clicks by each gender & domain level. But with roll up enabled, we loose individual events and they cannot be queried. The rollup granularity is the minimum granularity you will be able to explore data at and events are floored to this granularity.

Source :Source: Interactive Exploratory Analytics with Druid | DataEngConf SF ’17. Left Table- Individual Events. Right Table — Summarized Events.

Use case examples:

  • Understanding user activity and behavior — Druid is used for streams like clickstreams, viewstreams, and activity streams. It helps understanding user behavior like usage patterns, user engagement, compare user activity by gender, age etc.
  • Used for IoT and device metrics — Druid is often used as a time series solution. Data generated from devices can be ingested in real-time and perform ad-hoc analytics. Druid lets you search, filter on tag values faster than traditional time series databases.
  • OLAP & Business Intelligence — Known for high concurrency and sub second queries features, Druid is a common choice for OLAP & BI use cases. Interactive data exploration through a UI makes it suitable for visual analytics.

Conclusion:

Druid is often used to power UIs where an interactive, consistent user experience is desired and it can be deployed on commodity hardware, both in the cloud and on premise. The ideas from timeseries databases, column-oriented analytic databases, and search systems being combined in a real time database makes Apache Druid powerful in Big Data Systems.

--

--

Teepika R M

AWS Certified Big Data Specialty| Linux Certified Kubernetes Application Developer| Hortonworks Certified Spark Developer|Hortonworks Certified Hadoop Developer