How Apache Druid differs from other Big Data Giants

Teepika R M
4 min readAug 1, 2022

For better understanding of a tool/technology, it is important to compare & contrast with other similar tools in the market. It gives us a clear picture of where & when to use to the tool. Apache Druid is a real time database being used in many companies such as Netflix, Lyft, Salesforce, Pinterest, Walmart, Airbnb, Instacart. In this post, we will compare it against other Big data Giants like Spark, Hive and other types of databases like timeseries database, datawarehouses to get hold of where Apache Druid fits in a big data pipeline.

Data Engineers as well as many Software Engineers would have either used or heard of Apache Spark for large scale data processing and Apache Hive as distributed, fault-tolerant data warehouse system. Let’s start with compare & contrast mode of Spark & Druid. Before jumping into the technical differences, I want to give a heads-up that this comparison is apple to orange comparison.

Apache Druid vs Apache Spark

Druid & Spark can compliment a big data solution together for a great user experience. Apache Spark is specifically built to clean & transform huge volume of data as per business requirements. Being in-memory data engine, Spark can quickly perform processing tasks on very large data sets. The major benefits of using Spark are speedy processing of data by exploiting in memory computing and other optimizations, ease of use APIs which enables painless developments and a Unified Engine supporting SQL queries, streaming data, machine learning and graph processing. On the other hand, Druid is known for its special features like supporting Time-based partitioning, column-oriented storage & search indexes in a single system, real time data ingestion, sub second queries performance, high concurrency and high uptime.

Basically, Spark is a large scale data processing framework, in most cases used to build ETL pipeline whereas, Druid is a real-time analytics database designed for fast slice-and-dice analytics (“OLAP” queries) on large data sets.

Apache Druid vs Apache Hive

Apache Hive is a data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. It supports the analysis of large datasets stored in file systems like HDFS, S3, Google File System etc without having the need to implement the queries in low-level Java API instead express in HiveQL (SQL like queries).

How Druid differs from other OLAP datawarehouses?

Druid not only combines ideas from traditional datawarehouses but also from other streams like timeseries databases, search systems. It can handle more use cases than a traditional datawarehouse system.

Compared to a traditional datawarehouse system, Druid provides sub-second latency for both complex querying and data ingestion. It supports batch data sources and streaming data sources like Kinesis, Kafka for ingestion. With time based partitioning, it supports performant time based queries. With indexing, it provides fast search & filter. It supports semi-structured as well as nested data.

How it differs from Timeseries Databases?

Though Druid takes ideas from timeseries database concepts, it offers even more complex functionalities than a timeseries database like InfluxDB, Prometheus. Generally timeseries databases have their data partitioned by time series and provide aggregation capability over numbers but not any complex analytics. But Druid being an analytics engine at heart, offers multi dimensional groupBys on non-time based columns, fast slice and dice analytics, fast search & filter through inverted indexes.

Where Apache Druid fits in a big data pipeline?

Sample Big Data Pipeline

To simplify the explanation with a use case, consider Apache Spark being used to extract data from various sources like MySQL, HDFS, HBase, Kafka etc, clean & transform them as per business requirements and load the data from Spark’s filesystem to Apache Druid for complex querying with sub-second latency. Data sourced from Druid is used to power Business Intelligence tools like Superset for data exploration and visualization.

--

--

Teepika R M

AWS Certified Big Data Specialty| Linux Certified Kubernetes Application Developer| Hortonworks Certified Spark Developer|Hortonworks Certified Hadoop Developer