Spark Structured Streaming — Fault Tolerance

Teepika R M
4 min readMar 22, 2022

Real time data processing is the process where input flows continuously from sources and are processed in real time or near real time and output is produced constantly over time. The process helps in real time decision making. It assists in solving critical business requirements instantly without having to wait until the complete batch data be available for running the analytics.

Real time stream processing systems are expected to run 24/7 due to its continuous input source nature. Hence it increases the chances of system failure. There are many real-time stream processing systems available in market like Spark, Flink, Storm etc,. In this post, let’s see about how fault tolerance is handled in Apache Spark. Spark streaming is a scalable fault tolerant real time data processing engine that natively supports both batch and streaming workloads.

Apache spark works in driver and executor model. Driver is the process where main method runs and it has full information about the task executors all the time. Driver is the central coordinator and it communicates with all the executors to complete the spark job. The mode(Client mode/Cluster mode) in which the spark job is triggered decides where the driver is instantiated. Executors are worker nodes that executes the individual tasks and collectively produce the end result for the given spark job. Since two parties are involved in solving a task, failures are possible either at driver end or at executors end. Spark streaming has support for recovering from both the scenarios. Concepts like checkpointing and Write Ahead Logs (WAL or journaling) help in making the system fault tolerant.

Checkpointing

Checkpointing helps in covering two failure scenarios — Driver program failures or stateful transformations failures. As we all know what is driver process already, let’s see about stateful transformations failures.

What are stateful transformations? Transformations that involve processing of each micro-batch of data that depends on the previous batches of data either fully or partially. So preserving the state is vital for long running stateful transformations or complex transformations. If the state is not saved, it results in whole re-computation of previous dependent states.

There are two types of checkpointing

Metadata checkpointing — It helps in handling the driver process failure. With metadata checkpointing, all the information needed for restarting the streaming computation are saved to persistent storage systems like HDFS. Metadata is nothing but data about data. Metadata like Configuration — The configuration used to trigger the streaming application, DStream operations — The DAG (Directed Acyclic Graph) generated for the DStream operations of the streaming application and Incomplete batches — Batches which are queued but not yet completed are saved as part of this checkpointing.

Data checkpointing — It helps in handling stateful transformations failure. Here the data or RDD itself is saved as part of this checkpointing. Because of the dependency on the previous batches computations, failure of such transformations increases the time to compute the currently failed RDD micro batch.

Write Ahead Logs:

Write Ahead Logs also helps in solving the scenarios of driver program failures. We already have metadata checkpointing to support this case, so what is specially covered in Write Ahead Logs? Let’s see in detail to understand the difference.

Whenever the driver program crashes, it kills all the executors along with their memory and the data buffered in them. When it comes to real time data processing, sources can be fault tolerant system like S3 or HDFS or any persistent storage systems which can replay the data when the driver program crashes. The buffered data in executor memory can be resent in case of driver failure for such source systems. But for other sources like, Kafka/Flume, some of the received data which are buffered in memory but not processed by the executors can get lost. The buffered data cannot be recovered and this will impact zero data loss. To solve this problem, Write Ahead Logs are implemented to avoid zero data loss.

When Write Ahead Logs are enabled, all the data received from the source is saved to log files in fault tolerant storage system like HDFS. This makes the data recoverable under any circumstances of failure in spark streaming. If the receiver is reliable too, it will acknowledge the source system as the data received only after the data has been written to the logs file system. This way, the source can replay only the unacknowledged data in case of failures. Thus the data can be recovered from either logs or be resent from the source.

Conclusion:

Fault tolerance’s objective is to prevent any system obstructions from single of failure. It in-turn ensures high availability and business continuity. Building fault tolerant systems has become one of the basic qualities for mission critical applications. As we saw in the post, Spark streaming has features implemented to make the real time stream processing systems fault tolerant.

--

--

Teepika R M

AWS Certified Big Data Specialty| Linux Certified Kubernetes Application Developer| Hortonworks Certified Spark Developer|Hortonworks Certified Hadoop Developer