Apache Spark Shines brighter with Project Tungsten

Teepika R M
4 min readMay 22, 2022

What is the purpose of Project Tungsten?

Spark workloads are often slowed down by memory or CPU bottlenecks rather than IO or network communication. So, the purpose of project tungsten is to boost the spark applications by improving its CPU efficiency and reducing memory pressure.

Why CPU is the new bottleneck?

Improved Hardware has become common:

Increasingly large aggregate IO bandwidth. eg, instances with 10 Gbps has become quite common.

High bandwidth SSDs or striped HDD arrays for storage has also become common.

Spark IO operations have been optimized:

Optimized by limiting the movement of data by pruning(cutting back) unneeded input. It helps in avoiding unnecessary disk IO.

Data formats have improved:

Efficient data formats for bigger files like Parquet and other new binary data formats have become available.

Serialization and Hashing adds to the cpu workloads.

What is serialization & deserialization?

Serialization is a process of converting an object into a sequence of bytes which can be persisted to a disk or database or can be sent through streams. The reverse process of creating object from sequence of bytes is called deserialization. Those situations happen in Spark when things are shuffled around.

What is hashing?

Hashing is used to index and retrieve information from a database because it helps accelerate the process; it is much easier to find an item using its shorter hashed key than its original value.

(De)Serialization and hashing adds to the CPU workloads

Project Tungsten comes for the rescue and helps in improvising CPU & memory efficiency.

The main motto behind project tungsten is to exploit specificity using the application semantics & schema rather than generality. Generality incurs huge cost compared to specificity.

We will see the important enhancements as part of tungsten in detail as below,

  • Introduction of new serialization format called unsafe row, more compact and faster than usual methods (Kryo, Java serialization). This format is called unsafe since it represents the mutable internal raw memory.

Project Tungsten uses application semantics to manage memory explicitly and eliminate the need for JVM object model. It introduces a new representation “Sun.misc.Unsafe” for objects. JVM objects are known for their space overhead & garbage collection costs. On the other hand, Unsafe objects consume lesser space & reduces the CPU burden by avoiding garbage collection(off-heap).

  • Reduced memory footprint thanks to unsafe row.

Consider a string “abcd”, this string takes only 4 bytes using native UTF-8 encoding, whereas the same object occupies 48 bytes in terms of Java object (including the object header, hash code, overhead & the chars). Unsafe row format occupies fewer bytes compared to Java object.

  • Support for on-heap and off-heap allocation; Since the unsafe row format doesn’t depend on JVM, it allows spark to allocate objects on the off-heap memory.

Project Tungsten extends the support for object to be allocated in off-heap memory using its own off-heap binary memory management. Sun.misc.Unsafe API is used to build both on-heap & off-heap data structures.

  • Faster execution of sorting and hashing for aggregation, join and shuffle operations — Spark can often do some of SQL operations as aggregation on serialized form of data.

Comparison and hashing operation for aggregation can be done on raw bytes without requiring additional interpretation.

  • Less time spent on waiting on fetching data from memory thanks to new cache-friendly mechanism. This improvement is also related to faster execution of sorting and hashing from the previous point.

With the help of manual memory management technique, data-structures are defined in a way that it reduces the space overhead & boosts good memory access (sequential), making the process easier for scans (reduces bad memory access patterns).

  • Shuffle optimized thanks to better unsafe row throughput.

The shuffle operations are better optimized with unsafe row format, provided it satisfies the required conditions.

  • Whole-Stage Code Generation

Whole-Stage Java Code Generation improves the execution performance of a query by collapsing a query tree into a single optimized function that eliminates virtual function calls and leverages CPU registers for intermediate data.

Whole-Stage Code Generation directly generates bytecode that will be evaluated to produce the result.

What is bytecode?

JavaBytecode is the compiled format of Java programs. Once a Java program has been converted to Javabytecode, it can be transferred across a network and executed by Java Virtual Machine (JVM). Also, the JavaByteCode is platform independent that JVM converts the bytecode to be understood by the underlying hardware.

To produce result for expressions, the expression logic is evaluated. The generic evaluation of the logic is very expensive on the JVM because of the virtual function calls, object creation due to primitive boxing, memory consumption etc. But with Tungsten codegen, it utilizes the internal application semantics to produce custom bytecode that almost matches with the handwritten code performance.

By generating the optimized bytecode, we can skip the interpreted evaluation of operators. It helps in eliminating the virtual function calls & traversing the tree of expression nodes for every single row of data.

Conclusion:

Project Tungsten substantially increases the efficiency of memory & CPU of spark applications. As a result, Spark Shines brighter with Tungsten.

--

--

Teepika R M

AWS Certified Big Data Specialty| Linux Certified Kubernetes Application Developer| Hortonworks Certified Spark Developer|Hortonworks Certified Hadoop Developer