Apache Spark Optimization — Quasiquotes

What is JavaByteCode?

JavaBytecode is the compiled format of Java programs. Once a Java program has been converted to Javabytecode, it can be transferred across a network and executed by Java Virtual Machine (JVM). Also, the JavaByteCode is platform independent that JVM converts the bytecode to be understood by the underlying hardware.

How catalyst supports the bytecode generation?

Catalyst utilizes a special feature of Scala — Quasiquotes to make bytecode generation simpler. It simplifies the work of Scala compiler by allowing programmatic construction of abstract syntax trees(AST) in Scala language, which can be fed to the compiler at runtime to generate byecode. This AST generation from a tree representing an expression in SQL, helps in evaluating that expression.

Why should we evaluate expressions in SQL?

Expressions occur most commonly in the output column list and WHERE clause of SELECT statements. For eg,

SELECT 
(YEAR(death) - YEAR(birth)) - IF(RIGHT(death,5) < RIGHT(birth,5),1,0)
FROM president
WHERE
birth > '1900-1-1';

What are the drawbacks of evaluating expressions in the query?

Without code generation support by catalyst, expressions would have to be interpreted for each row of data. Each node of the expression tree is represented by an object with an evaluation method which defines how to calculate the result of the expression for a given row. What happens in interpreted execution is, evaluate method on the root as well as each of its children is called and finally, the result of the expression is calculated. It impacts the performance by making many virtual function calls, expanding the evaluation code with many if-else branches because of the need to handle different input datatypes and allocating extra object for generic return type.

What is Abstract Syntax Tree (AST)?

Quasiquotes generate AST that simplifies the code generation process. The abstract keyword in AST denotes that it does not represent every detail in the syntax, but just the structural or content-related details. It’s a tree representation that depicts actual structure of the code.

Sequence of Operation
def compile(node: Node): AST = node match {
case Literal(value) => q"$value"
case Attribute(name) => q"row.get($name)"
case Add(left, right) => q"${compile(left)} + ${compile(right)}"
}

Conclusion:

Quasiquotes are typed checked at compile time to ensure appropriate ASTs or literals are substituted in. It results in valid Scala AST helping to avoid running Scala parser at runtime. Thus catalyst optimizer utilizes quasiquotes to speed up the query execution.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Teepika R M

Teepika R M

AWS Certified Big Data Specialty| Linux Certified Kubernetes Application Developer| Hortonworks Certified Spark Developer|Hortonworks Certified Hadoop Developer