Serialization in Apache Spark!!
Purpose of Serialization and Deserialization:
Generally we store data in data structures like list, array, map/dictionary etc but transporting them in the same data structure format is inefficient. So, in order to transfer the data across network, we need to transform (serialize) the data into bytes stream with enough information in a way that it can be retransformed (deserialized) back into its original format at its destination.
When serialization/deserialization comes into picture:
- Data transfer over network
- Save data to a file
- Store data in databases
How serialization can impact the performance of a job?
Caching — In order to reduce the memory usage, we need to serialize the data while storing in memory. It helps in memory tuning for better performance since it reduces the data size.
Data Shuffling — To send the data through networks especially when shuffling happens, it is required to serialize and transfer them over the wire. It reduces the network bottleneck.
When it comes to distributed applications where the workload is distributed across the nodes to accomplish the task, serialization is fairly important. It needs to distribute code for running and data for execution between the nodes. It has to serialize, deserialize and send object over the network quickly.
Apache Spark provides support for two serialization libraries:
Java serialization
This is the default serialization method used by Spark. It works for any class that implements either java.io.Serializable
or java.io.Externalizable.
Java serialization can result in very space-consuming serialized objects and it can be slow too. A class is never serialized, only the object of a class is serialized.
Kryo serialization
Kryo is a fast and efficient binary object graph serialization framework for Java. It can be used whenever objects need to be persisted to a file, database, or over the network. When compared to java serialization, kryo serialization method can be fast and compact. But the drawback is, it requires custom registration.
Whenever performance tuning comes into picture, you can switch to kryo serialization since its faster and compact. Data shuffle intensive operations like grouping the data / joining can greatly be improved with kryo serialization, since the data is moved across the network. It has 50+ default serializers for various classes. For registering a class, we have to give its name as an input to registerKryoClasses method. If the class is not registered, then from the list of default serializers, a matching one is chosen. If no matching default serializer is found then global default serializer is used. If we don’t register the class, it comes with adverse effects like negative security implication that makes it vulnerable and increased serialized object size because fully qualified class name is written when first time the class appears in the object graph. To avoid those implications, we can make sure by every class is registered by passing the following property as part of config,
.set("spark.kryo.registrationRequired", "true")
To register your own custom classes with Kryo, use the registerKryoClasses
method.
val conf = new SparkConf().setMaster(...).setAppName(...)
conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
val sc = new SparkContext(conf)
Since Spark 2.0.0, Kryo serializer is internally used when shuffling RDDs with simple types, arrays of simple types, or string type, to reduce network overhead.
Conclusion:
By providing the support for both java and kryo serialization libraries, Spark intends to give a balance between both convenience and performance.