Join Strategies in Apache Spark (not regular Join types)!!

Teepika R M
4 min readJun 22, 2022

--

Be a software engineer or a data engineer, everyone would be aware of join operations when it comes to working on data. It is a basic operation that we do, to combine data based on business requirements. There are various types of joins like Inner Join, Outer Join, Semi Join, Cross Join. In this post, we will see what spark actually does internally to perform these joins. Understanding these internals, give users more control and enable them to choose right join strategy when needed.

A join strategy get chosen based on the following factors,

  1. Data Size
  2. Join type
  3. Join hints that user specify

Data Size

If data size is small and it satisfies the condition to get broadcasted, then Spark chooses broadcast and hash-based join strategies. Whenever possible it tries to avoid strategies involving sort/shuffle operations.

Join Type

Joins are broadly classified to two types — Equi (=)and Non-Equi Joins(<,>,≥, ≤). If a join is non-equi type, then the choice can only be between Broadcast nested loop join and Cartesian product join.

Join hints that user specify

Spark let users have more control by specifying the join hints for joining tables e.g, /*+ BROADCAST(table name)*/*. From Spark 3.0.0, four join hints are supported, including:

These above mentioned factors influence the join strategy that Spark uses internally to combine the data.

Spark Join Strategies

Broadcast Hash Join

When one of the dataset is small and it is lesser than or equal to spark.sql.autoBroadcastJoinThreshold parameter, this join strategy is followed. It is applicable only for Equi join types. First, a hash table is created for the small relation and it is broadcasted to all the executors. Then the hash table is matched against the large relation partitions in all the executors. The value for the broadcast threshold should not be set high since it needs to fit into the memory of both the executors and driver (driver first caches the hash table and broadcasts to the executors).

Broadcast Hash Join

When it comes to broadcasting relations, we can easily get into Out Of Memory error, because it needs to fit into the memory of both driver and executor. Also, when there are more executors, it increases the network traffic, since the relation needs to be broadcasted to all of them.

Shuffle Hash Join

It starts with shuffle phase that rows from both the relations get shuffled. The records from both the relations with same key goes to the same executor. The next step is to perform hash join, a hash table is formed from the small relation in each executor and a lookup is done against the larger relation. This type of join is possible only when one of the relations is small enough to create a hash table from it. If you want to use shuffle hash join to avoid sorting operation, you have to override the default by setting spark.sql.join.preferSortMergeJoin=false.

Spark will choose the Shuffle Hash Join only if :
1. one side of the join is at least three times smaller than the other side
2. the average size of each partition is smaller than the autoBroadcastJoinThreshold.

Shuffle Hash Join

Sort Merge Join

In this type of join, it starts with a shuffle process, the records with same key value get shuffled to the same executor. Once shuffled, it is followed by the sort operation that sorts the partitions based on the join key from both the relations. Once sorted, both relations are merged(by iterating over the records) to produce the result. The Sort-merge Join is the default Join and is preferred over Shuffle Hash Join.

Sort Merge Join

Broadcast Nested Loop Join

In this type of join, smaller relation is broadcasted to every executor nodes followed by performing nested loop join. Every record from one relation is tried to match against every record from the other relation. It supports both equi and non-equi joins.

Broadcast Nested Loop Join

Cartesian Join

To use cartesian join, we have to set spark.sql.crossJoin.enabled=true in our session. If there are no join keys specified and the type falls under inner join, then the cartesian join is used. It is basically the product of the two join relations. It hits the performance when the relations are larger in size and so needs to be avoided if possible.

Conclusion

Don’t miss an opportunity to optimize the performance by providing join hints when you understand the nature of underlying relations. In the mean time, developer needs to be extra cautious while specifying join hints since spark is instructed to work in a way that otherwise it would not.

--

--

Teepika R M
Teepika R M

Written by Teepika R M

AWS Certified Big Data Specialty| Linux Certified Kubernetes Application Developer| Hortonworks Certified Spark Developer|Hortonworks Certified Hadoop Developer

No responses yet