Materialized views in hive

Teepika R M
4 min readAug 8, 2022

Table vs View vs Materialized View

Data analytics involves complex querying of data that comprises of joins between tables, aggregations & filters. There are high possibilities that some set of tables are frequently joined to answer business queries. You can define views with those frequent complex joins & aggregations and use them for querying instead of rewriting the definition every time. It helps in hiding the complexity of the schema involved behind the query results.

Conceptually the materialized views & views remain the same with one distinction that the actual data physically get saved in the disk in materialized views whereas on the other hand, only the the definition gets saved & not the data behind it. Views rely on the base tables mentioned in the definition for retrieving the results whereas materialized views rely on the data saved separately in disk for querying.

For faster retrieval of results, you might think why can’t we create intermediate tables & save them physically since it matches with the concept of materialized views. But, it comes with two challenges — changing the access pattern ie, we have to explicitly redirect queries to those intermediate tables. We have to make sure the intermediate table data remains fresh in par with the source base tables.

Materialized view comes for the rescue & it provides faster retrieval of results along with addressing the above mentioned challenges.

The above picture depicts the difference between view & materialized view.

In hive, when materialized view is created, the results of the query definition is saved onto the disk with automatic query rewriting enabled by default. With automatic query rewriting enabled, the optimizer will exploit the materialized views definition & use them for rewriting the incoming queries by redirecting to the relevant views for faster execution instead of re-executing them from the scratch (base tables). User can also directly fire queries on materialized views instead of depending upon the optimizer to do the query rewriting job. Hive will stop using the materialized views if the data is not updated for long time or not in par with the source base tables.

The main objective of materialized views is to boost the performance of complex queries execution without taking any special efforts for maintenance. After the initial creation of materialized views, modification in source tables is not automatically reflected in the views. User has to be explicitly rebuild the view with the following command,

ALTER MATERIALIZED VIEW mv REBUILD;

When rebuild is initiated, if new data is added to the source tables using “INSERT” statement, it triggers an incremental rebuild on the views but incase of “UPDATE” / ”DELETE” statements, a full rebuild is required. Full rebuild is implemented with INSERT OVERWRITE with MV definition.

For rewriting queries using materialized views, Hive requires them to be up-to-date and not stale. It means by default, Hive doesn’t use stale views for query rewrite. Freshness of non-transactional tables cannot be determined & so they are not used for query rewrites in general. Sometimes you want to force hive to use stale views or possibly stale views like the ones defined on non-transactional tables for query rewriting. In that case, you run the rebuild operation periodically & set the parameter hive.materializedview.rewriting.time.window=5min (or as you wish) to enforce views rebuilding & make hive use them for rewriting.

Whenever rebuild (incremental/full) happens, it preserves the data that already exist in the LLAP cache for that materialized view. Here comes another question what is LLAP?

LLAP (Live Long And Process)

Consider LLAP as a layer of cache & in-memory processing that sits on top of the query engine. It is an YARN application that runs on all data nodes in the cluster. As the name suggests it is a long living process but it doesn’t eat up all the resources by dynamically scaling up or down based on the queries requirements. It is not tied to a single user or application making it distinct from spark engine. It implies that LLAP will cache data for reuse & it can be used across users (unlike spark) for querying. For eg, if userA fires a query on tableA and userB fires a different query on the same tableA, both the requests get served by the same LLAP cache. It boosts the performance by minimizing the data requests to HDFS. The query engines orchestrates what part of the queries get served by LLAP.

Whenever materialized view rebuild (incremental/full) happens, it preserves the data that already exist in the LLAP cache for that view.

Conclusion:

The post clarifies the difference between table, views & materialized views, how materialized views boosts the performance with zero maintenance, concept of rebuilding the views and concluded with the definition of the optimization feature LLAP.

--

--

Teepika R M

AWS Certified Big Data Specialty| Linux Certified Kubernetes Application Developer| Hortonworks Certified Spark Developer|Hortonworks Certified Hadoop Developer