Both Apache Kylin and ClickHouse are popular big data OLAP engines in the current market; Kylin was originally developed by eBay China R&D Center, open sourced in 2014 and contributed to the Apache Software Foundation, with sub-second query capabilities and ultra-high concurrent query capabilities , Adopted by many major manufacturers, including Meituan, Didi, Ctrip, Shell Search, Tencent, 58.com, etc.;
ClickHouse, the hottest in the OLAP field in the past two years, was developed by Russian search giant Yandex and open sourced in 2016. Typical users include famous companies such as Bytedance, Sina and Tencent.
What are the differences between these two OLAP engines, what are their advantages, and how to choose? This article will try to compare these two OLAP engines in terms of technical principles, storage structure, optimization methods, and advantageous scenarios, and provide some reference for your technology selection.
Technical principles, we mainly from architecture and ecological aspects to be compared.
1.1 Technical architecture
Kylin is based on Hadoop-based MOLAP (Multi-dimensional OLAP) technology. The core technology is OLAP Cube ; unlike traditional MOLAP technology, Kylin runs on Hadoop, a powerful and scalable platform, which can support massive amounts (TB to PB) Data; it imports the pre-computed (executed by MapReduce or Spark) multidimensional Cube into HBase, a low-latency distributed database, so as to achieve sub-second query response; the recent Kylin 4 began to use Spark + Parquet to Replace HBase to further simplify the architecture. Since a large number of aggregation calculations have been completed in the process of offline tasks (Cube construction), when executing SQL queries, it does not need to access the original data, but directly uses the index to combine the aggregation results and calculates again. The performance is higher than that of the original data. Hundred or even thousand times; due to low CPU usage, it can support a higher amount of concurrency, especially suitable for self-service analysis, fixed reports and other multi-user, interactive analysis scenarios.
ClickHouse is a distributed ROLAP (Relational OLAP) analysis engine based on the MPP architecture . Each node has equal responsibilities and is responsible for part of the data processing (shared nothing). It has developed a vectorized execution engine that uses log merge trees, sparse indexes, and CPU Features such as SIMD (Single Instruction Multiple Data) give full play to hardware advantages to achieve efficient calculations. Therefore, when ClickHouse faces a large data volume calculation scenario, it can usually reach the limit of CPU performance.
1. 2 Technology ecology
Kylin is written in Java, fully integrated into the Hadoop ecosystem, and uses HDFS for distributed storage. The computing engine can be MapReduce, Spark, and Flink; the storage engine can be HBase, Parquet (combined with Spark). Source data access supports Hive, Kafka, RDBMS, etc., multi-node coordination relies on Zookeeper; compatible with Hive metadata, Kylin only supports SELECT query, schema modification needs to be completed in Hive, and then synchronized to Kylin; modeling and other operations are passed Web UI is completed, task scheduling is performed through Rest API, and task progress can be viewed on Web UI.
ClickHouse is written in C++ and has a self-contained system with little dependence on third-party tools. Supports relatively complete DDL and DML. Most operations can be completed through command line combined with SQL; distributed clusters rely on Zookeper management, and single nodes do not need to rely on Zookeper. Most configurations need to be completed by modifying the configuration file.
Kylin uses Hadoop ecosystem HBase or Parquet as the storage structure, relies on HBase’s rowkey index or Parquet’s Row group sparse index for query speed, and uses HBase Region Server or Spark executor for distributed parallel computing. ClickHouse manages data storage by itself, and its storage features include: MergeTree as the main storage structure, data compression and block, sparse index, etc. The following is a detailed comparison of the two engines.
2.1 Kylin’s storage structure
Kylin calculates multi-dimensional Cube data through pre-aggregation, and dynamically selects the optimal Cuboid (similar to a materialized view) according to the query conditions when querying, which will greatly reduce the amount of CPU calculation and IO reading.
During the Cube construction process, Kylin performs certain encoding and compression of the dimension values, such as dictionary encoding, in an effort to minimize data storage; since Kylin’s storage engine and construction engine are pluggable, there are also storage structures for different storage engines. The difference.
In the case of using HBase as the storage engine, each dimension is encoded during pre-calculation to ensure that the length of the dimension value is fixed, and when generating hfile, the dimensions in the calculation result are spliced into rowkey, and the aggregated value is used as the value. The order of the dimensions determines the design of the rowkey and directly affects the efficiency of the query.
Parquet storage engine
When Parquet is used as the storage format, the dimension values and aggregated values are stored directly, without coding and rowkey splicing. Before saving into Parquet, the calculation engine will sort the calculation results according to the dimension. The higher the dimension field, the higher the filtering efficiency on it. In addition, the number of shards and the number of row groups of parquet files in the same partition will also affect the efficiency of the query.
2.2 Storage structure of ClickHouse
ClickHouse generally requires users to specify partition columns when creating a table structure. Adopt data compression and pure column storage technology, use Mergetree to store each column separately and compress blocks,
At the same time, data will always be written to disk in the form of fragments. When certain conditions are met, ClickHouse will periodically merge these data fragments through background threads.
When the amount of data continues to increase, ClickHouse will merge the data in the partition directory to improve the efficiency of data scanning.
At the same time, ClickHouse provides a sparse index for each data block. When processing query requests, sparse indexes can be used to reduce data scanning and accelerate.
Both Kylin and ClickHouse are big data processing systems. When the level of data continues to increase, the use of appropriate optimization methods can often double the result with half the effort, greatly reducing query response time, reducing storage space, and improving query performance. Because the computing systems and storage systems of the two are different, the optimization methods used are different. The next section will focus on the analysis of the optimization methods of Kylin and ClickHouse.
3.1 Kylin’s optimization method
Kylin’s core principle is pre-computation , as mentioned in the first section of technical principles: Kylin’s calculation engine uses Apache Spark and MapReduce; storage uses HBase and Parquet; SQL analysis and post-calculation uses Apache Calcite. Kylin’s core technology is to develop a series of optimization methods to help solve the problem of dimension explosion and excessive scanning data . These methods include: setting aggregation groups, setting joint dimensions, setting derivative dimensions, setting dimension table snapshots, setting Rowkey order , Set the shard by column, etc.
- Set aggregation group: Prune through aggregation group to reduce unnecessary pre-computed combinations;
- Set joint dimensions: put together dimensions that often appear in pairs to reduce unnecessary pre-calculation;
- Set derivative dimensions: set the dimensions that can be calculated by other dimensions (for example, year, month, and day can be calculated by date) as derivative dimensions to reduce unnecessary pre-calculation;
- Set the snapshot of the dimension table: put it into the memory and calculate it to reduce the storage space occupied;
- Dictionary encoding: reduce the storage space occupied;
- RowKey encoding, set shard by column: speed up query efficiency by reducing the number of rows scanned by data
3.2 ClickHouse optimization method
The most common optimization method of MPP architecture system is to sub-database and table. Similarly, the most common optimization method of ClickHouse includes setting partition and sharding . In addition, ClickHouse also includes some unique engines . In summary, these optimization methods include:
As the subsequent performance and concurrency requirements become higher and higher, the resource consumption of the machine is also increasing. In the ClickHouse official website document, it is recommended that the concurrency of ClickHouse not exceed 100. When the concurrency requirements are high, in order to reduce the resource consumption of ClickHouse, it can be optimized in conjunction with some special engines of ClickHouse.
The most commonly used special engines are SummingMergetree and Aggregate Mergetree. These two data structures are derived from Mergetree. The essence is to calculate the data that needs to be queried in advance through pre-calculation and save it in ClickHouse, so that you can further Reduce resource consumption.
From the principle of use, SummingMergetree and AggregateMergetree are similar to Kylin’s Cube. But when there are too many dimensions, it is unrealistic to manage many materialized views, and there are problems such as high management costs. Unlike ClickHouse, Kylin provides a series of simple and direct optimization methods to avoid the problem of dimensional explosion.
As you can see, both ClickHouse and Kylin provide some methods to reduce storage space and reduce the number of rows of scanned data during query. It is generally believed that proper optimization of ClickHouse and Kylin can meet business needs in large data scenarios. ClickHouse uses MPP to calculate, and Kylin uses pre-calculation. Due to the different technical routes adopted by the two, the corresponding advantage scenarios are also different.
Because Kylin adopts pre-computation technology, it is suitable for aggregation queries with fixed patterns, such as: join, group by, where condition patterns in SQL are relatively fixed, etc. The larger the amount of data, the more obvious the advantages of using Kylin; in particular, Kylin is going The advantages of scenarios such as count distinct, Top N, Percentile, etc. are particularly huge, which are widely used in scenarios such as Dashboard, various reports, large-screen displays, traffic statistics, and user behavior analysis . Meituan, Jiguang, Shell Finding House, etc. use Kylin to build their data service platform, providing millions to tens of millions of query services every day, and most of the queries can be completed within 2–3 seconds. There are few better alternatives for such high-concurrency scenarios.
ClickHouse has strong on-site computing power because of its MPP architecture . It is more suitable when the query request is more flexible, or there are detailed query requirements and the amount of concurrency is not large . Scenarios include: user tag filtering with a lot of columns and where conditions are randomly combined, complex ad hoc queries with little concurrency, etc. If the amount of data and access is large, a distributed ClickHouse cluster needs to be deployed. At this time, the challenge to operation and maintenance will be higher.
If some queries are very flexible but are not frequently checked, using the current calculation will save resources. Because of the small amount of queries, even if each query consumes a large amount of computing resources, it can be cost-effective overall. If some queries have a fixed pattern, a larger query volume is more suitable for Kylin, because the query volume is large and the calculation results are stored using large computing resources. The early calculation cost can be diluted in each query, so it is the most economical.
Technical principle : ClickHouse adopts MPP + Shared nothing architecture, which is more flexible in query, easy to install, deploy and operate. Because the data is stored locally, expansion and operation and maintenance are relatively troublesome; Kylin uses MOLAP pre-calculation, based on Hadoop, and separates calculation and storage ( Especially after using Parquet storage) and Shared storage architecture, it is more suitable for scenarios where the scene is relatively fixed but the data volume is large. Based on Hadoop, it is easy to integrate with the existing big data platform, and it is also easy to scale (especially upgrade from HBase to Spark). + Parquet).
Storage structure : ClickHouse stores detailed data. Features include MergeTree storage structure and sparse index. Aggregate tables can be created on the details to speed up performance; Kylin uses pre-aggregation and HBase or Parquet as storage. Materialized views are transparent to queries and aggregate queries. Very efficient but does not support detailed query.
Optimization methods : ClickHouse includes optimization methods such as partitioning and sharding and secondary indexes. Kylin uses optimization methods such as aggregation groups, joint dimensions, derivative dimensions, hierarchical dimensions, and rowkey sorting.
Advantage scenarios : ClickHouse is usually suitable for flexible queries of the order of hundreds of millions to billions (more orders are also supported, but the difficulty of cluster operation and maintenance will increase). Kylin is more suitable for relatively fixed query scenarios of billions to tens of billions or more.