Get started quickly with Apache Flink

At the beginning of 2020, Cloudera Hadoop god Arun announced on Twitter that Cloudera Data Platform has officially integrated Flink as its stream computing product, which means that Cloudera’s global customers will be able to use Flink for stream data processing. So, what are the outstanding capabilities of Apache Flink, which is considered to be the best alternative to Storm, are favored by the big data unicorn Cloudera? What experience enhancements can the combination of Cloudera and Apache Flink bring to enterprises and developers?

1 What is Apache Flink?

Flink was born in StratoSphere, a big data research project in Europe. This project is a research project of the Technical University of Berlin. In the early days, Flink was doing batch computing, but in 2014, the core members of StratoSphere hatched Flink. In the same year, Flink donated to Apache and later became Apache’s top big data project. At the same time, the mainstream direction of Flink computing was positioned as Streaming. , That is, use streaming computing to do all big data calculations.

Specifically, Apache Flink is a computing framework that solves real-time data processing, but it is not a data warehouse service. It can perform stateful computing on limited data streams and unlimited data streams, and can be deployed in various cluster environments. The size of the data scale is quickly calculated.

Image for post
Image for post

As shown in the figure above, the Flink framework can be roughly divided into three parts, from left to right: data input, Flink data processing, and data output.

Flink supports the input of Events (supporting real-time events) of the message queue. The upstream continuously generates data and puts it into the message queue. Flink continues to consume and process the data in the message queue. After the processing is completed, the data is written to the downstream system. This process is continuous The progress.

Image for post
Image for post

At the API level, Flink has a better level group. But whether it is through SQL API, Table API or DataStream API, it will eventually be converted into Stream Operator and then placed in the framework of flink Runtime for execution, that is, converted into a Flink application in which various operators are connected in series. It’s just that when the upper-level API tries to make a Flink program, there will be a variety of different angles to write the application you want to achieve from all aspects.

A new generation of distributed streaming data processing framework

Flink is a distributed streaming data processing framework that integrates high throughput, low latency, and statefulness.

As we all know, the very mature computing framework Apache Spark can only take into account the characteristics of high throughput and high performance, and cannot guarantee low latency in Spark Streaming streaming computing; while Apache Storm can only support low latency and high performance features, but cannot meet High throughput requirements. The three goals of high throughput, low latency, and statefulness are very important for the distributed streaming computing framework.

High throughput

Image for post
Image for post

As shown in the figure above, compared to Storm or other frameworks, the Flink network model is relatively efficient. There will be many Subtasks under each Flink TaskManager. Different from other scheme designs, Subtask will share the service of a TaskManager and communicate with other TaskManagers through a TCP Connection. The communication is completed by the Netty server in the TaskManager.

It should be noted that by default, the event data is not sent when one is completed, but a buffer block is obtained from the Buffer Pool of each Subtask, and written to the buffer block by the RecordWriter, and waits until the buffer block is full After that, inform Netty to send the queue to other TaskManagers. This not only ensures that each TCP packet is used as much as possible, but also reduces the number of unnecessary network packets.

Low latency

Image for post
Image for post

In terms of the underlying characteristics of the technology itself, Flink introduced the concepts of Buffer Pool and Buffer Block. In the case of heavy traffic, since the Buffer area will be filled up soon, Netty will be notified to send as much as possible, so you will not see too much delay. However, at low traffic, there may be a piece of data in a few seconds, which means that the Buffer pool has not been forcibly filled for a long time. Therefore, in order to ensure that the downstream system gets the upstream message as soon as possible, a mandatory The trigger mechanism for refreshing or pushing downstream.

Flink itself has such a mechanism. It can ensure that when the Buffer is not full, it can notify the Netty server in advance to send the data in the current Buffer block as soon as possible, and it can be controlled by the BufferTimeout parameter setting. Flink’s maximum system delay at low traffic.

Buffertimeout includes -1, 0, x ms allocation method. The special ones are -1 and 0. When the parameter is set to -1, Flink users will ignore the Flusher notification, and the downstream transmission must be completed by the RecordWriter, which is the default that the buffer is full and sent. In this case, although the efficiency of each communication is highly efficient, a large amount of unpredictable system delay will occur if it is accepted at low traffic.

When the parameter is set to 0, it means that every time Flink writes a piece of data, it will notify Netty to send it as much as possible, that is, the system has reached the lowest latency in technical theory. Therefore, when you are particularly sensitive to delay and the traffic is not very high, you can consider setting Buffertimeout to 0.

Under normal circumstances, Buffertimeout is set to a positive value, that is, how many milliseconds. At this time, Flink notifies Netty at intervals, and Netty will send it as much as possible regardless of whether the data is written or not.

In this way, through these two parameters, that is, the size of the buffer and how long to force the transmission, a dimensional control can be formed between delay and throughput, and some can be done in the two directions of low latency or high throughput. Control can ensure high throughput and low latency.

Image for post
Image for post

Since Flink is a real-time computing framework, the state of Flink is actually the core technical asset, which involves frequent writes and reads, and requires a fast storage solution to store the state. Flink provides three state storage modes, namely memory mode, file mode and Rocks DB mode.

  • Memory mode: In this way, Flink will maintain the state on the Java heap. As we all know, memory access is the fastest to read and write; its shortcomings are also obvious. The memory space of a single machine is limited, and it is not suitable for storing large amounts of data. Generally, memory is used in local development and debugging or in very small application scenarios.
  • File mode: When the file system is selected as the backend, the data being calculated will be temporarily stored in the memory of the TaskManager. During checkpoint, this backend will write the state snapshot to the configured file system, and will store very little metadata in the memory of the JobManager or in Zookeeper (high availability). The file system backend is suitable for tasks that handle large states, long windows, or large key-value states.
  • RocksDB: RocksDB is an embedded key-value database. When RocksDB is used as the backend, Flink will use RocksDB to store the data in real-time processing on the local disk. During Checkpoint, the entire RocksDB database will be stored in the configured file system, and Flink will store very little metadata in the memory of the JobManager or in Zookeeper (high availability). RocksDB supports incremental Checkpoint, that is, only the modified data is backed up, so it is very suitable for scenes with large states.

Three scenarios, real-time processing is no problem

Flink’s application scenarios generally see three categories, namely streaming ETL, real-time data analysis, and event-driven application transformation.

Streaming ETL

The task of traditional ETL is generally to start reading data regularly, write the results to a certain database or file system, and complete batch jobs by periodically calling ETL scripts. But when there is the ability to stream ETL, it is no longer necessary to complete the ETL task in a regular way, but start the ETL processing immediately after the data arrives. In the event of an unexpected situation, you can also resume the task from the last starting point through the screen mechanism.

Real-time data analysis

Apache Flink supports both streaming and batch analysis applications.

Flink provides good support for continuous streaming analysis and batch analysis. Specifically, it has a built-in SQL interface that complies with ANSI standards to unify the semantics of batch and stream queries. Whether it is on a static data set of recorded events or on a real-time event stream, the same SQL query will get consistent results.

However, it is inevitable that because the real-time analysis system faces an open interval or a semi-open data processing interval, if a real-time data analysis system is to be used, it is impossible to guarantee 100% operation of the product results. Developers can only reduce the probability of this situation through some means, but cannot completely avoid situations like this.

Event-driven application

An event-driven application is a type of stateful application that extracts data from one or more event streams and triggers calculations, status updates, or other external actions based on incoming events.

Image for post
Image for post

As shown in the figure above, the traditional transaction processing application on the left and the event-driven processing application on the right.

Clickstream Events of traditional transaction processing applications can be written into Transaction DB (database) through Application, and data can also be read from Transaction DB through Application and processed. When the processing result reaches an early warning value, an Action action will be triggered .

The data collected by event-driven applications can be continuously put into the message queue. The Flink application will continue to ingest (consume) the data in the message queue. The Flink application maintains data (state) for a period of time. Data persistent storage (Persistent storage) to prevent Flink applications from dying. Each time a Flink application receives a piece of data, it will process a piece of data. After processing, an action will be triggered. At the same time, the processing result can be written to an external message queue for consumption by other Flink applications. And can ensure consistency through the checkpoint mechanism to avoid unexpected situations.

Based on Flink, CSA realizes real-time state processing of IoT-level data streams and complex events

The significance of adding Flink to Cloudera DataFlow (CDF) is very significant. Cloudera provides several choices of stream processing engines: Storm, Spark Structured Streaming and Kafka Stream. Among them, Storm has gradually fallen out of favor in the market and open source communities, and users are looking for more A good choice, and Apache Flink naturally supports stream computing (rather than batch processing), which can process a large number of data streams on a large scale, has natively supported fault tolerance/recovery capabilities, and advanced Window semantics, which makes it a more extensive stream processing engine The default selection.

Cloudera Streaming Analytics (“CSA”) supported by Apache Flink is a new product in the CDF platform that provides real-time status processing of IoT-level data streams and complex events. As one of the key pillars of CDF, stream processing and analysis are very important for processing millions of data points and complex events from various data sources. Over the years, multiple streaming engines have been supported. The addition of Flink makes CDF a platform that can process large amounts of streaming data on a large scale.

Cloudera Streaming Analytics covers the core streaming functions of Apache Flink:

  • Support Flink 1.9.1 on YARN
  • Support for installing Flink on Cloudera managed cluster
  • Flink cluster supporting full security (with TLS and Kerberos enabled)
  • Read data source from Kafka or HDFS
  • Pipeline definition using Java DataStream and ProcessFunction API
  • Exactly once semantics
  • Semantics based on event time
  • Data receiver writes to Kafka, HDFS and HBase

Integrate with Cloudera Schema Registry for schema management and serialization/deserialization of streaming events

How to use Cloudera CSA?

The next step is to use some standard development kits to start the first Flink project. First obtain the operating environment, load or read data, then write Transformations, add the data output target system, and finally execute the application.

Apache Flink and CSA are exploring more possibilities

At present, Flink has become a mainstream stream computing engine. The next important task for the community is to integrate Flink and make a unified data processing model for stream and batch. In version 1.9, a technology preview version of Flink’s SQL Planner is used to replace the old SQL Planner, which supports native SQL keywords, which has a better standard for SQL and the correctness and efficiency of SQL syntax analysis. Guaranteed.

At the same time, as an open source technology or participant in the Apache community, Cloudera will also make more contributions to the Apache Flink technology, which will focus on the integration at the security level, and then the integration of Atlas components, and at the same time A new HBASE Connector will be made at the interface level.

In addition, although the current CSA supports the semantic environment of Kerberos, there is no such automated Kerberos configuration that is similar to the one-click completion, and includes some visualization of this framework or a unified security management framework, such as Ranger. Permission to manage tasks. Therefore, the future CSA will also do some new and better management in the direction of enterprise management, including the management of a Flink program for A/B testing, and the management of tasks and task JARs.

At the same time, Cloudera will invest more efforts in the development of open source Flink and the construction of the community, hoping to help the development of the Flink community with colleagues in the industry.

Written by

Digital Nomad

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store