Digital Nomad
Image for post
Image for post

In the processing of big data, the different big data frameworks play a key role , through the use of big data system frameworks, the integrated processing of large-scale data is become easy and to get and extract intelligence and other useful reports is also simple. From the perspective of manual statistical analysis and today’s distributed computing platforms are keystone behind the rapid increase in data processing speed and the continuous evolution of the overall architecture. Nowadays, there are many big data frameworks available on the market. The most popular ones are Hadoop, Spark and Storm. …


Image for post
Image for post

In 1993, Edgar F. Codd, the founder of relational databases, proposed the concept of online analytical processing (OLAP). Essentially, it is the concept of multidimensional database and multidimensional analysis capabilities. The goal is to meet the specific query and report requirements of decision support or multidimensional environments. After the arrival of the Internet era, the surge in data volume has also brought new challenges to relational databases. The most obvious challenges are as follows:

Expansion cost of data column is huge

Because the relational database defines the fields of the Table in advance, when the database already has hundreds of millions of data, the business scenario needs a new column of data. You are surprised to find that under the rule of the relational database, It is necessary to operate these hundreds of millions of data at the same time to complete the addition of a new column (otherwise the database will report errors), which poses a great challenge to server performance in the production environment. …


Image for post
Image for post

In the field of data processing, we are generally divided into online transaction processing (OLTP, Online Transaction Process) and online analysis processing (OLAP, Online Analysis Process). Take shopping as an example online transaction processing is to ensure that the same product is not purchased by multiple people. Online analysis and processing is to count how many people have purchased this product.

Kylin is a big data analysis engine built on the Hadoop platform. On the PB data (1PB=1000TB) data set, a tool that can return summarized data in seconds. Let me give you an example of the ability to summarize data, for example, I want to know the total score of each person in my game. This is doing data aggregation. This ability is amazing. …


Image for post
Image for post

History and problem

Looking back on the 10 years of evolution of distributed computing systems, we can more easily recognize the relative positions of Spark and Ray. In 2004, Google proposed MapReduce as a cluster programming framework, and cooperated with Google File System and other technologies as the support of the underlying storage. After more than 10 years, MapReduce became popular.

The reason for its success is that it provides programmers and data scientists with a very good understanding, rich expressiveness, high fault tolerance, and it is easy to implement a distributed system architecture based on commercial hardware (commodity devices).

Then in 2010, with the concept of memory cloud proposed by Stanford, researchers realized that memory, which seemed to be very expensive, was becoming cheap, and many fault-tolerant operations that were highly dependent on disk could actually be implemented in memory. In this context, Spark came into being, giving birth to RDD and a series of memory-based optimization technologies, replacing the original disk-based frameworks such as Hadoop Hive in small and medium-scale computing. But so far, Hive has not been completely replaced by this. In the very large-scale computing (PB level) scenario, it relies on SSD and super robustness, which is still the first choice of many companies. …


Image for post
Image for post

Dataflow model

The Dataflow model aims to establish an accurate and reliable solution for stream processing. Before the Dataflow model was proposed, stream processing was often regarded as an unreliable but a low-latency processing method. It required an accurate but high-latency batch processing framework similar like MapReduce to get a reliable result. This is the famous Lambda architecture .

This architecture brings a lot of trouble to the application. For example, the introduction of multiple sets of components leads to increased system complexity and difficulty in maintainability. …


Image for post
Image for post

As we all know, Hadoop, the first-generation framework for big data computing, was created to solve the problem of offline computing. Apache Spark has excellent results in offline batch processing, but many drawback for real-time stream processing. After Hadoop, Spark and Storm became rivals in stream processing.

Apache Spark stream processing

The emergence of the Spark framework is inherited and developed on the basis of Hadoop MapReduce. In essence, it still adopts the idea of ​​batch processing. However, the intermediate process of data calculation has been optimized, thereby improving the efficiency of data processing and gaining more Native MapReduce has better computing performance.

Spark provides Streaming with the help of core Spark API. Its stream processing idea is to divide it into batch processing jobs in advance according to time intervals before processing. …


Image for post
Image for post

The following describes each technology in different layers. Of course, each layer is not strictly divided in the literal sense. For example, Hive provides both data processing functions and data storage functions, but it is classified as a data analysis layer here.

1. Data acquisition and transmission layer

  • Flume
    Flume is a distributed, reliable, and highly available system for data collection, aggregation and transmission. Commonly used in log collection systems, it supports customizing various data senders to collect data, simple pre-processing of data through custom interceptors, and transmission to various data receivers such as HDFS, HBase, and Kafka. …


Image for post
Image for post

The Apache Spark core is the basic execution engine of the Spark platform. All other functions are built on this engine. It not only provides memory computing functions to improve speed, but also provides a general execution model to support various applications. In addition, users can use Java, Scala and Python API to develop applications. Spark core is built on a unified abstract RDD, which allows various components of Spark to be integrated at will, and different components can be used in the same application to complete complex big data processing tasks.

What is RDD

RDD (Resilient Distributed Datasets) was originally designed to solve the problem that some existing computing frameworks are not efficient in processing two types of application scenarios, which are iterative algorithms and interactive data mining . Both application scenarios, by storing data in memory, performance can be improved to several orders of magnitude. For iterative algorithms , such as PageRank, K-means clustering, logistic regression, etc., intermediate results often need to be reused. Another application scenario is interactive data mining , such as running multiple ad hoc queries on the same data set. In computing frameworks just as Hadoop, theb intermediate calculation results is to save them to an external storage device (such as HDFS), which will increase additional data replication, disk IO, and serialization efforts. This will increase the work load of the application. …


Image for post
Image for post

What is Airflow?

Airflow is a platform for scheduling and monitoring workflows using a data pipeline written in python. Airflow is a task scheduling tool that manages the task process through DAG (Directed acyclic graph). It does not need to know the specific content of the business data and set the dependency relationship of the task to achieve task scheduling.

This platform has the ability to interact with data sources such as Hive, Presto, MySQL, HDFS, Postgres, and provides hooks to make it highly scalable. …


Image for post
Image for post

Since the concept of big data was popular, companies demands for data intelligence become stronger and stronger, and many big data computing engines have emerged. The most famous and widely used ones are MapReduce, Storm, Spark, Sparkstreaming and Flink. They are all produced under different era backgrounds, and they are new solutions to solve difficult problems that cannot be encountered at each stage. So what are they? Let’s take a look at these computing engine babies one by one today

Offline and online computing

From the perspective of processing time, we can divide big data computing engines into offline computing and real-time computing. Offline computing is generally a delay of T+1, and real-time computing is generally a delay of seconds or milliseconds from the processed data In terms of quantity, we can divide big data engines into two types streaming computing and batch computing. Streaming computing processes one item at a time, and batch computing processes multiple items at a time. MapReduce and Spark are offline computing and batch computing engines, while Storm, Sparkstreaming, and Flink are computing engines that coexist in online and offline computing as well as streaming and batching. …

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store