The AWS data lake, the data lake is on fire, what about the data warehouse?

Image for post
Image for post

In the future, mankind will face three major problems.Biology itself is an algorithm, life is a process of continuously processing data.The separation of consciousness and intelligence.The external environment with the accumulation of big data will know ourselves better than ourselves;

These are the three revolutionary viewpoints put forward in A Brief History of the Future”. A book of just a hundred pages, let us see the disruptive changes in the world, from computers, to the Internet, to big data, artificial intelligence, all changes are happening quietly in a state that is visible but cannot be captured by the naked eye. Behind the drive for change is the increase in the value of data.

If you compare data to an “oil field,” if you want to fully tap its value, you first need to “mining/storing” the data, that is, data collection and storage, and then “refining”, that is, data mining and analysis, and ultimately creating more data value. Take today’s common e-commerce as an example e-commerce companies collect user-related data, and then use data analysis technology to analyze user preferences, and then recommend related products to improve user purchasing efficiency in addition, e-commerce companies also Predictive models can be established to make predictions for specific groups of people, and timely adjust sales methods at different stages to improve user satisfaction with products, thereby increasing sales.

Originally, enterprises usually rely on expensive and private local data warehouse solutions to store and analyze data. Due to the requirements of the model paradigm, the underlying data cannot be diversified, resulting in enterprise business cannot change at will. At the same time, with the outbreak of the Internet/mobile Internet, the amount of data has ranged from TB to PB to EB. The data types cover structured data, unstructured data, and semi-structured data, and users have more and more requirements for regionality and timeliness. The demand for development makes the traditional data warehouse solution need to be updated.

Nowadays, with the cloud, highly flexible and scalable computing and storage, data storage and analysis are easier to solve. It can be said that cloud data solutions have become the general trend. On the one hand, the distributed architecture and open source system can adapt to the current rapid data changes. On the other hand, it can integrate more new technical services, such as combining with machine learning to achieve more predictive analysis. And distributed storage, multiple file formats, multiple engines, and metadata services have gradually formed the foundation of the data lake.

1 The technological innovation road of AWS data lake

The concept of a data lake was first proposed in 2006, and its main concept is to define a data lake as a container for central data storage. Data can easily enter the data lake. It can store structured, unstructured and semi-structured data, and supports rapid scaling of data volume, flexibly adapts to changes in upper-level data applications, and finally realizes massive data storage and query analysis .

And it is AWS (Amazon Web Services) that really extended the concept of data lake. AWS has started to promote the technological evolution of data lakes very early. In 2009, AWS launched the Amazon Elastic MapReduce (EMR) data lake architecture to automatically configure HDFS across EC2 instance clusters in 2012, it continued to launch the data warehouse cloud with cloud MPP architecture. Serve Amazon Redshift then AWS gradually shifted the core of the data lake to Amazon S3.

With the development of big data technology, computing power has become the key, and the separation of computing and storage will bring about elastic expansion and cost advantages gradually. While cloud services are inherently separated from storage and computing, AWS’s cloud advantages are gradually becoming prominent. In the end, the AWS data lake combines big data and cloud computing to form a classic data lake with storage and multiple engines/services. Here, we will combine AWS’s overall analysis services to explain to developers how AWS helps developers/enterprises build a data lake environment and use data efficiently.

Fast data query engine

On AWS, Amazon S3 object storage service has become the first choice for building data lakes due to its high availability, high durability, scalability, and data format compatibility. AWS also provides an interactive query method to directly query the data in S3. Amazon Athena is an interactive query service.

It can use standard SQL to analyze data in Amazon S3. Athena is simple and easy to use. You only need to point to the data stored in S3 by the developer, define the structure and start the query. It does not need to perform complex ETL jobs to prepare for data analysis. Developers can easily analyze large-scale data sets.

How to solve the problem of various metadata formats?

Since the data lake can be stored in any format, there is no need to convert it into a pre-defined data structure. One of the main challenges of using the data lake is to find the data and understand the data structure and format. AWS Glue helps developers extract, transform, and load data, and reliably move data between different data stores. In addition, as a fully managed service, Glue will automatically crawl the massive data in the data lake like a “crawler” and automatically generate a data catalog, which is the permanent metadata storage for all data assets. Once stored in the catalog, the data can be searched, inquired and used immediately by ETL.

It is worth mentioning that Athena can be integrated with the AWS Glue data catalog to achieve out-of-the-box use, helping developers to create a unified metadata repository across various services, crawl data sources to discover the architecture, and use new Fill the data directory with modified table and partition definitions, and maintain schema version control.

How to quickly build a data lake?

It is not difficult to see that the data lake is an efficient and fast data storage/analysis concept, but at the same time it also has a considerable degree of complexity. When setting up and managing a data lake, it involves a large number of time-consuming and complex manual tasks, including loading data from different sources, monitoring data streams, setting partitions, turning on encryption and managing keys, defining conversion jobs and monitoring their operations, and restoring data Organize into column format etc.

In the face of solving such problems, developers can use the AWS Lake Formation service, which simplifies the creation and management of data lakes, shortens the construction time of data lakes, and can establish a secure data lake within a few days.

Lake Formation is built on the features available in AWS Glue. Developers only need to manually define data sources and formulate data access and security policies to be applied. Lake Formation will automatically help developers collect and catalog data from databases and object storage, and then move the data to the new Amazon S3 data lake. Ultimately, users can use these data sets to achieve diversified services by choosing different analysis and machine learning services.

Lake House New Model: Data Lake + Data Warehouse = Lake House

In the era of big data, the design of open source technology systems does allow cloud products or open source components to form an overall big data solution gradually. For example, data lakes, but it does not mean that data warehouses will be eliminated. contact. On the one hand, by going to the cloud, continue to enhance the core capabilities of the data warehouse and modernize the data warehouse. On the other hand, data warehouse and data lake are two design methods of big data architecture. The functions of the two can complement each other, which means that the two parties need to interact and share data.

In order to realize the interaction of the lake warehouse, at the AWS re:Invent conference in 2019, AWS proposed that a new model of operating data warehouse and data lake business is forming, namely “Lake House”. AWS Lake House follows the “ELT” paradigm (extract, load, transform). When migrating from a local data warehouse to Redshift, developers can use existing SQL workloads optimized for ELT, without having to start from scratch. The SQL workload was rewritten as a new computing framework.

Seamless interoperability between Amazon Redshift and the data lake

In the AWS Lake House model, Redshift is the preferred conversion engine to efficiently load, convert, and expand data. Amazon Redshift Spectrum is a feature of Amazon Redshift that prevents customers from searching for spectrum by service name in the console. AWS chooses the SQL language that developers are familiar with, and aims to help more developers easily query data.

Not only that, the new Redshift also has a data lake export function. This function can write data back to the data lake. It currently supports Apache Parquet, ORC, JSON and CSV formats. Take the Parquet format as an example (an efficient and open columnar storage format for analysis), which is similar to the traditional text format. Compared with the Parquet format, the unloading speed is up to 2 times faster, and the storage space occupied in S3 is reduced up to 6 times.

In addition, Redshift’s RA3 instance type allows developers to independently expand Redshift data storage and computing needs, helping developers to manage the combination of data and workload at a lower price. And Redshift can process concurrent queries and maintain consistent performance by automatically expanding additional transient capacity, thereby completing peak workloads.

When data begins to move smoothly between the data lake and Redshift, this flexibility allows developers to choose the best compromise between cost and performance when storing data. At present, a large number of enterprises and institutions have begun to adopt AWS data lake and data analysis cloud services. Among them, FOX Corporation (FOX Corporation), as one of the giants in the world’s entertainment industry, needs to face large-scale extraction, optimization, conversion, and aggregation of multi-source transactional events every day, and the amount of data reaches the order of billions. Amazon Redshift supports the query of real-time data in its data warehouse and data lake, and has witnessed the rapid growth of data at the petabyte level. At the same time, it helped FOX Corporation increase its workload by 10 times while keeping costs unchanged.

In summary, choosing AWS Lake House can help developers achieve the following goals:

  • Efficient, low-cost data storage
  • Independent and scalable computing power, capable of large-scale parallel processing
  • Standard SQL conversion
  • Concurrent expansion to flexibly execute SQL queries

It can be seen that with the continuous development of product sets and architecture models, the collaborative operation of data lakes and data warehouses will become more frequent. The Lake House proposed by AWS based on Redshift Spectrum will also continue to play a key role in the AWS data lake architecture. In the future, AWS firmly believes that compared with traditional data warehouse and data analysis solutions, cloud solutions such as the new Hucang model will release greater data value for users.

Written by

Digital Nomad

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store