What is a Data Lake?
A data lake refers to a central storage repository used to store a vast amount of raw, granular data in its native format. It is a single store repository containing structured data, semi-structured data, and unstructured data.
A data lake is used where there is no fixed storage, no file type limitations, and emphasis is on flexible format storage for future use. Data lake architecture is flat and uses metadata tags and identifiers for quicker data retrieval in a data lake.
The term “data lake” was coined by the Chief Technology Officer of Pentaho, James Dixon, to contrast it with the more refined and processed data warehouse repository. The popularity of data lakes continues to grow, especially in organizations that prefer large, holistic data storage.
Data in a data lake is not filtered before storage, and accessing the data for analysis is ad hoc and varied. The data is not transformed until it is needed for analysis. However, data lakes need regular maintenance and some form of governance to ensure data usability and accessibility. If data lakes are not maintained well and become inaccessible, they are referred to as “data swamps.”
Data Lakes vs. Data Warehouse
Data lakes are often confused with data warehouses; hence, to understand data lakes, it is crucial to acknowledge the fundamental distinctions between the two data repositories.
As indicated, both are data repositories that serve the same universal purpose and objective of storing organizational data to support decision making. Data lakes and data warehouses are alternatives and mainly differ in their architecture, which can be concisely broken down into the following points.
The schema for a data lake is not predetermined before data is applied to it, which means data is stored in its native format containing structured and unstructured data. Data is processed when it is being used. However, a data warehouse schema is predefined and predetermined before the application of data, a state known as schema on write. Data lakes are termed schema on read.
Data lakes are flexible and adaptable to changes in use and circumstances while data warehouses take considerable time defining their schema, which cannot be modified hastily to changing requirements. Data lakes storage is easily expanded through the scaling of its servers.
Accessibility of data in a data lake requires some skill to understand its data relationships due to its undefined schema. In comparison, data in a data warehouse is easily accessible due to its structured, defined schema. Many users can easily access warehouse data while not all users in an organization can comprehend data lake accessibility.
Why Create a Data Lake?
Storing data in a data lake for later processing when the need arises is cost-effective and offers an unrefined view to data analysts. The other reasons for creating a data lake are as follows:
- The diverse structure of data in a data lake means it offers a robust and richer quality of analysis for data analysts.
- There is no requirement to model data into an enterprise-wide schema with a data lake.
- Data lakes offer flexibility in data analysis with the ability to modify structured to unstructured data which cannot be found in data warehouses.
- Artificial intelligence and machine learning can be employed to make profitable forecasts.
- Using data lakes can give an organization a competitive advantage.
Data Lake Architecture
A data lake architecture is flat to accommodate unstructured data and different data structures from multiple sources across the organization. All data lakes have two components, storage and compute, and they can both be located on-premises or based in the cloud. The data lake architecture can use a combination of cloud and on-premises locations.
It is difficult to measure the volume of data that will need to be accommodated by a data lake. For this reason, data lake architecture provides expanded scalability, as high as an exabyte, a feat a conventional storage system is not capable of. Data should be tagged with metadata during its application into the data lake to ensure future accessibility.
Below is a concept diagram for a data lake structure:
Data lakes software such as Hadoop and Amazon Simple Storage Service (Amazon S3) vary in terms of structure and strategy. Data lake architecture software organizes data in a data lake and makes it easier to access and use. The following features should be incorporated in a data lake architecture to prevent the development of a data swamp and ensure data lake functionality.
- Utilization of data profiling tools proffers insights into the classification of data objects and implementing data quality control
- Taxonomy of data classification includes user scenarios and possible user groups, content, and data type
- File hierarchy with naming conventions
- Tracking mechanism on data lake user access together with a generated alert signal at the point and time of access
- Data catalog search functionality
- Data security that encompasses data encryption, access control, authentication, and other data security tools to prevent unauthorized access
- Data lake usage training and awareness
Hadoop Data Lakes Architecture
We have singled out illustrating Hadoop data lake infrastructure as an example. Some of the data lake architecture providers use a Hadoop-based data management platform consisting of one or more Hadoop clusters. Hadoop uses a cluster of distributed servers for data storage. The Hadoop ecosystem comprises three main core elements:
- Hadoop Distributed File System (HDFS) – The storage layer whose function is to store and replicate data across multiple servers.
- Yet Another Resource Negotiator (YARN) – Resource management tool
- MapReduce – The programming model for splitting data into smaller subsections before processing in servers
Hadoop supplementary tools include Pig, Hive, Sqoop, and Kafka. The tools assist in the processes of ingestion, preparation, and extraction. Hadoop can be combined with cloud enterprise platforms to offer a cloud-based data lake infrastructure.
Hadoop is an open-source technology that makes it less expensive to use. Several ETL tools are available for integration with Hadoop. It is easy to scale and provides faster computation due to its data locality, which has increased its popularity and familiarity among most technology users.
Data Lake Key Concepts
Below are some key data lake concepts to broaden and deepen understanding of data lakes architecture.
- Data ingestion – The process where data is gathered from multiple data sources and loaded into the data lake. The process supports all data structures, including unstructured data. It also supports batch and one-time ingestion.
- Security – Implementing security protocols for the data lake is an important aspect. It means managing data security along the data lake flow from loading, search, storage, and accessibility. Other facets of data security such as data protection, authentication, accounting, and access control to prevent unauthorized access are also paramount to data lakes.
- Data quality – Information in a data lake is used for decision making, which makes it important for the data to be of high quality. Poor quality data can lead to bad decisions, which can be catastrophic to the organization.
- Data governance – The process of administering and managing data integrity, availability, usability, and security within an organization.
- Data discovery – Discovering data is important before data preparation and analysis. It is the process of collecting data from multiple sources and consolidating it in the lake, making use of tagging techniques to detect patterns enabling better data understandability.
- Data exploration – Data exploration starts just before the data analytics stage. It assists in identifying the right dataset for the purpose of the analysis.
- Data storage – Data storage should support multiple data formats, be scalable, accessible easily and swiftly, and should be cost-effective.
- Data auditing – Facilitates evaluation of risk and compliance and tracks any changes made to crucial data elements, including the identity of who made the changes, how data was changed, and when the changes took place.
- Data lineage – Concerned with the data flow from its source or origin and its path as it is moved within the data lake. Data lineage smoothens error corrections in a data analytics process from its source to its destination.
Benefits of a Data Lake
- A data lake is an agile storage platform that can be easily configured for any given data model, structure, application, or query. Data lake agility enables multiple and advanced analytical methods to interpret the data.
- Being a schema on read makes a data lake scalable and flexible.
- Data lakes support queries that require a deep analysis by exploring information down to its source to queries that require a simple report with summary data. All user types are catered for.
- Most data lakes software applications are open source and can be installed using low-cost hardware.
- Schema development is deferred until an organization finds a business case for the data. Hence, no time and costs are wasted on schema development.
- Data lakes offer centralization of different data sources.
- They provide value for all data types as well as the long-term cost of ownership.
- Cloud-based data lakes are easier and faster to implement, cost-effective with a pay-as-you-use model, and are easier to scale up as the need arises. It also saves on space and real estate costs.
Challenges and Criticism of Data Lakes
- Data lakes are at risk of losing relevance and becoming data swamps over time if they are not properly governed.
- It is difficult to ensure data security and access control as some data is dumped in the lake without proper oversight.
- There is no trail of previous analytics on the data to assist new users.
- Storage and processing costs may increase as more data is added into the lake.
- On-premises data lakes face challenges such as space constraints, hardware and data center setup, storage scalability, cost, and resource budgeting.
Popular Data Lake Technology Vendors
Popular data lake technology providers include the following:
- Amazon S3 – Offers unlimited scalability
- Apache – Uses Hadoop open-source ecosystem
- Google Cloud Platform (GCP) – Google cloud storage
- Oracle Big Data Cloud
- Microsoft Azure Data Lake and Azure Data Analytics
- Snowflake – Processes structured and semi-structured datasets, notably JSON, XML, and Parquet
CFI offers the Business Intelligence & Data Analyst (BIDA)® certification program for those looking to take their careers to the next level. To keep learning and developing your knowledge base, please explore the additional relevant resources below: