The term ‘data lake’ has been floating around for the past few years. While there has been growing discussion as more enterprises begin utilizing data lake services, the concept definition and how it can bring value to RM is still be a bit murky.
We’re here to clear up the muddy waters for you. In this article, we explore:
- The origins of the data lake,
- what a data lake is,
- how data lakes compare to data warehouses (they're similar, but don’t get them confused!)
- three common critiques of the concept, and
- how the right data lake solution can transform and bring insights to an organization's RM.
Let's dive in.
Related: Want to further explore the records management and overall business value of the data lake? Read our article on why your organization should implement a data lake. |
Where did data lakes come from?
Got big data? As described in our blog article on AI, every two days we create as much information as we did from the dawn of civilization up until 2003. Much of this data is unstructured.
In the past two decades, organizations began to find that traditional data storage solutions were no longer sufficient to hold all of their information. A new solution that could hold large volumes of unstructured data was necessary, and thus data lakes were created.
The term itself was coined in 2010 by James Dixon, then-CTO of Pentaho. He used it to contrast data marts, which are smaller data repositories with limited storage (a subset of data warehouse, which we'll explore below). Dixon described the term in the following way:
“If you think of a data mart as a store of bottled water—cleansed and packaged and structured for easy consumption—the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” |
What is a data lake?
Simply put, a data lake is a type of data repository which has a flat architecture that provides capacity for storing large volumes of information. Just like the body of water, this is a reserve that can stream in all types of data, whether text or image-based, processed or raw, from all different sources.
Data lake solutions are beneficial to records management because they provide organizations with a single source to backup and safeguard their information. Each data element in the lake is tagged with a set of metadata tags, so users can access and gain the most insights from whatever is stored.
Let’s compare this term to a more traditional data storage repository, the data warehouse.
How does the data lake compare to the data warehouse?
The data lake is often compared to the data warehouse. We’ve created the chart below to outline their differences but will first state: one was not created to replace the other.
This is not a 'data warehouse versus lakes' discussion because both serve their own purpose. Organizations should understand their own information management needs before deciding to incorporate either data repository, or both. For example, while organizations can use data warehouses for smaller amounts of processed data to refer back and answer more specific business questions, data lakes are better used for storing large amounts of unstructured, processed or unprocessed data for gained clarity and business insights.
Data Warehouses | Data Lakes |
Centralized data storage | Decentralized data storage |
Hierarchical structure | Flat architecture |
Can retain processed data | Can retain all data types |
Limited data storage | Holds large volumes of information |
Established, standardized structure | More flexibility |
Items to consider: 3 common data lake critiques
While data lakes have garnered excitement for their storage capabilities, they have not come without their critiques. When considering this data repository, organizations should be aware of the part of the discussion that questions this type of solution. We'll list the top three here before addressing how these challenges can be avoided and the most insights gained in the following section.
1. The lake is expansive
The large capacity of storage potential of data lakes brings intrigue, but it can also instill fear of a vast source where masses of information will stream in, "sink to the bottom" and be forgotten. We'll discuss how to avoid this risk in the following section.
2. Data quality can vary
Related to volume capacity once again, questions have been raised about the quality of information from a data repository that can hold so much unprocessed data. What if the waters holding business-critical information become murky with all the other information? Once again, we address how to avoid this next section.
3. Questions around security
While the governance and security of data warehouses is trusted, there are concerns around these factors for data lakes. Since the latter repository is newer, there is worry that backing up content into this source can leave organizations' information vulnerable.
Gaining the most value from data lakes for your information management
Those three common critiques around data lakes are important to consider: organizations need to be extremely mindful where they are streaming and storing their information, or they can find themselves with a data swamp repository that is not well protected or providing them with additional value. Not the crystal blue waters one would hope for.
This enforces just how critical it is to select a solution that will not only help organizations store and manage their information, but provide clarity for improving business processes.
How to ensure this? Storing information in a data lake that is intelligent. A data lake repository like Collabspace stands apart because it implements artificial intelligence to automatically stream and categorize content from systems without impacting use or operation.
Collabspace has basic data lake capabilities, enabling information retention and protection with compliant archive and backup into a ransomware-proof, Write Once Read Many (WORM) compliant storage system. But it is the AI compute (such as deeper search and analytics features) that ensure visibility and avoids staff interruption to unlock new insights and data intelligence for improved records management and overall business processes.
Conclusion
The data lake is a storage repository that offers a lot of potential to organizations with its ability to hold large volumes of all types of data. This ability should be taken seriously to avoid a data swamp of old information that has been streamed and forgotten. The best way to get the most value from this repository type is selecting a solution where the stream flows both ways: organization's data is streamed in to centralize, backup and secure from one central hub, while application of AI allows for an outflow of visibility and business insights (we've got a free white paper about achieving data intelligence, if you're curious!).
Curious to learn more? We outline overall business values of the data lake in another article. For specifics around how Collabspace data lake capabilities can bring information clarity, visibility, and insights to your organization, contact us or access our free information pack below:
What do you think? Share your thoughts and questions in the comment section and subscribe to our blog for more content on data lakes, artificial intelligence, information management and more.