Data Lake is one of the newer buzzwords that is fast become lingua franca in the Big Data world. It was first described by Pentaho CTO James Dixon in his 2010 blog entry: "If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
In Big Data parlance, a data lake refers to data that is stored in its original form – much of it unstructured – until it is needed to be processed. But because it is so new and remains unchartered, it is fast becoming unchartered territory for which vendors, consultants and industry observers are laying claim defining data lake according to their bias.
It is in this scenario that one industry group – Gartner – warns that several vendors are marketing data lakes as an essential component to capitalize on Big Data opportunities, but there is little alignment between vendors about what comprises a data lake, or how to get value from it.
"In broad terms, data lakes are marketed as enterprise-wide data management platforms for analyzing disparate sources of data in its native format," said Nick Heudecker, research director at Gartner. "The idea is simple: instead of placing data in a purpose-built data store, you move it into a data lake in its original format. This eliminates the upfront costs of data ingestion, like transformation. Once data is placed into the lake, it's available for analysis by everyone in the organization."
However, while the marketing hype suggests audiences throughout an enterprise will leverage data lakes, this positioning assumes that all those audiences are highly skilled at data manipulation and analysis, as data lakes lack semantic consistency and governed metadata.
"The need for increased agility and accessibility for data analysis is the primary driver for data lakes," said Andrew White, vice president and distinguished analyst at Gartner. "Nevertheless, while it is certainly true that data lakes can provide value to various parts of the organization, the proposition of enterprise-wide data management has yet to be realized."
Data lakes focus on storing disparate data and ignore how or why data is used, governed, defined and secured. The data lake concept hopes to solve two problems, one old and one new. The old problem it tries to solve is information silos. Rather than having dozens of independently managed collections of data, you can combine these sources in the unmanaged data lake. The consolidation theoretically results in increased information use and sharing, while cutting costs through server and license reduction.
The new problem data lakes conceptually tackle pertains to Big Data initiatives. Big Data projects require a large amount of varied information. The information is so varied that it's not clear what it is when it is received, and constraining it in something as structured as a data warehouse or relational database management system (RDBMS) constrains future analysis.
"Addressing both of these issues with a data lake certainly benefits IT in the short term in that IT no longer has to spend time understanding how information is used — data is simply dumped into the data lake," said White. "However, getting value out of the data remains the responsibility of the business end user. Of course, technology could be applied or added to the lake to do this, but without at least some semblance of information governance, the lake will end up being a collection of disconnected data pools or information silos all in one place."
Data lakes therefore carry substantial risks. The most important is the inability to determine data quality or the lineage of findings by other analysts or users that have found value, previously, in using the same data in the lake. By its definition, a data lake accepts any data, without oversight or governance. Without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a data swamp. And without metadata, every subsequent use of data means analysts start from scratch.
Another risk is security and access control. Data can be placed into the data lake with no oversight of the contents. Many data lakes are being used for data whose privacy and regulatory requirements are likely to represent risk exposure. The security capabilities of central data lake technologies are still embryonic. These issues will not be addressed if left to non-IT personnel.
Finally, performance aspects should not be overlooked. Tools and data interfaces simply cannot perform at the same level against a general-purpose store as they can against optimized and purpose-built infrastructure. For these reasons, Gartner recommends that organizations focus on semantic consistency and performance in upstream applications and data stores instead of information consolidation in a data lake.
"Data lakes typically begin as ungoverned data stores," said Heudecker. "Meeting the needs of wider audiences require curated repositories with governance, semantic consistency and access controls — elements already found in a data warehouse.
"The fundamental issue with the data lake is that it makes certain assumptions about the users of information," said Heudecker. "It assumes that users recognize or understand the contextual bias of how data is captured, that they know how to merge and reconcile different data sources without 'a priori knowledge' and that they understand the incomplete nature of datasets, regardless of structure."
While these assumptions may be true for users working with data, such as data scientists, the majority of business users lack this level of sophistication or support from operational information governance routines. Developing or acquiring these skills or obtaining such support on an individual basis, is both time-consuming and expensive, or impossible.
"There is always value to be found in data but the question your organization has to address is this — do we allow or even encourage one-off, independent analysis of information in silos or a data lake, bringing said data together, or do we formalize to a degree that effort, and try to sustain the value-generating skills we develop?" said White. "If the option is the former, it is quite likely that a data lake will appeal. If the decision tends toward the latter, it is beneficial to move beyond a data lake concept quite quickly in order to develop a more robust logical data warehouse strategy."