Data Lake implementation

“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

-James Dixon,  the founder and CTO of Pentaho

“A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.”

-Unkown source

The term data lake is regularly connected with Hadoop-oriented object storage. In such a situation, an entity’s data is initially stacked into the Hadoop platform, and afterward business analytics and data mining instruments are connected to the data where it lives on Hadoop’s bunch hubs of product PCs.

Like big data, the term data lake is sometimes disparaged as being simply a marketing label for a product that supports Hadoop. Increasingly, however, the term is being accepted as a way to describe any large data pool in which the schema and data requirements are not defined until the data is queried.

Data Lake and Data warehouse

The commonality between the two is that they are both data storage repositories.

Data warehouse

Vs

Data Lakes

Structured, processed

DATA

Structured/semi-structured/unstructured/raw

Schema on write

PROCESSING

Schema on read

Expensive for large volumes

STORAGE

Designed for low cost storage

Less agile, fixed configuration

AGILITY

Highly agile, configure and reconfigure as needed

Mature

SECURITY

Maturing

Business professionals

USERS

Data scientists

Gaps in Data Lake Concept


Data lakes convey significant dangers. The most critical is the inability to determine data quality or the lineage of findings by other analysts or users that have found value, previously, in using the same data in the lake. By its definition, a data lake accepts any data, without oversight or administration. Without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a information swamp. What’s more- without metadata, every subsequent utilization of data means analysts start from the beginning.

Another danger is security and access control. Data can be placed into the data lake with no oversight of the contents. Numerous data lakes are being used for data whose privacy and regulatory requirements are likely to represent risk exposure. The security abilities of central data lake technologies are still embryonic. These issues will not be addressed if left to non-IT personnel.

At last, performance aspects should not be neglected. Tools and data interfaces simply cannot perform at the same level against a general-purpose store as they can against optimized and purpose-built infrastructure.

Leave a Reply

Your email address will not be published.