Data Lake implementation

“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

-James Dixon,  the founder and CTO of Pentaho

“A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.”

-Unkown source

The term data lake is regularly connected with Hadoop-oriented object storage. In such a situation, an entity’s data is initially stacked into the Hadoop platform, and afterward business analytics and data mining instruments are connected to the data where it lives on Hadoop’s bunch hubs of product PCs.

Like big data, the term data lake is sometimes disparaged as being simply a marketing label for a product that supports Hadoop. Increasingly, however, the term is being accepted as a way to describe any large data pool in which the schema and data requirements are not defined until the data is queried.

Data Lake and Data warehouse

The commonality between the two is that they are both data storage repositories.

Data warehouse


Data Lakes

Structured, processed



Schema on write


Schema on read

Expensive for large volumes


Designed for low cost storage

Less agile, fixed configuration


Highly agile, configure and reconfigure as needed




Business professionals


Data scientists

Gaps in Data Lake Concept

Data lakes convey significant dangers. The most critical is the inability to determine data quality or the lineage of findings by other analysts or users that have found value, previously, in using the same data in the lake. By its definition, a data lake accepts any data, without oversight or administration. Without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a information swamp. What’s more- without metadata, every subsequent utilization of data means analysts start from the beginning.

Another danger is security and access control. Data can be placed into the data lake with no oversight of the contents. Numerous data lakes are being used for data whose privacy and regulatory requirements are likely to represent risk exposure. The security abilities of central data lake technologies are still embryonic. These issues will not be addressed if left to non-IT personnel.

At last, performance aspects should not be neglected. Tools and data interfaces simply cannot perform at the same level against a general-purpose store as they can against optimized and purpose-built infrastructure.

Hadoop! – What is Hadoop?

Apache Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures. for more details

The project includes these subprojects: 
Hadoop Common  The common utilities that support the other Hadoop subprojects.
Hadoop Distributed File System (HDFS™)  A distributed file system that provides high-throughput access to application data.
Hadoop MapReduce  A software framework for distributed processing of large data sets on compute clusters.
Avro™  A data serialization system.
Cassandra™  A scalable multi-master database with no single points of failure.
Chukwa™  A data collection system for managing large distributed systems.
HBase™  A scalable, distributed database that supports structured data storage for large tables.
Hive™  A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout™  A Scalable machine learning and data mining library.
Pig™  A high-level data-flow language and execution framework for parallel computation.
ZooKeeper™  A high-performance coordination service for distributed applications.

 New nodes can be added as needed, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top.
Cost effective–
 Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.
 Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.
Fault tolerant–
 When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.

IBM and Hadoop
Eighty percent of the world’s data is unstructured, and most businesses don’t even attempt to use this data to their advantage. Imagine if you could afford to keep all the data generated by your business? Imagine if you had a way to analyze that data?IBM InfoSphere BigInsights brings the power of Hadoop to the enterprise. With built-in analytics, extensive integration capabilities and the reliability, security and support that you require, IBM can help put your big data to work for you.Other Hadoop-related projects at Apache include:Hadoop changes the economics and the dynamics of large scale computing. Its impact can be boiled down to four salient characteristics.