June 27, 2016 \ Ananth TM Data Lake implementation “If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” -James Dixon, the founder and CTO of Pentaho “A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.” -Unkown source The term data lake is regularly connected with Hadoop-oriented object storage. In such a situation, an entity’s data is initially stacked into the Hadoop platform, and afterward business analytics and data mining instruments are connected to the data where it lives on Hadoop’s bunch hubs of product PCs. Like big data, the term data lake is sometimes disparaged as being simply a marketing label for a product that supports Hadoop. Increasingly, however, the term is being accepted as a way to describe any large data pool in which the schema and data requirements are not defined until the data is queried. Data Lake and Data warehouse The commonality between the two is that they are both data storage repositories. Data warehouse Vs Data Lakes Structured, processed DATA Structured/semi-structured/unstructured/raw Schema on write PROCESSING Schema on read Expensive for large volumes STORAGE Designed for low cost storage Less agile, fixed configuration AGILITY Highly agile, configure and reconfigure as needed Mature SECURITY Maturing Business professionals USERS Data scientists Gaps in Data Lake Concept Data lakes convey significant dangers. The most critical is the inability to determine data quality or the lineage of findings by other analysts or users that have found value, previously, in using the same data in the lake. By its definition, a data lake accepts any data, without oversight or administration. Without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a information swamp. What’s more- without metadata, every subsequent utilization of data means analysts start from the beginning. Another danger is security and access control. Data can be placed into the data lake with no oversight of the contents. Numerous data lakes are being used for data whose privacy and regulatory requirements are likely to represent risk exposure. The security abilities of central data lake technologies are still embryonic. These issues will not be addressed if left to non-IT personnel. At last, performance aspects should not be neglected. Tools and data interfaces simply cannot perform at the same level against a general-purpose store as they can against optimized and purpose-built infrastructure.