Is Data Lake and Big Data the same?

2楼-- · 2020-04-07 23:25

Big Data and Data Lake are two interrelated terms but have completely different meaning and this is the main reason why people often get confused between the two terms. So let's have a brief understanding of the differences between the two.

BIG DATA As the name itself says it all, Big Data is simply the data that is humongous in size. The data that is in the order of petabytes and more is considered as Big Data. Not only the size, but there are a few more parameters that define Big Data. The sources that are generating this data, the different formats of it, and the speed with which it is generated, all these factors when combined define Big Data. Big Data in the simplest of words is huge amounts of DATA. That's it.

DATA LAKE A data lake is a repository for Big Data. It stores data of all types i.e. structured, unstructured, and semi-structured, that has been generated from different sources. It stores data in its rawest form. A data lake is different from the data warehouse. Data warehouses store data in a well-structured form. Data present in a data lake may or may not be utilized in the future but the data in a data warehouse is meant for utilization since all the irrelevant has already been disposed of.

Big Data is huge data and data lake is the storehouse for it.

I hope this helps.

0人赞添加讨论(0) 举报

不美不萌又怎样

3楼-- · 2020-04-07 23:39

I can't say I've come across the term 'big repository' before, but to answer the original question, no, data lake and big data are not the same, although in fairness they are both thrown around a lot and the definitions vary depending who you ask, but I'll try to give it a shot:

Big Data

Is used to describe both the technology ecosystem around, and to some extent the industry that deals with, data that is in some way too big or too complex to be conveniently stored and/or processed by traditional means.

Sometimes this can be a matter of sheer data volume: Once you get into the 100s of terabytes or petabytes, your good old fashioned RDBMS databases tend to throw in the towel, and we are forced to spread our data across many disks, not just one large one. And at those volumes we'll want to parallellize our workloads, leading to things like MPP databases, the Hadoop ecosystem, and DAG-based processing.

However, volume alone does not tell the whole story. A popular definition of Big Data is described by the so-called '4 Vs': Volume, Variety, Velocity, and Veracity. In a nutshell:

Volume - as mentioned above, refers to the difficulty caused by the size of the data
Variety - refers to the inherent complexity of dealing with disparate types of data; some of your data will be structured (think SQL data tables), while other data might be either semi-structured (XML documents) or unstructured (raw image files), and the technology to deal with this variety is nontrivial
Velocity - refers to the velocity with which new data may be generated; when collecting real time events like IoT data, or web traffic, or financial transactions, or database changes, or anything else that happens in real time, the 'velocity' of data flowing into (and in many cases, out of) your systems, can easily exceed the capabilities of traditional database technologies, necessitating some sort of scalable message bus (Kafka) and possibly a Complex Event Processing framework (such as Spark Streaming or Apache Flink)
Veracity - the final 'V', refers to the added complexity of dealing with data which often comes from sources outside of your control, and which may contain data which is invalid, erroneous, malicious, malformed, or all of the above. This adds a need for data validation, data quality checking, data normalization, and more.

In this definition, 'big data' is data which, due to the particular challenges associated with the 4 V's, is unfit for processing with traditional database technologies; while 'big data tools' are tools which are specifically designed to deal with those challenges.

Data Lake

In contrast, Data Lake is generally used as a term to describe a certain type of file or blob storage layer that allows storage of practically unlimited amounts of structured and unstructured data as needed in a big data architecture.

Some companies will use the term 'Data Lake' to mean not just the storage layer, but also all the associated tools, from ingestion, ETL, wrangling, machine learning, analytics, all the way to datawarehouse stacks and possibly even BI and visualization tools. As a big data architect however, I find that use of the term confusing and prefer to talk about the data lake and the tooling around it as separate components with separate capabilities and responsibilities. As such, the responsibility of the Data Lake is to be the central, high-durability store for any type of data that you might want to store at rest.

By most accounts, the term 'data lake' was coined by James Dixon, Founder and CTO of Pentaho, who describes it thus:

“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

Amazon Web Services defines it on their page 'What Is A Data Lake':

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.

From Wikipedia:

A data lake is a system or repository of data stored in its natural format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, analytics and machine learning.

And finally Gartner:

A data lake is a collection of storage instances of various data assets additional to the originating data sources. These assets are stored in a near-exact, or even exact, copy of the source format. The purpose of a data lake is to present an unrefined view of data to only the most highly skilled analysts, to help them explore their data refinement and analysis techniques independent of any of the system-of-record compromises that may exist in a traditional analytic data store (such as a data mart or data warehouse).

On on-premises clusters, the data lake usually refers to the main storage on the cluster, in the distributed file system, usually HDFS, though other file systems exist, such as GFS used at Google or the MapR File system on MapR clusters.

In the cloud, data lakes are generally not stored on clusters, since it's just not cost effective to keep a cluster running at all times, but rather on durable cloud storage, such as Amazon S3, Azure ADLS, or Google Cloud Storage. Compute clusters can then be launched on demand and connect seamlessly to the cloud storage to run transformations, machine learning, analytical jobs, etc.

I hope that was helpful and I wish you the best,

0人赞添加讨论(0) 举报

该账号已被封号

4楼-- · 2020-04-07 23:44

Big Data is just a term to encapsulate the massive amounts of data that is now being generated. It doesn't refer to anything specific or any specific amount of data.

Data Lake to me = Schema on Read. Data that is unstructured and dumped to object storage or similar without an associated schema.

0人赞添加讨论(0) 举报