A brand new pattern within the Knowledge Science and Knowledge Engineering world is the time period of “ Data Lakes “. According to Wikipedia:
A data lake is a system or repository of data stored in its natural/raw format. A data lake is usually a single store of data including raw copies of source system data, sensor data, social data etc and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. A data lake can include structured data from relational databases, semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).
Data Lakes can be considered as organic information, just like nature and this is because they can store any structure of data (structured, semi-structured and unstructured)
The main mains reasons to use a data lake are:
- To increase operational efficiency
- Make data available from departmental silos
- Lower transactional costs
- Offload capacity from databases and warehouses
- Store data without thinking about its structure
The main characteristics of the data lakes are:
- There are data agnostic meaning that are not limited to store just one data type. This is a main difference with the data warehouses where they expected structured data.
- There are “future proof” which implies that you could be not have a selected query to reply right now, but when your supervisor asks you a query sooner or later, most likely it is possible for you to to accommodate it since you’ve got the information in a uncooked format.
- They’ve two sorts of processing akin to when the method happens earlier than or whereas ingesting information and when the method happens after information has been saved like cleaning, aggregating, remodeling, merging with different datasets and so forth.
The 4 predominant parts of the Knowledge Lakes are:
- Ingest and Retailer
- Catalog and Search
- Course of and Serve
- Defend and Safe
A information warehouse is often a database optimized to carry out analytical queries that results in insights. However as a result of it often operates as an analytical database, it’s essential create tables and outline the desk construction earlier than including your information into your information warehouse. Once you create these tables, you need to set the desk columns and information sorts, so as phrases, an information schema that usually wants data to be structured. When the schema must be populated and it must be decided earlier than you write the information, you’ve got what we name a schema-on-write structure. Though schema-on-write is nice for information normalization as a result of it will reject information that doesn’t slot in that particular format, it’s not very best for flexibility, which is the place information lakes actually shine.
Knowledge lakes are what we name schema-on-read. And that’s the primary elementary distinction between information lakes and information warehouses. Knowledge lakes can deal with unstructured information and primarily function in a schema-on-read style. Which signifies that you don’t want to concern with the information schema whereas ingesting the information to your information lake. That let you handle the schema solely when learn information for some future processing. Therefore the title schema-on-read.
One other distinction between information lakes and information warehouses is that information warehouses largely use the SQL because the language for querying. That limits what you are able to do, and a few engines even help the creation of user-defined capabilities and different functionalities to increase that a little bit bit.
Lastly, one other distinction is that whereas information warehouses work with structured information solely, information lakes work with unstructured and structured information natively.