Using data lakes or data warehouses will determine how your organization manages and applies its data information systems. As predicted, the data volumes continue to grow ever larger in 2020. The continuing use of big data will impact the way organizations acquire, manage and use business informatics and intelligence. Some big data trends involve new concepts, while others mix or merge different big data computing technologies. Finding storage solutions for the volumes of data being generated every second is of utmost importance for enterprises involved with big data. It is why data professionals have to carefully consider which data repository systems to employ. This article intends provide non-technical user with basic understanding between data lakes and warehouses.
What is a data warehouse?
A data warehouse stores data in an organized manner. The data is archived and ordered in a defined way. This means that a use for the data going into the warehouse has been identified, and that decisions were already made regarding what data to include in the warehouse.
When building a data warehouse, a significant amount of effort occurs during the initial stages to analyze data sources and understand business processes.
What is a data lake?
A data lake holds data in an unstructured way, or in its rawest form. The data lakes accepts and stores all data from all sources and supports all data types. There is no hierarchy or organization among the individual pieces of data. Finally, the schemas or the way the data is stored in a database, are applied only when the data is ready to be used.
The key differences between the data lake and data warehouse
The major differences between data lakes and data warehouses is in the way the system support the data, its users, its adaptability to change and speed or agility to acquire insights and security.
1. Data lakes support all kinds of data
Data warehouses generally consist of data extracted from transactional systems having quantitative metrics and the attributes that describe them. The warehouse only includes only the data that is necessary to use for reporting or to answer specific business requirements.
The data lakes retain all data, whether it’s structured, semi-structured and unstructured/raw data. It’s possible that some of the data in a data lake will never be used. The data is kept in its raw form and only transform when it’s ready for use. For example: non-traditional data sources such as web server logs, sensor data, social network activity, text and images are typically ignored if they don’t answer a specific business case or requirement. As new uses for these data types continue to be found, consuming and storing this data can be expensive and difficult. That’s where a data lake can be useful.
2. Data lakes support all users
There exist various types of ‘data users’ within the enterprise organization. The majority of users are ‘operational’. This kind of user needs their data well structured, easy to use and understand because it needs to answer their specific business question or requirements. The operational user gets their reports, see their key performance metrics or slice the same set of data in a spreadsheet every day. The next type of users are the ‘analysts’, roughly 10%, are those doing more analysis on the data. These users often rely on source systems to get their data as well as data from outside the organization with the main objective of creating reports. The data warehouse is ideal for the operational and analyst users. Data warehouses are used by specific business users to report and extract a particular meaning from the data that was defined when the data warehouse was set up.
Finally, the last set of users, typically 5%, are those requiring deeper analysis and research from the data. The data scientist user are typically the ones who access the data in data lakes. They may create totally new data sources based on research. The data lake allows data scientists to mash up many different types of data and come up with entirely new questions to be answered.
Data lakes support all kinds of users. The data scientists can go to the lake and work with the very large and varied data sets, while other users make use of more structured views of the data provided for their specific requirements.
3. Data lakes are more agile and adaptable
A lot of initial planning and data design decisions go into getting the right data warehouse structure because a lot of business processes are tied to the warehouse. So, data warehouse are more cumbersome for change. A good warehouse design can adapt to change but because of the complexity of the data loading process and the work done to make analysis and reporting easy. Any changes will consume developer resources and time.
Data lakes are significantly more agile and adaptable to change because the data lake lacks structure. it’s relatively easy to make changes to models and queries. Data lakes are more flexible and can be configured and reconfigured as necessary based on the job.
Faster insights from the data
Many business questions can’t wait for the data warehouse team to adapt their system to answer them. The ever increasing need for faster answers is what has given rise to the concept of self-service business intelligence. The use of data lakes can offer faster insights. The data lake allows any user access to the data before it has been transformed, cleansed and structured. This leaves users in the driver’s seat to explore, research, analyze and use the data as they see fit. This is also evidenced in the increasing trend of data scientist and other data-related roles within enterprise organization.