Data Lakes Vs Data Warehouses: How to Choose

Andrew Rosa By: Andrew Rosa

In the world of data storage and analytics, many terms are thrown around and often confused for one another. Data Warehouse and Data Lake for example are very similar, and it is easy to mistake them. We hope that this post will help you to understand the similarities and differences so that you can choose the appropriate data storage to drive business insights as well as other analytics.

Understanding the Two Types of Data
Before we dive into data warehouses and data lakes, it’s important to understand two broad types of data: structured and unstructured data. Structured data is what you find in relational databases, and sometimes in excel spreadsheets. It has a schema, meaning that entries in one table are linked to entries in other tables through a planned-out model. Usually this data is contained in neat tables, organized in set rows and columns, or organized with nodes and edges in the case of a graph database. Unstructured data is an umbrella term for many things that may contain or be used as data. An image, sound clip, video file, pdf document, or word document are a few examples of unstructured data.

Data Warehouses
Data Warehouses are built to store structured data primarily for the purpose of analytics and occasionally for use in software applications. The data stored in a data warehouse will come from multiple databases, usually application databases. Application databases are populated with transactional data. When the data is extracted from these sources, it’s often transformed and aggregated to save space and then loaded into the data warehouse. This process is referred to as an extract, transform, load (ETL) process. The data transformation will take the transactional data and either normalize it or de-normalize it. Normalizing data is done to reduce the amount of memory the data takes in the data warehouse. Data is separated out into dimensions and facts. In the case of a normalized data warehouse, the dimension tables hold distinct categorical entities with no repetition creating single points of truth that can be utilized for future application. Fact tables contain identification columns that link back to dimension tables as well as measurements.

In a de-normalized data warehouse, you will still have dimension and fact tables, but entities might be repeated. This type structure takes a bit more memory for storage, but the advantage is that it’s better optimized for querying because fewer table joins are needed. Both normalized and de-normalized data warehouses are suited for driving analytics and analytic dashboards.

Data Lakes
Data Lakes can contain both structured and unstructured data. They can be thought of as a repository that can hold many kinds of objects like text documents, web server logs, JSON data, or images. You can, but don’t necessarily have to, ETL data to be stored in a data lake, or load and keep it in its raw transactional structure. In a data lake, you are storing many kinds of objects, and those objects are being indexed so that the most recent form can be used. Meta data is also being cataloged so that you can organize and manage which objects “live” in your data lake.

Data stored in data lakes doesn’t always have an upfront purpose; you might start collecting some form of data but not have a use for it. You would store it in a data lake because you think that it could be useful sometime in the future. Where data warehouses have processes that prepare the data coming from the source before storage, data lakes a have processes that prepare data after the storage but before arriving to the user of that data. Since so many kinds of data can be stored into a data lake, the “lake” can get deep holding very large amounts of data such as streaming data, images, video files, and so on.

How do you know which data storage method is appropriate for your organization?
First and foremost, it's important to know that you do not need to choose one over the other. The answer to figuring this out is to understand your business priorities.

If you have data that is structured and already used to drive reports but have to spend time grabbing the data from multiple sources, then you are ready to streamline your process and have a data warehouse set up for your one-stop-shop.

If you have a bunch of structured data with no definite use for it yet, but you also have unstructured data that you will want to leverage in the future, then go with a data lake.

If you have structured data that you do have an analytic use for, as well as a “bunch” of unstructured data, you can go with a hybrid approach.

In a hybrid approach, you might have both a data warehouse and a data lake with direct access but they are operating as two separate entities. Or you could have your data warehouse feed data into the data lake and only have direct access to the data lake for your analysts.

How Timmons Group Can Help
At Timmons Group, we have experience in setting up both data warehouses and data lakes. They can be set up on premise or you can leverage cloud infrastructure in order to benefit from cost savings and limited need to worry about system maintenance. When it comes to cloud infrastructure, AWS and Azure are two very popular services that we have had experience with. If you need help discovering the types of data you have and prioritizing its use, we can consult to help you determined the best solution and draft architecture as well.

About the Author

Andrew is an experienced data scientist and quantitative analyst at Timmons Group. His interests are in creating data driven strategies for businesses and organizations. He has a history of working with start-ups and corporations in technology, manufacturing, and consumer electronics industries.  Andrew is proficient in statistical modeling, time-series analysis, data manipulation, web scraping, and database management. Skilled in R, SQL, python, SAS, Tableau, PowerBI and SPSS. Andrew is a strong technology professional who holds a Master of Science focused in Data Science from Utica College, located in Utica, New York.

Andrew’s skills and experience include regression modeling, decision tree modeling, cluster analysis, A/B testing, hypothesis testing, time-series analysis, and ARIMA Modeling.