The ability to make better business decisions relies on better access to information, which companies are finding through the implementation of data lakes.
Data lakes are centralized repositories that can be used to store all of a company's structured and unstructured data at any scale. They allow businesses to rapidly harness more data from more sources, giving stakeholders swift access to the necessary information to reach their goals.
Data lakes are designed to store data from anywhere: emails, social media, spreadsheets, sensors, transactional databases, even direct from customers. The core idea behind them is to have a place to store everything until the company figures out what to do with it all.
Over time, any raw, unstructured data that was previously sitting dormant can begin to present new opportunities and add new value to the business when included in a data lake; however, like any attractive technology, there are a few things to be aware of before investing. Here's some advice for getting started.
Identifying the Important Data Sources
The first step in any data lake implementation is to identify which data sources are the most important to the business. This is achieved by defining clear business objectives that would be supported by greater access to information. For instance, if the business wants to improve stock control efficiency, a spreadsheet of employee salary information is not going to help; the historical data from the company's online store would be far more beneficial.
The core value of data lakes comes from the idea that no information is ever lost, but choosing which data sources to feed into the data lake is all about balancing costs with the return of investment. If achieving a business goal is difficult because specific data was left out at the beginning, the data lake might seem like a sunk cost. Equally, if too much data is stored in the data lake, then the costs of storage will rise, so it's vital to find that balance.
Moving to the Cloud
Data lakes are best handled in the Cloud, with infrastructure services like Google Cloud Storage, Azure Cloud, or Amazon S3 being the most popular choices.
Moving data from on-premise infrastructure to Cloud is a challenge in itself, requiring skilled engineers to lead the process and design the data lake structure at a higher level. Things like data profiling, tagging, security, workflows, and policies must all be defined well in advance.
Without that preemptive approach, the data lake can quickly become too disorganized for even the most skilled data scientists to work with. By implementing a disciplined, automated process of cataloging the data that enters the data lake, information is less likely to get lost over time. At this point, any future changes or additions to the lake will feel much more natural and incremental.
Assembling a Team
At PSL, we find that business-minded people are usually the primary drivers of data lake adoption within a company. Data lakes are often regarded as being highly technical assets, but the reality is they cannot provide value without the knowledge that business professionals bring to the table. Data lakes need that business insight to determine what data to collect and how the data should be utilized.
Data engineers and data architects are the second main players in any data lake initiative and should be brought on board to manage the data lake itself. These experts understand the main technologies behind data lakes, along with Big Data trends, file systems, file formats, and the best tools to ensure that the business people get the right insights when they need them. Data scientists and analysts also play a significant role in extracting the value of that data.
Then there is the question of infrastructure security, which ensures that data is not lost or stolen. This requires operations specialists who can collaborate with the data engineers to determine which security technologies are best suited to the data lake infrastructure. Anybody working with the data must also be clear on what can and cannot be accessed and the regulations behind working with customer data. This varies depending on which country the data is stored in, where the company operates, and the laws governing data usage in those locations.
In the early stages of adoption, it is common to find push back from within the company as stakeholders question why a data lake is necessary, especially when a data warehouse is already in place. This fear of change can be alleviated by highlighting the limitations of a standard data warehouse, and then showing how a data lake structure can better answer valuable business needs.
Data Lakes Benefits to Remember
A data warehouse might be capable of storing a lot of data, but that data is usually aggregated in a highly limiting way. That's because the data in a data warehouse is already processed and filtered fora specific purpose. The main reason for adopting a data lake setup is its flexibility, which enables unlimited changes and improvements to the data, along with the possibility to quickly generate new, valuable insights for the business.
It is easier and cheaper to implement a data lake structure sooner rather than later, instead of waiting until you really need it, as the cost and time required to rebuild a database continue growing alongside the amount of data.
The level of clarity that data lakes offer allows a business to start thinking about new metrics, more advanced technologies like machine learning, and more questions that can be answered quickly by the agility that a data lake provides. Data lakes enable companies to experiment faster and find unique ways to deliver value that weren't possible with a data warehouse.
Ultimately, with the right team of engineers and data experts leading the technical side, and with business professionals guiding them through the company's core objectives, data lakes can quickly become an indispensable means of tapping into the true value of data. But, don't wait for the data lake before you start creating your own data-driven opportunities and benefits.
This article was informed by PSL Big Data Engineer, Luis Miguel Mejia.
Try these articles: