5 minutes reading time (1091 words)

How to Get Started with Data Lakes

The ability to make better business decisions relies on better access to information, which companies are finding through the implementation of data lakes.

Data lakes are centralized repositories that can be used to store all of a company's structured and unstructured data at any scale. They allow businesses to rapidly harness more data from more sources, giving stakeholders swift access to the necessary information to reach their goals.

Data lakes are designed to store data from anywhere: emails, social media, spreadsheets, sensors, transactional databases, even direct from customers. The core idea behind them is to have a place to store everything until the company figures out what to do with it all.

Over time, any raw, unstructured data that was previously sitting dormant can begin to present new opportunities and add new value to the business when included in a data lake; however, like any attractive technology, there are a few things to be aware of before investing. Here's some advice for getting started. 

Identifying the Important Data Sources 

The first step in any data lake implementation is to identify which data sources are the most important to the business. This is achieved by defining clear business objectives that would be supported by greater access to information. For instance, if the business wants to improve stock control efficiency, a spreadsheet of employee salary information is not going to help; the historical data from the company's online store would be far more beneficial.

The core value of data lakes comes from the idea that no information is ever lost, but choosing which data sources to feed into the data lake is all about balancing costs with the return of investment. If achieving a business goal is difficult because specific data was left out at the beginning, the data lake might seem like a sunk cost. Equally, if too much data is stored in the data lake, then the costs of storage will rise, so it's vital to find that balance. 

[Ready for a different kind of software outsourcing? | Let's Talk!]

Moving to the Cloud 

Data lakes are best handled in the Cloud, with infrastructure services like Google Cloud Storage, Azure Cloud, or Amazon S3 being the most popular choices.

Moving data from on-premise infrastructure to Cloud is a challenge in itself, requiring skilled engineers to lead the process and design the data lake structure at a higher level. Things like data profiling, tagging, security, workflows, and policies must all be defined well in advance.

Without that preemptive approach, the data lake can quickly become too disorganized for even the most skilled data scientists to work with. By implementing a disciplined, automated process of cataloging the data that enters the data lake, information is less likely to get lost over time. At this point, any future changes or additions to the lake will feel much more natural and incremental. 

Assembling a Team