For those of you just getting started with building your own data lake, there can be a dizzying array of technologies and processes at play. For our team at NITS Solutions, understanding all of the nuances has certainly been a steep learning curve. Each build depends on the problem it is trying to solve, but luckily, many solutions can be distilled down to just a few key points. If you’ve already read Part I: Is a Data Lake Right for Me? and believe that a data lake solution is a good fit for you, keep reading to get a handle on the different technical aspects you will need to master.
With data lakes, gone are the days where you can just load your data into a database and forget about it. Instead, you will have to plan for where you will store your files and what kind of folder structure you will implement. Make sure to spend extra time thinking about what kind of future projects you may have and how those files will fit into your existing structure. If your in-house resources are minimal, a great cloud-based option is to use Amazon Web Services (AWS) and load all your files to S3. It has a wealth of SDKs and clients that will help you quickly upload and download files.
Hadoop Distributed File System (HDFS)
Regardless of where you store your files, they will usually have to make their way to HDFS. Note that the “D” stands for “distributed”, meaning that it’s a file system that is spread across multiple nodes (i.e. computers/cores). Files are split apart and evenly distributed across this network of nodes. By doing so, we’re able to take advantage of parallelism while executing our data processing programs. Keep in mind that files are not meant to be edited in HDFS, but either stored or deleted. If you plan on having a dedicated Spark cluster (more on that later) rather than temporarily spinning one up, then you can use HDFS as your primary file storage solution.
Stored File Types
An important aspect of data lake file types is that they should be able to be split into smaller files so they can be spread across your cluster. This includes popular text-based file formats like CSV, TXT, and JSON, but also includes a few you may not have heard of. Avro files are popular when you wish to apply a schema to your data and will make use of a significant portion of each row you read. Parquet files also apply a schema but are column-based, meaning that they are useful when you have a relatively large amount of columns but will only be interested in reading a few columns of data at a time.
Apache Spark Cluster
Apache Spark is the engine that unleashes the full potential of your data lake. It is a network of nodes that takes advantage of the distributed nature of HDFS to do computations against your data in parallel. There is typically one node called the master node whose job it is to give instructions to multiple worker nodes. The worker nodes have multiple executors (i.e. processes) that execute calculations against partitioned data. Remember how I said that your file types must be splittable, and that HDFS is designed to send your data to multiple different nodes? That’s what partitioning is all about. Partitioned data is the product of splitting apart huge data files (>1GB) and evenly distributing that data to processes across the worker nodes.
You have two main options for getting a Spark cluster up and running. The first is to download all of the software and install yourself. This is doable on any operating system, and even on a single laptop, though I do not recommend it. The whole reason for moving to a data lake is that your data is becoming very large, and a single machine does not typically have the memory nor parallelization necessary to make it worth your while. If you still want to run your cluster in-house, you will have to do the setup for all nodes in your network which is not ideal for the inexperienced user. A much easier solution is to run your Spark cluster in the cloud using out-of-the-box solutions provided by Cloudera or AWS EMR. This allows you to quickly spin up a cluster, freeing up your time to focus on writing programs rather than fixing networking and config errors.
Now we have to come up with a way to provide our Spark cluster with the instructions that its master node can distribute to the worker nodes. Typically, this will be done by writing a program in either Java or Scala, compiling it into a jar file, and including that jar file as part of a spark-submit job. There is also the possibility of writing a Python program using PySpark, though official support may not always be the same as for Java/Scala. When running a spark-submit job, you are not only able to define your program but also how many executors, memory size, additional jars to use, and a host of other possibilities.
Putting It All Together
The above steps form the skeleton of a basic data lake architecture, the particulars of which are up to you. Just remember these core steps while planning your build:
1. Store splittable file types in an accessible location
2. Code out a Spark program
3. Start up a Spark cluster
4. Transfer your stored files to the HDFS on your cluster
5. Run a spark-submit job on your cluster to execute your program
6. Move any results to your final file storage location
That’s it! Master these six steps and you will be able to start building your own data lake architecture.
There are times though, where you don’t need a bunch of number crunching and just want blazing fast access to the data in your database. Rather than go through the trouble of setting up EMR clusters and writing Spark jobs, there is a much simpler method. If that sounds like something your organization might need, make sure to tune in next week for Part III: An Athena Case Study.