Data engineering in today’s clouds

Data is all around us. Some of it we can easily see, some of it we can’t see until we have a reason to gather it into a form that’s usable. Some data streams like weather stations stream data in constantly. Other data is generated through activities like web browsing or photographs uploaded to a photo sharing website. There’s no shortage of data and somehow the amount of available data will only continue to increase.

This is data is used for all kinds of purposes. It is used to keep us safe, it is used to show us ads and sell us things, and countless other purposes. It can also be used for bad things, but that is a separate blog post, or more likely an entire book.

What is data engineering?

For the never ending data to be applied, it must be made usable. It must be gathered, collected, and organized. It must be processed in some way that takes the raw data and turns it into what we need. This process must be efficient, scalable, and resilent. This is data engineering.

I think data engineering skills are the most important skills someone can have today. Someone who does data engineering should be familiar and comfortable with all aspects of it, but I also think that someone who is only on the periphery of data engineering should be familiar with the concepts, too. Data engineering is just that important to nearly all businesses today.

For example, an engineer implementing a REST API that receives data and stores it should have some knowledge and concept of the downstream processes. This will help the engineer know how to collect and store the data efficiently. Granted, the engineer and the data engineer need to work together, but having some fundamental knowledge of best practices along with a common vocabulary can turn a good system into an efficient system.

Being capable of designing and implementing sound processes for ingesting and processing data is a crucial capability that enables so many applications of the data. But learning those skills can be difficult. Where should someone start?

Where should someone start?

Where to start can depend on where your data pipeline will be run. If it will be run in the cloud, which cloud? AWS, Google Cloud, a combination, or another? Knowing this can help identify some tools and a learning path.

For Google Cloud, you can access Google Cloud’s Data Engineer learning path. This learning path is focused toward passing their Professional Data Engineer exam but we can ignore that (unless you want that to be your goal) and just focus on the content. You should learn about the Google Cloud services Dataproc, Dataflow, and BigQuery, as the most important services. I personally have found PubSub and its capabilities, especially the ability to add BigQuery as a subscription to a topic, to be a very important service for building loosely connected data pipelines. It might be worth taking a look at the Google Cloud Innovators subscription if you desire a certification voucher, Google Cloud credit, and access to more learning resources.

For AWS, I have found the learning path to not be quite as straight forward. AWS Skill Builder offers a wide variety of accessible content but it can be overwhelming to find what you need. Also, note that not all of the content is free. Some of the content may require a monthly subscription to AWS Skill Builder. I suggest start with the Data Analytics Learning Plan. In this learning plan you will learn about data streaming with Kinesis, Hadoop technologies with Elastic Map Reduce (EMR), and data manipulation with Glue. Also covered are Athena and Quicksight.

There are, of course, some open source and third party software that may be useful. Apache Airflow is a workflow management platform for data engineering. Apache NiFi allows you to automate data flow using directed graphs. Some clouds might offer hosted versions of these open source applications. Amazon Managed Workflow for Apache Airflow is one such service. However, before learning the intricacies of the service, I recommend learning about the open source Apache Airflow project first.

Learn the Concepts First, Services Second

When learning about data engineering, try to focus on the underlying concepts. These concepts will apply across any cloud provider. Learn those concepts first, then you can learn about the details of a particular services, such as Amazon Kinesis or AWS SQS and Google Cloud PubSub. There will be differences in how the services operate, but the underlying problems each are solving will have overlap. Knowing those base concepts will allow you to move between clouds with less friction.

Practice is Crucial

No matter what guided learning you chose, if any, the most important way to learn is to practice. You can only get so far watching videos and reading text. If a course offers hands-on labs, do them! But more importantly, think of your own example problems to solve and get to building. You can implement a professional data pipeline in a cloud for just a few dollars a month. If you use infrastructure-as-code, you can easily stand it up to work on and tear it down when you are done to keep the billing as small as possible.

But, you have to get your hands dirty and you have to get hands-on experience. A certification offers very little value if you have only studied the content enough to pass the exam. Practice makes perfect yet once again.

Notify of
Inline Feedbacks
View all comments