Apache Airflow and the New Data Engineering

Apache Airflow is open-source workflow management software used to design and orchestrate data engineering pipelines. Not surprisingly, as the use of data pipelines has exploded, so has interest in (and uptake of) Airflow.

Enter Astronomer.io, a company that contributes to Apache Airflow development (several of its full-time employees are also top Airflow committers). Astronomer offers both an enterprise and a fully managed Airflow service. Its relationship to Airflow somewhat resembles that of Confluent Inc. to Kafka -- although, as we shall see, Astronomer sees itself as doing something different.

Recently, I spoke with Roger Magoulas, senior director of data strategy at Astronomer.io, about what Apache Airflow is, how it works, what it does well, where it needs help, and how Astronomer extends and complements the software.

In his day-to-day work, Roger effectively tests and consumes Astronomer's own dog food: He uses Astronomer's Airflow managed service to perform his data work. We spoke about some of the most common Apache Airflow use cases or patterns, the suitability of both SQL and Python for data work, and the availability of new open-source technologies that make it possible to scale Pandas to perform parallel and concurrent processing workloads on large data sets.

The following Q&A is a composite of three interviews. It has been edited for length and clarity. (Full disclosure: Roger is both a friend and former colleague of mine at O’Reilly Media.)

If I don’t use Apache Airflow, what am I using to engineer data? What do I have to do?

Roger Magoulas: You could just write code and put it in cron. For example, you could write code to extract your data, code to put [this data] some place and code to transform it. In addition, you could write code to validate that these [operations] exited [successfully], and probably other bits of engineering logic to support all this code. So, you would want to write logic that not only checked if [an operation] exited [successfully], but checked if there were any other dependent operations, verified that these exited, and said, ‘OK, you can go onto the next step.’ This is complicated, but it was something people did.

From my perspective of building data analytic platforms and organizations the last bunch of years, I’ve found the most common tool for building and managing data flows are DIY programs -- typically a mix of Python or another language and SQL -- that run on cron or a homegrown scheduler. These systems typically extract data from a source, move/load it to a target platform, and transform that data into structures suitable for analytics and model building. [It’s] a lot to take care of, and often a mess.

For example, when I was at Sybase in the late ‘90s, I had my data engineering team write a Perl program that would run 12 or so steps, with checkpoints that noted progress. If a step failed, the program stopped and could be restarted without repeating the steps already successfully completed. Other organizations use ETL tools, like Informatica and Talend, or data integration tools, like Matillion and Fivetran. Of course, there are many other examples of these types of tools.

The ELT steps can seem simple enough on the surface, but with a lot of moving parts, an increasing number of sources and increasing ways to use the data, a lot can go wrong.

Data engineers need to contend with complex scheduling requirements, creating dependencies between tasks, figuring out what can run in parallel and what needs to run in series, what makes for a successful task run, how to checkpoint tasks and handle failures and restarts, how to check data quality, how and who to alert on fails -- all the stuff Airflow was designed to handle.

The cloud only makes that process more complicated, with cloud buckets used to stage data from sources before loading that data into cloud-based distributed data management systems like Snowflake, Google Cloud Platform or Databricks. And here’s what I think is important: For many organizations, making the leap from exploratory data analysis [EDA] to formalizing what’s found into data pipelines has become increasing valuable. Anything that makes that work simpler lets the data team focus on generating insights and increasing the utility of their data platform.

Where does Astronomer.io come into the picture? Some people have described Astronomer’s relationship to Airflow as being like Confluent’s relationship to Kafka. Is that the basic idea?

Magoulas: Airflow came out of data engineering, developed by folks who were software engineers at heart. While they covered the needs of data engineers well, they didn’t necessarily have the best feel for how people in the analytic data space work. And that’s what creates such a unique opportunity for Astronomer: working with data analytics and data science organizations to help them transition from where they’re doing the prep work of exploratory data analysis as basically a bunch of one-off [tasks] -- first acquiring, then preparing, cleansing, organizing and provisioning the data they need to do their work -- to a process where that work has been formalized into something reusable with Airflow.

Let’s put a pin in that question for just a moment. If I understand you correctly, you’re saying that straight-from-Git Apache Airflow is less user friendly, or less automated, than it could be?

Magoulas: I don’t know if I’d put it quite that way. I see it as a question of extending Airflow to suit the requirements of many different kinds of user experiences -- for example, the workflow of a data scientist or analyst. If I’m a data scientist, I have to do all of this [EDA] outside of [Airflow], and then I have to take my work and make it compatible with Airflow. Ideally, I would just flip a switch and have that happen. I see this as one opportunity to improve on what Airflow offers today, and what Astronomer is working to provide.

Could you say a little bit more about this? For example, describe what’s involved in making this exploratory work “compatible” with Airflow. What kinds of things could you do to simplify this?

Magoulas: So, to use Airflow, you need to set up the connection strings to all the platforms you need to interact with. This is true of any data pipeline program, really. I just finished a cost model that used data from a few Snowflake schemas. After doing some EDA, I was able to create a DAG [directed acyclic graph] consisting of tasks [i.e., atomic units or work specified by the DAG] that stepwise builds up the components of the model.

The DAG associates the SQL with each task and specifies the dependencies between the tasks -- the logical heart of the cost model. However, the DAG also needs all the plumbing, the connection configurations, database resources, schedule parameters, etc., that I had to manually create. And this is what can be increasingly simplified and automated -- and is a big focus at Astronomer.

Getty Images

Was your use case typical for the project you just completed? Could you say more about what you see as the most common Apache Airflow use cases or patterns today?

Magoulas: The most common circumstance is that you have data in a source system that you want to move to your data platform, so you would use Airflow to extract the source data into a GCP or S3 bucket and, from there, load the data to your data warehouse or data lake – or a cloud service like Snowflake, BigQuery or Databricks – for further transformation.

Once it’s there, you’re probably just using SQL and Pandas to operate on the data and make it useful. This is another area that Astronomer has a keen focus on, with work on decorators and notebook-like interfaces that make integrating Python and SQL easier, while also making the connections and other plumbing simpler to manage and reuse.

This is the reason we’re putting tools in Airflow to make it easier to make the SQL flow right into Python. We’re trying to make sure we have something to offer people so they don’t feel like they need to bring in a third-party tool to do their SQL (although they can), because they should be able to do it all in Airflow.

You mentioned SQL and Python/Pandas. Is there a sense in which one is “better” than the other? Or is it rather that suitability depends on the use case?

Magoulas: I’m more SQL savvy than a lot of data people, so I’m a little biased. I think if you came into this in the last 10 to 15 years, you started doing SQL, but over time, you found yourself doing more Python, too. If you started more recently, you may have more of a focus on Python and its associated data analytic modules like Pandas and scikit-learn, learning enough SQL to get by.

It’s like you said: Suitability really depends on the use case and the person. One obvious example is that if you’re working in Python and want to look at data, you may start by using a “dictionary,” which is also called an associative array. But dictionaries make some common data analytic tasks complicated and wordy -- like sorting based on value or finding min/max values. These tasks are just simpler in SQL. This is where Pandas and its DataFrames come in. Loading data into a Pandas DataFrame makes both simple and more complex analytic tasks, including pivots and simple stats, easy and fast.

I think you’re served well by knowing both. Some problems lend themselves to SQL, and some lend themselves to Python. So, in this cost model I just designed, almost all our customers had just one [digital payments] account, but there were, I think, three customers that had two or even three [digital payments] accounts. This means I had to account for these extra accounts in my cost model.

In a pivot table centered on customers, each customer gets a subtotal line, even though for all but the three customers with multiple accounts, the subtotal is the same as the line item. The result was a lot of unneeded noise in the report. What I really needed was a kind of smart subtotaling logic that says, ‘If there’s more than one thing, aggregate them and show a subtotal, [and] other[wise] don’t’ – logic I can’t easily do in SQL but can easily implement in Python.

I’m hearing a lot more about Pandas, in part because of the folks over at Databricks, who’ve done some work to parallelize Pandas. Can we speak of parallel Pandas as an alternative to MPP SQL?

Magoulas: Everyone we talk to, the work they’re doing internally is mostly SQL, whether people really want to use it or not. It’s still the way things happen. The one place that is different is where you can put things into a DataFrame and do Pandas stuff to it. As you point out, the problem with Pandas was that it didn’t really scale. So, if you had really big data sets, there was really no way you could use Pandas with them [i.e., on a single node] without taking a sample and doing it that way.

You mentioned DataBricks, but there are other tools, like Ray and Dask, that allow you to build a cluster behind Pandas. As long as you can configure it, you can get the kind of scale and handle bigger data sets for analytic and machine learning tasks.

Basically, what you can do is you can get a pretty knowledgeable system administrator to set up a cluster for you and tell you a little bit about how to connect to that cluster. You can then add one line of code on top of your Python program so you can run Pandas, and it’s running against that cluster.

The nice thing about a DataFrame is that it’s all in really fast C, so it’s much, much faster than pure Python, which is an interpreted language. The not-so-nice thing about a DataFrame is that it’s all in-memory, so really big data sets may not fit, which is why you may need a large cluster to provide enough scale to accomplish particular tasks. And, if the data is too big for that, parallel, distributed SQL may be your best option. Luckily, Python/Pandas provides easy tools for moving between DataFrames and tables.

So, to answer your question [about parallel Pandas as an alternative to MPP SQL], I would go back to the earlier question about suitability. Some things are easier in SQL, some in Python. It depends.

Getty Images

You’ve described a few scenarios where you see Astronomer complementing Apache Airflow. Some of what you’ve described reminds me of the Confluent-Kafka model. Is that the general idea?

Magoulas: Kind of. We definitely are trying to make Airflow better, and we’ve got people dedicated to working [as committers] on Apache Airflow development. [Note: As of December 2021, three of the top five Apache Airflow committers were Astronomer employees.] We helped drive the version 2.2 release that came out in September, as well as the version 2.3 [release] that is coming out [in the spring of 2022].

So, we offer value-add when you buy the Astronomer version [of Apache Airflow], and we’re committed to supporting open-source development and the open-source community, as well. And, like Confluent, Astronomer offers a managed service that customers can subscribe to. I suspect we differ in that Astronomer is more focused on making the Airflow experience more useful and more available to a less data engineering-savvy audience. Of course, Astronomer continues to offer plenty of improvements for the hardcore data engineering crowd, but this focus on making Airflow simpler and more intuitive to use opens up the tool for increased use by data analysts and data scientists.

If your staff has the technical chops, open-source Airflow is a great option. The simpler DAG-development tools and managed service cloud platform that Astronomer offers brings Airflow functionality to a wider swath of data organizations. Many organizations do not have the engineering resources they need to quickly get started with Airflow -- not without help, I mean. They’re going to be interested in a managed service, something that lets them outsource a lot of the plumbing they would otherwise have to build.

Comments

Plain text