AWS Data Pipeline Terraform Support

History

A short time ago I was working on building out an ETL flow to populate a data warehouse. The needs were fairly simple and while an ideal solution seemed to be to use as managed of an offering as a possible, there was a strong inclination towards SQL on the team and dbt was already in use so I was primarily interested in a solution which could help coordinate running of the associated tasks.

AWS Data Pipeline was selected since it offered a fairly straightforward way to choreograph our existing logic while allowing for iterative replacement of any custom code with provided functionality (with an eye towards something like Glue). For the specific needs Data Pipeline didn’t provide notable additional functionality, but the primary need was a reliable job runner with relatively low initial configuration overhead and it was able to quckly address that. Due to infrastructural considerations more generalized alternatives raised some additional initial concerns that seemed better deferred (such as coordination with suitable compute resources). Data Pipeline does not appear to be one of the more attended to AWS offerings, but overall seemed a solid next step.

I’m a stalwart advocate for infrastructure as code (IaC) generally, and in cases where chunks of work can be offloaded across different systems it seems particularly essential to coordinate and evolve those moving pieces. Terraform was adopted for this purpose, but unfortunately there was no readily available support for Data Pipeline in the AWS Terraform provider(2021). I’d poked around inside of Terraform in the past and so adding such support seemed doable and the configuration for the pipeline we needed was quickly up and running.

Shortly thereafter this effort was abandoned; there was an alternative perspective that had been dormant throughout the assorted discussions and approvals that (contrary to the project history and what seems to be prevailing wisdom) things should just work and therefore having a reliable job runner was unnecessary. The pieces that had been split out were jammed back together such that one process would execute them in sequence, the work of trying to make sure that process never failed was resumed, and I accelerated my departure plans.

2021. https://github.com/hashicorp/terraform-provider-aws/pull/9404.