In this post I talk about Halfpipe’s pre-canned data pipelines solve common data migration scenarios. I cover the simple commands that generate and run the pipelines so any engineer can access them. I also describe how Halfpipe works under the hood.
Halfpipe takes care of the E and L in ETL/ELT and I wonder if it could give your engineering teams a leg-up or help democratise your regular data integration processes. If you’d like to take the tool for a spin for free, take a look at my GitHub page to get started.
I’d love to hear what you’re using to solve this piece of the data integration problem and I’d welcome your comments and feedback.
This article continues on from my previous post, where I described the hidden gotchas around building and running incremental data pipelines. For those who have joined me here for the first time, here’s a quick recap:
After using Pentaho (open-source, PDI) for 5+ years in production, when it came to migrating data from Oracle to Snowflake, I found myself building config-driven pipelines that were far too complicated to be easily maintained. And all they needed to do was generate some CSV files incrementally, copy them to an S3 bucket and tell Snowflake to load the files.
The main concerns were how tricky it would be for a product team of microservices engineers to support, the inherent security issues around passwords and general software bloat. The ~1 GB deployments and RAM requirements put people off!
So I got started writing Halfpipe as a data-ops tool that would be fast, easy to use and simple to deploy. The idea was to create something that contained all of the smart data integration patterns I’d learned while using Pentaho into something that anyone could fire up from the command line. I wanted to get the basics right before adding a UI so I came up with this simple command format to get going, where
hp is the name of the tool:
hp <action> <sub-command> <source.table> <target.table>
The vision is for the source and target above to be any RDBMS, cloud storage bucket (S3 and others), stream (Kinesis, Kafka, etc), HTTP API or file system etc. Halfpipe should figure out the best, most efficient path to take data from the source to target so developers don’t have to.
So it’s pretty simple to learn. Here’s a summary of the current capabilities…
What can Halfpipe do?
These headings are the actions that can be inserted into the command above, while the bullets describe the sub-commands that are available:
- copy table metadata between source and target (execute DDL)
- copy table snapshots from source to target (fetch all records; optional append to target)
- copy table deltas (incremental changes since you last looked) from source to target
- synchronise data across two tables in full/batch, by doing a fast row-by-row comparison
- synchronise data in two tables by streaming change-events (use Continuous Query Notifications in the case of an Oracle source)
- run a pipeline defined in a JSON/YAML file. Save one by using the above actions, or edit one manually
- run a microservice, to kick off a pipeline or monitor one remotely
- execute arbitrary SQL and print results. Make it grep-able to drive more
- add/remove logical connections to a source or target (saved and encrypted with AES-256)
- manage default CLI flag values to simplify the command usage
Sources & Targets
The types of source and target available are as follows, and they can be used interchangeably:
- SQL Server
Okay, that’s not much connectivity! But there’s a lot that can be added and there’s plenty on the roadmap already.
If you would be interested to see a specific type added, let me know what you’d find most useful in the comments.
How is it built?
Halfpipe is written in Go and is compiled against Oracle’s client drivers (the standard OCI library) so it supports bulk (array bind) operations for improved performance over the standard go-sql interface.
It would be simple enough to add support for ODBC or connectivity to any of the databases or event-streaming solutions that the Golang community supports natively, like Kafka or Kinesis. There are tons of options and the community support constantly astounds me!
I thought about doing all this in Python to create something that new components could easily be added to - dynamically - but there are existing projects out there like Singer that do the job and are nicely designed.
I find Python can be tricky to deploy repeatably outside of Docker and maintaining package compatibility presents an ongoing challenge. If you regularly try to fix security issues by updating modules then you will almost certainly have broken dependencies at some point. And this is an annoying time sink!
The benefit of Golang is that it produces small and fast, cross-platform binaries in seconds. We can get back some of the flexibility of a dynamic language like Python by adding plugins as shared libraries, assuming we’re happy to live with the inherent versioning issues that come with the approach.
How does it work?
Internally, Halfpipe has a set of reusable components that can be wired up in JSON/YAML so it’s quite configurable. They’re designed to keep mem usage low with minimal buffering and to execute in parallel. You’ll see there are no sorters or reducers, for example. Remember the aim was to take care of E and L in ETL/ELT so we can let the sources/targets handle the transformations.
Here’s the list of components so far. They’re badly named, but if you’ve used a data integration tool like Pentaho, Talend or Informatica, you should be able to understand the general concept of each of them.
There are three main types, where all components output records in a way that another can read as input:
- input components that produce records from a data source
- output components that persist records
- components that manipulate and transfer records to another connected component
There’s also one special component to handle metadata injection.
- get records by executing any SQL query
- get records from Oracle continuous query notifications
- output records to a database table (generate DML, or synchronise records after a merge-diff)
- output records to Oracle database (support bulk DML operations for better Oracle performance)
- output to CSV files (split files on number of rows or size)
- output JSON records to AWS SQS
- output JSON records to AWS Kinesis
- merge-diff (compare sorted records and flag them as new, changed or deleted)
- filter fields (allow records if they: contain the max value, are the last row in steam, or match a JSON logic pattern)
- map fields (add constants, concat fields, regexp replace, use JSON logic)
- read/write manifest files
- copy files to an S3 bucket
- list s3 bucket contents
- generate date or number ranges (use with metadata injection to create loops)
- Snowflake loader (COPY INTO … from S3 files)
- Snowflake merge (MERGE INTO … from S3 files)
- Snowflake sync (automatic insert, update, delete into … from S3 using a temp table)
- metadata injection
With these, it’s possible to construct complex data pipelines that take data from A to B easily and make some decisions along the way.
Metadata injection, for example, makes it possible to run loops to extract-load portions of data. While the “manifest file” component helps us make pipelines bullet-proof.
On their own, these things are nothing new compared to the traditional ETL tools. They still require us to know the art of the possible and how to connect the components together to produce useful logic.
But we can wrap up expert knowledge by pre-canning pipelines making them easily available to everyone.
Halfpipe automatically runs pre-canned data pipelines for common migration scenarios, determined by the source and target supplied on the command line, so engineers don’t need to spend their time developing them.
It’s a light-weight binary (~15 MB) that’s easy to deploy and quick to get started with. It’s designed to be stateless and operate in a zero-maintenance way. And with re-usable, built-in components that execute in parallel, it keeps memory usage to a minimum.
It essentially allows data integration patterns to be democratised.
If you’d like to simplify your extract-load pipelines or are curious as to whether this tool could help your developers manage data across their environments, feel free to take it for a spin or get in touch.
To get started, check out my GitHub page where you’ll find instructions, sample commands and links to short videos.
If you’d just like to see Halfpipe in action, head over to my website for a series of short videos. There’s a full list of features growing here.
Alternatively, if you’d like to discuss your use-cases, I am more than happy to chat any time and to see how I can help.
I’d like to spend some time looking at how much it costs to run a data platform with Halfpipe in the cloud, vs existing SaaS offerings.
How well does it perform and what’s the minimum compute/memory footprint (cost) we can get away with?
Halfpipe is small so could fit in AWS Lambda. Okay, Lambda would present a new set of problems, but it would be interesting to explore the possibilities. There’s also cheap infrastructure like AWS EC2 nano hosts, or managed services like CodePipeline / CodeBuild that could be used.
I’d like to make it easy for everyone to go from zero to a running a data platform that’s cost-effective and “just works”.
Thanks for reading and see you in the next one,