In this article I provide a step by step guide to recreating a config-driven data pipeline able to extract Oracle tables and views into an S3 bucket, ready to be loaded into Snowflake.
Before we get to it, here’s a quick video (2 mins) to give you an overview of the moving parts.
When I originally put this together back in 2018, I wrapped everything up — Java, Pentaho, Oracle drivers and the pipeline code — into a single Docker image, to make it easy to deploy and get started.
At the time, I called it all a “Super Simple Data Extractor” so you’ll see that name mentioned some more.
I developed this on my Mac, so it won’t necessarily play well on other platforms.
That said, it shouldn’t be too far off working on Linux, but Windows will be more of a challenge unless you’re setup with WSL and a bash/zsh shell.
Perhaps you’d like to improve on the code to resolve the shortcomings that I called out in my previous article. If you haven’t read it yet, I’ll recap in the discussion below.
So let’s get to it…
The Pentaho Code & Links
All instructions and details of how to get started are held in the following Google Slides.
Missing The Point
Now — it seems we have a pipeline that falls short of the title of this article. We’re missing the piece that loads the data into Snowflake!
But bear with me…
This pipeline was the first version of a similar capability used on my journey from Oracle to Snowflake and I wanted to publish it as an exercise, believing that it may contain something useful for others.
Perhaps there are some techniques that you could repurpose, or if you just want to dump a load of Oracle data straight out to S3, you have a Docker image and a config file that’ll do it.
The Docker image is super large (over 1GB) but once it’s downloaded, you’ll be good to go.
It takes about a minute to fire up the container so it’s a bit slow from a cold start, but it’s not too bad.
We could improve on these issues by using a Docker “builder” pattern to reduce image size, and compile PDI from source to decrease the start-up time. There’s loads of bloat that can be removed!
Now let’s recap on the shortcomings called out in the slides above and in my previous article, so we know what we’re up against if we’re going to improve this:
- CSV files only go to S3, not Snowflake
- Missing manifest files in S3 prevent us from parallelising extracts reliably
- Daily incremental extracts are not granular enough (they need to be configurable and based on more data types than just a DATE)
- Chunk size is hardcoded for the whole pipeline (that’s the number of days worth of records extracted per SQL query)
- Too many moving parts makes it brittle and hard to maintain (20+ jobs/transforms with metadata injection, and more)
- Encrypted credentials are insecure (symmetric key encryption means anyone with Pentaho can decrypt)
- TLS certificates are self-signed (can we integrate with free Letsencrypt.org?)
A Better Solution
These issues and the overall complexity of the solution led me to start work on a simpler alternative to make it easier for developers to adopt. The plan was to make it simple enough for any engineer to pick up, not just DataOps or Pentaho experts.
Over the course of the next few articles, I’m going to share how we can drastically simplify the whole set-up down to a ~15MB binary, written in super-fast Go.
All of the data integration patterns that are familiar to Pentaho developers have been wrapped up into a command-line tool.
And, internally, it has the familiar component-based, streaming architecture that Pentaho uses too.
In my next article I talk about what the ideal pipeline to go from Oracle to Snowflake needs to look like if we want to fix the issues above.
In the meantime, I hope you can get some use out of the Pentaho code. Feel free to take a copy and make changes.
If you have any questions, just shoot me a message any time, as I’d be really interested to chat about this stuff.
Thanks for reading and see you in the next one.