In this article I talk about how I started to migrate data from Oracle to Snowflake for a retail company based in West London back in 2018 – using Pentaho Data Integration (PDI).
I cover why we started the journey and describe some of the shortcomings of PDI (at least, the Community Edition) that I discovered along the way.
Keep reading to learn more about the serious level of complexity required to extract records in a reliable and config-driven way and some security related gotchas.
I also talk about why I wouldn’t recommend Pentaho to solve this migration problem and offer a simpler alternative in the conclusion that’s far easier to use, deploy and maintain.
My previous article sets the scene around my data integration platforms here.
My next article will have a step by step guide to recreating the PDI pipeline so you can see in detail what’s needed to get going.
Thanks for reading,
Back in 2018, I was given the opportunity to start engineering a data integration platform again, while working for a retail company in West London, UK.
This time, the task was to migrate data out of an on-prem Oracle database into Snowflake, the cloud warehouse service.
The aim was to give the business renewed access to their data in modern analytics and dash-boarding services, starting with SaaS product, Looker.
We wanted to democratise the data and gain the efficiencies of having more people innovating around it, instead of relying on a centralised reporting team.
So I thought, “I know, this’ll be easy, I’ll just use Pentaho!”
After all, I had used it in anger for 5 years in production…
So we set out to quickly build a pipeline that would:
Extract full tables/views or produce incremental extracts using a range of dates or sequence numbers
Be as simple as possible to include new tables/views to extract
Be config driven
Be containerised so we run the same code in both dev and production
Be self-healing in the event of a failure (zero maintenance required)
But it turns out there are problems with PDI (Community Edition) that I hadn’t appreciated while working in my past life (previous article), which centred purely around Oracle databases, the odd SQL Server instance and a few HTTP APIs here and there.
❌ For a start, it doesn’t integrate well with complex HTTP API endpoints. Sure, it has some built-in components to make HTTP calls, but the smarts just aren’t there to make integrating whole pipelines with things like secrets management apps very easy, or services that require cookies to be carried between jobs or steps in a transformation.
❌ Access to S3 buckets is limited in that the built-in components are just capable of reading and writing files. Anything like copying, moving or making smart decisions around the objects in a bucket has to be built manually.
Sure, you can shell out to AWS CLI and do anything you want, but it’s more scripting and complexity, all of which you’re trying to avoid by using an ETL tool in the first place.
It feels like that age-old analogy - banging a square object into a round hole.
It will probably fit eventually, but then you’ll be stuck with it and wishing you’d written the whole thing cleaner in pure code, say Python.
At least with Python you can write some unit tests and be confident it will continue to work after future maintenance.
❌ There’s also a painful hurdle around how to get PDI to dynamically set environment variables for pipelines to use.
We needed environment variables to be set so we could make a single pipeline that was config-driven. We wanted a single set of jobs and transforms that would use metadata injection to extract tables/views based on a list in a CSV file.
Because all steps in a PDI transformation run in parallel, it’s not possible for a single transform to set an environment variable for its own steps to use.
The normal Pentaho way of working is to export them for a subsequent job or transform to pick up. You have to wire up variable names across multiple hard-to-find UI dialogues. It’s super clunky and brittle when you compare it to a command line app or shell script.
From a complexity point of view, environment variables are a pain in Pentaho. But that’s just the beginning…
❌ Other more fundamental issues come up when you realise that you need to supply properly encrypted values to PDI, decrypting them at run time and exporting them as environment variables is way harder than it should be.
In my case, I needed keys and secrets configured for the AWS CLI to talk to my S3 buckets. Without this, we were going nowhere.
The problem is that there are a number of ways to do this. And, as a novice, there’s lots of opportunity to get it wrong, which will introduce security risks around critical credentials.
You see, to pass environment variables into PDI so they work consistently with both Spoon (the UI on your workstation), Kettle and Carte (the ETL engine and web service ), it requires some hacking of the start-up scripts so that values can be supplied to the Java
-D flag on the command line.
When running in production in AWS we can use an EC2 instance profile to remove the issue. And while running locally, we can use utilities like
aws-vault to set our environment variables for us.
We want the approach to be as consistent as possible across the two environments to avoid storing up complexity and maintenance headaches for later.
Hindsight is a great thing, but, at the time, the most consistent method I knew of was to use the standard
kettle.properties file to store encrypted secrets. We can use this to store raw or obfuscated environment variables, where the latter must be manually decrypted.
There’s a script called
encr.sh that’s bundled with the Pentaho distribution to let us obfuscate values easily. You can use the obfuscated values in most text boxes throughout the UI and have them automatically decrypted on-the-fly at run-time.
❌ The problem is that Pentaho uses a symmetric key to achieve this so anyone with a copy of PDI can reverse engineer the original values easily. So it’s no good to save this stuff in source control, which makes it harder for the solution to be portable within a DataOps team and consistent across development and live environments.
A better way to handle these secrets would be to use a service like AWS Secrets Manager or Hashicorp Vault, but at the time PDI didn’t integrate with those services natively.
Another alternative would have been to store values encrypted manually with a public key in source control and decrypt them using AWS Key Management Service.
Sure it would be possible to build some transformations to make API calls to these services, but again that seems to defeat the purpose of using the pre-canned features of an ETL tool in the first place. Ideally we should avoid this complexity as it will cost us in terms of maintenance in future.
Wouldn’t it be great if we could just use environment variables easily like a 12 Factor App does in the microservices space?
Let me know your thoughts around this and how you’ve solved accessing secrets in PDI reliably. I’d be really interested to know if I’ve missed anything obvious.
Where Pentaho Shines (100s of Millions of Records)
Nevertheless, I knew Pentaho very well, so all of this wasn’t a deal-breaker for me. We wanted to get going quickly and Pentaho would let us do that in a snap!
Pentaho shines when you use its core features around manipulating streams of records.
If you’re just dealing with 100s millions of rows in a RDBMS and not true big data, Pentaho will take care of it easily, say, in a nightly batch using 4GB RAM and a couple of CPU threads.
You begin to wonder why anyone would bother munging data with Python when Pentaho has the simple stuff covered with its built-in components.
But in reality, there are plenty of cases where you need Python running in something that scales-out, but in my situation PDI fit the bill:
✅ I wanted to leverage the performance gains of multi-threaded, parallel extracts from Oracle, which are configurable in a few clicks. Extracting ~60-100k rows per second.
✅ The ability to read CSV config files that the business could edit easily using their favourite tool, Excel.
✅ And CSV file generation, from any input stream of records, that easily applies smart naming conventions and splits files based on their size or the number of rows in them.
I needed results in hours from scratch, not weeks down the line.
With Pentaho, if you can keep all your logic in visible components, without burying complexity in scripts, then you’re winning. When things are visual in the IDE, in as few transformations as possible, anyone in theory can pick up the code and immediately understand what it’s trying to do. So it’s super easy to support and costs you less in the long run.
If, on the other hand, logic has to be moved out to scripts, then you’re probably going beyond the capabilities of Pentaho and you’ll benefit from moving the whole lot into pure code, where you can at least have some unit tests, to make the reliability of your pipelines predictable over time.
Anyway, enough of my ramblings about PDI for now…
About The Data
One of the issues with the Oracle warehouse that would make it hard for our business focused customers to use were the table and column naming conventions. They were obscure and not exactly self-documenting.
The issue grew out of the historical 30-byte limit imposed by Oracle on table and column names. I see this limit has now changed since Oracle database 12.2, but our source systems originated from the early 2000s.
If we wanted to democratise the data we needed to improve on this.
So as a principal, we didn’t want to lift and shift everything from Oracle, we wanted to take the opportunity to simplify our new schema in Snowflake.
We could quickly improve the table and column names, even if we didn’t have time to refactor the complex data model entirely. The CSV file also provided a way to remap column names and you’ll see that in a later blog.
Within minutes of firing up the Pentaho IDE we stared extracting data from an Oracle data warehouse and generating CSV files.
In a small team of two again, we chose a manual process to get fast results in the short-term.
We copied files up to S3 using the AWS CLI so we could start building out the business' first data lake and ingest the data into Snowflake and Looker from there.
My partner in crime was an excellent Data Architect who was passionate about running a product-based team as a mini start-up within the wider organisation. He was great at customer discovery and drumming up interest across the business.
He started demoing “the art of the possible” as we called it back then, using Looker.
The business seemed genuinely excited by the speed of progress as we were exposing the data they knew well from old tabular Business Objects reports in fresh new ways!
Version 1 of the Pentaho Pipeline
Meanwhile, I was evolving the PDI feed into a config-driven pipeline. I wanted it to be easy for anyone to add their migration requirements to a list of tables/views and have them copied from any Oracle source to Snowflake in one hit.
The config was just a CSV file edited in Excel and checked into a GitHub repo, but this would drive everything from table names to the sequencing and degree of parallelism.
The pipeline concentrated on extracting records into CSV files split based on size and copying them up to an S3 bucket.
While I’m afraid the last step to load the data files into Snowflake was not there, it would be easy enough to add it using the Snowflake JDBC drivers.
I’ll see if I can get around to that in another post soon. Let me know your thoughts in the comments.
After a month of development and bending Pentaho to my will, I built a repeatable pipeline that would take our data into S3 easily. It was designed to be stateless and carry on from where it left off in the event of failure, just by reading the contents of our S3 bucket.
This meant the naming convention used for the CSV files needed to include date-time and/or sequence numbers that reflected the make-up of the source tables/views.
All of the pitfalls mentioned earlier around variables and security were fiddly to overcome, but it worked. It was a good enough version-one.
All I had to do to get new tables/views from our Oracle databases was add them to the config file and give them a priority to be extracted with.
Pentaho Cluster In Production
Of course, there was infrastructure to take care of as well, as we moved from running the Pentaho feeds on our laptops in Spoon to a needing to have our batches run hourly or over night in production.
We built everything in AWS from the ground up using Terraform (v0.11 at the time) with Concourse for CICD.
That covered VPC creation and some annoying peering to a VPC in another account, which had access to our on-prem network where the databases were running.
I built out a multi-node Pentaho Carte cluster running on EC2. Pipelines could be kicked off by hitting an HTTP endpoint to run a core job that figured out which node in the cluster was least loaded. The actual extracts themselves would then be launched on the least loaded nodes dynamically.
Logs were visible in the Carte web server too. This was good enough and we could bolt on monitoring and log streaming later.
It’s probably worth a dedicated blog to talk about the infrastructure some more, and then the Airflow platform that superseded it, but I’ll come back to that.
Let me know in the comments below what you’d like to hear more about.
Pro & Cons
The cluster was great:
✅ We could scale out the cluster easily in Terraform by changing a few lines of code and letting CICD take care of it.
✅ Pipelines would self load balance. They’d figure out what the least-loaded node in the cluster was and post themselves to run there. Just add a node to the cluster via Terraform and the “launcher” job would figure it all out.
✅ Each node in the cluster would regularly fetch pipeline code from GitHub so any changes we needed to make would be rolled out seamlessly.
But it had serious limitations that needed to be addressed if we were going to keep scaling as a team…
❌ Fixed RAM limits: It was easy to add tables to the config file driving the extracts. You could choose a priority for each table to be extracted in. The problem was that the degree of parallelism needed to be carefully considered to avoid crashing the Pentaho (Carte) node, due to lack of RAM!
❌ Feedback loops missing: As an extension to the previous issue, we were running on EC2 instances with a fixed amount of RAM so Pentaho could only extract so many tables in parallel within its RAM limits. I don’t think this was the fault of PDI, but rather of the pipeline’s lack of feedback from its hosting environment.
❌ Lack of monitoring: It was harder than it should be to understand if we were crashing Carte due to a lack of memory or due to some other bug because we hadn’t implemented any detailed mem stats for the EC2 nodes. Unfortunately, it required some manual smarts to onboard new tables after all.
When Carte crashed we only had to restart the EC2 host and it would all come back to life automatically. It was easier and saved time that way.
It would have been great to containerise this all and run it on Kubernetes with some health checks to take care of the restarts, but again, I’ll leave that for another day.
After we moved away from Pentaho, we built an Airflow platform on Kubernetes. And I have a new opinion about how this stuff should be done correctly.
Let me know your thoughts in the comments or get in touch directly if you’d like to hear more.
In the introduction, I mentioned I’d talk about why you shouldn’t use Pentaho to migrate data to Snowflake. And my reason for saying this is really centred around long-term team performance and security.
If you’ve jumped here and want to get an idea of why I say this, scroll up to see the items with ❌ against them.
While your teams will get a speed increase from the re-usable components in PDI, they will burn a lot of time bending the tool to work with environment variables and integrating with modern HTTP services and secrets management apps.
I haven’t touched on monitoring, logging, alerting or scheduling, which are other essential elements of a good data integration platform, but they’re going to take up a lot of engineering time to get right too.
Modern SaaS platforms are starting to take care of this periphery stuff from the get-go so it’s worth considering the whole picture if you’re picking a 3rd party platform to run with. That’s assuming your business is happy to get onboard with SaaS in general.
If, on the other hand, you’re planning to build your own data platform in-house, you’ll probably end up coding (think, reinventing) data integration patterns directly instead.
Your teams will spend time essentially rewriting a library of functionality to mimic features that come pre-canned with existing ETL tools. And once you have your library, you still need to wire up the data integration patterns themselves.
Either way, you know you have some development work to do, unless you pick a platform that does everything in one place.
After I started using Pentaho to migrate to Snowflake, I immediately thought it’d be far simpler if there was a tool that wrapped up ETL functionality and the data integration patterns that you need to write in-house.
That leads me on to the topic of another blog I’m going to post about a simple tool I wrote called Halfpipe that helps every engineer move records between RDBMS, Snowflake and S3. It reduces all of the moving parts to just one command.
But before I get to that, I need to take you through the old Pentaho pipeline and what an ideal one looks like.
In my next article I’ll get down to the logic in the Pentaho pipeline and give links to the code on GitLab. It will add some more weight to the known issues and gotchas that I mentioned above.
We sure need to talk about the gotchas.
See you in the next one!