Bulletproof Data Integration: Tracking Changes - Log Mining vs Triggers

Published on 16 Jun, 2020. Last updated 18 Jan, 2021

Introduction

In this article I talk about the best ways to propagate changes to relational data from one database to another.

This follows on from my previous post about the best incremental data pipeline, which relies on knowing the last modified date-time of records in the source, so I’ll cover this in more detail.

Then there’s the situation where we need to track deleted records. This isn’t always necessary in data warehouses, as rows are often just accumulated, but what’s the best way to handle them in case we need to?

Should we just use a log mining solution that may have a higher set-up cost and ongoing licence fees, or should we opt for a more granular, agile rollout of triggers that capture the keys of deleted records in an audit trail, or the last modified date-time of changes?

Let’s find out.

Options

At a high level, these are the options available to track changes:

  1. Use transaction log mining
  2. Add light-weight triggers to log deleted keys or maintain a “last modified” date
  3. Soft-delete records instead, if the application is in your control

Log Mining

You might be thinking it would be far easier if we just used a log mining solution, but the decision to go down that road shouldn’t be taken lightly.

There are obviously some situations where you clearly HAVE to start mining, like when you’re running a 3rd-party product whose schema you can’t touch for licence reasons.

In other situations, there’s more to it than that. Consider these variables:

  • The ongoing cost of product licences and support agreements, commonly charged per processor
  • The initial set-up cost of product installation and source database configuration spanning multiple teams
  • Pipeline development and engineering cost
  • Product incompatibility with database operations like recreating or truncating tables and partition maintenance
  • Reliability β€” how often does it need to be poked to keep it running?

Oracle will tell you that the best tool out there in this category is GoldenGate. It’s okay, they own it so they’re going to say that πŸ˜„ After all, it’s packed full of features that are well integrated and supported by the database, but it’s expensive!

The last time I looked, licences were $17k USD per CPU core and $3.8k per year for support in 2019. There are cheaper 3rd-party alternatives like Dell Shareplex, HVR and Attunity if cost is the driving factor for you.

Licence price aside, all of these products have a high initial set-up cost and will add a large amount of architectural complexity to your tech stack.

If you’re aiming to be maintenance-free, the added complexity unfortunately won’t help to simplify your migration problems.

Before adopting them, it would be sensible to talk with your DBAs to understand how much time the set-up and ongoing maintenance will consume, and if there are any regular database operations that aren’t compatible with the log mining.

You don’t want to end up in a situation where one team is continually breaking the mining required by another team.

Now let’s discuss option 2 β€” the triggers.

Database Triggers

In contrast, deploying triggers can take a more incremental approach. The technique of using them is often referred to as “synchronous capture” and there are many ways to implement it.

In more complex set-ups, triggers can be written to record changes found in one database into another. But this is frowned upon, and rightly so, because of the coupling it causes between systems.

In their simplest set-up, triggers can be used to record changes in the same schema as the tables that are being audited. With this, there’s minimum coupling and complexity.

The light-weight triggers can be rolled out on a per-table basis by either the application developers or the very same data integration team handling the migration.

There seems to be a belief in the industry that triggers should ALWAYS be avoided. Perhaps the thinking was born in a time when full-blown business applications where built inside the database. A time when spaghetti PL/SQL or T-SQL code would cause cascading effects that were hard to visualise and debug.

Ultimately with code like this, application development and database performance grew slower over time, leaving a bad impression of database tech with the business.

However, small triggers that are localised to a single database object and designed to do one thing only β€” say, capture a last-modified date or append the key of a deleted record to an audit trail β€” are the perfect application.

There will be a natural performance concern, but we can quantify it easily. Let’s do that now!

Performance Testing Triggers

I’m going to test two types of trigger and provide stats:

  1. Track the last modified date-time for records that are inserted or updated, as this is needed to drive the pipeline in my previous article.
  2. Capture the keys of deleted records in a separate table as an audit trail.

I have a 13" MacBook Pro from 2018 that I use for all my development work. It’s a quad-core i5 @ 2.3 GHz. I use the Oracle “Developer Day” VM without any special treatment in VirtualBox and it runs with 2GB RAM and 1 CPU thread. It’s running in NOARCHIVELOG mode and the total SGA size is only 800 MB.

Your set-ups will probably have more horse-power, but may be under more load.

My test data is based on the contents of Oracle’s ALL_OBJECTS (20+ columns), duplicated to get 1m rows in a nicely compacted table. The table size is only 168 MB on disk, without compression and it’s all 8 KB blocks.

I added a DATE field with a trigger that sets the value to SYSDATE before a row is inserted or updated. I also added a trigger that fires before a row is deleted to save the primary key into another table.

So the set-up is capturing both the last modified date-time for any changes and tracking any deleted records too.

The code can be found in a GitLab repo here if you’d like to take a look. There’s much room for improvement, but it seems to be a fair start.

Testing INSERTs / UPDATEs

The first test SQL statement is an UPDATE to all rows. It sets the value of a column to a short string. It captures the timing for ten executions and calculates the average. It commits each transaction as it goes.

Results

  • For 1m rows it takes 7.9 sec with the trigger disabled vs 13.5 sec with it enabled. The triggers have added 5.6 sec overhead to the batch. ~5 microseconds per row.
  • For 10m rows (1662 MB on disk this time) it takes 92.8 sec with the trigger disabled vs 137.7 sec with it enabled. That’s quite a large 44.9 sec overhead! Or ~4 microseconds per row.

It’s a pretty simple test with just one user and one thread, but with a finger in the air, these are comparable timings.

You can see that for large batch operations there’s going to be a noticeable impact. But for smaller single-row operations the impact is super low.

You could say that if you’re running regular updates to 100s millions of rows, you probably won’t want this trigger.

On the other hand, if you’re just dealing with an OLTP system running your business then it’s going to be okay.

If the database application is in your full control, you’ve probably already implemented triggers like this that can be repurposed, so you’re good to go.

Testing DELETEs

In the next test, the SQL statement is a DELETE of all rows in the sample table. I capture the timing for ten executions and calculate the average. There’s a rollback after each DELETE statement. The first execution will grow the audit trail table so that’s expected to be slower than subsequent iterations.

Results

  • For 1m rows it takes 42.5 sec with the trigger disabled vs 125.3 sec with it enabled. The triggers have added 82.8 sec overhead for the batch.
  • For 10m rows (1662 MB on disk this time) it takes …

<test in progress - I’ll publish these stats ASAP>

Often, the performance impact can be reduced in the case where many child rows are deleted for one parent record. In that situation, triggers only need to capture the deleted parent record to minimise performance impact. Then data integration tools can focus on fetching changes for the parent table while replacing all child records to keep things in sync.

More Advantages Of The Triggered Approach

Another great advantage of using the triggers is that it means your mechanism for collecting data changes is via the front door of the database, so to speak. You can just use a regular SQL client to fetch the data in a way that the database is designed for. No sneaking around or using-the-back-door required.

You’ll be safely abstracted from database maintenance operations performed by DBAs and you’ll never (dare I say it) be subject to future compatibility issues arising when Oracle inevitably changes the internals of its redo logs or the PL/SQL packages that 3rd-party products rely on.

Conclusion

If you’re in a situation where you’re considering database log-mining tools to replicate data or track deleted records then this article highlights arguments both for and against them.

There are certainly some cases where you need to use log mining, but the performance stats show that light-weight triggers are not the evil that they’re made out to be.

Adopting log mining opens the door to a whole load of new maintenance scenarios that require niche expertise to engineer.

It’s not the recurring licence fees that hurt so much as the ongoing development and support cost across multiple teams.

If you prefer low-cost, agile solutions and you’re able to make changes to your source database then you should avoid the complexity and ongoing maintenance cost of log-mining tools and prefer triggers instead.

Using triggers keeps your tech stack simple and allows engineers to use bulletproof data integration patterns that use plain SQL to access data in a maintenance-free way.

Thanks for reading and see you in the next one,

– Richard