Serverless Data Pipelines: How AWS Lambda Reduces Data Platform Costs (1 Billion Records a Day for $16 a Month)

Published on 19 Oct, 2020. Last updated 26 May, 2022

Introduction

I’d like to talk about how using serverless can reduce the cost of data pipelines and in what circumstances it stacks up.

The Scenario

Let’s say you have some relational data that you want to get into the cloud, but you don’t have the luxury of a few spare servers kicking around on-prem to run the integration.

So you know you need to pay for new compute, and you’ve considered SaaS products like Alooma, Fivetran or Stitch, but simply don’t like the idea of paying per record.

You’ve ruled out alternative services like Matillion, which let you pay by the hour, and have decided to build your own platform instead.

You would prefer to spin up your own compute and pay the going rate direct to AWS, Azure or GCP etc.

What’s Being Covered?

So this article talks about the results of an experiment I ran to understand if and when using AWS Lambda can save on EC2 compute costs.

Since Lambda offers high availability out of the box, and you only pay when it’s on, there would be no need to spend time and effort on clustering EC2 instances and shutting them down when they’re not running pipelines.

I’m going to show the monthly cost of using Lambda to extract data from 100 tables, each with 10 million records changing per day, at common execution intervals.

Many organisations will have more nuanced data sets to deal with, but hopefully this seems like a large enough amount to give you a feeling for the cost that you might incur while using Lambda in your environments.

The aim is to use Halfpipe to extract data from a demo SQL Server instance into gzipped CSV files in an S3 bucket, as a starting point for other cloud services like Snowflake. The data source could equally be Oracle or Postgres alike.

Given the right circumstances, can Lambda reduce the cost and effort required to stand up data pipelines?

Here’s the overview…

So let’s get straight to the punchline - a graph of Lambda costs vs EC2.

Cost Savings

Check out the graph below to see the monthly cost of various Lambda sizes running at common execution intervals, to process 10 million records per table per day, for 100x tables.

The cost of some EC2 instances is also shown for comparison.

Click on the legend to enable/disable different series.

To zoom, you can click and drag a new area. If you need to reset the zoom level, double-click (or double-tap if on mobile).

If you’re viewing this on a smartphone, you may have more luck with the graphs in landscape mode as it’s quite dense.

Many thanks to Plotly.com for doing the rendering!

The original calculations can be found in this Google Sheet.

First Impressions

Lambda

Here, the assumption is that the Lambdas are extracting deltas, so as the execution frequency rises, the record volume decreases per interval.

The cost stays quite flat until execution frequency increases beyond minutely.

For as little as $16 per month with AWS Lambda, you can continuously migrate 100 tables each containing 10 million records changing per day to an S3 bucket. This makes it super easy to get into cloud warehouses like Snowflake, all without server infrastructure to manage, patch or upgrade.

If you have more data to process, I’m going to cover that below with more graphs to help visualise the limits.

To schedule the Lambdas we can use CloudWatch events or create a more elaborate architecture with Kinesis streams for an event-based approach. The latter will be the subject of another blog where I discuss simple options for event-sourcing and how Halfpipe can help with that.

EC2

An alternative always-on, server-based approach is available for a little over $31 per month, in the form of one t3a.medium offering 2 CPUs and 4 GB RAM.

While this is plenty of RAM to run Halfpipe, it may not offer much room for the more resource-hungry traditional integration tools. On the subject of memory usage, there’s more detail on how much Halfpipe consumes below.

Now, while the Lambda provides high-availability and less to manage by default, the EC2 approach does not.

If you want pipelines to be highly available on EC2 then you’ll need to add more servers and consider how to schedule jobs within a cluster. Perhaps Kubernetes is another option at this point, but that’s a different level of complexity, cost and fun altogether.

To reduce the EC2 cost, an autoscaling group can be used to automatically stop servers when they’re not running any schedules. It would be slightly more complex to set up, but not the end of the world.

So the Lambda approach keeps things clean, cheap and simple. The barrier to entry is super low!

What About The Cost Spikes?

The graph series spike as the pipeline frequency hits 1-second intervals because the number of executions dramatically increases the Lambda “request” charges.

At the time of writing, AWS bills $0.20 per two million requests. That means if we execute every second we’ll reach 2,628,000 requests in one month.

Below that frequency, it’s quite cost effective even at 10 second intervals for the 100x tables.

Per-Record Thinking

When we start to use services like Lambda, and even Snowflake for that matter, we enter a new world where the impact of our technical behaviour can be linked to cost more directly.

For example, in the good old days of running servers on-prem, we paid up front for the compute and the database licence for the year, and then just threw as much as possible at the resources to realise the most value.

Back then, it was important to ask questions like, “do you really need your data to be that up-to-date?”, but not nearly as much as it is now. In this new world of per-second billing those kinds of considerations take on a new level of importance.

For example, we can see that if we run the largest Lambda size (3008 MB) and increase the execution interval from minutely to every 10 seconds then the cost jumps from $22 to $134 per month. Ouch, that’s a big difference just for the sake of “50 seconds”!

On the other hand, it’s only a $4 jump for the smaller Lambda at 192 MB, so we clearly need to be more mindful of data volumes and transfer speeds with this approach.

Let’s discuss performance some more…

Performance

How Much Data Can Lambda Handle?

It’d be great to know how much data can be processed by Halfpipe running in Lambda. Is there a cost-effective upper limit?

Well, the Lambda size (configured RAM in MB) directly affects pipeline performance and therefore the volume of data that can be processed within a given execution.

The complexity of the source data will also impact the extract speed.

There’s also the speed of the source system to consider - how well can it serve up the data? - and network performance etc.

The more Lambda appears to be always-on, the more EC2 is going to be cheaper. It’d be nice to find out if there’s a sweet spot though.

To understand the throughput that you can expect to achieve in your environments, unfortunately there’s no getting away from having to measure it with your data.

Max Power

The maximum performance I found during my basic tests, without any parallelisation, using the fastest Lambda size of 3008 MB was 91,700 records per second.

As you’ll discover below, that was extracting from a simple SQL Server instance created in Docker without any customisation, running on a t3a.small instance.

This appears to be CPU and network-bound, as the amount or RAM required to extract the sample data was only ~140 MB (more on that below).

The figure above would allow over 80 million records to be processed within the maximum run duration of 15 minutes. That’s not too bad!

Below we’ll look at cost by volume of records, so keep an eye out for the next set of graphs.

If you choose to run the Lambdas more frequently than 15 minutes, you won’t be able to extract as much. For example, running at 1-minute intervals only allows ~5.5 million records to be processed. Still, that’s not too shabby!

Use Cases?

So there’s clearly a niche use case here. If your data volumes, per table, are below these logical performance thresholds then you could easily run an ELT platform this way.

When data sets are larger than the ~80 million, you have to think a bit harder, especially around initial extracts:

  1. If the total record volume is over ~80 million then you could run Lambdas constantly until they have churned through all the data incrementally. Then dial back the execution frequency to just capture ongoing deltas after the initial extract/load is complete;
  2. Or just run your initial extracts using EC2 and continue to capture deltas using Lambda after

An upper limit of 80 million records per table per exec isn’t bad. Your mileage will obviously vary for more complex datasets, different source systems and network set-ups.

As your Lambda run times approach their execution frequencies, you may just be better off paying for the full-time cost of running EC2 instances.

I personally prefer to use one solution that works consistently regardless of the situation to keep things simple, but that’s a matter of preference.

So let’s have a look at what we could achieve if we fixed the execution frequency and Lambda size. It’s time for some more graphs.

How To Estimate The Cost Of Lambda & Halfpipe

If you come armed with the volume of records changing in your datasets each day, you can use the following graphs to estimate the cost of running Halfpipe in Lambda on a monthly basis.

The graphs show how monthly cost increases as the record volume rises. They visualise just two configured Lambda sizes at various execution intervals.

Both graphs have the same shape; it’s the cost that differs due to the Lambda size.

The graph series are topping out at slightly different levels depending on the Lambda speed.

The thing to remember here is that the record volume you choose as an entry point will be the amount extracted at each Lambda execution.

In my experience, as pipeline execution frequency increases, the volume of records changing in data sets tends to shrink.

Bear in mind these charts represent a single Lambda doing all the work, so there will be a slight cost difference compared to running multiple Lambdas processing the same total volume.

To get one Lambda to work for multiple tables such that the total volume matches your chosen entry point, we’ll need to create a custom pipe file (it’s JSON/YAML) and supply that to Halfpipe with the pipe command. Take a look at the next article for some more context.

I’m not going to cover that here, but I’d be happy to talk more about it. Please let me know in the comments if you’d like to know more.

The Source Data

The source data was hosted on a t3a.small EC2 instance.

That’s 2 CPUs x 2 GB RAM.

I created a SQL Server instance running the Docker image mcr.microsoft.com/mssql/server:2019-latest without any manual configuration whatsoever.

For the dataset, I wanted it to be easy for everyone to access, so I created a table using 1 million records from SYS.ALL_OBJECTS. Nothing fancy.

It occupied a small 128 MiB on disk and had rows that look like the one below.

This is just a single example record; not all rows were identical to this. Hopefully you get the idea of the data types:

name, object_id, principal_id, schema_id, parent_object_id, type, type_desc, create_date, modify_date, is_ms_shipped, is_published, is_schema_published
sp_getsubscriptiondtspackagename,-80567752,,4,0,P ,SQL_STORED_PROCEDURE,2020-08-15 11:22:05.220,2020-08-15 11:22:05.907,true,false,false

I executed the Halfpipe Lambda for each size shown on the first graph above and took the average duration of three runs to calculate the speed in records per second.

The figure was used to extrapolate the duration required to handle 10,000,000 rows per table.

And after applying some standard AWS Lambda pricing math and multiplying out by 100 tables, I arrived at the costs.

It would certainly be better to produce an average from more than three executions, but you can see in the calculations that the durations seem to be quite consistent between runs.

All the original calculations can be found in this Google Sheet.

How To Run Halfpipe In Lambda

It’s a case of creating a Lambda with the Halfpipe binary and setting a handful of environment variables to configure the extract.

I decided to move details from this section into a dedicated post, to keep this one focussed around cost.

Conclusion

Serverless offers a great opportunity to create a data integration platform for very little cost if the circumstances are right.

Lambda is normally thought of in the same context as event-driven architectures, but it can still be used for batch processing - up to a point. This article helps us understand what that “point” is.

The three graphs above make it possible to estimate the cost of using Lambda to run data pipelines, based on the volume of records that are changing in your sources.

Essentially, if the rate of change to the source data is below the number of records that can be processed by Lambda within a chosen execution frequency then it all stacks up. That is, Lambda can simplify infrastructure management costs and provide high availability.

There’s an upper limit to what can be processed in a single Lambda execution: up to ~80 million records for the largest Lambda (3008 MB), processing a simple dataset.

Here are two example costs taken from the graphs above if you haven’t already seen them:

  1. If you have 100x tables each with 10 million records changing per day then a platform built with Lambda would cost around $16 per month, assuming only deltas are extracted
  2. If you have 25 million records changing per day in a single table, you would be looking at paying around $0.40 per month if you extract a full snapshot daily. That increases to $9.75 per month at hourly intervals

It’s a balancing act as the smaller Lambda sizes cost less, but run slower. So if you have less data they are the cheaper option, assuming you don’t mind your pipelines running slower.

If you find yourself running micro-batches in Lambda at intervals as high as every 10 seconds it’s going to cost a lot, but by that point you’d probably want a proper streaming architecture instead or just to opt for always-on EC2 instances.

That said, it can still be done, and it’s easy to get started.

While true event-based architectures harder to engineer, Halfpipe and Lambda offers a lean alternative.

What Next?

I’d like to come back to GCP and Azure functions as a service later so let me know if that’s of interest in the comments.

It’d be interesting to see how easily we could scale out the Lambdas to increase performance when the source data is physically partitioned, or even parallelise extracts using a logical partition ID derived at run-time to achieve similar gains.

There’s also Terraforming the Lambda creation to make deployments easier.

If you’re interested to take Halfpipe for a spin, I’d love to help you get set up.

Drop me a message any time or add some comments below.

Many thanks for reading and see you in the next one.

-Richard