Lambda functions are awesome, but they only provide a single dimension to allocate resources to it — memorySize. The simplicity is refreshing, as functions are complex enough — but AWS really shouldn’t have called it memorySize if it controls CPU…
At the end of 2018, AWS introduced custom runtimes for Lambda functions, which provided customers a way to run applications written in languages not in the holy list of the ‘Official AWS Lambda Runtimes’ which include a plethora of languages. It has 3 versions of Python, 2 versions of Node, Ruby, Java, Go and .NET core (that’s a lot of language support)
Security-wise, it’s better to use an Official AWS Lambda runtime than it is to roll your own. After all, why take ownership for something AWS is already doing for you — and for free!
But, as plentiful as the official runtime list is– there’re always edge-cases where you’d want to roll your own custom runtime to support applications written in languages AWS doesn’t provide.
Maybe you absolutely have to use a Haskell component — or you need to migrate a c++ implementation to lambda. In these cases, a custom runtime allows you to leverage the power of serverless functions even when their runtimes are not officially supported.
Bash Custom Runtime
Which brings us to the topic of today’s post, the bash custom runtime.
For Klayers, I needed a way to update a github repo with a new json file every week — which can be done in python, but no python package came close to the familiarity of
git pull ,
git add and
So rather than try to monkey around a python-wrapper of git, I decided to use git directly — from a shell script — running in a lambda — on the bash runtime.
So I pulled in the runtime a github repo I found, and used it for write a lambda function. Simple right? Well not entirely — running regular shell scripts is easy, but there are some quirks you’ll have to learn when you run them in a lambda function…
Not so fast there cowboy…
Firstly, the familiar home directory in
~/ is off-limits in a lambda function — and I mean off-limits. There is absolutely no-way (that I know off), for you can add files into this directory. Wouldn’t be a big isue, except this is where
git looks for ssh keys and the
Next, because lambda functions are ephemeral, you’ll need a way to inject your SSH key into the function, so that it can communicate to GitHub your behalf.
Finally, because you’ve chosen to use the bash runtime, you’re limited to the awscli utility, which while fully functional doesn’t come with the usual tools as boto3 for python. It’s a lot easier to loop and parse json in python than it is in bash — fortunately,
jq makes that less painful, and
jq is included in the custom runtime :).
Enough talking let’s build this
One of a great things about Lambda functions is that you can’t SSH into it.
This sounds like a drawback, but actually it’s a great security benefit — you can’t hack what you can’t access. Although it’s rare to see SSH used as an entry path for attackers these days, it’s not uncommon to see organizations lose SSH keys every once in a while. So cutting down SSH access does limit the attack surface of the lambda — plus the fact, that the lambda doesn’t exist on a 24/7 server helps reduce that even further.
Your support engineers might still want to log onto a **server**, but in todays serverless paradigm, this is unnecessary. After all, logs no longer exists in
/var/logs they’re on cloudwatch, and there is no need to change passwords or purge files because the lambdas recycle themselves after a while anyway. Leave those lambda functions alone will ya!
As a developer, you might want to see what is **in** the lambda function itself — like what binaries are available (and their versions), or what libraries and environment variables are set. For this, it’s far more effective to just log onto a lambci docker container — Amazon work very closely with lambci to ensure their container matches what’s available in a Lambda environment. Just run any of the following
docker run -ti lambci/lambda:build-python3.7 bash
docker run -ti lambci/lambda:build-python3.6 bash
Lambci provide a corresponding docker container for all AWS runtimes, they even provide a
build image for each runtime, that comes prepackaged with tools like
zip. This is the best way to explore a lambda function in interactive mode, and build lambda layers on.
But sometimes you’ll find yourself wanting to explore the actual lambda function you ran, like checking if the binary in the lambda layer was packaged correctly, or just seeing if a file was correctly downloaded into
/tmp— local deploy has it’s limits, and that’s what this post is for.
This is a continuation in the Klayers series, where I deep dive into the architecture of Klayers. At its core, Klayers is a collection of AWS Lambda Layers for Python3, with the idea that python packages in layers is more efficient than packaging them with application code.
Visit the GitHub repo here, where you’d find 50+ lambda layers for public consumption across most AWS regions (including HK and Oman). This post is how I automated the building of layers inside lambda functions — but specifically on layers composed of Python Packages (e.g. requests, beautifulsoup4, etc)
Python Packages for Dummies
As a primer, let’s take a look at python packages in general. Python utilizes the Python Package Index (or PyPI), this is similar to Maven for Java or NPM for Node. It’s simply a package manager that helps with the installation of python packages for your application.
In order to help with this, there is a program called
pip that helps with the installation of python packages. While
pip isn’t limited to packages from PyPI, you can use it to install packages from other sources as well — it and PyPI are the dynamic duo of Python packages.
The problem is that while Python is a interpreted language, there are some components of it that are OS specific. When you pip install into Windows, you get a different package installation than when you pip install into Ubuntu or OSX.
pip detects your OS and installs specific files for your specific purpose — sometimes those files need to be compiled for your OS as well.
Which means, if you wanted to put a Python Package into a Lambda Layer, it would need the AWS Linux version of that Python Package (Ubuntu might be close enough, CentOS is even better), because Lambda functions run on AWS Linux. And because not many folks run Linux as their core distribution, the general recommendation for creating these lambda layers has always been to use Docker.
But to me, this seemed sub-optimal. After all, we preach the ‘serverless first’ mantra, yet when it comes to building lambda layers — we default to a docker container on a serverful laptop …. there must be a serverless way.
I’ve been bitten by the serverless bug lately, and just completed my latest hobby project this week. It’s a fully serverless pipeline that builds python packages as Lambda layers — and it uses Lambda functions to do so. As a…
Just this week, my team was on the cusp of demo-ing a product they’ve been working on for the last 2 months, only for a build process to fail, just hours before the demo to some very high ranking people….
I’ve been really digging into Lambda Layers lately, and once you begin using layers you’ll wonder how you got by without them.
Layers allow you to package just about anything into lambda, but in a modular way. So elements of your code that don’t change much, can be packaged into layers, while keeping your actual lambda deployment for just the code that’s changing.
It’s akin to docker cache, where you keep the un-changing elements higher up in your docker file, separate from the code that always changes. The difference though, is that docker cache speeds up builds, while layers speeds up lambda deployments.
But layers aren’t magic, and they’re still limited by the AWS size limit, hence your entire function (including all it’s layers) need to be no larger than 250MB (unzipped).
Which is tough for something like spaCy — because its default installation size on a AWS Linux is ~400MB (or 492MB based on my quick installation on lambci for python3.7). So, in order to get spaCy working on a lambda, certain tweaks are going to be necessary.
Some have tried working around this problem by installing spaCy onto the lambda container on cold-start — i.e. pull the data into lambda only when you have access to the 512MB in
/tmp. Cool solution, but it almost completely fills out
/tmp, and makes a cold-start even slower.
A more optimal solution would be to reduce the size of the spaCy installation and have it fit into a layer! Fortunately I found a GitHub issue after some googling that enables us to do exactly this.
It involves removing unnecessary language files, which spaCy lazy load in, If you’re only interested in one language, you can simply remove the unnnecessary language files in the
After manually removing all non-English (en) language files, I managed to reduce the size of the spaCy package to 110MB, which fits very nicely into a lambda layer, in the end my lang directory only had the following files:
As a bonus, I also included the English en_core_web_sm-2.1.0 model, to make the lambda layer fully usable on its own .
Finally I published it as a publicly available layer, for anyone to consume. One of the amazing things about layers, is that once a layer is made, it can be shared across AWS for anyone to consume.
Recently I found myself working with an S3 bucket of 13,000 csv files that I needed to query. Initially, I was excited, because now had an excuse to play with AWS Athena or S3 Select — two serverless tools I been meaning to dive into.
But that excitement — was short-lived!
For some (as yet unexplained) reason, AWS Athena is not available in us-west-1. Which seemingly, is the only region in the us that Athena is not available on!
And…. guess where my bucket was? That’s right, the one region without AWS Athena.
Now I thought, there’d a simple way to copy objects from one bucket to another — after all, copy-and-paste is basic computer functionality, we have keyboard shortcuts to do this exact thing. But as it turns out, once you have thousands of objects in a bucket, it becomes a slow, painful and downright impossible task to get done sanely.
For one, S3 objects aren’t indexed — so AWS doesn’t have a directory of all the objects in your bucket. You can do this from the console — but it’s a snap-shots of your current inventory rather than a real-time updated index, and it’s very slow — measured in days slow! An alternative is to use the
But there’s a problem with
list_bucket as well, it’s sequential (one at a time), and is limited ‘just’ 1000 items per request. A full listing of a million objects would require 1000 sequential api calls just to list out the keys in the your bucket. Fortunately, I had just 13,000 csv files, so this part for fast, but that’s not the biggest problem!
Once you’ve listed out your bucket, you’re then faced with the monumentally slow task of actually copying the files. The S3 API has no bulk-copy method, and while you can use the
copy_object for a file or arbitrary size, but it only works on one file at a time.
Hence copying 1 million files, would require 1 million API calls — which could be parallel, but would have been nicer to batch them up like the
So to recap, copying 1 million objects, requires 1,001,000 API requests, which can be painfully slow, unless you’ve got some proper tooling.
AWS recommend using the S3DistCP, but I didn’t want to spin up an EMR server ‘just’ to handle this relatively simple cut-n-paste problem — instead I did the terribly impractical thing and built a serverless solution to copy files from one bucket to another — which looks something like this:
The Serverless framework (SF) is a fantastic tool for testing and deploying lambda functions, but it’s reliance on cloudformation makes it clumsy for infrastructure like DynamoDB, S3 or SQS queues.
For example, if your
serverless.yml file had 5 lambdas, you’d be able to
sls deploy all day long. But add just one S3 bucket, and you’d first have to
sls remove before you could deploy again. This different behavior in the framework, once you introduce ‘infra’ is clumsy. Sometimes I use
deploy to add functions without wanting to remove existing resources.
Terraform though, keeps the state of your infrastructure, and can apply only the changes. It also has powerful commands like
taint, that can re-deploy a single piece of infrastructure, for instance to wipe clean a DynamoDB.
In this post, I’ll show how I got Terraform and Serverless to work together in deploying an application, using both frameworks strengths to complement each other.
**From here on, I’ll refer to tool Serverless Framework as SF to avoid confusing it with the actual term serverless
Terraform and Serverless sitting on a tree
First some principles:
- Use SF for Lambda & API Gateway
- Use Terraform for everything else.
- Use a tfvars file for Terraform variable
- Use JSON for the tfvars file
- Terraform deploys first followed by SF
- Terraform will not depend on any output from SF
- SF may depend on output from terraform
- Use SSM Parameter Store to capture Terraform outputs
- Import inputs into Serverless from SSM Parameter Store
workspacesin Terraform to manage different environments.
stagesin Serverless to manage different environments.
In the end the deployment will look like this:
First a definition.
A lambda function is a service provided by aws that runs code for you without the introducing the complexity of provisioning servers of managing Operating Systems. It belongs in a category of architectures called serverless architectures.
There’s a whole slew of folks trying to define with is serverless, but my favorite definition is this.
Serverless means No Server OpsJoe Emison
They’re the final frontier of compute, where the idea is that developers just write code, while allowing AWS (or Google/MSFT) to take care of everything else. This includes H/W management, OS Patching, even application level maintenance like Webserver upgrades are not your problem anymore with serverless.
Nothing runs on fairy-dust though, serverless still has servers — but in this world those servers, their operating systems, and the underlying runtime (e.g. Python, Node, JVM) are fully managed services that you pay per use.
As a developer you write some code into a function. Upload that function to AWS — and now you can invoke this function over and over again without worrying about servers, operating systems or run-time.
But how does AWS achieve this?
Before we can understand how to secure a serverless function, we need to at least have a fair understanding of how Serverless functions (like AWS Lambda) work.
So how does a lambda function work?