All posts filed under “serverless

comment 0

Lambda functions in a VPC

In my honest (and truly humble) opinion, VPCs don’t make much sense in a serverless architecture — it’s not that they don’t add value, it’s that the value the add isn’t worth the complexity you incur.

After all, you can’t log into a lambda function, there are no inward connections allowed. And it isn’t a persistent environment, some functions may timeout after just 2-3 seconds. Sure, network level security is still worthy pursuit, but for serverless, tightly managing IAM roles and looking after your software supply chain for vulnerabilities would be better value for your money.

But if you’ve got a fleet of EC2s already deployed in a VPC, and your Lambda function needs access them. Then you have no choice but to deploy that function in a VPC as well. Or, if your org requires full network logging of all your workloads, then you’ll also need VPC (and their flow logs) to comply with such requests.

Don’t get me wrong, there is value in having your functions in a VPC, just probably not as much as you think.

Put that aside though, let’s dive into the wonderful world of Lambda functions and VPCs

Working Example

First, imagine we deploy a simple VPC with 4 subnets.

  1. A Public Subnet with a Nat Gateway inside it.
  2. A Private Subnet which routes all traffic through that NAT Gateway
  3. A Private Subnet without internet (only local routing)
  4. A Private Subnet without internet but with a SSM VPCe inside it

Let’s label these subnets (1), (2) ,(3) and (4) for simplicity.

Now we write some Lambda functions, and deploy each of them to each subnet. The functions have an attached security group that allows all outgoing connections, and similarly each subnet has very liberal NACLs that allow incoming and outgoing connections.

Then we create a gateway S3 VPC-endpoint (VPCe), and route subnet (4) to it.

Finally, we enable private DNS on the entire VPC. And then outside the subnet we create a bucket and an System Manager Parameter Store Parameter (AWS really need better terms for these things).

The final network looks like this:

comment 0

Amazon KMS: Intro

Amazon KMS is one of the most integrated AWS services, but probably also the least understood. Most developers know about it, and what it can do, but never really fully realize the potential of the service. So here’s a rundown…

comment 0

Keith @ PyconSG 2019

Had a blast at PyConSG 2019, really cool to be in the presence of so many pythonistas. Would definitely recommend, especially since python is one of the more broadly used languages (AI, Blockchain, RPA, etc). My talk was on AWS…

comment 0

Cloud Run — is it the ultimate Fat lambda?

Everyone knows that I’m a Lambda fanboy, and to be fair Lambda deserves all the praise it gets, it is **the** gold-standard for serverless functions. But yesterday, I gave Google Cloudrun a spin, and boy(!) is Lambda is going to get a run for its money.

Which is surprising given Google has traditionally lagged in this area — isn’t it quaint that we use words like ‘traditional’ in the serverless world!

But I digress.

The Lambda equivalent in the Google world, is Google cloud functions … which is (generously speaking) what lambda was 2 years ago– pretty boring. The only advantage I saw it having over Lambda, was the ability to build python packages natively in the requirements.txt file. But that incurred a build during deploy, which in turn had a limit.

And while, it did allow for a larger package size (double what AWS Lambda offers) it was severely more complex to understand. Just looking at it’s limit and pricing models can make you dizzy.

In short, Google Cloud functions lacked the simplicity of Lambda, with little benefit for incurring all that additional complexity.

But Cloud Run is something else. It’s still more complex than lambda, but here the trade-off seems worth it. So let’s take a peek at Google’s new serverless Golden Boy!

Containers vs. Functions

In Lambda the atomic unit of compute is the function, which for an interpreted language like Python is just plaintext code uploaded to AWS. But in Cloud Run the atomic unit is the container — and that can be a container for just the one function, or the container for the entire app itself — with all the routing logic embedded within it.

Now why would you need apps for the serverless world?! You ask indignantly. Aren’t these all supposed to be function based?

Well actually lots of people have legacy code written at the application level, and re-writing an entire application takes a long time, and very rarely succeeds on the first try.

comment 0

Multiprocessing in Lambda Functions

Lambda functions are awesome, but they only provide a single dimension to allocate resources – memorySize. The simplicity is refreshing, as lambda functions are complex enough — but AWS really shouldn’t have called it memorySize if it controls CPU as well.

Then again this is the company that gave us Systems Manager Session Manager, so the naming could have been worse (much worse!).

Anyway….I digress.

The memorySize of your lambda function, allocates both memory and CPU in proportion. i.e. twice as much memory gives you twice as much CPU.

The smallest lambda can start with minimum of 128MB of memory, which you can increment in steps of 64MB, all the way to 3008MB (just shy of 3GB).

So far, nothing special.

But, at 1792MB, something wonderful happens — you get one full vCPU. This is Gospel truth in lambda-land, because AWS documentation says so. In short, a 1792MB lambda function gets 1 vCPU, and a 128MB lambda function gets ~7% of that. (since 128MB is roughly 7% of 1792MB).

Using maths, we realize that at 3008MB, our lambda function is allocated 167% of vCPU.

But what does that 167% vCPU mean?!

I can rationalize anything up to 100%, after all getting 50% vCPU simply means you get the CPU for 50% of the time, and that makes sense up to 100%, but after that things get a bit wonky.

After all, why does having 120% vCPU mean — do you get 1 full core plus 20% of another? Or do you get 60% of two cores?

comment 0

Updating a GitHub repo from a Lambda Function using Bash!

At the end of 2018, AWS introduced custom runtimes for Lambda functions, which provided customers a way to run applications written in languages not in the holy list of the ‘Official AWS Lambda Runtimes’ which include a plethora of languages. It has 3 versions of Python, 2 versions of Node, Ruby, Java, Go and .NET core (that’s a lot of language support)

Security-wise, it’s better to use an Official AWS Lambda runtime than it is to roll your own. After all, why take ownership for something AWS is already doing for you — and for free!

But, as plentiful as the official runtime list is– there’re always edge-cases where you’d want to roll your own custom runtime to support applications written in languages AWS doesn’t provide.

Maybe you absolutely have to use a Haskell component — or you need to migrate a c++ implementation to lambda. In these cases, a custom runtime allows you to leverage the power of serverless functions even when their runtimes are not officially supported.

Bash Custom Runtime

Which brings us to the topic of today’s post, the bash custom runtime.

For Klayers, I needed a way to update a github repo with a new json file every week — which can be done in python, but no python package came close to the familiarity of git pull , git add and git commit.

So rather than try to monkey around a python-wrapper of git, I decided to use git directly — from a shell script — running in a lambda — on the bash runtime.

So I pulled in the runtime a github repo I found, and used it for write a lambda function. Simple right? Well not entirely — running regular shell scripts is easy, but there are some quirks you’ll have to learn when you run them in a lambda function…

Not so fast there cowboy…

Firstly, the familiar home directory in ~/ is off-limits in a lambda function — and I mean off-limits. There is absolutely no-way (that I know off), for you can add files into this directory. Wouldn’t be a big isue, except this is where git looks for ssh keys and the known_hosts file.

Next, because lambda functions are ephemeral, you’ll need a way to inject your SSH key into the function, so that it can communicate to GitHub your behalf.

Finally, because you’ve chosen to use the bash runtime, you’re limited to the awscli utility, which while fully functional doesn’t come with the usual tools as boto3 for python. It’s a lot easier to loop and parse json in python than it is in bash — fortunately, jq makes that less painful, and jq is included in the custom runtime :).

Enough talking let’s build this

comment 0

Interactive Shell on a Lambda Function

One of a great things about Lambda functions is that you can’t SSH into it.

This sounds like a drawback, but actually it’s a great security benefit — you can’t hack what you can’t access. Although it’s rare to see SSH used as an entry path for attackers these days, it’s not uncommon to see organizations lose SSH keys every once in a while. So cutting down SSH access does limit the attack surface of the lambda — plus the fact, that the lambda doesn’t exist on a 24/7 server helps reduce that even further.

Your support engineers might still want to log onto a **server**, but in todays serverless paradigm, this is unnecessary. After all, logs no longer exists in /var/logs they’re on cloudwatch, and there is no need to change passwords or purge files because the lambdas recycle themselves after a while anyway. Leave those lambda functions alone will ya!

As a developer, you might want to see what is **in** the lambda function itself — like what binaries are available (and their versions), or what libraries and environment variables are set. For this, it’s far more effective to just log onto a lambci docker container — Amazon work very closely with lambci to ensure their container matches what’s available in a Lambda environment. Just run any of the following

  • docker run -ti lambci/lambda:build-python3.7 bash
  • docker run -ti lambci/lambda:build-python3.6 bash

Lambci provide a corresponding docker container for all AWS runtimes, they even provide a build image for each runtime, that comes prepackaged with tools like bash, gcc-c++, git and zip. This is the best way to explore a lambda function in interactive mode, and build lambda layers on.

But sometimes you’ll find yourself wanting to explore the actual lambda function you ran, like checking if the binary in the lambda layer was packaged correctly, or just seeing if a file was correctly downloaded into /tmp— local deploy has it’s limits, and that’s what this post is for.

comment 0

Klayers Part 1: Building Lambda Layers with Lambda Functions

This is a continuation in the Klayers series, where I deep dive into the architecture of Klayers. At its core, Klayers is a collection of AWS Lambda Layers for Python3, with the idea that python packages in layers is more efficient than packaging them with application code.

Visit the GitHub repo here, where you’d find 50+ lambda layers for public consumption across most AWS regions (including HK and Oman). This post is how I automated the building of layers inside lambda functions — but specifically on layers composed of Python Packages (e.g. requests, beautifulsoup4, etc)

Python Packages for Dummies

As a primer, let’s take a look at python packages in general. Python utilizes the Python Package Index (or PyPI), this is similar to Maven for Java or NPM for Node. It’s simply a package manager that helps with the installation of python packages for your application.

In order to help with this, there is a program called pip that helps with the installation of python packages. While pip isn’t limited to packages from PyPI, you can use it to install packages from other sources as well — it and PyPI are the dynamic duo of Python packages.

The problem is that while Python is a interpreted language, there are some components of it that are OS specific. When you pip install into Windows, you get a different package installation than when you pip install into Ubuntu or OSX. pip detects your OS and installs specific files for your specific purpose — sometimes those files need to be compiled for your OS as well.

Which means, if you wanted to put a Python Package into a Lambda Layer, it would need the AWS Linux version of that Python Package (Ubuntu might be close enough, CentOS is even better), because Lambda functions run on AWS Linux. And because not many folks run Linux as their core distribution, the general recommendation for creating these lambda layers has always been to use Docker.

It’s very easy to use a docker container based on lambci/lambda:build-python3.7, to build python packages for lambda. In fact I even have a script that does that here.

But to me, this seemed sub-optimal. After all, we preach the ‘serverless first’ mantra, yet when it comes to building lambda layers — we default to a docker container on a serverful laptop …. there must be a serverless way.

comment 0

Copy Millions of S3 Objects in minutes

Recently I found myself working with an S3 bucket of 13,000 csv files that I needed to query. Initially, I was excited, because now had an excuse to play with AWS Athena or S3 Select — two serverless tools I been meaning to dive into.

But that excitement — was short-lived!

For some (as yet unexplained) reason, AWS Athena is not available in us-west-1. Which seemingly, is the only region in the us that Athena is not available on!

And…. guess where my bucket was? That’s right, the one region without AWS Athena.

Now I thought, there’d a simple way to copy objects from one bucket to another — after all, copy-and-paste is basic computer functionality, we have keyboard shortcuts to do this exact thing. But as it turns out, once you have thousands of objects in a bucket, it becomes a slow, painful and downright impossible task to get done sanely.

For one, S3 objects aren’t indexed — so AWS doesn’t have a directory of all the objects in your bucket. You can do this from the console — but it’s a snap-shots of your current inventory rather than a real-time updated index, and it’s very slow — measured in days slow! An alternative is to use the list_bucket method.

But there’s a problem with list_bucket as well, it’s sequential (one at a time), and is limited ‘just’ 1000 items per request. A full listing of a million objects would require 1000 sequential api calls just to list out the keys in the your bucket. Fortunately, I had just 13,000 csv files, so this part for fast, but that’s not the biggest problem!

Once you’ve listed out your bucket, you’re then faced with the monumentally slow task of actually copying the files. The S3 API has no bulk-copy method, and while you can use the copy_object for a file or arbitrary size, but it only works on one file at a time.

Hence copying 1 million files, would require 1 million API calls — which could be parallel, but would have been nicer to batch them up like the delete_keys method.

So to recap, copying 1 million objects, requires 1,001,000 API requests, which can be painfully slow, unless you’ve got some proper tooling.

AWS recommend using the S3DistCP, but I didn’t want to spin up an EMR server ‘just’ to handle this relatively simple cut-n-paste problem — instead I did the terribly impractical thing and built a serverless solution to copy files from one bucket to another — which looks something like this: