All posts filed under “Misc

Just a collection of stuff (mostly from my old blog)

comment 1

Spacy in a Lambda

I’ve been really digging into Lambda Layers lately, and once you begin using layers you’ll wonder how you got by without them.

Layers allow you to package just about anything into lambda, but in a modular way. So elements of your code that don’t change much, can be packaged into layers, while keeping your actual lambda deployment for just the code that’s changing.

It’s akin to docker cache, where you keep the un-changing elements higher up in your docker file, separate from the code that always changes. The difference though, is that docker cache speeds up builds, while layers speeds up lambda deployments.

But layers aren’t magic, and they’re still limited by the AWS size limit, hence your entire function (including all it’s layers) need to be no larger than 250MB (unzipped).

Which is tough for something like spaCy — because its default installation size on a AWS Linux is ~400MB (or 492MB based on my quick installation on lambci for python3.7). So, in order to get spaCy working on a lambda, certain tweaks are going to be necessary.

Some have tried working around this problem by installing spaCy onto the lambda container on cold-start — i.e. pull the data into lambda only when you have access to the 512MB in /tmp. Cool solution, but it almost completely fills out /tmp, and makes a cold-start even slower.

A more optimal solution would be to reduce the size of the spaCy installation and have it fit into a layer! Fortunately I found a GitHub issue after some googling that enables us to do exactly this.

It involves removing unnecessary language files, which spaCy lazy load in, If you’re only interested in one language, you can simply remove the unnnecessary language files in the site-packages/spacy/lang directory.

After manually removing all non-English (en) language files, I managed to reduce the size of the spaCy package to 110MB, which fits very nicely into a lambda layer, in the end my lang directory only had the following files:

As a bonus, I also included the English en_core_web_sm-2.1.0 model, to make the lambda layer fully usable on its own .

Finally I published it as a publicly available layer, for anyone to consume. One of the amazing things about layers, is that once a layer is made, it can be shared across AWS for anyone to consume.

comment 0

Copy Millions of S3 Objects in minutes

Recently I found myself working with an S3 bucket of 13,000 csv files that I needed to query. Initially, I was excited, because now had an excuse to play with AWS Athena or S3 Select — two serverless tools I been meaning to dive into.

But that excitement — was short-lived!

For some (as yet unexplained) reason, AWS Athena is not available in us-west-1. Which seemingly, is the only region in the us that Athena is not available on!

And…. guess where my bucket was? That’s right, the one region without AWS Athena.

Now I thought, there’d a simple way to copy objects from one bucket to another — after all, copy-and-paste is basic computer functionality, we have keyboard shortcuts to do this exact thing. But as it turns out, once you have thousands of objects in a bucket, it becomes a slow, painful and downright impossible task to get done sanely.

For one, S3 objects aren’t indexed — so AWS doesn’t have a directory of all the objects in your bucket. You can do this from the console — but it’s a snap-shots of your current inventory rather than a real-time updated index, and it’s very slow — measured in days slow! An alternative is to use the list_bucket method.

But there’s a problem with list_bucket as well, it’s sequential (one at a time), and is limited ‘just’ 1000 items per request. A full listing of a million objects would require 1000 sequential api calls just to list out the keys in the your bucket. Fortunately, I had just 13,000 csv files, so this part for fast, but that’s not the biggest problem!

Once you’ve listed out your bucket, you’re then faced with the monumentally slow task of actually copying the files. The S3 API has no bulk-copy method, and while you can use the copy_object for a file or arbitrary size, but it only works on one file at a time.

Hence copying 1 million files, would require 1 million API calls — which could be parallel, but would have been nicer to batch them up like the delete_keys method.

So to recap, copying 1 million objects, requires 1,001,000 API requests, which can be painfully slow, unless you’ve got some proper tooling.

AWS recommend using the S3DistCP, but I didn’t want to spin up an EMR server ‘just’ to handle this relatively simple cut-n-paste problem — instead I did the terribly impractical thing and built a serverless solution to copy files from one bucket to another — which looks something like this:

comment 0

Using Terraform and Serverless Framework

Image from wikicommons.

The Serverless framework (SF) is a fantastic tool for testing and deploying lambda functions, but it’s reliance on cloudformation makes it clumsy for infrastructure like DynamoDB, S3 or SQS queues.

For example, if your serverless.yml file had 5 lambdas, you’d be able to sls deploy all day long. But add just one S3 bucket, and you’d first have to sls remove before you could deploy again. This different behavior in the framework, once you introduce ‘infra’ is clumsy. Sometimes I use deploy to add functions without wanting to remove existing resources.

Terraform though, keeps the state of your infrastructure, and can apply only the changes. It also has powerful commands like taint, that can re-deploy a single piece of infrastructure, for instance to wipe clean a DynamoDB.

In this post, I’ll show how I got Terraform and Serverless to work together in deploying an application, using both frameworks strengths to complement each other.

**From here on, I’ll refer to tool Serverless Framework as SF to avoid confusing it with the actual term serverless

Terraform and Serverless sitting on a tree

First some principles:

  • Use SF for Lambda & API Gateway
  • Use Terraform for everything else.
  • Use a tfvars file for Terraform variable
  • Use JSON for the tfvars file
  • Terraform deploys first followed by SF
  • Terraform will not depend on any output from SF
  • SF may depend on output from terraform
  • Use SSM Parameter Store to capture Terraform outputs
  • Import inputs into Serverless from SSM Parameter Store
  • Use workspaces in Terraform to manage different environments.
  • Use stages in Serverless to manage different environments.
  • stage.name == workspace.name

In the end the deployment will look like this:

comment 0

Securing Lambda Functions

First a definition.

A lambda function is a service provided by aws that runs code for you without the introducing the complexity of provisioning servers of managing Operating Systems. It belongs in a category of architectures called serverless architectures.

There’s a whole slew of folks trying to define with is serverless, but my favorite definition is this.

Serverless means No Server Ops

Joe Emison

They’re the final frontier of compute, where the idea is that developers just write code, while allowing AWS (or Google/MSFT) to take care of everything else. This includes H/W management, OS Patching, even application level maintenance like Webserver upgrades are not your problem anymore with serverless.

Nothing runs on fairy-dust though, serverless still has servers — but in this world those servers, their operating systems, and the underlying runtime (e.g. Python, Node, JVM) are fully managed services that you pay per use.

As a developer you write some code into a function. Upload that function to AWS — and now you can invoke this function over and over again without worrying about servers, operating systems or run-time.

But how does AWS achieve this?

Before we can understand how to secure a serverless function, we need to at least have a fair understanding of how Serverless functions (like AWS Lambda) work.

So how does a lambda function work?

comment 0

Android TV boxes

Android TV boxes, are computers that stream content from the internet onto your TV. The difference between them and your smart-phone is that it has a HDMI connector to your TV, and it usually comes pre-loaded with software to illegally stream content.

While the boxes themselves, are general purpose computers running Android (the most popular OS today), the real focus of any regulation should be on the software on the device and the internet-based streaming services that support them.

Which seems to be the case…

Today, TheStar reports that the MCMC will begin blocking these unauthorized streaming services, rendering the boxes that connect to them useless.

But, if the MCMC uses it’s usual method of DNS filtering to implement the block, it’ll be trivial for most folks to circumvent the issue, the boxes run Android after all. The government will very quickly find itself in a cat and mouse situation in trying to block them.

comment 0

2018 in Review

I started the year building out govScan.info, a site that audits .gov.my websites for TLS implementation. Overall I curated a list of ~5000 Malaysian government domains through various OSINT and enumeration techniques and now use that list to scan them…

comment 0

Introducing potassium-40

Over the past few weeks, I’ve been toying with lambda functions and thinking about using them for more than just APIs. I think people miss the most interesting aspect of serverless functions — namely that they’re massively parallel capability, which can do a lot more than just run APIs or respond to events.

There’s 2-ways AWS let’s you run lambdas, either via triggering them from some event (e.g. a new file in the S3 bucket) or invoking them directly from code. Invoking is a game-changer, because you can write code, that basically offloads processing to a lambda function directly from the code. Lambda is a giant machine, with huge potential.

What could you do with a 1000-core, 3TB machine, connected to a unlimited amount of bandwidth and large number of ip addresses?

Here’s my answer. It’s called potassium-40, I’ll explain the name later

So what is potassium-40

Potassium-40 is an application-level scanner that’s built for speed. It uses parallel lambda functions to do http scans on a specific domain.

Currently it does just one thing, which is to grab the robots.txt from all domains in the cisco umbrella 1 million, and store the data in the text file for download. (I only grab legitimate robots.txt file, and won’t store 404 html pages etc)

This isn’t a port-scanner like nmap or masscan, it’s not just scanning the status of a port, it’s actually creating a TCP connection to the domain, and performing all the required handshakes in order to get the robots.txtfile.

Scanning for the existence of ports requires just one SYN packet to be sent from your machine, even a typical banner grab would take 3-5 round trips, but a http connection is far more expensive in terms of resources, and requires state to be stored, it’s even more expensive when TLS and redirects are involved!

Which is where lambda’s come in. They’re effectively parallel computers that can execute code for you — plus AWS give you a large amount of free resources per month! So not only run 1000 parallel processes, but do so for free!

A scan of 1,000,000 websites will typically take less than 5 minutes.

But how do we scan 1 million urls in under 5 minutes? Well here’s how.

comment 0

GitHub webhooks with Serverless

GitHub Webhooks
with Serverless

Just because you have webhook, doesn’t mean you need a webserver.

With serverless AWS Lambdas you’ve got a free (as in beer) and always on ability to receive webhooks callbacks without the need for pesky servers. In this post, I’ll setup a serverless solution to accept incoming POST from a GitHub webhook.