Spacy in a Lambda

S

I’ve been really digging into Lambda Layers lately, and once you begin using layers you’ll wonder how you got by without them.

Layers allow you to package just about anything into lambda, but in a modular way. So elements of your code that don’t change much, can be packaged into layers, while keeping your actual lambda deployment for just the code that’s changing.

It’s akin to docker cache, where you keep the un-changing elements higher up in your docker file, separate from the code that always changes. The difference though, is that docker cache speeds up builds, while layers speeds up lambda deployments.

But layers aren’t magic, and they’re still limited by the AWS size limit, hence your entire function (including all it’s layers) need to be no larger than 250MB (unzipped).

Which is tough for something like spaCy — because its default installation size on a AWS Linux is ~400MB (or 492MB based on my quick installation on lambci for python3.7). So, in order to get spaCy working on a lambda, certain tweaks are going to be necessary.

Some have tried working around this problem by installing spaCy onto the lambda container on cold-start — i.e. pull the data into lambda only when you have access to the 512MB in /tmp. Cool solution, but it almost completely fills out /tmp, and makes a cold-start even slower.

A more optimal solution would be to reduce the size of the spaCy installation and have it fit into a layer! Fortunately I found a GitHub issue after some googling that enables us to do exactly this.

It involves removing unnecessary language files, which spaCy lazy load in, If you’re only interested in one language, you can simply remove the unnnecessary language files in the site-packages/spacy/lang directory.

After manually removing all non-English (en) language files, I managed to reduce the size of the spaCy package to 110MB, which fits very nicely into a lambda layer, in the end my lang directory only had the following files:

As a bonus, I also included the English en_core_web_sm-2.1.0 model, to make the lambda layer fully usable on its own .

Finally I published it as a publicly available layer, for anyone to consume. One of the amazing things about layers, is that once a layer is made, it can be shared across AWS for anyone to consume.

How to use the layer

The simplest way to get spaCy working on your lambda function is to just import the publicly available layer I created. To do this, create a python3.7 function in an aws region of your choice.

Then select layers, and add the spaCy layer as an arn (details of the arn for all regions can be found here) — for spaCy the layer arn takes the form below, replace <region> with your actual aws region (e.g. us-east-2)

arn:aws:lambda:<region>:113088814899:layer:Klayers-python37-spacy:1

Finally copy the code below as an example usage (just to make sure things work):

Before you test, raise the lambda execution time to 10 seconds, and allocate 512MB of memory (128MB wasn’t enough for spaCy). From there you can run a dummy test, and get the following results:

If all goes well, you’ll have a spaCy fully working within a lambda function, with very little effort. To accommodate more languages, you’ll have to manually package the lambda layer by simply removing the un-needed language files from the site-packages directory.

Remember you can’t simply package layers from any box, it has to be pip-ed installed onto an AWS Linux installation. I recommend using the lambci docker images to do this.

Conclusion

Lambda layers are an awesome way to store python packages, and since everyone uses python packages in lambda, it makes sense to share them via layers rather than having everyone packaging them repeatedly.

There’s a bunch more layers for python in my repo here. If you’ve used the layers or find the project helpful, all I ask is that you consider starring the repo 🙂

18 comments

Leave a Reply to AnonymousCancel reply

  • Hello Keith,

    Thank you for such wonderful post. This was highly helpful and thank you for sharing sweetly curated ARN list.

  • Hi Keith,
    This is very nice and useful. I was able to use your layer easily! Thanks for sharing!
    I wonder if there was a way to use larger Spacy models in layers or otherwise in AWS Lambda (such as the “en_core_web_lg” model, which is 788 MB)

  • Why does vector return shape (96,) ? Running Spacy small outside the layer shows shape (384,) … but yet again, Glove is (300, )

  • Hi Keith,
    This is very nice and useful. I was able to use your layer easily! Thanks for sharing!
    I wonder if there was a way to use larger Spacy models in layers or otherwise in AWS Lambda (such as the “en_core_web_lg” model, which is 788 MB)

    • that’s quite difficult — at present there is no way to do this. Layers have to come under 250MB unzipped, and there’s only 512MB available in /tmp.

  • Hello Keith,

    Thank you for such wonderful post. This was highly helpful and thank you for sharing sweetly curated ARN list.

  • Why does vector return shape (96,) ? Running Spacy small outside the layer shows shape (384,) … but yet again, Glove is (300, )

  • It seems that you’ve removed support for Spacy? Any chance to get it back ? If not could you explain what was the issue ?

      • The latest version of Klayers, now has build spaCy automatically every week. This is version 2.2, and does **not** include the pre-compiled language files. You’ll have to bring those to the party yourself.

    • Sorry, I removed it from the layer. You’ll have to pull that in yourself, either via layer or in your function code :(.

  • Hello I have been trying to upload a custom spacy package to my lambda layer as none of the python 3.8/3.9 spacy ARNs are working. I am not able to get spacy model loaded even after all attempts whats the issue ?

    arn:aws:lambda:us-east-1:198171447625:layer:39pythondemospacyv1:1

    def lambda_handler(event, context):
    # TODO implement
    # Load the spaCy model
    nlp = spacy.load(‘/mylayer/layer/en_core_web_sm-2.1.0/en_core_web_sm/en_core_web_sm-2.1.0’)