Recently I found myself with a bucket of 13,000 csv files that I needed to query. I was thoroughly excited, as I now had an excuse to play with AWS Athena or S3 Select — two serverless offerings I been wanting to dive into.
But that excitement — was short-lived!
For some (as yet unexplained) reason, AWS Athena is not available in us-west-1. It’s available in us-west-2, us-east-1 and us-east-2, but not us-west-1. Seriously Amazon?!
And…. guess where my bucket was? That’s right, the one region without AWS Athena.
Now I thought, there’d a simple way to copy objects from one bucket to another — after all, copy-and-paste is basic computer functionality. But as it turns out, once you have thousands of objects in a bucket, it becomes a slow, painful and downright impossible task to get done sanely.
For one, S3 objects aren’t indexed — so AWS doesn’t have a directory of all the objects in your bucket. You can do this from the console — but it’s a snap-shots of your current inventory rather than a real-time updated index, and it’s very slow — days!. An alternative is to use the
But there’s a problem with
list_bucket as well, it’s sequential (one at a time), and is limited ‘just’ 1000 items per request. A full listing of a million objects would require 1000 sequential api calls just to list out the keys in the your bucket.
Once you’ve listed out your bucket, by making these large sequential API calls, you arrive at the most painful part of the process — actually copying the files. The S3 API has no bulk-copy method. You can use the
copy_object for a file or arbitrary size, but it only works on one file at a time.
Hence copying 1 million files, would require 1 million API calls — which could be parallel, but would have been nicer to batch them up like the
So to recap, copying 1 million objects, requires 1,001,000 API requests.
AWS recommend using the S3DistCP, but I didn’t want to spin up an EMR server ‘just’ to handle this relatively simple cut-n-paste problem — instead I did the terribly impractical thing and built a serverless solution to copy files from one bucket to another — which looks something like this: