Part 1: An intro to Data Breaches Let’s start with some basics. What is a Data Breach? According to Verizon, a data breach is when you’ve confirmed that data has been lost to an attacker, while a data incident is merely…
Consider this a bonus piece from my long thoughts about data breaches. You might the older post before reading this. So let’s dive in. The telco breach was a giant hairball of issues, and one of the strands in the…
While designing sayakenahack, the biggest problem I faced was trying to write millions of rows efficiently into DynamoDB. I slowly worked my way up from 100 rows/second to around the 1500 rows/second range, and here’s how I got there.
Work with Batch Write Item
First mistake I did was a data modelling error. Sayakenahack was supposed to take a single field (IC Number) and return the results of all phone numbers in the breach. So I initially modeled the phone numbers as an array within an item (what you’d called a row in regular DB speak).
Strictly speaking this is fine, DynamoDB has an update command that allows you to update/insert an existing item. Problem is that you can’t batch an update command, each update command can only update/insert one item at a time.
Running a script that updated one row in DynamoDB (at a time) was painfully slow. Around 100 items/second on my machine, even if I copied that script to an EC2 instance in the same datacenter as the DynamoDB, I got no more than 150 items/second.
At that rate, a 10 million row file would take nearly 18 hours to insert. That wasn’t very efficient.
So I destroyed the old paradigm, and re-built.
Instead of phone numbers being arrays within an item, phone numbers were the item itself. I kept IC Number as the partition key (which isn’t what Amazon recommend), which allowed me to query for an IC Number and get an array of items.
This allowed me to use DynamoDB’s batch_write_item functionality, which does up to 25 request at once (up to a maximum of 16MB). Since my items weren’t anywhere 16MB, I would theoretically get a 25 fold increase in speed.
In practice though, I got ‘just’ a 10 fold increase, allowing me to write 1000 items/second, instead of 100. This meant I could push through a 10 million row file in under 3 hours.
First rule of thumb when trying to write lots of rows into DynamoDB — make sure the data is modeled so that you can batch insert, anything else is painfully slow.
Posting this here first, my thoughts to follow. Random thoughts below are draft :). Random thoughts on the matter We still need a single identifier in Malaysia (IC Number), this is administrative necessity. LHDN needs to check your bank accounts,…