Part 1: An intro to Data Breaches Let’s start with some basics. What is a Data Breach? According to Verizon, a data breach is when you’ve confirmed that data has been lost to an attacker, while a data incident is merely…
Consider this a bonus piece from my long thoughts about data breaches. You might the older post before reading this. So let’s dive in. The telco breach was a giant hairball of issues, and one of the strands in the…
While designing sayakenahack, the biggest problem I faced was trying to write millions of rows efficiently into DynamoDB. I slowly worked my way up from 100 rows/second to around the 1500 rows/second range, and here’s how I got there.
Work with Batch Write Item
First mistake I did was a data modelling error. Sayakenahack was supposed to take a single field (IC Number) and return the results of all phone numbers in the breach. So I initially modeled the phone numbers as an array within an item (what you’d called a row in regular DB speak).
Strictly speaking this is fine, DynamoDB has an update command that allows you to update/insert an existing item. Problem is that you can’t batch an update command, each update command can only update/insert one item at a time.
Running a script that updated one row in DynamoDB (at a time) was painfully slow. Around 100 items/second on my machine, even if I copied that script to an EC2 instance in the same datacenter as the DynamoDB, I got no more than 150 items/second.
At that rate, a 10 million row file would take nearly 18 hours to insert. That wasn’t very efficient.
So I destroyed the old paradigm, and re-built.
Instead of phone numbers being arrays within an item, phone numbers were the item itself. I kept IC Number as the partition key (which isn’t what Amazon recommend), which allowed me to query for an IC Number and get an array of items.
This allowed me to use DynamoDB’s batch_write_item functionality, which does up to 25 request at once (up to a maximum of 16MB). Since my items weren’t anywhere 16MB, I would theoretically get a 25 fold increase in speed.
In practice though, I got ‘just’ a 10 fold increase, allowing me to write 1000 items/second, instead of 100. This meant I could push through a 10 million row file in under 3 hours.
First rule of thumb when trying to write lots of rows into DynamoDB — make sure the data is modeled so that you can batch insert, anything else is painfully slow.
Posting this here first, my thoughts to follow. Random thoughts below are draft :). Random thoughts on the matter We still need a single identifier in Malaysia (IC Number), this is administrative necessity. LHDN needs to check your bank accounts,…
I keep this blog to help me think, and over the past week, the only thing I’ve been thinking about, was sayakenahack.
I’ve declined a dozen interviews, partly because I was afraid to talk about it, and partly because my thoughts weren’t in the right place. I needed time to re-group, re-think, and ponder.
This blog post is the outcome of that ‘reflective’ period.
The PR folks tell me to strike while the iron is hot, but you know — biar lambat asal selamat.
Why I started sayakenahack?
I’m one part geek and one part engineer. I see a problem and my mind races to build a solution. Building sayakenahack, while difficult, and sometimes frustrating, was super-duper fun. I don’t regret it for a moment, regardless of the sleepless nights it has caused me.
But that’s not the only reason.
I also built it to give Malaysians a chance to check whether they’ve been breached. I believe this is your right, and no one should withhold it from you. I also know that most Malaysians have no chance of ever checking the breach data themselves because they lack the necessary skills.
I know this, because 400,000 users have visited my post on “How to change your Unifi Password“.
If they need my help to change a Wifi password, they’ve got no chance of finding the hacker forums, downloading the data, fixing the corrupted zip, and then searching for their details in file that is 10 million rows long — and no, Excel won’t fit 10mln rows.
So for at least 400,000 Malaysians, most of whom would have had their data leaked, there would have been zero chance of them ever finding out. ZERO!
The ‘normal’ world is highly tech-illiterate (I’ve even talked about it on BFM). Sayakenahack was my attempt to make this accessible to common folks. To deny them this right of checking their data is just wrong.
But why tell them at all if there’s nothing they can do about it? You can’t put the genie back in the lamp.
I know the picture is a bit hard to read, but I wanted to make sure I had a detailed enough picture to understand the ‘innards’ of sayakenahack. Sometimes when you’re building stuff on the fly, and bottom-up, it’s good…
OK, this is my last post on sayakenahack.com, and I’ve got a script scheduled to run at Sunday midnight to tear down the database. So if you wanna check, you better do it now, cause in 3 days time, it’ll…
Why does sayakenahack have dummy data? If I enter “123456” and “112233445566” I still get results.
I was struggling with answering this question, as some folks have used it to ‘prove’ that I was a phisher. We’ll get to that later, for now I hope to answer why these ‘fake’ IC numbers exist in the sayakenahack.
Firstly, I couldn’t find a good enough way to validate IC numbers as I was inserting them into the database. Most of you think that IC numbers follow a pre-define pattern :
- 6-digit birthday (yymmdd format)
- 2-digit state code
- 4-digit personal identifier, where the last digit is odd for men, and even for women.
But, there are still folks with old IC numbers, and the army have their own format. Not to mention that the IC Number field can be populated by passport numbers (for foreigners) and Company registration IDs. So instead of cracking my head on how to validate IC numbers, I decided to pass them all in.
The only ‘transformation’ I do is to strip them of all non-AlphaNumeric characters and uppercasing any letters in the result. This would standardize the IC numbers in the database, regardless of source file format.
Had I done some validation, I might have removed these dummy entries — but fortunately I didn’t.
Upon further analyzing the data, I went back to the original source files and notice something strange, the account numbers belonged to some strange names. And then it made sense — this was Test data.
Test data in a Production Environment to be exact.
And when the Database for the telco was dumped, the telco’s didn’t remove these test accounts from their system. So what we have is a bunch of dummy accounts, with dummy IC numbers.
On the 19th of October, Lowyat.net reported that a user was selling the personal data of MILLIONS of Malaysians on their forum. Shortly after, the article was taken down on the request of the MCMC, only to put up again, a…
I haven’t blogged in a while because I’m busy studying (yes, studying) for my OSCP certification. But what happened over the week, was just to mind-blowingly stupid to ignore. Here’s what happened…. A Taiwanese company released a game titled Fight…