comment 0

Writing Millions of rows into DynamoDB

While designing sayakenahack, the biggest problem I faced was trying to write millions of rows efficiently into DynamoDB. I slowly worked my way up from 100 rows/second to around the 1500 rows/second range, and here’s how I got there.

Work with Batch Write Item

First mistake I did was a data modelling error. Sayakenahack was supposed to take a single field (IC Number) and return the results of all phone numbers in the breach. So I initially modeled the phone numbers as an array within an item (what you’d called a row in regular DB speak).

Strictly speaking this is fine, DynamoDB has an update command that allows you to update/insert an existing item. Problem is that you can’t batch an update command, each update command can only update/insert one item at a time.

Running a script that updated one row in DynamoDB (at a time) was painfully slow. Around 100 items/second on my machine, even if I copied that script to an EC2 instance in the same datacenter as the DynamoDB, I got no more than 150 items/second.

At that rate, a 10 million row file would take nearly 18 hours to insert. That wasn’t very efficient.

So I destroyed the old paradigm, and re-built.

Instead of phone numbers being arrays within an item, phone numbers were the item itself. I kept IC Number as the partition key (which isn’t what Amazon recommend), which allowed me to query for an IC Number and get an array of items.

This allowed me to use DynamoDB’s batch_write_item functionality, which does up to 25 request at once (up to a maximum of 16MB). Since my items weren’t anywhere 16MB,  I would theoretically get a 25 fold increase in speed.

In practice though, I got ‘just’ a 10 fold increase, allowing me to write 1000 items/second, instead of 100. This meant I could push through a 10 million row file in under 3 hours.

First rule of thumb when trying to write lots of rows into DynamoDB — make sure the data is modeled so that you can batch insert, anything else is painfully slow.

comments 6

Sayakenahack: Epilogue

I keep this blog to help me think, and over the past week, the only thing I’ve been thinking about, was sayakenahack.

I’ve declined a dozen interviews, partly because I was afraid to talk about it, and partly because my thoughts weren’t in the right place. I needed time to re-group, re-think, and ponder.

This blog post is the outcome of that ‘reflective’ period.

The PR folks tell me to strike while the iron is hot, but you know — biar lambat asal selamat.

Why I started sayakenahack?

I’m one part geek and one part engineer. I see a problem and my mind races to build a solution. Building sayakenahack, while difficult, and sometimes frustrating, was super-duper fun. I don’t regret it for a moment, regardless of the sleepless nights it has caused me.

But that’s not the only reason.

I also built it to give Malaysians a chance to check whether they’ve been breached. I believe this is your right, and no one should withhold it from you. I also know that most Malaysians have no chance of ever checking the breach data themselves because they lack the necessary skills.

I know this, because 400,000 users have visited my post on “How to change your Unifi Password“.

400,000!!!

If they need my help to change a Wifi password, they’ve got no chance of finding the hacker forums, downloading the data, fixing the corrupted zip, and then searching for their details in file that is 10 million rows long — and no, Excel won’t fit 10mln rows.

So for at least 400,000 Malaysians, most of whom would have had their data leaked, there would have been zero chance of them ever finding out. ZERO!

The ‘normal’ world is highly tech-illiterate (I’ve even talked about it on BFM).  Sayakenahack was my attempt to make this accessible to common folks. To deny them this right of checking their data is just wrong.

But why tell them at all if there’s nothing they can do about it? You can’t put the genie back in the lamp.

comments 12

Why does SayaKenaHack have dummy data?

Why does sayakenahack have dummy data? If I enter “123456” and “112233445566” I still get results.

I was struggling with answering this question, as some folks have used it to ‘prove’ that I was a phisher. We’ll get to that later, for now I hope to answer why these ‘fake’ IC numbers exist in the sayakenahack.

Firstly, I couldn’t find a good enough way to validate IC numbers as I was inserting them into the database. Most of you think that IC numbers follow a pre-define pattern :

  • 6-digit birthday (yymmdd format)
  • 2-digit state code
  • 4-digit personal identifier, where the last digit is odd for men, and even for women.

But, there are still folks with old IC numbers, and the army have their own format. Not to mention that the IC Number field  can be populated by passport numbers (for foreigners) and Company registration IDs. So instead of cracking my head on how to validate IC numbers, I decided to pass them all in.

The only ‘transformation’ I do is to strip them of all non-AlphaNumeric characters and uppercasing any letters in the result. This would standardize the IC numbers in the database, regardless of source file format.

Had I done some validation, I might have removed these dummy entries — but fortunately I didn’t.

Upon further analyzing the data, I went back to the original source files and notice something strange, the account numbers belonged to some strange names. And then it made sense — this was Test data.

Test data in a Production Environment to be exact.

And when the Database for the telco was dumped, the telco’s didn’t remove these test accounts from their system. So what we have is a bunch of dummy accounts, with dummy IC numbers.

comments 106

SayaKenaHack.com

On the 19th of October, Lowyat.net reported that a user was selling the personal data of MILLIONS of Malaysians on their forum. Shortly after, the article was taken down on the request of the MCMC, only to put up again, a…

comment 1

#PotongSteam

I haven’t blogged in a while because I’m busy studying (yes, studying) for my OSCP certification. But what happened over the week, was just to mind-blowingly stupid to ignore. Here’s what happened…. A Taiwanese company released a game titled Fight…

comment 0

JJPTR wasn’t hacked

The fact that this RM2 company manage to raise RM500 million should be news enough, but claims that it lost all it’s money to ‘hackers’ is too hilarious for me to ignore. If you haven’t heard, a get-rich-quick scheme called…

comment 0

Everything wrong with TalkingPoint’s “Cybersecurity” episode

Channel News Asia posted last week that hackers could steal your info by just knowing your phone number.

Woah!! Must be some uber NSA stuff right–but no, it was a couple of guys with Metasploit and they required a LOT more than ‘just’ the phone number.

The post was an add-on to a current affairs show called Talking Point, that aired an episode last week about cybersecurity, which (like most mainstream media reporting) had more than a few errors I’d like to address.

Problem 1: Cost of cybercrime — but no context

The show starts off, by highlighting that Cybercrime cost Singaporeans S$1.25bln, which might be true, but lacks context, or rather had the context removed.

Because the very report that estimated the cost, also mentioned that society was willing to tolerate malicious activity that cost less than 2% of GDP, like Narcotics (0.9%) and even pilferage (1.5%). S$1.25bln is less than 0.3% of Singapore’s GDP, and is long way off the 2% threshold. Giving out big numbers without context gives readers the wrong impression.

So allow me to provide context on just how big that S$1.25bln is.

In 2010, Singapore’s retail sector lost S$222 mln to shrinkage, a term used to describe the losses attributed to employee theft, shoplifting, administrative error, and others. Had we split the cost of cybercrime across different industry based on their percentage of overall GDP, the total losses for cybercrime on the retail sector in 2015 would be $225 mln–almost identical to what the sector lost to shrinkage….7 years ago!

Cybercrime is a problem, but not one that is wildly out of proportion to the other issues society is facing.