comment 0

Another Day, Another breach

220,000 is a lot of people. It’s the population of a small town like Taiping, and roughly twice the capacity of Bukit Jalil Stadium.

Yet today, a data breach of this size, barely registers in the news-cycle. After all, the previous data breach was 200 times bigger, and occurred just 3 months ago. How could we take seriously something that occurs so frequently, and on a scale very few comprehend.

Individually, each breach is not particularly damaging, it’s a thin thread of data about victims, but they do add up. Criminals use multiple breaches, and stitch together a fabric of the victims identity, eventually being able to forge credit card applications in their name, or to perform typical scams.

But if you’re thinking of avoiding being in a breach, that’s an impossible task. The only Malaysians that weren’t part of the telco breach, were those without mobile phones. In the organ donor leak, the victims were kind-hearted souls who were innocent bystanders in the war between attackers and defenders on the internet.

The only specific advice that would work, would be to not subscribe to mobile phone accounts and don’t pledge your organs. That is not useful advice.

I wanted this post to be about encouraging people to stop worrying about data breaches, and move on with their lives. To accept that the price of living in a hyper-connected world, is that you’ll be data breach victim every now and then — I wanted to demonstrate this by actually going out and pledging my organs to show that we shouldn’t be afraid.

But when I went to the Malaysian organ donation website (demarorgan.gov.my), I was greeted by all too common “Connection is Not Secure” warning. Which just made my head spin!

comment 0

Writing Millions of rows into DynamoDB

While designing sayakenahack, the biggest problem I faced was trying to write millions of rows efficiently into DynamoDB. I slowly worked my way up from 100 rows/second to around the 1500 rows/second range, and here’s how I got there.

Work with Batch Write Item

First mistake I did was a data modelling error. Sayakenahack was supposed to take a single field (IC Number) and return the results of all phone numbers in the breach. So I initially modeled the phone numbers as an array within an item (what you’d called a row in regular DB speak).

Strictly speaking this is fine, DynamoDB has an update command that allows you to update/insert an existing item. Problem is that you can’t batch an update command, each update command can only update/insert one item at a time.

Running a script that updated one row in DynamoDB (at a time) was painfully slow. Around 100 items/second on my machine, even if I copied that script to an EC2 instance in the same datacenter as the DynamoDB, I got no more than 150 items/second.

At that rate, a 10 million row file would take nearly 18 hours to insert. That wasn’t very efficient.

So I destroyed the old paradigm, and re-built.

Instead of phone numbers being arrays within an item, phone numbers were the item itself. I kept IC Number as the partition key (which isn’t what Amazon recommend), which allowed me to query for an IC Number and get an array of items.

This allowed me to use DynamoDB’s batch_write_item functionality, which does up to 25 request at once (up to a maximum of 16MB). Since my items weren’t anywhere 16MB,  I would theoretically get a 25 fold increase in speed.

In practice though, I got ‘just’ a 10 fold increase, allowing me to write 1000 items/second, instead of 100. This meant I could push through a 10 million row file in under 3 hours.

First rule of thumb when trying to write lots of rows into DynamoDB — make sure the data is modeled so that you can batch insert, anything else is painfully slow.

comments 6

Sayakenahack: Epilogue

I keep this blog to help me think, and over the past week, the only thing I’ve been thinking about, was sayakenahack.

I’ve declined a dozen interviews, partly because I was afraid to talk about it, and partly because my thoughts weren’t in the right place. I needed time to re-group, re-think, and ponder.

This blog post is the outcome of that ‘reflective’ period.

The PR folks tell me to strike while the iron is hot, but you know — biar lambat asal selamat.

Why I started sayakenahack?

I’m one part geek and one part engineer. I see a problem and my mind races to build a solution. Building sayakenahack, while difficult, and sometimes frustrating, was super-duper fun. I don’t regret it for a moment, regardless of the sleepless nights it has caused me.

But that’s not the only reason.

I also built it to give Malaysians a chance to check whether they’ve been breached. I believe this is your right, and no one should withhold it from you. I also know that most Malaysians have no chance of ever checking the breach data themselves because they lack the necessary skills.

I know this, because 400,000 users have visited my post on “How to change your Unifi Password“.

400,000!!!

If they need my help to change a Wifi password, they’ve got no chance of finding the hacker forums, downloading the data, fixing the corrupted zip, and then searching for their details in file that is 10 million rows long — and no, Excel won’t fit 10mln rows.

So for at least 400,000 Malaysians, most of whom would have had their data leaked, there would have been zero chance of them ever finding out. ZERO!

The ‘normal’ world is highly tech-illiterate (I’ve even talked about it on BFM).  Sayakenahack was my attempt to make this accessible to common folks. To deny them this right of checking their data is just wrong.

But why tell them at all if there’s nothing they can do about it? You can’t put the genie back in the lamp.

comments 12

Why does SayaKenaHack have dummy data?

Why does sayakenahack have dummy data? If I enter “123456” and “112233445566” I still get results.

I was struggling with answering this question, as some folks have used it to ‘prove’ that I was a phisher. We’ll get to that later, for now I hope to answer why these ‘fake’ IC numbers exist in the sayakenahack.

Firstly, I couldn’t find a good enough way to validate IC numbers as I was inserting them into the database. Most of you think that IC numbers follow a pre-define pattern :

  • 6-digit birthday (yymmdd format)
  • 2-digit state code
  • 4-digit personal identifier, where the last digit is odd for men, and even for women.

But, there are still folks with old IC numbers, and the army have their own format. Not to mention that the IC Number field  can be populated by passport numbers (for foreigners) and Company registration IDs. So instead of cracking my head on how to validate IC numbers, I decided to pass them all in.

The only ‘transformation’ I do is to strip them of all non-AlphaNumeric characters and uppercasing any letters in the result. This would standardize the IC numbers in the database, regardless of source file format.

Had I done some validation, I might have removed these dummy entries — but fortunately I didn’t.

Upon further analyzing the data, I went back to the original source files and notice something strange, the account numbers belonged to some strange names. And then it made sense — this was Test data.

Test data in a Production Environment to be exact.

And when the Database for the telco was dumped, the telco’s didn’t remove these test accounts from their system. So what we have is a bunch of dummy accounts, with dummy IC numbers.

comments 112

SayaKenaHack.com

On the 19th of October, Lowyat.net reported that a user was selling the personal data of MILLIONS of Malaysians on their forum. Shortly after, the article was taken down on the request of the MCMC, only to put up again, a…