Sayakenahack: Epilogue

I keep this blog to help me think, and over the past week, the only thing I’ve been thinking about, was sayakenahack.

I’ve declined a dozen interviews, partly because I was afraid to talk about it, and partly because my thoughts weren’t in the right place. I needed time to re-group, re-think, and ponder.

This blog post is the outcome of that ‘reflective’ period.

The PR folks tell me to strike while the iron is hot, but you know — biar lambat asal selamat.

Why I started sayakenahack?

I’m one part geek and one part engineer. I see a problem and my mind races to build a solution. Building sayakenahack, while difficult, and sometimes frustrating, was super-duper fun. I don’t regret it for a moment, regardless of the sleepless nights it has caused me.

But that’s not the only reason.

I also built it to give Malaysians a chance to check whether they’ve been breached. I believe this is your right, and no one should withhold it from you. I also know that most Malaysians have no chance of ever checking the breach data themselves because they lack the necessary skills.

I know this, because 400,000 users have visited my post on “How to change your Unifi Password“.

400,000!!!

If they need my help to change a Wifi password, they’ve got no chance of finding the hacker forums, downloading the data, fixing the corrupted zip, and then searching for their details in file that is 10 million rows long — and no, Excel won’t fit 10mln rows.

So for at least 400,000 Malaysians, most of whom would have had their data leaked, there would have been zero chance of them ever finding out. ZERO!

The ‘normal’ world is highly tech-illiterate (I’ve even talked about it on BFM).  Sayakenahack was my attempt to make this accessible to common folks. To deny them this right of checking their data is just wrong.

But why tell them at all if there’s nothing they can do about it? You can’t put the genie back in the lamp. Continue reading

Sayakenahack architecture

I know the picture is a bit hard to read, but I wanted to make sure I had a detailed enough picture to understand the ‘innards’ of sayakenahack. Sometimes when you’re building stuff on the fly, and bottom-up, it’s good to take a step back, and have a top-down view.

I’ll be expanding this post over time, wanted to get my thoughts down quickly on paper before I moved on.

Intro to serverless

Serverless is a new-ish buzzword. It’s about building full-blown applications without servers (not even virtuals).

No EC2 instances — at all!

Some folks thought that just because I was on AWS, I was calling it serverless. Not so, lots of people use AWS for EC2, which are virtual servers. This blog, is hosted on a virtual server, but sayakenahack ran without any servers (virtual or otherwise). Except that one spot instance at the bottom right of the picture, we’ll get to that later.

That doesn’t mean I ran it on Elven magic and sparkly fairy-dust though. At some point, there were servers involved. But I never managed or ran those servers, they were abstracted from me by AWS services such as API Gateway (which is awesome by the way) and Lambda.

The beauty of serverless is :

  • I don’t have the headache of managing a full-blown OS stack (with requires more skills than I have)
  • It can scale till kingdom come Amazon has designed their serverless offerings to scale, and they do it beautifully.
  • It’s cheaper. MUCH cheaper.

With API Gateway for example, I could focus purely on building the resources and methods, without worrying about Apache configurations.

With lambda I can write python code natively (almost magically), without building out an EC2 instance.

And with DynamoDB, I get a database that can do anywhere from 10 to 10,000 writes per seconds without worrying about clustering, mirroring , etc. And even at 37 mln rows, the DB still qualifies for the free-tier.

That’s awesome (if I do say so myself).

Now of course there is a drawback. Without full granular control of Apache/Nginx, certain edge cases you need might not be possible, and DynamoDB is somewhat limited in it’s capability (although I’m not sure if that’s DynamoDBs fault, or because all NoSQL databases are like that).

But overall, for sayakenahack, serverless was the way to go. It might not work for you, but it worked beautifully for me.

Next up, we’ll look at the Holy Trinity of Serverless (API Gateway, Lambda, DynamoDB).

–stay tuned.

 

Sayakenahack.com answering the questions

OK, this is my last post on sayakenahack.com, and I’ve got a script scheduled to run at Sunday midnight to tear down the database. So if you wanna check, you better do it now, cause in 3 days time, it’ll be gone.

*poof*

But here are my thoughts on this whole debacle — and it’s going to get emotional, so don’t say I didn’t warn you.

So let’s start with the basics.

The right to know

I believe that if you’re data was leaked online, you have a right to know.

You might choose to “not know”, but that is a right you can choose to exercise. No one should be allowed to withhold that information from you.

I believe that you have a right to know about it, in a timely manner. Authorities can’t sit on the data for weeks without letting you know on any pretense.

I believe that the correct authority to do tell you about leaks is the MCMC. But till today they have made no attempt to create such a service, not even communicated a plan to implement one. There is no evidence to suggest they have (or had) any intention to do anything about it.

If I can code sayakenahack within 4 weeks (in my sparetime, while holding a 9-to-5 job, being a father and husband) there is no logical reason why the MCMC or the telcos couldn’t do something better in a shorter time-frame.

Even if you can’t do anything about it

I believe the right to know about a breach should exists even if you can’t do anything about it.

If you have terminal un-treatable cancer — does that mean a Doctor shouldn’t tell you about it? If you’re on  plane that’s about to crash, should the pilot remain silent?

You have a right to know about the leak. Regardless of whether you can do anything about it.

Only hackers and geeks should see the data

This data is freely available for anyone to download. The only people with the skills to find it though, are people we generally refer to as ‘geeks’ or ‘hackers’.

To ban sayakenahack is to say geeks and hackers can access the data — but not the average joe. It’s emphasizing that normal people don’t deserve that knowledge while geeks and hackers do.

This is elitism, and it’s wrong.

When Lowyat published the initial report, the knew of it’s importance. But chose to remove the article when the MCMC came. They continue to side with the MCMC, in saying that “sheer amount of information made available on the site could subject it to abuse.”

They fail to mention that the ‘sheer amount of information’ is already made available, just not to common folks, but to geeks and hackers. Effectively Lowyat is saying that it’s OK for geeks and hackers to have this data, but god-forbid the average joe get a hold of it.

God-forbid the actual data subject who is actually impacted be notified, the great Gods of Lowyat think that’s too much.

Oh, and btw, when Lowyat published the article, the site took roughly 200+ concurrent users at one time. When the star published the article in the morning, the site did 2000+ concurrent users (10 times more!)

Most Malaysians are technically illiterate and don’t visit Lowyat. They shouldn’t have less information because of it.

Manipulating vs. Masking

Lowyat’s editor then goes on to tell the Malaysian insight that “It’s blocked because it’s not right to manipulate the stolen data”.

Oh, give me a fucking break!

The word ‘manipulate’ is a dishonest choice. I mask the data, not manipulate it. No IT professional would ever make confuse manipulation with masking. Manipulation carries a negative connotation, that implies I’m changing the data in some way. Masking though is the intentional removal of data, to protect its confidentiality.

I’m masking. I’m not manipulating.

I went out my way to ensure that enough data was left so that users could still identify their numbers, yet not enough for somebody else to guess.

If you buy anything a credit card, your masked credit card number will be on the printed receipt. Generally the first 6  and last 4 digits are in the clear, while the middle 10 are masked (“replaced with asterisks”). This ensures that there’s enough information left on the receipt to trace a transaction, but not enough for fraudsters to get a card number. That’s a PCI-DSS rule, the gold-standard security framework for credit card processing, and I’ve spent 10 years deploying PCI systems. Masking is an acceptable practice to do this sort of thing.

So give me a break, with your ‘manipulating’. I love Lowyat to bits, but today they failed me big time.

How can we be sure the site is secure?

How can we be sure my site is secure?

Well there’s no such thing as an un-hackable website, and that includes things like Maybank2u.com, or CIMB-Clicks. Do we tell the banks to shut-down their sites just because the might be hacked?

No. We weigh the benefits and risk, and make calculated decisions as to what to do. Just because something is hackable doesn’t mean we take it offline.

We try our best to reduce the risk, until it reaches an acceptable level.

I would argue that the benefits of sayakenahack far outweighs the risk of it getting hacked. Hence continue to believe this was the right thing to do.

What if your site gets hacked?

So there are two parts here, how do I prevent a hack, and how do I mitigate the impact of breach (if it occurs).

First-off, the data is masked at source. i.e. the database only has the masked data. Making the data less valuable to fraudster. This is in the “minimize impact” bucket.

Below is the full representation of the data in the DB (for ic number 12345):

The data is masked at source, not in transit. So even I have no way to retrieve the full phone number from the DB. That’s why I can’t provide folks their full phone numbers — it just doesn’t work that way.

True, maybe this data is still valuable — but how would you extract it?

The DynamoDB is capped at 20 reads/seconds (10 RCUs in AWS parlance). Reading the data at full capacity, would take you 3 weeks (provided you knew all the ICs). If you were guessing ICs, my rough estimate is 3 years to dump the DB.

So maybe you have the skills to query the API, throttle it to avoid attention, and patiently write it to a DB for 3 weeks. If you got those skills — you probably found the files online already.

More AWS technical site

*skip this if you’re uninterested

Now onto the prevent the hack bucket.

My AWS account is protected with a super long password, and 2FA protected with the code on just one iPhone. I work on the AWS console exclusively via my Ubuntu virtual machine, that I spun up for this purpose, and I intend to destroy that VM image on Sunday as well.

Let’s get more details.

The entire architecture of sayakenahack is serverless. The html is served from a S3 bucket on Amazon via Cloudfront — which has TLS enabled. The javascript on the html calls an API that is hosted on API Gateway (which also is via cloudfront and has TLS enabled). It took me a few iterations, but the API and website exist on the same domain (no same-origin policy violations, but I still keep it CORS enabled).

Which means that there are no servers to hack on this thing.

No RFI,  No LFI.

No un-patched Apache version, or some shitty PHP bug that a server-ed site would be vulnerable to.

No SQLi (tehcnically it’s a noSQL database), or CMS Vulnerability. No Windows server to patch, no FTP, SMTP vuln, and all the other crap that gets servers in trouble.

It doesn’t mean that this is unhackable, it just means that I reduce my attack surface tremendously by getting rid of any servers.

The API calls a Lambda function that reads from the DynamoDB with an IAM that allows for only read access from the DB. Ensuring integrity of data.

A seprate IAM that allows for full DynamoDB access, the one I use to insert rows into the DB, is now keyless because I de-activated the key for it.

Finally the HTML site is a ‘minimalist’ design and I wrote it to be easily readable, any programmer can vet the code to detect bugs (or malicious stuff, like malvertising or bitcoin mining). The full code source is available on github for anyone can vet — in fact some already have!

Oh and I don’t log any API requests (they are cached though) — but that’s not the same as logging.

That’s just the tip of the icerberg of what I have on AWS. This thing is a labour of love.

So what

Let’s compare all of the above — with this!!

That’s the election commission website, that publishes your full name, and voting location based on a simple IC entry. The site is marked as insecure by Google Chrome because it doesn’t even have TLS.

TLS!!!! In fucking 2017, your website doesn’t have TLS??

A simple thing, that a free LetsEncrypt certificate would solve in less than 5 minutes.

What that means, is that when you search for your voting information on the website, the data is transferred in clear across the internet for anyone in the middle to see. It also means that your browser is not authenticating the site, and anyone can create a fake SPR website and make it look identical.

If you’re logged onto the SPR website from a kopitiam WiFi, I can see the data you’re sending (and receiving) just by logging on the same WiFi.

Trust on the internet

Fundamentally, when you log onto the SPR website, you’re trusting all the infrastructure between you and SPR, kopitiam Wifi included.

Do you trust the SPR? How about their vendor? How about the company that supplied them the servers?

How about the guy managing their database? Or the company that host their datacenter?

Their SysAdmin? Their Web Admin? All of their guys who wrote their code? Trust all of them?

Oh, and if you’re logging on to the site from home on Unifi — you’re probably trusting your stock-standard Dlink DIR-615 router, that’s hackable from the open internet.

The internet is built on a whole load of trust, and maybe you don’t trust me, but have you ever considered the number of people who un-suspectingly trust by just visiting a simple website?

Just saying, maybe sayakenahack isn’t a problem when the Election Commission’s website is marked as insecure. Why doesn’t Lowyat complain about the ‘sheer amount of data’ on the Election Commissions website?

This is sayakenahack.

That 1 cookie is for Google Analytics, it’s the only bit of ‘data capturing’ I do, and it’s industry standard. I collect data about who visits the site, and see what their load/lag times are, to ensure the site is operational and working well. But that doesn’t capture query strings, so no IC numbers are tracked.

More importantly, the site is TLS protected. All data between you and servers is encrypted, meaning you don’t have to trust the internet providers, or WiFi connection.

So, I go through great lengths protecting the site, and definitely more effort than the SPR.

As a bonus, here’s the PTPTN website, that allows you to check your balance. To be fair, they at least have a TLS equivalent, but they don’t re-direct you to it. So it’s still possible to access their non-encrypted site. See that bit on the browser that says “Not Secure”.

 

Sayakenahack forces re-direction. There is no un-encrypted version for either the website or API.

 

Yes, but these are government websites

Yes, the government (both state and federal) are exempt from PDPA.

But the damage inflicted on victims by a breach is identical whether the data comes from the government or private companies. I’m not sure why that exemption is there.

Secondly, there exist an exemption clause in the PDPA, specifically section 45(2)(f)(ii) that states there is an exemption IF:

(f) processed only for journalistic, literary or artistic purposes shall be exempted from the General Principle, Notice and Choice Principle, Disclosure Principle, Retention Principle, Data Integrity Principle and Access Principle and other related provisions of this Act, provided that—

(ii) the data user reasonably believes that, taking into account the special importance of public interest in freedom of expression, the publication would be in the public interest;

I don’t know how you define, public interest, but the site got 100,000 visits today (even with the ban kicking in at 12pm) signals to me that there is public interest.

I know Public Interest doesn’t literally mean things that interest the public, but you can’t argue that this is something people should be aware of. And don’t give me bullshit about hackers querying this instead of ‘real-users’. Hackers would use the API, and bypass the Google Analytics, the 100,000 is purely from the Google Analytics data.

Now, the lawyer types tell me that that ‘journalistic’ may not apply to bloggers, but this might be a good test case. Let’s see.

Also, I don’t know of any journalist in Malaysia, that could trawl through hacker forums getting the data, and then stand up a site that can support querying a 37mln row database, while serving 2000 concurrent users.  Do you?

There is one more exemption I might play here — but that’s a long shot, and I’m playing my cards close to my chest on that one.

Conclusion

To be honest, I’m afraid.

Afraid that next time I land in Malaysia, I end up in handcuffs at the back of a police car.

But sometimes, you gotta do what’s right, and not just what’s ‘legally permissible’.

Post-note

To all the reporters who’ve contacted me. I’m cancelling all interviews for now. This is my last post on the matter — at least for the next week.

See you on the other side.

Why does SayaKenaHack have dummy data?

Why does sayakenahack have dummy data? If I enter “123456” and “112233445566” I still get results.

I was struggling with answering this question, as some folks have used it to ‘prove’ that I was a phisher. We’ll get to that later, for now I hope to answer why these ‘fake’ IC numbers exist in the sayakenahack.

Firstly, I couldn’t find a good enough way to validate IC numbers as I was inserting them into the database. Most of you think that IC numbers follow a pre-define pattern :

  • 6-digit birthday (yymmdd format)
  • 2-digit state code
  • 4-digit personal identifier, where the last digit is odd for men, and even for women.

But, there are still folks with old IC numbers, and the army have their own format. Not to mention that the IC Number field  can be populated by passport numbers (for foreigners) and Company registration IDs. So instead of cracking my head on how to validate IC numbers, I decided to pass them all in.

The only ‘transformation’ I do is to strip them of all non-AlphaNumeric characters and uppercasing any letters in the result. This would standardize the IC numbers in the database, regardless of source file format.

Had I done some validation, I might have removed these dummy entries — but fortunately I didn’t.

Upon further analyzing the data, I went back to the original source files and notice something strange, the account numbers belonged to some strange names. And then it made sense — this was Test data.

Test data in a Production Environment to be exact.

And when the Database for the telco was dumped, the telco’s didn’t remove these test accounts from their system. So what we have is a bunch of dummy accounts, with dummy IC numbers. Continue reading

SayaKenaHack.com

On the 19th of October, Lowyat.net reported that a user was selling the personal data of MILLIONS of Malaysians on their forum. Shortly after, the article was taken down on the request of the MCMC, only to put up again, a couple of days later.

Lowyat later reported that a total of 46.2 Million phone numbers were exposed,  and the data included IC numbers, Addresses, IMSI, IMEI and SIM numbers as well. In short, a lot of data from a lot of people.

So Malaysia joined the ranks of The Phillipines, Turkey and South Africa to have data on their entire population leaked on the internet. [Spoiler alert: This is not a good thing]

Where can I check?

You can head over to a site I created: sayakenahack.com to check if you’re part of the breach. So far I’ve loaded data from Maxis, Digi, Celcom and UMobile onto the site. I’ll be adding the smaller telcos later this week (stay tuned).

Medical council, etc…I’m still debating whether I should put that in. Maybe some doctors don’t want to be identified as doctors, so that data stays out for now.

Waah… That means you downloaded illegal data?

Technically yes, the data might be illegal. But any geek can find it online, it’s a google search away.

I’m just making the data available to the ‘normals’, people who don’t look around in hacker forums.

Plus all data is masked, so only the first 4 and last 2 digits of the phone number is available. Which is almost as good as the masking of credit card numbers on your printed receipts.

I also don’t publish any names or addresses. If you’re unhappy with this, you should be unhappy with the Election Commission website that publishes your name in FULL on their website upon entering just an IC number. Similar to PTPTN etc.

Did you pay for the Data?

No. Contrary to what’s being reported the data is available for FREE online. Even the ‘hacker’ who was selling it on Lowyat was basically a re-seller.

I did not pay for the data, I would never validate the business case of reselling stolen data.

If I search for my IC, will you log my data?

No.

In technical terms, I’ve switched of logging for my API Gateway, CloudFront & Lambda.

If I wanted your data — I wouldn’t need you to search for you. I already have it.

OMG I’m breached !!! What can I do?

Unfortunately, there’s little you can do.

Your IC number is a permanent fixture of your life –and can’t be changed. This is bad design, but it’s the design we have at the moment.

If you lose your Phone Number, Credit Card details or E-mail address, you’d still have some form of mitigating the damage. But if someone gets your IC number, you can’t go to the NRD and get them to issue you a new one.

To be fair IC numbers (in their modern form) are at least 25 years old, so I’m not blaming anyone — but the reality is that we should either stop using IC numbers so extensively , or find some way to make them mutable. Not and easy task, but until that happens the damage of this leak will continue… in perpetuity.

Now onto the good news!

The leak is from 2014, so the chances of you having the same phone is minuscule. I know of only one person whose phone is older than 3 years old, everybody else has changed their phone. So IMEI numbers (which are tied to your phones) from 2014 are pretty useless.

IMSI and SIM are almost the same as well. Over the past 3 years, I’m almost certain a large percentage of the victims (50-80%) would have their sim cards swapped — primarily from buying a new phone that required a micro or nano sim or from porting telcos, or just losing their phones.

What’s not so good is the fact that most people still keep their Name, Address and Phone Number. So those are the top 3 (4 if you count IC Numbers) data elements in the breach, and unfortunately their almost all there.

Where did the data come from?

Well……

The breach includes not just Telco data but Jobstreet and various other sources as well. Let’s just focus on Telco because that’s the big one.

There’s only 2 possibilities on where the telco data came from:

  • Someone hacked into individuals telcos and took it; or
  • Someone hacked a central source with all the data

Now, consider that all Telco’s are in this breach — including Altel, PLDT, Redtone, etc. Which self-respecting hacker, with the skills to hack Maxis, Digi and Celcom, is going to waste time on Altel? Really?!

Consider also, that if you downloaded the data, (which I obviously have), it’s clear as day where the leak came from. It’s so clear, Stevie Wonder can see where the data was leaked from.

I’m hoping over the next few days somebody somewhere will make an announcement.

In the mean-time stay safe Malaysia.

End notes and Special Thanks

Thanks to Bin Hong for alerting me that I had a few logs on the GitHub repository. I’ve torn down the old repo and created a new one.

Thanks to Ang YC for letting me know I gave too much info to folks.

Thanks to **rax***n for sharing the data on the *ahem* site.

Thanks to Ridhwan Daud for correcting my API spelling. (it’s case sensitive).

All data available on sayakenahack.com is available somewhere on the web. I’m just making sure that it’s not just geeks/hackers who have this data, but the average citizen can also be informed if they’re part of the leak.

I’m especially proud of the architecture underlying sayakenahack. It’s completely serverless, and I’ll make a post about it soon. But learning DynamoDB and about a gazillion AWS services to deploy this was both fun and tiring.

For now, you can build your own version of sayakenahack with the data, by using the api at:

https://sayakenahack.com/api/v1/pwn?icNum=12345

I’ve changed the API many times. I promised this version is stable for the next 3 months.

The api is CORS enabled, so you can call it with javascript on your browser. There’s only one endpoint for now, I’ll documenting the API and will publish some documentation soon.

I spent a good 40+ hours building all of this, the code is mostly available on my GIT repository. Couple of elements aren’t there (lambda function to query DynamoDB) — but I’ll upload that when time permits.