comment 0

Gov TLS Audit : Architecture

Last Month, I embarked on a new project called GovTLS Audit, a simple(ish) program that would scan 1000+ government websites to check for their TLS implementation. The code would go through a list of hostnames, and scan each host for TLS implementation details like redirection properties, certificate details, http headers, even stiching together Shodan results into a single comprehensive data record. That record would inserted into a DynamoDB, and exposed via a rest endpoint.

Initially I ran the scans manually Sunday night, and then uploaded the output files to S3 Buckets, and ran the scripts to insert them into the DB.

But 2 weeks ago, I decided to Automate the Process, and the architecture of this simple project is complete(ish!). Nothing is ever complete, but this is a good checkpoint, for me to begin documenting the architecture of GovTLS Audit (sometimes called siteaudit), and for me to share.

What is GovTLS Audit

First let’s talk about what GovTLS Audit is — it’s a Python Script that scans a list of sites on the internet, and stores the results in 3 different files, a CSV file (for human consumption), a JSONL file (for insertion into DynamoDB) and a JSON file (for other programmatic access).

A different script then reads in the JSONL file and loads each row into database (DynamoDB), and then uploads the 3 files as one zip to an S3 bucket.

On the ‘server-side’ there are 3 lambda functions, all connected to an API Gateway Endpoint Resource.

  • One that Queries the latest details for a site [/siteDetails]
  • One that Queries the historical summaries for the site [/siteHistory]
  • One that List all scan (zip files) in the S3 Bucket [/listScans]

Finally there’s a separate S3 bucket to serve the ‘website’, but that’s just a simple html file with some javascript to list all scan files available for download. In the End, it looks something like this (click to enlarge):

comments 2

Read this before GE14

Let’s start this post the same way I start my day — by looking at Facebook.

Facebook made $40 Billion dollars in revenue in 2017, solely from advertising to pure schmucks like you. The mantra among the more technically literate is that facebook doesn’t have users it has products that it sells to advertisers, it just so happens that all its products are homo-sapien smart-phone totting urbanites (just like you!)

The platforms meteoric rise from nobody to top-dog, is a dream-story in Silicon Valley, but underneath the veneer of wholesome innovation lies a darker secret, one that could be responsible for the polarization of entire communities, including our own. And it’s all because of their most valuable employee.

No, not Mark Zuckerberg, but the real genius behind the blue and white site. The one responsible for billions of ad revenue facebook generates yearly, and unsurprisingly she’s female.

Anna Lytica and Machine Learning

There’s probably thousands of post your facebook friends make everyday, but she decides which 3 to fit onto your smartphone screen first, and the next 3 and so forth. From the millions of videos shared every hour, she painstakingly picks the few you’d see in your timeline, she decides which ads to show you, and which advertisers to sell you too, underneath the hood in the giant ad behemoth, she lies working all day, everyday.

She isn’t a person, ‘she’ is an algorithm, a complex program that does billions of calculations a second, and for this post we’ll give her the name… Anna Lytica.

Facebook doesn’t talk about her much, she is after all a trade secret (sort of), but what she does and how she does it, might be as much a mystery to us, as it is to Mr. Zuckerberg. Machine Learning algorithms are complex things, we know how to build them, and train them, but how they actually work is sometimes beyond our understanding.

Google can train Alpha-Go to play a game, but how it makes decisions is unknown to Google and even itself — it just IS a Go player.And it is really sad, when we watch these AI algorithms make amazing discoveries, but are unable to explain their rationale to us mere humans. It’s the reason why Watson, IBMs big AI algorithm, hasn’t taken off in healthcare, there’s no point recommending a treatment for cancer, if the algorithm can’t explain why it chose the treatment in the first place.

This is hard to grasp, but AI isn’t just a ‘very powerful’ program, AI is something else entirely. We don’t even use traditional words like write or build to refer to the process of creating them (like we do regular programs), instead we use the word train.

We train an algorithm to play Go, to drive, or to treat cancer. We do this the same way we breed dogs, we pick specimens with the traits we want, and breed them till we end up with a something that matches our desires. How a dog works, and what a dog thinks is irrelevant. If we want them big, we simply breed the biggest specimens, the process is focused entirely on outcome.

Similarly, how the algorithm behaves is driven by what it was trained to do. How it works is irrelevant, all that matters is outcome. Can it play Go, can it drive, can it answer jeopardy? If you want to understand an algorithm you need to know what it was trained to do.

Anna Lytica, was trained to keep you browsing Facebook, after all the companies other endeavors like internet.org, and instant articles were built with the same intention. And while good ol’ Mark stated that he’s tweaking Anna to reduce the time people spend on Facebook, this is something new, an exception to the years Facebook tweaked her to keep you on their site.

After all the average monthly user spends 27 minutes per day in the app, and if you go by daily users, they spend about 41 minutes per day on Facebook. If that’s the end-result of tweaking Anna to ensure we spend less time on Facebook — God help us all!

And while it’s difficult to understand how Anna works, its very easy to guess how she’ll behave. If the end result of Anna’s training is to keep you browsing Facebook, then human psychology reveals a simple trait all humans share — confirmation bias.

comment 0

Gov.My TLS audit: Version 2.0

Last week I launched a draft of the Gov.my Audit, and this week we have version 2.0

Here’s what changed:

  1. Added More Sites. We now scan a total of 1324 government websites, up from just 1180.
  2. Added Shodan Results. Results includes both the open ports and time of the Shodan scan (scary shit!)
  3. Added Site Title. Results now include the HTML title to give a better description of the site (hopefully!).
  4. Added Form Fields. If the page on the root directory has an input form, the names of the fields will appear in the results. This allows for a quick glance at which sites have forms, and (roughly!) what the form ask for (search vs. IC Numbers).
  5. Added Domain in the CSV. The CSV is sorted by hostname, to allow for grouping by domain names (e.g. view all sites from selangor.gov.my or perlis.gov.my)
  6. Added an API. Now you can query the API can get more info on the site, including the cert info and HTTP headers.
  7. Released the Serverless.yml files for you to build the API yourself as well 🙂

All in all, it’s a pretty bad-ass project (if I do say so myself). So let’s take all that one at a time.

comments 3

I scanned 1000 government sites, what I found will NOT shock you

Previously, I moaned about dermaorgan.gov.my, a site that was probably hacked but was still running without basic TLS. This is unacceptable, that in 2018, we have government run websites, that ask for personal information, running without TLS.

So I decided to check just how many .gov.my sites actually implemented TLS, and how many would start being labled ‘not secure’ by Google in July. That’s right, Google will start naming and shaming sites without TLS, so I wanted to give .gov.my sites the heads up!

Why check for TLS?

TLS doesn’t guarantee a site is secure (nothing does!), but a site without TLS signals lack of care from the administrator. The absence of TLS is an indicator of just how lightly the security of the servers has been taken.

Simply put, TLS is necessary for not sufficient for security — and since it’s the easiest thing to detect for, without running intrusive network scans, it seems like the best place to start.

How I checked for TLS?

But first I needed a list of .gov.my sites.

To do that, I  wrote a web-crawler that started with a few .gov.my links, and stored the results. It then repeated the process for the links, the links of the links…and so forth. After 3 iterations, I ended with 20,000 links from 3,000+ individual hostnames (a word I wrongly use in place of FQDN, but since the code uses hostnames, I’m sticking to it for now — please forgive me networking nerds)

I then manually filtered the hostnames to those from a .gov.my or .mil.my domain and scanned them for a few things:

  • Does it have a https website ( if it doesn’t redirect)
  • Does it redirect users from http to https
  • Does the https site have a valid certificate
    • Does it match the hostname
    • Does it have a fresh certificate (not expired)
    • Can the certificate be validated — this required all intermediary certs to be present
  • What is the IP of the site
  • What is the asn of the IP
  • What is the server & X-Powered-By headers returned by the host

Obviously, as I was coding this, my mind got distracted and I actually collected quite a bit more data, but those fields are in the csv for you the Excel the shit out off! The repository contains both a json and jsonl file that has more data.

Now onto the results

comment 0

Another Day, Another breach

220,000 is a lot of people. It’s the population of a small town like Taiping, and roughly twice the capacity of Bukit Jalil Stadium.

Yet today, a data breach of this size, barely registers in the news-cycle. After all, the previous data breach was 200 times bigger, and occurred just 3 months ago. How could we take seriously something that occurs so frequently, and on a scale very few comprehend.

Individually, each breach is not particularly damaging, it’s a thin thread of data about victims, but they do add up. Criminals use multiple breaches, and stitch together a fabric of the victims identity, eventually being able to forge credit card applications in their name, or to perform typical scams.

But if you’re thinking of avoiding being in a breach, that’s an impossible task. The only Malaysians that weren’t part of the telco breach, were those without mobile phones. In the organ donor leak, the victims were kind-hearted souls who were innocent bystanders in the war between attackers and defenders on the internet.

The only specific advice that would work, would be to not subscribe to mobile phone accounts and don’t pledge your organs. That is not useful advice.

I wanted this post to be about encouraging people to stop worrying about data breaches, and move on with their lives. To accept that the price of living in a hyper-connected world, is that you’ll be data breach victim every now and then — I wanted to demonstrate this by actually going out and pledging my organs to show that we shouldn’t be afraid.

But when I went to the Malaysian organ donation website (demarorgan.gov.my), I was greeted by all too common “Connection is Not Secure” warning. Which just made my head spin!

comments 4

Writing Millions of rows into DynamoDB

While designing sayakenahack, the biggest problem I faced was trying to write millions of rows efficiently into DynamoDB. I slowly worked my way up from 100 rows/second to around the 1500 rows/second range, and here’s how I got there.

Work with Batch Write Item

First mistake I did was a data modelling error. Sayakenahack was supposed to take a single field (IC Number) and return the results of all phone numbers in the breach. So I initially modeled the phone numbers as an array within an item (what you’d called a row in regular DB speak).

Strictly speaking this is fine, DynamoDB has an update command that allows you to update/insert an existing item. Problem is that you can’t batch an update command, each update command can only update/insert one item at a time.

Running a script that updated one row in DynamoDB (at a time) was painfully slow. Around 100 items/second on my machine, even if I copied that script to an EC2 instance in the same datacenter as the DynamoDB, I got no more than 150 items/second.

At that rate, a 10 million row file would take nearly 18 hours to insert. That wasn’t very efficient.

So I destroyed the old paradigm, and re-built.

Instead of phone numbers being arrays within an item, phone numbers were the item itself. I kept IC Number as the partition key (which isn’t what Amazon recommend), which allowed me to query for an IC Number and get an array of items.

This allowed me to use DynamoDB’s batch_write_item functionality, which does up to 25 request at once (up to a maximum of 16MB). Since my items weren’t anywhere 16MB,  I would theoretically get a 25 fold increase in speed.

In practice though, I got ‘just’ a 10 fold increase, allowing me to write 1000 items/second, instead of 100. This meant I could push through a 10 million row file in under 3 hours.

First rule of thumb when trying to write lots of rows into DynamoDB — make sure the data is modeled so that you can batch insert, anything else is painfully slow.