Gov TLS Audit has a website!

Gov TLS Audit finally has a website to complement the API.

I used the services of a guy from fiverr to code the site, it isn’t the best design in the world, but it’s good enough for now. The site allows you to query a site and view the historical details of a particular .gov.my website. The full list of .gov hostnames can be found here.

It also links to the full daily scan outputs (in csv, json and jsonl formats) if you wish to download and do more analysis. Please note, the csv output has some errors that I’ve not had time to fix, best you use the jsonl or json file, which don’t have errors, but also have much more details.

It’s hosted on https://gov-tls-audit.sayakenahack.com/, because I’m not made out of money and can’t afford another domain just for this side-project. Strangely this is blocked by most locals ISPs, which is bad, I’m going to write a letter to ISPs, since the block request from PDP was very specific and didn’t include subdomains of sayakenahack.

Hope you like it, leave comments below for feedback, or email me keith [at] keithrozario.com

First I deleted my most popular tweet — then I deleted 2000 more.

Two weeks ago, I rage-tweeted something regarding Malaysian politics that got a lot more viral than I liked (I’ve censored out the profanity for various reasons, most notably, there are teenagers who read this blog). It was a pointless collection of 200 characters, that somehow resonated with people enough to be shared across social media. Obviously, since it was me, the tweet was filled with a small collection of profanities, and laced with just the right amount of emotive content 🙂

But then things started getting bad.

Soon after I tweeted, I received messages from folks I hadn’t met in decades, showing me screenshots of their whatsapp group that had my tweet — my wife’s chinese speaking colleagues were showing it to her at work — I checked, and nearly 2,000 people retweeted it, which isn’t typical for me, and frankly speaking pretty scary.

As much as I’d like to have my content shared, the tweet in question is nothing but couple of crude words pieced together in a ‘rage-tweet’. And I understand that it emotionally resonates with folks who are angry, but if this the level of discourse we’re having on  Malaysian social media, we should be alarmed. Completely pointless rants being viralled is not how we ubah, it is the absolute opposite of how we ubah!

Research on the virality of articles from the New York Times showed that ‘angry’ content was more viral than any other, beating out awe, surprise and even practical value. The angrier the content, the more likely it would be shared. A rage-tweet is more likely to go viral than something like fuel-saving tips, even though the latter clearly is more valuable to readers.

At this point, I’d rant about how the media has a responsibility to look beyond clicks and ads, and to think about the impact of their content on society, but since I owned the tweet, I simply deleted it. Of course, I can’t stop the screen-shots being shared across whatsapp, but we do what we can.

Deleting your tweets

That got me thinking, twitter is a cesspool of angry farts screaming at each other, and that has some value.

But while, what I tweet today, may be relevant and acceptable today, it may not be 2-3 years from now. Kinda like how Apu from the Simpsons was acceptable and non-offensive in the 90’s.

I’m ashamed to say it, but I once thought that Michael Learns to Rock was a great rock band, in context, thats acceptable for a 12 year old 2 decades ago, before even Napster or Limewire. Of course, as a adult in 2018, I’m thoroughly aware that AC/DC are the greatest rock band ever, and Michael Learns to Rock, well they’re not exactly Denmark’s best export.

And that’s the problem, twitter removes context  — it’s very easy to take a 140 character tweet from 5 years ago out of context. Nobody cares about context on a platform that limits users to 140 characters (or 280 characters since end 2017). Maybe you quoted an article from TheMalaysianInsider, which, guess what, no longer exist. Context is rather fluid on twitter, and it changes rapidly over weeks, let alone the years from your first tweet.

For example,  this tweet from Bersatu’s Wan Saiful:

Gee, I wonder who he was talking about, a simple internet search will give you the answer, but that’s not the point.

Wan Saiful changed his opinion,  and he’s explained why, people should be allowed to change their mind.Freedom to change your opinion not just perfectly fine, it’s a per-requisite for progress.If we allow our tweet history to be a ball-and-chain that ties us to our old idealogy, how could we ever progress? Everybody changes their mind — and that’s OK.

The point is twitter should not be a historical archive — it should be current. A great place to have an informed discussion of current affairs, but not a place to keep old, out-dated and out of context material floating around.

Hence, I decided to delete all my tweets that were older than 90 days old, and here’s how. Continue reading

Gov TLS Audit : Architecture

Last Month, I embarked on a new project called GovTLS Audit, a simple(ish) program that would scan 1000+ government websites to check for their TLS implementation. The code would go through a list of hostnames, and scan each host for TLS implementation details like redirection properties, certificate details, http headers, even stiching together Shodan results into a single comprehensive data record. That record would inserted into a DynamoDB, and exposed via a rest endpoint.

Initially I ran the scans manually Sunday night, and then uploaded the output files to S3 Buckets, and ran the scripts to insert them into the DB.

But 2 weeks ago, I decided to Automate the Process, and the architecture of this simple project is complete(ish!). Nothing is ever complete, but this is a good checkpoint, for me to begin documenting the architecture of GovTLS Audit (sometimes called siteaudit), and for me to share.

What is GovTLS Audit

First let’s talk about what GovTLS Audit is — it’s a Python Script that scans a list of sites on the internet, and stores the results in 3 different files, a CSV file (for human consumption), a JSONL file (for insertion into DynamoDB) and a JSON file (for other programmatic access).

A different script then reads in the JSONL file and loads each row into database (DynamoDB), and then uploads the 3 files as one zip to an S3 bucket.

On the ‘server-side’ there are 3 lambda functions, all connected to an API Gateway Endpoint Resource.

  • One that Queries the latest details for a site [/siteDetails]
  • One that Queries the historical summaries for the site [/siteHistory]
  • One that List all scan (zip files) in the S3 Bucket [/listScans]

Finally there’s a separate S3 bucket to serve the ‘website’, but that’s just a simple html file with some javascript to list all scan files available for download. In the End, it looks something like this (click to enlarge):


Continue reading

Read this before GE14

Let’s start this post the same way I start my day — by looking at Facebook.

Facebook made $40 Billion dollars in revenue in 2017, solely from advertising to pure schmucks like you. The mantra among the more technically literate is that facebook doesn’t have users it has products that it sells to advertisers, it just so happens that all its products are homo-sapien smart-phone totting urbanites (just like you!)

The platforms meteoric rise from nobody to top-dog, is a dream-story in Silicon Valley, but underneath the veneer of wholesome innovation lies a darker secret, one that could be responsible for the polarization of entire communities, including our own. And it’s all because of their most valuable employee.

No, not Mark Zuckerberg, but the real genius behind the blue and white site. The one responsible for billions of ad revenue facebook generates yearly, and unsurprisingly she’s female.

Anna Lytica and Machine Learning

There’s probably thousands of post your facebook friends make everyday, but she decides which 3 to fit onto your smartphone screen first, and the next 3 and so forth. From the millions of videos shared every hour, she painstakingly picks the few you’d see in your timeline, she decides which ads to show you, and which advertisers to sell you too, underneath the hood in the giant ad behemoth, she lies working all day, everyday.

She isn’t a person, ‘she’ is an algorithm, a complex program that does billions of calculations a second, and for this post we’ll give her the name… Anna Lytica.

Facebook doesn’t talk about her much, she is after all a trade secret (sort of), but what she does and how she does it, might be as much a mystery to us, as it is to Mr. Zuckerberg. Machine Learning algorithms are complex things, we know how to build them, and train them, but how they actually work is sometimes beyond our understanding.

Google can train Alpha-Go to play a game, but how it makes decisions is unknown to Google and even itself — it just IS a Go player.And it is really sad, when we watch these AI algorithms make amazing discoveries, but are unable to explain their rationale to us mere humans. It’s the reason why Watson, IBMs big AI algorithm, hasn’t taken off in healthcare, there’s no point recommending a treatment for cancer, if the algorithm can’t explain why it chose the treatment in the first place.

This is hard to grasp, but AI isn’t just a ‘very powerful’ program, AI is something else entirely. We don’t even use traditional words like write or build to refer to the process of creating them (like we do regular programs), instead we use the word train.

We train an algorithm to play Go, to drive, or to treat cancer. We do this the same way we breed dogs, we pick specimens with the traits we want, and breed them till we end up with a something that matches our desires. How a dog works, and what a dog thinks is irrelevant. If we want them big, we simply breed the biggest specimens, the process is focused entirely on outcome.

Similarly, how the algorithm behaves is driven by what it was trained to do. How it works is irrelevant, all that matters is outcome. Can it play Go, can it drive, can it answer jeopardy? If you want to understand an algorithm you need to know what it was trained to do.

Anna Lytica, was trained to keep you browsing Facebook, after all the companies other endeavors like internet.org, and instant articles were built with the same intention. And while good ol’ Mark stated that he’s tweaking Anna to reduce the time people spend on Facebook, this is something new, an exception to the years Facebook tweaked her to keep you on their site.

After all the average monthly user spends 27 minutes per day in the app, and if you go by daily users, they spend about 41 minutes per day on Facebook. If that’s the end-result of tweaking Anna to ensure we spend less time on Facebook — God help us all!

And while it’s difficult to understand how Anna works, its very easy to guess how she’ll behave. If the end result of Anna’s training is to keep you browsing Facebook, then human psychology reveals a simple trait all humans share — confirmation bias. Continue reading