UniSuper and Google Cloud Platform

I know a lot of enterprise cloud customers have been watching the recent incident with Google Cloud (GCP) and UniSuper. For those of you who haven’t seen it: UniSuper is an Australian pension fund firm which had their services hosted on Google Cloud. For some weird reason, their private cloud project was completely deleted. Google’s postmortem of the project is here: https://cloud.google.com/blog/products/infrastructure/details-of-google-cloud-gcve-incident . Fascinating reading – in particular what surprises me is that GCP takes full blame for the incident. There must be some very interesting calls occurring with Google and their other enterprise customers.

There’s some fascinating morsels to consider in Google’s postmortem of the incident. Consider this passage:

Data backups that were stored in Google Cloud Storage in the same region were not impacted by the deletion, and, along with third party backup software, were instrumental in aiding the rapid restoration.

https://cloud.google.com/blog/products/infrastructure/details-of-google-cloud-gcve-incident

Fortunately for UniSuper, the data in Google Cloud Storage didn’t seem to be affected and they were able to restore from there. But it looks like UniSuper also had a another set of data stored with another cloud. The following is from UniSuper’s explanation of the event at: https://www.unisuper.com.au/contact-us/outage-update .

UniSuper had backups in place with an additional service provider. These backups have minimised data loss, and significantly improved the ability of UniSuper and Google Cloud to complete the restoration.

https://www.unisuper.com.au/contact-us/outage-update

Having a full set of backups with another service provider has to be terrifically expensive. I’d be curious to see a discussion of who the additional service provider is and a discussion of the costs. I also wonder if the backup cloud is live-synced with the GCP servers or if there’s a daily/weekly sync of the data to help reduce costs.

The GCP statement seems to say that the restoration was completed with just the data from Google Cloud Storage, while the UniSuper statement is a bit more ambiguous – you could read the statement as either (1) the offsite data was used to complete the restoration or (2) the offsite data was useful but not vital to the restoration effort.

Interestingly, a HN comment indicates that the Australian financial regulator requires this multi-cloud strategy: https://www.infoq.com/news/2024/05/google-cloud-unisuper-outage/ .

I did a quick dive to figure out where these requirements are coming from, and from the best that I could tell, these requirements come from the APRA’s Prudential Standard CPS 230 – Operational Risk Management document. Here’s some interesting lines from there:

  1. An APRA-regulated entity must, to the extent practicable, prevent disruption to
    critical operations, adapt processes and systems to continue to operate within
    tolerance levels in the event of a disruption and return to normal operations
    promptly once a disruption is over.
  2. An APRA-regulated entity must not rely on a service provider unless it can ensure that in doing so it can continue to meet its prudential obligations in full and effectively manage the associated risks.
Australian Prudential Regulation Authority (APRA) – Prudential Standard CPS 230 Operational Risk Management

I think the “rely on a service provider” is the most interesting text here. I wonder if – by keeping a set of data on another cloud provider – UniSuper can justify to the APRA that it’s not relying on any single cloud provider but instead has diversified its risks.

I couldn’t find any discussion about the maximum amount of downtime allowed, so I’m not sure where the “4 week” tolerance from the HN comment came from. Most likely that is from industry norms. But I did find some text about tolerance levels of disruptive events:

  1. 38. For each critical operation, an APRA-regulated entity must establish tolerance levels for:
    (a) the maximum period of time the entity would tolerate a disruption to the
    operation
Australian Prudential Regulation Authority (APRA) – Prudential Standard CPS 230 Operational Risk Management

It’s definitely interesting to see how requirements for enterprise cloud customers grow from their regulators and other interested parties. There’s often some justification underlying every decision (such as duplicating data across clouds) no matter how strange it seems at first.

APRA History On The Cloud

While digging into this subject, I found it quite interesting to trace how the APRA changed its tune about cloud computing over the years. As recently as 2010, the APRA felt the need to, “emphasise the need for proper risk and governance processes for all outsourcing and offshoring arrangements.” Here’s an interesting excerpt from their 2010 letter sent to all APRA-overseen financial companies:

Although the use of cloud computing is not yet widespread in the financial services industry, several APRA-regulated institutions are considering, or already utilising, selected cloud computing based services. Examples of such services include mail (and instant messaging), scheduling (calendar), collaboration (including workflow) applications and CRM solutions. While these applications may seem innocuous, the reality is that they may form an integral part of an institution’s core business processes, including both approval and decision-making, and can be material and critical to the ongoing operations of the institution.
APRA has noted that its regulated institutions do not always recognise the significance of cloud computing initiatives and fail to acknowledge the outsourcing and/or offshoring elements in them. As a consequence, the initiatives are not being subjected to the usual rigour of existing outsourcing and risk management frameworks, and the board and senior management are not fully informed and engaged.

https://www.apra.gov.au/sites/default/files/Letter-on-outsourcing-and-offshoring-ADI-GI-LI-FINAL.pdf

While the letter itself seems rather innocuous, it seems to have had a bit of a chilling effect on Australian banks: this article comments that, “no customers in the finance or government sector were willing to speak on the record for fear of drawing undue attention by regulators“.

An APRA document published on July 6, 2015 seems to be even more critical of the cloud. Here’s a very interesting quote from page 6:

In light of weaknesses in arrangements observed by APRA, it is not readily evident that risk management and mitigation techniques for public cloud arrangements have reached a level of maturity commensurate with usages having an extreme impact if disrupted. Extreme impacts can be financial and/or reputational, potentially threatening the ongoing ability of the APRA-regulated entity to meet its obligations.

https://www.apra.gov.au/sites/default/files/information-paper-outsourcing-involving-shared-computing-services_0.pdf

Then just three years later, the APRA seems to be much more friendly to cloud computing. A ComputerWorld article entitled “Banking regulator warms to cloud computing” published on September 24, 2018 quotes the APRA chair as acknowledging, “advancements in the safety and security in using the cloud, as well as the increased appetite for doing so, especially among new and aspiring entities that want to take a cloud-first approach to data storage and management.

It’s curious to see the evolution of how organizations consider the cloud. I think UniSuper/GCP’s quick restoration of their cloud projects will result in a much more friendly environment toward the cloud.

Facebook Outbound Email

I’m in the middle of testing an email server on Compute Engine, and I noticed something unusual: apparently Facebook’s outbound email servers insist on using extended SMTP to send email.

With extended SMTP, an email server sends email by opening up a connection and sending the EHLO command. The proper response is either 250 (to indicate success and that extended SMTP support is available) or 550 (the responding server did not understand the command, which is another way of saying that the responding server does not support ESMTP). In case of 550 errors, the usual practice is to fall back to the original SMTP command set and to send a HELO request.

But Facebook’s outbound mail servers seem to only want to connect with ESMTP servers: FB mail servers send a quit command instead of falling back to sending a HELO command.

Another interesting oddity from watching mail logs: Google’s Gmail servers seem to be the only mail servers properly implementing the BDAT command (binary data). I never see any other mail servers attempt to use it.

Removing EXIF Data With ImageMagick

Recently I needed to find a way to mass-remove EXIF data from multiple images. EXIF – if you didn’t know – is essentially metadata included within images. This metadata can include the name of the camera, the GPS coordinates where the picture was taken, comments, etc.

The easiest way is to remove EXIF data is to download ImageMagick and use the mogrify command line tool:

mogrify -strip /example/directory/*

ImageMagick can be run within a Compute Engine VM to make it available to a web application. Here’s the command to install ImageMagick:

sudo aptitude install imagemagick

Communicating With Sockets On Compute Engine

Writing custom applications on Compute Engine requires the use of sockets to communicate with clients. Here’s a code example demonstrating how to read and write to sockets in Java.

Assume that client_socket is the socket communicating with the client:

/**
 * Contains the socket we're communicating with.
 */
Socket client_socket;

Create a socket by using a ServerSocket (represented by server_socket ) to listen to and accept incoming connections:

Socket client_socket = server_socket.accept();

Then extract a PrintWriter and a BufferedReader from the socket. The PrintWriter out object sends data to the client, while the BufferedReader in object lets the application read in data sent by the client:

/**
 * Handles sending communications to the client.
 */
PrintWriter out;
/**
 * Handles receiving communications from the client.
 */
BufferedReader in;
try {
    //We can send information to the client by writing to out.
    out = new PrintWriter(client_socket.getOutputStream(), true);
    //We receive information from the client by reading in.
    in = new BufferedReader(new InputStreamReader(client_socket.getInputStream()));
    /**
     * Start talking to the other server.
     */
    //Read in data sent to us.
    String in_line = in.readLine();
    //Send back information to the client.
    out.println(send_info);
}//end try
catch (IOException e) {
    //A general problem was encountered while handling client communications.
}

A line of text sent by the client can be read in by calling in.readLine() demonstrated by the in_line string. To send a line of text to the client, call out.println(send_info) where send_info represents a string. If an error occurs during communication, an IOException will be thrown and caught by the above catchstatement.

Remember to add the following imports:

import java.io.IOException;
import java.net.ServerSocket;
import java.net.Socket;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.io.BufferedReader;

Releasing A Static IP In Compute Engine

Compute Engine allows static IPs to be reserved, even if no VM is attached. However, IPs that are unattached to VMs or load balancers are charged at $0.01 per hour. To avoid this charge, it’s a good idea to release IPs that are no longer in use.

To release a static IP, first go to the Compute Engine section of the Google Cloud Console. Select the Networks option:

A list of the IPs allocated to the project will be displayed:

Select the IP to remove and click the link labeled Release IP address:

You’ll see a confirmation box next. Press Yes.

The console will need a few moments to release the IP:

Measuring Elapsed Time: System.nanoTime()

When I need to measure elapsed time (for example, timing how long an API call takes) I prefer to use System.nanoTime() instead of using the usual method of (new Date()).getTime().

The key idea to remember is that nanoTime() doesn’t return a timestamp – it represents a measurement of time from some arbitrary point (so for example nanoTime() could return a negative figure). This helps when writing self-documenting code: a variable storing a nanoTime() value can only be used for elapsed time comparisons, not timestamp recording.

Here’s an example of measuring elapsed time with System.nanoTime(). First, record the start time:

/**
 * Records the start time 
 * using System.nanoTime()
 */
Long start_time;
//Record the start time.
start_time = System.nanoTime();

Insert some time-consuming operation here, or another long-running call. Place this code where you want to end the timer:

//Calculate how long we've been running in nanoseconds.
Long diff_time = System.nanoTime() - start_time;

The variable diff_time stores the number of nanoseconds that has elapsed during the timer. Now suppose you wanted to throw an exception if the timer took more than 30 seconds (30 billion nanoseconds); here’s the example code:

//We set the maximum time in nanoseconds, multiplied by milliseconds, 
//multiplied by seconds.
Long MAXIMUM_TIME_OPEN = new Long(1000000L * 1000 * 30);
//If the runtime of this operation is longer than the time we've set, 
//throw an Exception.
if (diff_time > MAXIMUM_TIME_OPEN) {
    throw new IOException("Timeout has been exceeded.");
}

To keep the code easy to understand, we’re showing how the maximum time amount is computed: there are 1 million (1,000,000) nanoseconds in a millisecond, multiplied by 1 thousand milliseconds in a second (1,000), multiplied by 30 seconds.

Google Compute Engine

Google’s Compute Engine just released into General Availability, and I’ve been testing it out the last couple of days.

The one thing that blows me away is how reliably fast even the low-end instances are. I provisioned and set up a f1-micro instance – it runs great and quite consistently. That’s in sharp contrast to Amazon’s micro instance which is limited to “bursty” processing; there are spikes where processing goes quickly, then the CPU gets throttled and the instance grinds to a near-halt.

I’m considering building a mail app on top of GCE – so far, everything looks great. GCE even allows inbound SMTP connections (although unfortunately no outbound SMTP connections).