Is The Service Account Now Required With Google Cloud Build?

I was updating some Cloud Build triggers and I’m not sure what changed – I think that the service account field when configuring a new build trigger is now mandatory because I don’t recall ever having to set that field before.

Also, this is the first time I’ve ever seen the below error:

Your build failed to run: generic::invalid_argument: if ‘build.service_account’ is specified, the build must either (a) specify ‘build.logs_bucket’, (b) use the REGIONAL_USER_OWNED_BUCKET build.options.default_logs_bucket_behavior option, or (c) use either CLOUD_LOGGING_ONLY / NONE logging options

Google Cloud Build

And the fix is obviously just to configure cloud logging in the cloudbuild.yaml file in my repository:

steps:
- name: "gcr.io/cloud-builders/gcloud"
  args: ["app", "deploy", "--version", "1alpha"]
timeout: "1600s"
options:
  logging: CLOUD_LOGGING_ONLY

UniSuper and Google Cloud Platform

I know a lot of enterprise cloud customers have been watching the recent incident with Google Cloud (GCP) and UniSuper. For those of you who haven’t seen it: UniSuper is an Australian pension fund firm which had their services hosted on Google Cloud. For some weird reason, their private cloud project was completely deleted. Google’s postmortem of the project is here: https://cloud.google.com/blog/products/infrastructure/details-of-google-cloud-gcve-incident . Fascinating reading – in particular what surprises me is that GCP takes full blame for the incident. There must be some very interesting calls occurring with Google and their other enterprise customers.

There’s some fascinating morsels to consider in Google’s postmortem of the incident. Consider this passage:

Data backups that were stored in Google Cloud Storage in the same region were not impacted by the deletion, and, along with third party backup software, were instrumental in aiding the rapid restoration.

https://cloud.google.com/blog/products/infrastructure/details-of-google-cloud-gcve-incident

Fortunately for UniSuper, the data in Google Cloud Storage didn’t seem to be affected and they were able to restore from there. But it looks like UniSuper also had a another set of data stored with another cloud. The following is from UniSuper’s explanation of the event at: https://www.unisuper.com.au/contact-us/outage-update .

UniSuper had backups in place with an additional service provider. These backups have minimised data loss, and significantly improved the ability of UniSuper and Google Cloud to complete the restoration.

https://www.unisuper.com.au/contact-us/outage-update

Having a full set of backups with another service provider has to be terrifically expensive. I’d be curious to see a discussion of who the additional service provider is and a discussion of the costs. I also wonder if the backup cloud is live-synced with the GCP servers or if there’s a daily/weekly sync of the data to help reduce costs.

The GCP statement seems to say that the restoration was completed with just the data from Google Cloud Storage, while the UniSuper statement is a bit more ambiguous – you could read the statement as either (1) the offsite data was used to complete the restoration or (2) the offsite data was useful but not vital to the restoration effort.

Interestingly, a HN comment indicates that the Australian financial regulator requires this multi-cloud strategy: https://www.infoq.com/news/2024/05/google-cloud-unisuper-outage/ .

I did a quick dive to figure out where these requirements are coming from, and from the best that I could tell, these requirements come from the APRA’s Prudential Standard CPS 230 – Operational Risk Management document. Here’s some interesting lines from there:

  1. An APRA-regulated entity must, to the extent practicable, prevent disruption to
    critical operations, adapt processes and systems to continue to operate within
    tolerance levels in the event of a disruption and return to normal operations
    promptly once a disruption is over.
  2. An APRA-regulated entity must not rely on a service provider unless it can ensure that in doing so it can continue to meet its prudential obligations in full and effectively manage the associated risks.
Australian Prudential Regulation Authority (APRA) – Prudential Standard CPS 230 Operational Risk Management

I think the “rely on a service provider” is the most interesting text here. I wonder if – by keeping a set of data on another cloud provider – UniSuper can justify to the APRA that it’s not relying on any single cloud provider but instead has diversified its risks.

I couldn’t find any discussion about the maximum amount of downtime allowed, so I’m not sure where the “4 week” tolerance from the HN comment came from. Most likely that is from industry norms. But I did find some text about tolerance levels of disruptive events:

  1. 38. For each critical operation, an APRA-regulated entity must establish tolerance levels for:
    (a) the maximum period of time the entity would tolerate a disruption to the
    operation
Australian Prudential Regulation Authority (APRA) – Prudential Standard CPS 230 Operational Risk Management

It’s definitely interesting to see how requirements for enterprise cloud customers grow from their regulators and other interested parties. There’s often some justification underlying every decision (such as duplicating data across clouds) no matter how strange it seems at first.

APRA History On The Cloud

While digging into this subject, I found it quite interesting to trace how the APRA changed its tune about cloud computing over the years. As recently as 2010, the APRA felt the need to, “emphasise the need for proper risk and governance processes for all outsourcing and offshoring arrangements.” Here’s an interesting excerpt from their 2010 letter sent to all APRA-overseen financial companies:

Although the use of cloud computing is not yet widespread in the financial services industry, several APRA-regulated institutions are considering, or already utilising, selected cloud computing based services. Examples of such services include mail (and instant messaging), scheduling (calendar), collaboration (including workflow) applications and CRM solutions. While these applications may seem innocuous, the reality is that they may form an integral part of an institution’s core business processes, including both approval and decision-making, and can be material and critical to the ongoing operations of the institution.
APRA has noted that its regulated institutions do not always recognise the significance of cloud computing initiatives and fail to acknowledge the outsourcing and/or offshoring elements in them. As a consequence, the initiatives are not being subjected to the usual rigour of existing outsourcing and risk management frameworks, and the board and senior management are not fully informed and engaged.

https://www.apra.gov.au/sites/default/files/Letter-on-outsourcing-and-offshoring-ADI-GI-LI-FINAL.pdf

While the letter itself seems rather innocuous, it seems to have had a bit of a chilling effect on Australian banks: this article comments that, “no customers in the finance or government sector were willing to speak on the record for fear of drawing undue attention by regulators“.

An APRA document published on July 6, 2015 seems to be even more critical of the cloud. Here’s a very interesting quote from page 6:

In light of weaknesses in arrangements observed by APRA, it is not readily evident that risk management and mitigation techniques for public cloud arrangements have reached a level of maturity commensurate with usages having an extreme impact if disrupted. Extreme impacts can be financial and/or reputational, potentially threatening the ongoing ability of the APRA-regulated entity to meet its obligations.

https://www.apra.gov.au/sites/default/files/information-paper-outsourcing-involving-shared-computing-services_0.pdf

Then just three years later, the APRA seems to be much more friendly to cloud computing. A ComputerWorld article entitled “Banking regulator warms to cloud computing” published on September 24, 2018 quotes the APRA chair as acknowledging, “advancements in the safety and security in using the cloud, as well as the increased appetite for doing so, especially among new and aspiring entities that want to take a cloud-first approach to data storage and management.

It’s curious to see the evolution of how organizations consider the cloud. I think UniSuper/GCP’s quick restoration of their cloud projects will result in a much more friendly environment toward the cloud.

Correcting A SQLite Code Example

I’ve been experimenting with filtering and manipulating a large amount of data within a Google Cloud Function. I decided to use an in-memory SQLite database to help manage all the data, so I googled up some code samples. This page came up with some helpful Python code samples.

Unfortunately when I tried to run the sample code, Cloud Functions popped an error. The sample code uses Python 2-style print as a statement instead of as a function call – i.e. the print call is missing the parentheses needed to make it a correct function call. Here’s a sample screenshot:

I’ve placed red arrows next to the erroneous print statements. If you paste this code into Google Cloud Functions, it won’t work because print needs to be a function call, (with parentheses) instead of a statement (missing parentheses). Credit: https://www.tutorialspoint.com/sqlite/sqlite_python.htm

Below is a fixed version of the code in the linked page. You can paste it directly into the Google Cloud Functions editor and it’ll work: it sets up an in-memory database, creates a table, adds data, then queries data out of it.

import sqlite3



def hello_world(request):
    """Responds to any HTTP request.
    Args:
        request (flask.Request): HTTP request object.
    Returns:
        The response text or any set of values that can be turned into a
        Response object using
        `make_response <http://flask.pocoo.org/docs/1.0/api/#flask.Flask.make_response>`.
    """
    conn = sqlite3.connect(":memory:")
    conn.execute('''CREATE TABLE COMPANY
         (ID INT PRIMARY KEY     NOT NULL,
         NAME           TEXT    NOT NULL,
         AGE            INT     NOT NULL,
         ADDRESS        CHAR(50),
         SALARY         REAL);''')
    conn.execute("INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) \
        VALUES (1, 'Paul', 32, 'California', 20000.00 )");
    conn.execute("INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) \
        VALUES (2, 'Allen', 25, 'Texas', 15000.00 )");
    conn.execute("INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) \
        VALUES (3, 'Teddy', 23, 'Norway', 20000.00 )");
    conn.execute("INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) \
        VALUES (4, 'Mark', 25, 'Rich-Mond ', 65000.00 )");
    conn.commit()
    print("Records created successfully");
    cursor = conn.execute("SELECT id, name, address, salary from COMPANY")
    for row in cursor:
        print("ID = ", row[0])
        print("NAME = ", row[1])
        print("ADDRESS = ", row[2])
        print("SALARY = ", row[3], "\n")
    conn.close()
    request_json = request.get_json()
    if request.args and 'message' in request.args:
        return request.args.get('message')
    elif request_json and 'message' in request_json:
        return request_json['message']
    else:
        return f'Hello World!'

Use this code as a starting point to build your own cloud functions and work with data.

I’m pleasantly surprised at how fast SQLite runs within a cloud function – I was worried that the function would run out of memory quickly, but I’ve been manipulating thousands of rows comfortably within a 512MB RAM function.

Cloud Build Failure: You do not have permission to act as (Service Account)

I have a demo App Engine application on GitHub, mapped through Google Cloud Build to automatically redeploy upon any change in the master repository. I’ve left this app untouched for about a month or so, until now where I made some minor updates and pushed those updates to the GitHub repository.

Unfortunately it seems that Cloud Build has changed some permissions, because suddenly errors came up and my updates failed to deploy. Here’s a screenshot of my Cloud Build page, and the errors:

My Cloud Build page. All my builds in October and earlier of this year succeeded, but my November builds started failing.

Apparently this error was due to a permissions error

ERROR: (gcloud.app.deploy) PERMISSION_DENIED: You do not have permission to act as '[email protected]'
- '@type': type.googleapis.com/google.rpc.ResourceInfo
  description: You do not have permission to act as this service account.
  resourceName: [email protected]
  resourceType: serviceAccount
ERROR
ERROR: build step 0 "gcr.io/cloud-builders/gcloud" failed: step exited with non-zero status: 1
Error detail in context.

When I tried to force the run via the Run trigger, I got this error:

Failed to trigger build: generic::permission_denied: service account [email protected] has insufficient permission to execute the build on project project-name.

In short, you need to add the Cloud Build Service Agent role to Cloud Build, allowing it to use service accounts to authenticate into other Google services. in the IAM section of the cloud console, find the Cloud Build service account:

The Cloud Build service account is in the black box.

And then add the Cloud Build Service Agent to the Cloud Build service account:

After I added that role, my Cloud Build deployments worked again.

A Funny Health Graph

Health graphs always amuse me. I occasionally have to scratch my head and wonder what exactly is being measured.

Take the graph below. At first, the service has zero users and a zero error rate. But once it gets to 1 user (July 19), the error rate ramps right up to 100% and stays at 100% error rate even when there are no users using it (July 20 – 21 part of the graph).

How can there be an error rate if it’s not being used?

Just another day in amusing metrics.

Initial Thoughts On Comparing Tweets

Note: some of this code is old; it could be rewritten to benefit more from SQL’s windowing and other advanced functions.

I’ve been poking around some old code of mine recently – since it’s going unused, maybe someone on the Internet would benefit from it. A few years ago, I was toying with some code to detect Twitter trending tweets. Consider a SQL table set up like the below, where user is the name of Twitter user posting the tweet, ID is the tweet ID, created_at is the creation date of the tweet, and retweet_count/favorite_count are the number of retweets and favorites the tweet had.

CREATE TABLE twitter (
    category VARCHAR(30),
    user VARCHAR(30),
    id VARCHAR(90), 
    created_at DATETIME,
    retweet_count INT,
    favorite_count INT,
    add_date DATETIME
);

The trend finder worked like this: it would take snapshots of each Tweet’s retweet and favorite count (add_date is the datetime of the snapshot) and then compare the snapshots to detect how fast the tweet was gaining retweets/favorites. The comparison was done in SQL. I named the gain rate of retweets/favorites the slope of the activity, and calculated multiple slopes: the increase in retweets/favorites within the most recent 90 minutes, and the overall increase of retweets/favorites. Here’s the simplified SQL (other SQL has been cut off because this is already too long, and not relevant to the point I’m making):


insert into tweet_slope 
(

select twitter.created_at, twitter.category, twitter.id, twitter.retweet_count as max_retweet, 
 twitter.favorite_count as max_favorite, twitter.add_date, new_table.max_date, 
 first_table.first_date, first_table.first_retweet, first_table.first_favorite, 
 prior_table.prior_date, prior_table.prior_retweet, prior_table.prior_favorite,
                                                                                                    	
(twitter.retweet_count - first_table.first_retweet) as full_retweet_diff,
TIMESTAMPDIFF(MINUTE, first_table.first_date, new_table.max_date) as full_retweet_timespan,
((twitter.retweet_count - first_table.first_retweet) / 
TIMESTAMPDIFF(MINUTE, first_table.first_date, new_table.max_date)) as full_retweet_slope,
(twitter.retweet_count - prior_table.prior_retweet) as recent_retweet_diff,
TIMESTAMPDIFF(MINUTE, prior_table.prior_date, new_table.max_date) as recent_retweet_timespan,
((twitter.retweet_count - prior_table.prior_retweet) / 
TIMESTAMPDIFF(MINUTE, prior_table.prior_date, new_table.max_date)) as recent_retweet_slope,
(twitter.favorite_count - first_table.first_favorite) as full_favorite_diff,
TIMESTAMPDIFF(MINUTE, first_table.first_date, new_table.max_date) as full_favorite_timespan,
((twitter.favorite_count - first_table.first_favorite) / 
TIMESTAMPDIFF(MINUTE, first_table.first_date, new_table.max_date)) as full_favorite_slope,
(twitter.favorite_count - prior_table.prior_favorite) as recent_favorite_diff,
TIMESTAMPDIFF(MINUTE, prior_table.prior_date, new_table.max_date) as recent_favorite_timespan,
((twitter.favorite_count - prior_table.prior_favorite) / 
TIMESTAMPDIFF(MINUTE, prior_table.prior_date, new_table.max_date)) as recent_favorite_slope
 from (select id, max(add_date) as max_date from twitter group by id) as new_table
 inner join twitter on twitter.id=new_table.id and twitter.add_date=new_table.max_date
 INNER JOIN 
(select twitter.id, twitter.add_date as first_date, twitter.retweet_count as 
 first_retweet, twitter.favorite_count as first_favorite from twitter inner 
 join (select id, min(add_date) as min_date from twitter group by id) as old_dates 
 on twitter.id=old_dates.id and twitter.add_date=old_dates.min_date) as first_table
 
 on twitter.id=first_table.id
 INNER JOIN 
 (select twitter.id, twitter.add_date as prior_date, twitter.retweet_count as 
 prior_retweet, twitter.favorite_count as prior_favorite from twitter INNER JOIN 
 (select id, max(add_date) as prior_set_date from twitter where add_date <  
 DATE_SUB(NOW(), INTERVAL 90 MINUTE) group by id) as prior_set 
 ON twitter.add_date=prior_set.prior_set_date and twitter.id=prior_set.id) as  prior_table 
 ON twitter.id=prior_table.id
 where twitter.created_at > DATE_SUB(NOW(), INTERVAL 1 DAY) AND 
 new_table.max_date > DATE_SUB(NOW(), INTERVAL 45 MINUTE) AND 
 new_table.max_date <> first_table.first_date
and twitter.category = :category
)
;

Now this code worked, and it worked great! A fast DB machine made all the difference to crunch out all the SQL calculations. I tested it by sampling tweets from sets of related accounts. One set I tested were all Disney/Marvel/Star Wars-related Twitter accounts (for instance, the accounts of actors for Luke Skywalker, Iron Man, Disney Parks) and the top-ranking tweets were almost all from Mark Hamill’s account; on the day I was testing, Mark’s tweets had thousands of favorites/retweets due to being politics-related.

I said this code was old at the beginning of the post, and that’s true – it has since been replaced. While this code worked, it was too simple: in practice, it would never surface some trending tweets from less-popular Twitter accounts.

The real problem is that the popularity of tweets is a bimodal distribution: most tweets end up puttering along with, at best, a handful of retweets/favorites; some might even get a few hundred. But then there’s a small minority of tweets, posted from extremely popular accounts, that get thousands and thousands of favorites and retweets regardless of the individual tweet’s merit – the tweets themselves are popular because the poster is popular. This crowds out the trending tweets from the other section of the bimodal graph.

I rebuilt the application to take into account the number of followers a Twitter user has (as a proxy for the account’s relative popularity) and to compare a tweet to the account’s tweet history (how popular is a given tweet compared to the other tweets in the same account?). I’ll share the code soon.

AppSheet

I’m pretty bullish on all no-code/low-code automation services such as IFTTT, Zapier, AWS HoneyCode, and all the startups in this sector. Software is eating the world, and it’s not hard to see the piles of cash being shoveled to the first company who can drastically simplify a lot of current IT operations.

That’s one of the reasons I’m pleased to see Google introducing more automation within AppSheet and Apigee: https://techcrunch.com/2020/09/08/google-cloud-launches-its-business-application-platform-with-apigee-and-appsheet/ . I’ve been toying around with AppSheet to help automate some work, and it is a surprisingly fun and agile platform to work on.

I really like the deep integration with Google Sheets: AppSheet makes it easy to quickly deploy a proof of concept app or a low-traffic production application and store data within a Google Sheet. Better yet, AppSheet can analyze your data and set up an app template automatically. When I first started AppSheet, it automatically imported a spreadsheet “Post Mapping” stored in my Google Drive: it was a listing of bookmarks and dates I had been saving.

A view of AppSheet’s introductory page.

When I clicked on the Post Mapping file, AppSheet automatically created table and calendar views of my data:

One thing I’m surprised not to see is integration with Firebase. Firebase is – IMO – the easiest to use backend for mobile and web apps, and yet AppSheet doesn’t integrate it. At the very least I’d like to see AppSheet borrow Firebase’s authentication options.

In short, AppSheet is definitely something to keep your eyes on.

Firestore Add Document – Python

A simple example of adding a document to the Firestore:

    from google.cloud import firestore
    
    #Make a note in the firestore
    firestore_client = firestore.Client()
    doc_ref = firestore_client.collection(u'CollectionsNameHere').add({
        u'propertyone': u'thisisavalue',
        u'propertytwo': 2,
    })
    logging.warning(doc_ref[1].id)

.add returns a tuple of the date that the document was written into the firestore, and a document reference to the written document. .add lets Firestore create a document ID to the written document: to figure out what the ID is, the last line accesses the tuple and pulls out the document ID.

An item that bit me: the official documentation for Firestore states that .add and .doc().set() are equivalent. That’s true, but the returns from both of those functions are different.

Documentation from Google showing .add and .doc.set are the same operation.

.add returns a tuple of the document creation time and then a reference to the added document. However, .set returns a WriteResult, not a tuple. The WriteResult contains the time the document was updated/set in the property update_time. Make sure to use the correct function as necessary.

Set Documentation

Set Documentation - Firestore
WriteResult documentation - Firestore.

Add Documentation

Add documentation - Firestore