A Short Post On The Google Spider’s Ability To Read Javascript – Loaded Content

There was a terrific post from Vercel’s blog How Google handles JavaScript throughout the indexing process. The short version of this post is yes, the Google indexer/spider can read content loaded through Javascript. That should come as no surprise to most professional SEO-ers out there, but it’s always worth a reminder. I did want to highlight one or possibly two important items in the recommendations section.

Critical SEO elements: Use server-side rendering or static generation for critical SEO tags and important content to ensure they’re present in the initial HTML response.

https://vercel.com/blog/how-google-handles-javascript-throughout-the-indexing-process

Content updates: For content that needs to be quickly re-indexed, ensure changes are reflected in the server-rendered HTML, not just in client-side JavaScript. Consider strategies… to balance content freshness with SEO and performance.

https://vercel.com/blog/how-google-handles-javascript-throughout-the-indexing-process

Yes, the Google spider can read Javascript. At the same time, there are tradeoffs to be considered, such as if scripting is slow loading the content, then the spider “sees” your page as being slow and possibly annoying to the user.

My thoughts are: if you’re scroll-loading/infinite scrolling content (i.e. the user scrolls down the page and more posts/images/whatever keep appearing) don’t assume that the Google spider is going to keep scrolling down forever. Your site should have an alternative way to link to those posts/images, such as through a tags or categories section, or through a calendar widget like the one on learngoogle.com’s home page. Have a sitemap so Google’s spider isn’t reliant on having to use Javascript to see all of your pages. Make sure that any scripts loading content run fast, and are not heavily changing the layout of the page.

A Reply To: “Google Defaults To Not Indexing” Or: Google As Miss Manners

I saw this blog post on Hacker News, and it was so notable that I was thinking about it for the past week. I disagree on its major points for technical reasons, but I agree in that you should SEO with the thought that it’s true.

But first, I want to make a distinction here. When Google hits a website and looks at its content for possible inclusion into its search index, we call that “spidering”. That’s not a word plucked out of nowhere – we call web crawlers searching for content “spiders” and there’s a long technical history behind that.

In my experience, Google spiders basically everything – even places maybe you wish Google didn’t find such as admin pages. And frankly this makes sense – spidering your web site doesn’t only give information about your website, but it also gets Google information about how it should rank other web pages. For example, Google gets information about the sites you link out to, which contributes to PageRank calculations of how other web pages should be ranked. A second example is that by spidering all the web pages, Google can find scraped/duplicate content and possibly consider the offending domain (not necessarily your domain!) for SEO penalties.

So if there is an incentive to spider everything, you can see where I disagree with the blog post:

Credit: https://www.vincentschmalbach.com/google-now-defaults-to-not-indexing-your-content/

I think it’s very unreasonable to say “Google is no longer trying to index the entire web.” There are huge incentives for Google to spider and at least know about the entire web, even if they don’t actually show the web pages it knows about in its search.

First off, most people don’t go past the first page of search results anymore. For a majority of searches, the answers from Google’s AI summary/the first few results (regardless of whether they’re ads or not) will show up with the answer. 60% of searches don’t even result in a click to an outside web page. So even if Google knows about additional web sites that might match the search, is it worth the computing power to resolve the rankings much below the 20th search result slot or even farther?

There’s a human analog here: people do not want to hear additional details. They want you to get to the point as fast as possible. Here’s a Miss Manners article on “Is there any polite way to encourage someone who is recounting an anecdote to you to come to the point a little faster?” I find it reasonable to assume Google search is simply getting to the point and not showing sites that – even though they have relevant information – that information is already available on the other competing web pages that are higher ranked.

So in short, I disagree with this blog article on a technical basis. I don’t think it’s quite so easy to to say because a web page is not showing up in a Google search, that automatically equals Google didn’t see it or care about it or that it’s not in the Google index.

On the other hand, I think the blog’s deeper point is true. We’ve reached the point in the Internet where there are lots of good competing information sources. If you want to launch a competitor, you need to have a value proposition and a niche: a place that you can get started. For example, suppose you have a Pizza Hut, Papa Johns, (insert your favorite pizza place here) in your town. Your townspeople are generally happy with the pizza available, and there’s no obvious need for another pizza place. If you want to launch a new pizza restaurant, you can’t just say, “We sell pizza.” You have to have a value proposition different than Pizza Hut/Papa Johns/etc: maybe the pizza at your restaurant is meatier/cheesier/better crust/whatever better than the competitors.

The same goes for content: if you want to launch a new website, you need to have a value proposition different than what your competitors are offering if you want a space in Google search rankings. You need to develop a following as an expert in some niche in order to compete with better, more well funded competitors especially if you’re a smaller blog.

Is The Service Account Now Required With Google Cloud Build?

I was updating some Cloud Build triggers and I’m not sure what changed – I think that the service account field when configuring a new build trigger is now mandatory because I don’t recall ever having to set that field before.

Also, this is the first time I’ve ever seen the below error:

Your build failed to run: generic::invalid_argument: if ‘build.service_account’ is specified, the build must either (a) specify ‘build.logs_bucket’, (b) use the REGIONAL_USER_OWNED_BUCKET build.options.default_logs_bucket_behavior option, or (c) use either CLOUD_LOGGING_ONLY / NONE logging options

Google Cloud Build

And the fix is obviously just to configure cloud logging in the cloudbuild.yaml file in my repository:

steps:
- name: "gcr.io/cloud-builders/gcloud"
  args: ["app", "deploy", "--version", "1alpha"]
timeout: "1600s"
options:
  logging: CLOUD_LOGGING_ONLY

GoDaddy EMail Forwarding Shuts Off – Migrating Catchall Email Addresses

I have a domain that I used for email almost 20 years ago. I don’t use it for (important!) email anymore, nor really for any other purpose, but I do occasionally check the email account attached to the domain every week or so in case something important gets sent through. Most of the time it’s nothing more than a few hundred messages from various listservs I’ve been on for a long time.

So you can imagine my surprise when I logged onto the email account and saw zero new messages – very unusual since those listservs have a lot of daily traffic. After some Googling, I found that GoDaddy’s catch all email forwarders apparently no longer work. Here’s an example post from Reddit on the situation: https://www.reddit.com/r/godaddy/comments/1d94771/email_catchall_help/ .

What really annoys me is that the email forwarding is silently broken – there’s no rejection email or anything. I tried to send some emails to my email-forwarded domain and none of them went through, nor were they rejected. Again, here’s a Reddit post documenting this: https://www.reddit.com/r/DontGoDaddy/comments/1d1pvim/comment/l7d2ua5/ .

I tried searching my email archives to see if there was any warning email catchalls would be turned off – I didn’t see anything, and this Reddit post confirms that nobody else received a warning either: https://www.reddit.com/r/ProtonMail/comments/1d2bup6/comment/l6k9amp/ .

I’m pretty disappointed in how email forwarding from GoDaddy was shut down. Free email forwarding has been basically free with domain registration for a long time with any decent registrar.

Anyway to fix this, I’ve been rerouting a bunch of domains to map to my Google Apps account as alias domains, then mapping the catchall address in Gmail Routing to map to my main account as in the picture below.

Once again, Google to the rescue, but I am seriously annoyed at having to work around GoDaddy issues. The fact that they gave zero warning of this change is concerning to say the least.

Google Search For . (Period)

For today, I wanted to record a quick observation I had while Googling. It’s also a reminder that choosing the correct search terms can drastically change what Google returns to you.

If I Google for the period symbol (.), I get back results for the phrase “full stop punctuation.” I know this because the words “full stop punctuation” are bolded in the returned Google page. Here’s a screenshot in case that changes:

Note that the links aren’t terribly interesting – I don’t see any links to punctuation or style guides, just pages with the words “full stop punctuation.”

Now interestingly, if I search for the words “period punctuation”, I get back a small context box explaining to me what a period is used for in writing, as well as a list of punctuation and writing guides:

The results for a Google search for “period punctuation.”

As you can see, a minor change in search terms dramatically changes what you get, even if both terms mean largely the same thing.

UniSuper and Google Cloud Platform

I know a lot of enterprise cloud customers have been watching the recent incident with Google Cloud (GCP) and UniSuper. For those of you who haven’t seen it: UniSuper is an Australian pension fund firm which had their services hosted on Google Cloud. For some weird reason, their private cloud project was completely deleted. Google’s postmortem of the project is here: https://cloud.google.com/blog/products/infrastructure/details-of-google-cloud-gcve-incident . Fascinating reading – in particular what surprises me is that GCP takes full blame for the incident. There must be some very interesting calls occurring with Google and their other enterprise customers.

There’s some fascinating morsels to consider in Google’s postmortem of the incident. Consider this passage:

Data backups that were stored in Google Cloud Storage in the same region were not impacted by the deletion, and, along with third party backup software, were instrumental in aiding the rapid restoration.

https://cloud.google.com/blog/products/infrastructure/details-of-google-cloud-gcve-incident

Fortunately for UniSuper, the data in Google Cloud Storage didn’t seem to be affected and they were able to restore from there. But it looks like UniSuper also had a another set of data stored with another cloud. The following is from UniSuper’s explanation of the event at: https://www.unisuper.com.au/contact-us/outage-update .

UniSuper had backups in place with an additional service provider. These backups have minimised data loss, and significantly improved the ability of UniSuper and Google Cloud to complete the restoration.

https://www.unisuper.com.au/contact-us/outage-update

Having a full set of backups with another service provider has to be terrifically expensive. I’d be curious to see a discussion of who the additional service provider is and a discussion of the costs. I also wonder if the backup cloud is live-synced with the GCP servers or if there’s a daily/weekly sync of the data to help reduce costs.

The GCP statement seems to say that the restoration was completed with just the data from Google Cloud Storage, while the UniSuper statement is a bit more ambiguous – you could read the statement as either (1) the offsite data was used to complete the restoration or (2) the offsite data was useful but not vital to the restoration effort.

Interestingly, a HN comment indicates that the Australian financial regulator requires this multi-cloud strategy: https://www.infoq.com/news/2024/05/google-cloud-unisuper-outage/ .

I did a quick dive to figure out where these requirements are coming from, and from the best that I could tell, these requirements come from the APRA’s Prudential Standard CPS 230 – Operational Risk Management document. Here’s some interesting lines from there:

  1. An APRA-regulated entity must, to the extent practicable, prevent disruption to
    critical operations, adapt processes and systems to continue to operate within
    tolerance levels in the event of a disruption and return to normal operations
    promptly once a disruption is over.
  2. An APRA-regulated entity must not rely on a service provider unless it can ensure that in doing so it can continue to meet its prudential obligations in full and effectively manage the associated risks.
Australian Prudential Regulation Authority (APRA) – Prudential Standard CPS 230 Operational Risk Management

I think the “rely on a service provider” is the most interesting text here. I wonder if – by keeping a set of data on another cloud provider – UniSuper can justify to the APRA that it’s not relying on any single cloud provider but instead has diversified its risks.

I couldn’t find any discussion about the maximum amount of downtime allowed, so I’m not sure where the “4 week” tolerance from the HN comment came from. Most likely that is from industry norms. But I did find some text about tolerance levels of disruptive events:

  1. 38. For each critical operation, an APRA-regulated entity must establish tolerance levels for:
    (a) the maximum period of time the entity would tolerate a disruption to the
    operation
Australian Prudential Regulation Authority (APRA) – Prudential Standard CPS 230 Operational Risk Management

It’s definitely interesting to see how requirements for enterprise cloud customers grow from their regulators and other interested parties. There’s often some justification underlying every decision (such as duplicating data across clouds) no matter how strange it seems at first.

APRA History On The Cloud

While digging into this subject, I found it quite interesting to trace how the APRA changed its tune about cloud computing over the years. As recently as 2010, the APRA felt the need to, “emphasise the need for proper risk and governance processes for all outsourcing and offshoring arrangements.” Here’s an interesting excerpt from their 2010 letter sent to all APRA-overseen financial companies:

Although the use of cloud computing is not yet widespread in the financial services industry, several APRA-regulated institutions are considering, or already utilising, selected cloud computing based services. Examples of such services include mail (and instant messaging), scheduling (calendar), collaboration (including workflow) applications and CRM solutions. While these applications may seem innocuous, the reality is that they may form an integral part of an institution’s core business processes, including both approval and decision-making, and can be material and critical to the ongoing operations of the institution.
APRA has noted that its regulated institutions do not always recognise the significance of cloud computing initiatives and fail to acknowledge the outsourcing and/or offshoring elements in them. As a consequence, the initiatives are not being subjected to the usual rigour of existing outsourcing and risk management frameworks, and the board and senior management are not fully informed and engaged.

https://www.apra.gov.au/sites/default/files/Letter-on-outsourcing-and-offshoring-ADI-GI-LI-FINAL.pdf

While the letter itself seems rather innocuous, it seems to have had a bit of a chilling effect on Australian banks: this article comments that, “no customers in the finance or government sector were willing to speak on the record for fear of drawing undue attention by regulators“.

An APRA document published on July 6, 2015 seems to be even more critical of the cloud. Here’s a very interesting quote from page 6:

In light of weaknesses in arrangements observed by APRA, it is not readily evident that risk management and mitigation techniques for public cloud arrangements have reached a level of maturity commensurate with usages having an extreme impact if disrupted. Extreme impacts can be financial and/or reputational, potentially threatening the ongoing ability of the APRA-regulated entity to meet its obligations.

https://www.apra.gov.au/sites/default/files/information-paper-outsourcing-involving-shared-computing-services_0.pdf

Then just three years later, the APRA seems to be much more friendly to cloud computing. A ComputerWorld article entitled “Banking regulator warms to cloud computing” published on September 24, 2018 quotes the APRA chair as acknowledging, “advancements in the safety and security in using the cloud, as well as the increased appetite for doing so, especially among new and aspiring entities that want to take a cloud-first approach to data storage and management.

It’s curious to see the evolution of how organizations consider the cloud. I think UniSuper/GCP’s quick restoration of their cloud projects will result in a much more friendly environment toward the cloud.

How To Waste AdWords Budget: Postie Plugin Edition

Some time ago I was looking for ways to send in posts to my WordPress blog via email, and I found a reference to a WordPress plugin called “Postie.” So I popped that into Google search and what did I get?

The correct answer to this search would be the Postie WordPress plugin hosted here. But apparently there is another company named Postie which manages enterprise mail (hosted at postie.com) which is a completely separate entity to the WordPress plugin (hosted at postieplugin.com). As you can see from the screenshot, my search resulted in an ad for the enterprise company.

But I have no interest in enterprise mail. That ad is effectively wasted. Worse yet, the CTR (clickthrough rate, the number of times the ad is clicked on divided by the number of times the ad is shown) of the ad goes down through no fault of the ad itself. But you can see why the ad was shown – the ad’s creator placed ads on the word “postie” and didn’t realize there might be other organizations with the same name.

This is a good example of where negative keywords are used. In short negative keywords are used to find searches to NOT show ads to. In this case, Postie (the enterprise company) should have used negative keywords to exclude the word “plugin” so they’re not confused with Postie Plugin (the WordPress plugin).

Google SEO Update On March 2024: Up 314%

If you’re interested in search optimization, you’ll know about Google’s new search update that released in March 2024. Per Google, the search update is intended to weed out low effort sites, sites with a ton of AI content, affiliate review sites, and so forth. A good outline of what went on in this update is here.

In short, a lot of chaos occurred. Major publications are reporting pretty severe drops in traffic; smaller sites are reporting traffic drops of greater than 90%. Here’s a fun quote:

BBC News, for example, was among the sites that saw the biggest percentage drops, with its site losing 37% of its search visibility having fallen from 24.7 to 15.4 points in a little over six weeks. Its relative decline was second only to Canada-based entertainment site, Screenrant which saw its visibility fall by 40% from 27.6 to 16.7.

https://pressgazette.co.uk/media-audience-and-business-data/first-google-core-update-of-2024-brings-bad-news-for-most-news-publishers/

There’s a lot of doom and gloom about this update, but I’m really liking it. I’m seeing a lot of very interesting stuff float up on my Google searches that normally would be buried. In particular I’m seeing fewer “top 10 XYZ” type webpages and more links to opinion websites such as Reddit and other forums.

And then there’s this: one of my websites is reporting 314% more clicks from Google search.

I run a small blog (not this one) which is basically a tumblelog-style fan blog for a specific consumer-goods company. It really doesn’t do much except repost funny pictures and interesting articles. The blog typically gets about 100 clicks a month from Google search – which never ceases to amaze me, especially since the site itself is so simple.

With that in mind, I was shocked to suddenly see a burst of emails over the past month congratulating me over a sudden rise in traffic:

A sample of the emails:

What on earth is going on? A quick view of my search console shows the truth:

I’m not making any larger point here, it’s just interesting to see how fast things can change during a search core update.

Task Queue Fun: DeadlineExceeded

I always love pointing out fun errors – where I define “fun error” as an error that is intermittent/happens rarely in the context of regular operation in the application. Those errors are always the most fun to fix.

Today I was poking through some of my toy applications running on Google Cloud when I saw this:

And the text only:

_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
._end_unary_response_blocking ( /layers/google.python.pip/pip/lib/python3.7/site-packages/grpc/_channel.py:1003 )	-	Jan 23, 2024	22 hours ago	-

DeadlineExceeded: 504 Deadline Exceeded
.error_remapped_callable ( /layers/google.python.pip/pip/lib/python3.7/site-packages/google/api_core/grpc_helpers.py:81 )	-	Jan 23, 2024	22 hours ago

Hmm – so an error occurred 22 hours ago, that last reoccurred almost 4 months ago (Jan 23, 2024). Doesn’t sound very important. But just for the laughs, let’s dig in.

Of the two errors, I know that the first one (InactiveRPCError) is most likely due to a connection being unable to complete. Not a giant problem, happens all the time in the cloud – servers get rebooted, VMs get shuffled off to another machine, etc. Not a serious problem. The deadline exceeded one concerns me though because I know this application connects to a bunch of different APIs and does a ton of time consuming operations, and I want to make sure that everything is back to normal.

So here’s the view of the error page:

So I know that the error is somewhere communicating with Google services since the error pops up in the google api core library. Let’s hop on over to logging and find the stack trace – I’ve redacted a line that doesn’t mean anything to the purpose of this post:

Traceback (most recent call last):
  File "/layers/google.python.pip/pip/lib/python3.7/site-packages/flask/app.py", line 2070, in wsgi_app
    response = self.full_dispatch_request()
  File "/layers/google.python.pip/pip/lib/python3.7/site-packages/flask/app.py", line 1515, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/layers/google.python.pip/pip/lib/python3.7/site-packages/flask/app.py", line 1513, in full_dispatch_request
    rv = self.dispatch_request()
  File "/layers/google.python.pip/pip/lib/python3.7/site-packages/flask/app.py", line 1499, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  [REDACTED]
  File "/srv/main.py", line 331, in launch_task
    task_creation_results = client.create_task(parent=queue_prop, task=task)
  File "/layers/google.python.pip/pip/lib/python3.7/site-packages/google/cloud/tasks_v2/services/cloud_tasks/client.py", line 2203, in create_task
    metadata=metadata,
  File "/layers/google.python.pip/pip/lib/python3.7/site-packages/google/api_core/gapic_v1/method.py", line 131, in __call__
    return wrapped_func(*args, **kwargs)
  File "/layers/google.python.pip/pip/lib/python3.7/site-packages/google/api_core/timeout.py", line 120, in func_with_timeout
    return func(*args, **kwargs)
  File "/layers/google.python.pip/pip/lib/python3.7/site-packages/google/api_core/grpc_helpers.py", line 81, in error_remapped_callable
    raise exceptions.from_grpc_error(exc) from exc
google.api_core.exceptions.DeadlineExceeded: 504 Deadline Exceeded

If you missed the culprit in the above text, let me help you out: the call to the Google Task Queue service on line 331 of my application ended up exceeding Google’s deadline, and threw up the exception from Google’s end. Perhaps there was a transient infrastructure issue, perhaps task queue was under maintenance, perhaps it was just bad luck.

File "/srv/main.py", line 331, in launch_task
    task_creation_results = client.create_task(parent=queue_prop, task=task)

There’s really nothing to be done here, other than maybe catching the exception and trying again. I will point out that the task queue service is surprisingly resilient: out of tens/hundreds of thousands of task queue calls over the past 5 months that this application has performed, only 2 tasks (one in January 2024, one yesterday) have failed to enqueue. More importantly, my code is functioning as intended and I can mark this issue as Resolved or at least Muted.

Now honestly, this is a my bad sort of situation. If this was a real production app there should be something catching the exception. But since this is a toy application, I absolutely am fine tolerating the random and thankfully very rare failures in task queue.