Initial Thoughts On Comparing Tweets

Note: some of this code is old; it could be rewritten to benefit more from SQL’s windowing and other advanced functions.

I’ve been poking around some old code of mine recently – since it’s going unused, maybe someone on the Internet would benefit from it. A few years ago, I was toying with some code to detect Twitter trending tweets. Consider a SQL table set up like the below, where user is the name of Twitter user posting the tweet, ID is the tweet ID, created_at is the creation date of the tweet, and retweet_count/favorite_count are the number of retweets and favorites the tweet had.

CREATE TABLE twitter (
    category VARCHAR(30),
    user VARCHAR(30),
    id VARCHAR(90), 
    created_at DATETIME,
    retweet_count INT,
    favorite_count INT,
    add_date DATETIME
);

The trend finder worked like this: it would take snapshots of each Tweet’s retweet and favorite count (add_date is the datetime of the snapshot) and then compare the snapshots to detect how fast the tweet was gaining retweets/favorites. The comparison was done in SQL. I named the gain rate of retweets/favorites the slope of the activity, and calculated multiple slopes: the increase in retweets/favorites within the most recent 90 minutes, and the overall increase of retweets/favorites. Here’s the simplified SQL (other SQL has been cut off because this is already too long, and not relevant to the point I’m making):


insert into tweet_slope 
(

select twitter.created_at, twitter.category, twitter.id, twitter.retweet_count as max_retweet, 
 twitter.favorite_count as max_favorite, twitter.add_date, new_table.max_date, 
 first_table.first_date, first_table.first_retweet, first_table.first_favorite, 
 prior_table.prior_date, prior_table.prior_retweet, prior_table.prior_favorite,
                                                                                                    	
(twitter.retweet_count - first_table.first_retweet) as full_retweet_diff,
TIMESTAMPDIFF(MINUTE, first_table.first_date, new_table.max_date) as full_retweet_timespan,
((twitter.retweet_count - first_table.first_retweet) / 
TIMESTAMPDIFF(MINUTE, first_table.first_date, new_table.max_date)) as full_retweet_slope,
(twitter.retweet_count - prior_table.prior_retweet) as recent_retweet_diff,
TIMESTAMPDIFF(MINUTE, prior_table.prior_date, new_table.max_date) as recent_retweet_timespan,
((twitter.retweet_count - prior_table.prior_retweet) / 
TIMESTAMPDIFF(MINUTE, prior_table.prior_date, new_table.max_date)) as recent_retweet_slope,
(twitter.favorite_count - first_table.first_favorite) as full_favorite_diff,
TIMESTAMPDIFF(MINUTE, first_table.first_date, new_table.max_date) as full_favorite_timespan,
((twitter.favorite_count - first_table.first_favorite) / 
TIMESTAMPDIFF(MINUTE, first_table.first_date, new_table.max_date)) as full_favorite_slope,
(twitter.favorite_count - prior_table.prior_favorite) as recent_favorite_diff,
TIMESTAMPDIFF(MINUTE, prior_table.prior_date, new_table.max_date) as recent_favorite_timespan,
((twitter.favorite_count - prior_table.prior_favorite) / 
TIMESTAMPDIFF(MINUTE, prior_table.prior_date, new_table.max_date)) as recent_favorite_slope
 from (select id, max(add_date) as max_date from twitter group by id) as new_table
 inner join twitter on twitter.id=new_table.id and twitter.add_date=new_table.max_date
 INNER JOIN 
(select twitter.id, twitter.add_date as first_date, twitter.retweet_count as 
 first_retweet, twitter.favorite_count as first_favorite from twitter inner 
 join (select id, min(add_date) as min_date from twitter group by id) as old_dates 
 on twitter.id=old_dates.id and twitter.add_date=old_dates.min_date) as first_table
 
 on twitter.id=first_table.id
 INNER JOIN 
 (select twitter.id, twitter.add_date as prior_date, twitter.retweet_count as 
 prior_retweet, twitter.favorite_count as prior_favorite from twitter INNER JOIN 
 (select id, max(add_date) as prior_set_date from twitter where add_date <  
 DATE_SUB(NOW(), INTERVAL 90 MINUTE) group by id) as prior_set 
 ON twitter.add_date=prior_set.prior_set_date and twitter.id=prior_set.id) as  prior_table 
 ON twitter.id=prior_table.id
 where twitter.created_at > DATE_SUB(NOW(), INTERVAL 1 DAY) AND 
 new_table.max_date > DATE_SUB(NOW(), INTERVAL 45 MINUTE) AND 
 new_table.max_date <> first_table.first_date
and twitter.category = :category
)
;

Now this code worked, and it worked great! A fast DB machine made all the difference to crunch out all the SQL calculations. I tested it by sampling tweets from sets of related accounts. One set I tested were all Disney/Marvel/Star Wars-related Twitter accounts (for instance, the accounts of actors for Luke Skywalker, Iron Man, Disney Parks) and the top-ranking tweets were almost all from Mark Hamill’s account; on the day I was testing, Mark’s tweets had thousands of favorites/retweets due to being politics-related.

I said this code was old at the beginning of the post, and that’s true – it has since been replaced. While this code worked, it was too simple: in practice, it would never surface some trending tweets from less-popular Twitter accounts.

The real problem is that the popularity of tweets is a bimodal distribution: most tweets end up puttering along with, at best, a handful of retweets/favorites; some might even get a few hundred. But then there’s a small minority of tweets, posted from extremely popular accounts, that get thousands and thousands of favorites and retweets regardless of the individual tweet’s merit – the tweets themselves are popular because the poster is popular. This crowds out the trending tweets from the other section of the bimodal graph.

I rebuilt the application to take into account the number of followers a Twitter user has (as a proxy for the account’s relative popularity) and to compare a tweet to the account’s tweet history (how popular is a given tweet compared to the other tweets in the same account?). I’ll share the code soon.

AppSheet

I’m pretty bullish on all no-code/low-code automation services such as IFTTT, Zapier, AWS HoneyCode, and all the startups in this sector. Software is eating the world, and it’s not hard to see the piles of cash being shoveled to the first company who can drastically simplify a lot of current IT operations.

That’s one of the reasons I’m pleased to see Google introducing more automation within AppSheet and Apigee: https://techcrunch.com/2020/09/08/google-cloud-launches-its-business-application-platform-with-apigee-and-appsheet/ . I’ve been toying around with AppSheet to help automate some work, and it is a surprisingly fun and agile platform to work on.

I really like the deep integration with Google Sheets: AppSheet makes it easy to quickly deploy a proof of concept app or a low-traffic production application and store data within a Google Sheet. Better yet, AppSheet can analyze your data and set up an app template automatically. When I first started AppSheet, it automatically imported a spreadsheet “Post Mapping” stored in my Google Drive: it was a listing of bookmarks and dates I had been saving.

A view of AppSheet’s introductory page.

When I clicked on the Post Mapping file, AppSheet automatically created table and calendar views of my data:

One thing I’m surprised not to see is integration with Firebase. Firebase is – IMO – the easiest to use backend for mobile and web apps, and yet AppSheet doesn’t integrate it. At the very least I’d like to see AppSheet borrow Firebase’s authentication options.

In short, AppSheet is definitely something to keep your eyes on.

Fun With Google Trends: AP CS Edition

I was amused at this recent Slashdot posting: Google Searches For ‘Java’ Spiked During Friday’s Online AP CS Exam. Slashdot includes this interesting chart, showing traffic trends for Java searches peaking around the time of the AP CS test:

Credit: Slashdot

Apparently the first AP CS question was about java.util.ArrayList, and the Google Trends chart for ArrayList shows the same bump on Friday (this chart shows the search interest for ArrayList over the last 3 months):

Credit: Google Trends

It’s always fun to see how Google’s search volume changes depending on external events. Try looking at Google Trends whenever a major event occurs and watch the keywords people use.

Tinkering With WP-CLI, Google Cloud, and BlueHost

I’ve been spending the last few days wrapping the WP-CLI application – a command line program to automate the administration of WordPress installations – inside a Java app so I can automate some WordPress work. One of the major bottlenecks was fixing up the correct SSH string to connect to the various WordPress providers.

Initially I was having a bit of difficulty because I misread the wp-cli documentation and I thought the –user argument was the SSH username. When I got that fixed, it turns out that some WordPress hosting services, such as BlueHost, require you to contact their support to activate SSH. I had to connect my application to multiple WordPress hosting services, but in this post I’ll use BlueHost as an example since they’re fairly representative of the work I had to go through.

In BlueHost’s case, having support activate SSH support was a surprisingly painless process – it only took a quick 5 minute text chat where I verified my email address. To build out the proper SSH command, I also needed to look at details provided by cPanel:

Click to expand.

All the information you need to build the SSH string is in the General Information section in the above screenshot. Your SSH string should look like this:

php wp-cli.phar plugin list --ssh=<CURRENT_USER>@<SHARED_IP_ADDRESS>:2222/<HOME_DIRECTORY>/public_html --debug

The BlueHost account I’m using as an example is a shared WordPress account, so it listed a shared IP. Make sure to double check the port number (2222 in the above code sample) – the usual SSH port is 22, but BlueHost uses 2222 for shared accounts. Note that I’ve listed an additional folder under the user home directory; the home directory path only takes you to the user’s home directory, but wp-cli needs the path to the WordPress installation, which is usually under another folder (in this case /public_html/ ).

Why You Need To Buy Google Ads – GrubHub Edition

An incredibly thought-provoking article showed up today, written by a current restaurant owner – but one that has experience in technology circles. I strongly recommend reading it: https://www.saddlebackbbq.com/how-google-doordash-grubhub-conspire-screw-local-restaurants .

Obviously the top-line theme is how GrubHub and DoorDash take a large proportion of the monies from online orders. But there is a lesser theme that I want to emphasize which is easy to lose in the outrage. Quoted from the article:

DoorDash pays Google an advertising fee to steal customers that are searching for our restaurant name “Saddleback BBQ” and they are redirecting them to their own page. From there a customer can purchase from any BBQ restaurant in Lansing…

From https://www.saddlebackbbq.com/how-google-doordash-grubhub-conspire-screw-local-restaurants

Here’s the takeaway: While SEO is important, it’s not the end of the line. You could be the top result for your keywords, for your own business name (which was the case for this BBQ restaurant), and competitors will still take business away from you by running ads on your own name. This shows the power of a Google ad, especially an ad within Google’s Knowledge Panel (the side panel that provides contextual information).

That’s why it’s so critical to buy Google Ads on even searches you’re the top result in: to get rid of competitor ads that are trying to redirect business that should be yours. It might sound like a lot of extra money to spend, but as long as you provide a interesting landing page for the user the additional cost should not be too much.

One of the recurring themes I like to touch on with this site is how important ads revenue is to Google. As Google optimizes their search to show more local and “hyperlocal” content, more space will open up to show ads. Businesses will have a choice: buy up those ad slots, or their competitors will.

That’s why you need to be buying Google ads.

Lottery Error Message & Consultant Thoughts On Food

I love collecting samples of error messages, especially branded error messages. Today’s sample comes from the Illinois Lottery website – or to be more exact, their iOS app which frames a web page:

A screenshot of the Illinois Lottery iOS app, showing an error page. I quite like the image of an unplugged power cord to depict an error.
A screenshot of the Illinois Lottery iOS app, showing an error page. I quite like the image of an unplugged power cord to depict an error.

As I’ve discussed before, I’ve spent quite a few years as a consultant – while working as a consultant, I loved staying at DoubleTree hotels: their prices were fair, with good amenities targeted towards business travelers. That’s why I’m pleased to see that DoubleTree recently published their recipe for cookies – they were always a tasty treat when checking in.

Speaking of recipes, I’m also pleased to note that Chipotle published their recipe for guacamole. I enjoy eating at Chipotle when I travel: it’s a good, healthy helping of food in an easily portable burrito.

Chipotle tweet showing their recipe for guacamole.

Acquisition Thoughts

Lately we’ve seen some aggressive moves by Microsoft to pick up developer mind-share. The purchase of GitHub was the opening salvo, but Microsoft made a number of smaller moves recently as well: private repositories available for free on GitHub (previously you needed a student or paid account to have private repos), the acquisition of NPM (Javascript package manager and registrar), and GitHub Actions: a way to automate developer workflows – similar to a developer focused version of IFTTT/Zapier. Part of the new GitHub Actions makes it easier and faster to deploy code on Microsoft Azure.

Recently GitHub announced a huge drop of new features, the most important of which is Codespaces (an online IDE) and Discussions (a place to host community discussions). It’s clear to everyone that Microsoft is playing the long game in its war against Amazon Web Services: Microsoft is buying up developer mindshare, making it easier and faster to discuss, manage, and deploy applications on Microsoft services.

This leaves Google’s cloud platform in a difficult bind: how to compete against all these offerings? AWS is by far the market leader in the cloud game, with Microsoft in a strong 2nd place position and having strong Enterprise-size deployment credentials, plus increasing ownership of the development process. We’re beginning to see Google’s counter moves: recently they announced the release of a code editor within Cloud Shell. Here’s a screenshot as of today:

Screenshot of Google Cloud Shell’s code editor. Admittedly it’s a bit boring – it’s not as fully featured as AWS’s Cloud9 online IDE.

The bottom line is, Google Cloud needs to start opening up its checkbook if they want to compete with the options Microsoft and Amazon are developing/purchasing. Here’s a couple of companies I think Google should seriously consider purchasing:

  1. GitLab – GitLab is the perfect counter to Microsoft’s purchase of GitHub; GitLab includes the social coding aspect of GitHub, plus automated integration tools to compete with GitHub Actions
  2. An online IDE – There are a number of online IDEs available. I’ve been trying out a number of them and personally, I quite like GitPod.
  3. AI Services – Google needs something to differentiate its cloud platform compared to Microsoft Azure and AWS. An interesting play would be to lead the burgeoning AI industry. There are a number of players in this space, but a good starting acquisition would be Diffbot – it supplies APIs for structured extraction of text from a webpage and understanding context.

Some other acquisitions that I think make sense, especially in light of the current economic troubles driving down valuations:

  1. A note service to enhance GSuite. Google Keep exists, but it’s not as advanced as Evernote.

    With a $1 billion valuation, Evernote may be too expensive to purchase, but there are a number of smaller competitors that still provide a great note-taking experience. A good example would be Notion.
  2. AirTable. I’m surprised that nobody has acquired AirTable yet – it’s a marvelous new take on how spreadsheets and databases can be visualized.
  3. Automation tooling. Services such as IFTTT, Zapier, Integromat, Automate.io are the “glue” that can connect disparate services together. For instance, you can configure a new WordPress post to be saved to Google Drive/Dropbox – or any of hundreds of different web services.

    A purchase of IFTTT or similar service immediately buys integration into many different web services, plus allows a deeper integration into Google products. Imagine making it easier to share your favorite YouTube clip to any social media you have.

For New Engineers, A Reminder…

I find that many new software engineers are nervous when they start a new job – there’s always so much to learn: every software company does things a little differently or possibly radically differently than the “best” practices taught at university. It’s good to take a step back and remember that everybody messes up once in a while.

I was quite amused today when browsing HN and saw this article: I shipped a word processor that formatted the hard drive every 1024 saves. It links to a Twitter thread reproduced below: https://twitter.com/GeePawHill/status/1256342997643526151 (Click image below to expand).

A twitter thread where a programmer discusses a bug wherein a word processor overwrote the hard drive's master boot record with garbage, making the computer unable to boot.

Everyone makes mistakes – the important thing is to be able to learn from them. If you’re a new intern or software engineer, my advice to you is to ask lots of questions: in my experience, people don’t mind asking questions as long as the questions show that you’re growing and learning.

Googlebot Cannot Scroll; Infinite Scroll Doesn’t Help SEO

I saw a fascinating article in today’s Search Engine Journal: Google’s Martin Splitt Explains Why Infinite Scroll Causes SEO Problems. Read it for some background information, but the bottom line is that Googlebot (the Google web crawler/indexer) does not scroll web pages, which means that any content exposed via infinite scroll is not indexed.

Infinite scroll can also cause other problems: some screenshot browser addons and services have difficulties rendering infinite scroll web pages. It’s easy to love infinite scroll as a user since it gives the illusion of infinite content, but it can be a nightmare for automated services.

The fix: Make sure that all content on a web site can be accessed without using the infinite scroll function. Also, send Google a sitemap so it knows where all the valid URLs are: https://support.google.com/webmasters/answer/183668?hl=en .

A Google Credit Card?

TechCrunch is covering an upcoming Google product, a debit card under the umbrella of Google Pay. `

Branded credit cards bring in a surprising amount of money – the Apple credit card is projected to earn Apple $1 billion annually with very little risk. I think it’s notable that this is a debit card and not a credit card – a debit card charges purchases against a checking account (in other words, money is there before you make your purchase) while credit card purchases are essentially a short term loan. By making it a debit card, Google reduces their risk by requiring that consumers already have money in their checking accounts to charge against. Additionally, Google and its partner bank earn interest on that checking account’s balance.

This is another way of diversifying Google’s income stream away from Ads; I wouldn’t be surprised if we see the Google credit link to benefits on other Google properties – for example, free or discounted YouTube premium, discounted Google Home devices, etc.