Richard Towers

My first weeknote

This year I’m going to try writing personal weeknotes.

Each week I’ll look back at the things I did (mostly work things), and write a summary. I hope this will be an opportunity for me to think about how I spend my time, and a record for me to look back over. I’m keen to improve my writing style too, and there’s nothing like regular practice.

There are a few folk at work who have been writing weeknotes for years. I think it’s amazing to be able to maintain that habit, and their notes must be a great asset for them. Thanks in particular to Steve, Michael and Tobi for the inspiration.

So, Week 1, 2021. How did it go?

It was a very tough start to the year.

Outside work

The coronavirus pandemic in the UK is reaching terrifying levels, with signs that things will get much worse before they get better.

In the USA there were unbelievable scenes. President Donald Trump’s supporters stormed Congress in an attempt to prevent it from certifying the election results.

If reality had been a film, I would have pointed out some gaping plot holes. The idea that the security of the Capitol building could be overwhelmed by a disorganized, lightly armed mob, simply by pushing their way in? Ludicrous. And yet, here we are.

“Big Tech” has been in the news, with Twitter permanently banning the president of the USA from their platform. Amazon Web Services plan to shut down the hosting for Parler (a less moderated social network, favoured by extremists). This has brought a sharp focus on the ethical implications of working in technology, and perhaps on the power of workers and unions to influence the behaviour of these huge companies.

On Wednesday I attended my grandmother’s funeral. Despite all the restrictions in place, it was good to be together with (although socially distant from) my family. We remembered a wonderful person, and a life filled with optimism and love. I read some verses from Philippians, and managed not to stumble over the words.

At work

Scaling and observability

We spent the end of last year worrying about scale. Regular coronavirus press conferences drove floods of traffic to GOV.UK, with people trying to work out what the rules would be in their area.

This week saw possibly the biggest spike in traffic GOV.UK has ever handled. At the peak, our content delivery network was serving 88,000 requests per second.

For the most part, everything worked perfectly. We served 17,000,000 pages, of which only 21 were errors.

Behind the scenes though, I felt our observability could have been better. Although we have lots of monitoring (metrics, logs, alerts, analytics, smoke tests etc.), there never seems to be a single place for people to look at during these interesting times.

Matt shared some of the graphs we were looking at on twitter. Initially, we were looking at analytics. That’s flawed because it’s opt-in, but it shows the trends in real time. At this scale though, our analytics provider decided that discretion was the better part of valour and disabled real time monitoring.

So we switched to looking at CDN metrics. These are great, because they’re the truest measure of the scale of traffic we’re serving to users. But we don’t pull real-time CDN metrics into our main dashboards yet, so they’re another thing people have to sign in to to see what’s going on.

CDN metrics also don’t tell us which pages are seeing the most traffic. For an authoritative view on that, we use CDN logs. These are in another system again.

Because this is all a bit messy, I spent a fair bit of the week thinking about how we can improve our monitoring. I upgraded our Prometheus instance, and fixed a longstanding issue which meant we couldn’t use Prometheus from Grafana.

I had a play with a Prometheus exporter for Fastly’s real-time analytics API, which would allow us to see live metrics from our CDN on our primary dashboards. I wonder whether a Fastly datasource for Grafana might be a more elegant solution though.

Later in the week I attended an incident review about a different traffic spike that hadn’t been handled quite so well. There are still some uncomfortable unknowns about how and when our systems fail when under unprecedented load. We plan to do some systematic load testing to understand where the bottlenecks are.

Thinking about strategy

As a little side project, I’ve been taking GOV.UK’s flagship content management system (imaginatively called “content-publisher”) and hacking it so I can use it to publish my personal blog instead of content on GOV.UK. It’s not quite ready to be used for real yet, but most of the hard work is done.

I demoed this to my opposite numbers in Product and Delivery, and it provoked an excellent discussion.

Firstly, having the application running as a monolith (without GOV.UK’s many other microservices) raises a good question about how complicated our infrastructure really needs to be. Yes, a blog is not as complicated as a national government website, but how much harder can it be really? If there’s a complexity / functionality trade-off, are we getting that right at the moment?

Secondly, we discussed the amount of bespoke page designs we’ve had to produce over the last couple of years with Brexit and coronavirus. Because none of our content management systems quite fit the bill for these bits of content, producing these pages has cost us a lot of engineering time. Content Publisher is nearly there though, so perhaps it’s worth revisiting this user need.

I’m going to demo the same thing in our next GOV.UK Tech Fortnightly meeting. Many of my colleagues are a lot more familiar with GOV.UK than I am, so I hope this will lead to another good discussion.

Replatforming

The GOV.UK Replatforming team are working out how to host GOV.UK’s services on a managed container platform, namely AWS ECS Fargate (other hyperscale cloud providers are available).

I’ve been helping them refactor their infrastructure-as-code to try to keep it as simple as possible. I finished my refactoring of our content-store deployment, and did some pair-programming on the same refactoring for the “publisher” application.

The way Terraform and the AWS ECS API interact around task definitions makes it really difficult to do what we want to do in a nice way. This is bothering me, but we’re yet to come up with a better solution than the one we’re implementing at the moment. We had some conversations about other tools, like pulumi, cue, and dhall, but I think adding a new piece of configuration management technology to our stack is probably a nuclear option.

Other things

We’re getting the ball rolling again on a developer recruitment campaign we kicked off last year. We need to schedule lots of phone interviews, and start lining up face-to-face interviews. Although I made some progress, most of this still needs to get sorted out next week.

End of weeknote

… well, that was a lot longer than I expected it would be. The next ones will have to be shorter, or I’m never going to sustain this habit.

If you can believe it, I actually felt a little unproductive last week, but looking back at it I think that I got a lot done. Especially as circumstances weren’t ideal.

Next week I hope to get our recruitment moving again, provoke some interesting thoughts and conversations, and make some progress with my observability work and the infrastructure-as-code refactoring.