Richard Towers

Oops, I did it again

I think I got myself stuck between the boat and the dock this week.

I spent most of my time writing code, shepherding out releases, and helping out with an incident.

If you sent me an email, I probably haven’t read it yet. Sorry about that.

Outside work

Things were bad last week, and they’re still bad now. These things don’t change overnight.

That said, this week it felt like things were changing direction. Getting better, instead of worse.

The coronavirus pandemic in the UK is still horrifying, but there’s a steady trickle of good news about vaccines. Some people I know got the vaccine this week, which reassures me that it’s really happening. That there really might be some light at the end of the tunnel.

South Ayrshire Golf Club owner and current US President Donald Trump has been Impeached for a historic second time. Next week, on my birthday, his successor is due to be inaugurated. Good luck Joe and Kamala, it’s not going to be easy.

Our Christmas tree is still up, even though we’re more than halfway through January now. Perhaps I’m still not quite ready for the year to start in earnest. Perhaps it’s just nice to look at.

Claire and I are about to open a bottle of Folklore wine, and watch the Taylor Swift documentary of the same name. I was never a big fan of Taylor’s before the lockdown, but Folklore is a nice album. I’m converted. You certainly can’t argue with her levels of lockdown productivity.

At work

New colleagues, new friends

This week at GDS felt like an early spring. An era of new birth.

Lots of high profile appointments were announced. Tom Read joins as GDS’ new CEO. Ross Ferguson returns to GOV.UK as a new Deputy Director, joining Rachel Tsang who’s promotion was announced the week before. Meanwhile, Paul Willmott and Joanna Davinson take on the new Central Digital and Data Office.

Like everyone I’ve spoken to, I’m really excited and positive about working with all of these amazing people.

Rumour has it that we might have a new Head of Technology and Architecture for GOV.UK announced soon too, who’ll be my new boss. Very exciting!

Synergy

A few people actually read my last weeknote, which is cool. Thanks everyone!

In a couple of cases, this led to “synergy”.

Obviously I have no idea what the word synergy means. But talking about my work let some people spot that I was thinking about the same things as them. This led to some great conversations.

Using GOV.UK’s publishing tools to publish my blog

I mentioned last week that I’d been playing around with some of our open source software, to see if it might be useful outside GOV.UK’s niche. I’m still not actually using it to publish my actual blog yet, but it’s nearly there.

This week I demoed this to the other software developers and technical architects on GOV.UK at our fortnightly catch up. I’m going to do another demo to the rest of GDS’ technology community next week.

As expected, people had some interesting thoughts about it.

I was surprised that people didn’t mount a more spirited defence of GOV.UK’s microservices architecture. I was particularly rude about the publishing-api service, and I had hoped that someone would leap to its defence. Perhaps they’d enlighten me on all the value it brings us. Instead, several people joined me in sticking the boot in. Poor publishing-api.

Things are never as simple as they seem in a 10 minute demo, but I think it got people thinking about the trade-offs in our architecture.

I want technical architecture to feel like something the team owns, not something that’s imposed on us by history. If we don’t feel like it’s working for us, we should change it.

Easier said than done.

Oops

On Tuesday night, we had an incident.

We now know that one of our virtual machines had a Linux Kernel “Oops”, which left it in a state where it couldn’t handle network traffic properly.

Our engineers got paged, and took the instance out of its autoscaling group. That stopped it from receiving traffic, the alarms quietened down, and the engineers went back to bed.

Unfortunately, one service that runs on this machine is an annoying edge case. Taking it out of the autoscaling group without terminating it causes a cascading failure in another part of the system.

This secondary failure went undetected until the following morning, when people got to work and started trying to publish content.

I think I’ve seen this film before

And I didn’t like the ending

We’d seen the same thing happen before so it didn’t take us too long to work out what was wrong and fix it. But I can’t help kicking myself - last time this happened we took an incident action to fix the underlying issue. That ticket is a couple of weeks from the top of our backlog. If we’d got there sooner, we might have avoided the secondary incident.

At the moment this service is a bit like a patient with a penicillin allergy. Medicine that would make any other service better makes this one worse. This can make dealing with incidents uncomfortably similar to an episode of House M.D.

Observability

I mentioned in my last weeknote that I’d been tweaking our monitoring systems to try to improve our observability. This week I got to have a play with the shiny new toys they bring.

My favorite discovery so far is this latency heatmap, which shows the distribution of response durations over time in our frontend services:

Screenshot of a Grafana dashboard showing colourful, purple blue and green latency heatmaps of GOV.UK's frontend microservices

Time, wondrous time

Gave me the blues and then purple-pink skies

Grafana is a wonderful piece of software. Unfortunately, the Grafana we use in our infrastructure needs a major version upgrade, so I’ve only been able to get this dashboard working locally. Upgrading Grafana felt like one yak shave too far for this week.

Galaxy Brain Refactoring

I’ve been helping the GOV.UK replatforming team refactor their infrastructure-as-code to try to keep it as simple as possible.

There was one particularly tricky bit of this that was bothering me a lot.

This week I think I came up with an acceptable workaround, which I got very excited about.

I had an absolutely galaxy brain idea last night for the terraform app deployments. It’s going to change our lives.

Don’t want to oversell it, but it’s the best idea that’s ever been thought up.

It’s important to stay grounded and modest when you’re a lead developer.

When I tried to implement it as a proof of concept I found that there were other refactorings that needed to happen before I could express the idea cleanly.

So again, I ended up shaving yaks.

Still, I think I landed some really nice improvements that should make our lives a little bit easier going forwards. Sometimes removing layers of abstraction is better than adding them.

I should be able to get a proof of concept up next week. If the team agrees with the direction, then I’ll hand over some of the implementation work to them.

Other things

I volunteered some people to conduct phone interviews. We’ve got enough panels to get our recruitment moving again now. Before too long, we should be able to welcome some new developers to the team! Very exciting. Keep an eye on our careers page if you’re interested in roles we might open in the future :)

I was extremely bad at emails this week, as I mentioned in the introduction. I let myself get totally absorbed with bits of tech work, and almost completely lost touch with the rest of my job. Not great, but perhaps I can get away with it as a one off. I’ll do better next week.

End of weeknote

Another epicly long weeknote.

At least this week I felt busy. Hopefully next week will be a bit quieter - I’m taking a couple of days off, so that should help.

If you got this far, thanks for reading. You clearly ooze stamina.

Cheers!

Photograph of a couple of glasses of "Folklore" wine, along with the bottle