A few notes on a positive engineering experience
2024-05-11

A while ago I led an engineering team at a company where I think the software department did an unusually good job overall and my team ended up doing some good stuff as well, so I wanted to note down the practices which I think contributed to that.

Things we did well in engineering

  • Autonomy to implement things in ways suitable for the team and product
  • Continuous integration/continuous delivery, feature flags, automatic time-gated launches, sandbox environment
  • Mostly reliable deployment pipeline with built-in PR checks
  • Security built in, mostly at the platform level plus things like security scanning
  • Monitoring (alerts, canary)
  • Observability - in our case, it was 90% Sentry, but also logs and admin UI allowing to diagnose issues
  • Infrastructure as code (with the caveat that this still adds a lot of complexity, cloud doesn’t save you time or money etc.)
  • A really helpful and responsive platform team
  • Rescoping feature work to account for deadlines and delays
  • Generally running a tight ship: it’s far too easy to drown in FIXMEs, TODOs, deferred maintenance and deprecated dependencies, so it’s necessary to stay on top of them.

Things that worked well in my team

This is a mix of technical and process stuff:

  • For the most part, zero bug tolerance - fix bugs immediately; could still do better on this one.
  • Very low levels of tech debt thanks to continuous refactoring, and maintenance and upgrade work done as part of sprints, without requiring negotiation vs. feature work. In my experience it’s the only way to not end up with a pile of tech debt and deferred maintenance
  • Fairly extensive test suite including UI tests, which allows many changes to be released without extensive manual testing
  • A reliable test suite thanks to jumping on flaky tests and fixing them
  • Avoiding mocks - best to try testing as much of the real code paths as possible
  • Very lightweight processes optimised for minimal handovers and parallel streams of work matching team size. No story points, no long write-ups in cards, no backlog refinements, minimal ceremonies around sprints (sometimes I do away with sprints altogether and go full Kanban, but I think sprint boundaries can be good psychologically, as a way of denoting that we achieved something)
  • Tactical estimation when needed to figure out whether we’ll meet deadlines
  • Continuous improvement based on regular retrospectives (this works if concrete actions are assigned to people right in the retro - otherwise it just turns into documenting wishlists; this is why I prefer not to keep a record of the retros)
  • Automation of coding rules and guidelines where possible (eg. enforcing naming conventions within string event names, enforcing pure modules when marked, limiting the use of functions like System.get_env to specific modules)
  • Individual developers on the team taking ownership of implementing features starting from clarifying requirements all the way through to wording in the UI (in collaboration with the PM, the designer etc.)
  • Production incident post-mortems with actionable conclusions
  • Banishing TODOs from the code; at one point I noticed that they kept spreading through the code, so I started using the idea of occasional time-limited TODOs with a linter check (eg. for fixes that couldn’t be made until a particular dependency had a new release). All the rest either became a card in the backlog or was simply deleted.
  • Assorted little things that still may not be common everywhere: automatically enforced code formatting and style, automated DB migrations, generating API clients from OpenAPI specs.

Now this might look like engineers given free rein to tinker all day, however my team acquired a reputation for delivering what we promised and for being good at estimation as well.

Things that I wanted to use or try but didn’t get to

I think some or all of these would have helped us do things better still:

  • Property based testing
  • Function contracts
  • Even more automated DB migrations via Terraform (I later investigated this and learned that this isn’t really feasible in practice; there’s just too much stuff that can be happening in DB migrations, and the tools for this kind of thing (such as atlas-go) are limited)
  • A structured approach to displaying redacted data for debugging
  • Explicitly modelling data states with FSMs
  • Monitoring more metrics instead of simply logging stuff and trying to extract metrics later.