Using Codex to finish Algo

I started Algo about three years ago because I wanted to have a utility library like Ramda.js in Elixir. I had written maybe a quarter of it, but then life happened and I never found the time to finish it. I finally came back to it because I thought it would be a good experiment to try to complete it with Codex. This post summarises my experience.

Codex generated reasonably idiomatic Elixir code. I didn't need to write elaborate prompts and Codex used Elixir tooling without any instructions. I felt that compared to last year's experiments with generating Elixir code using older models, I got better results this time.

Codex was especially useful for documentation and refactoring. It was also able to do research on libraries I suggested and propose candidate functions for inclusion in Algo.

It did quite well with property-based testing. I asked it to propose tests using StreamData, and it came up with many of the properties I would have wanted: round-tripping path operations, preserving expected keys, validating output dimensions and so on.

Some things still didn't work well. Codex did not particularly care about duplication. It sometimes created helpers that were almost the same as existing helpers. In tests, it would define local helper functions even when suitable helpers already existed in another module. This may be solvable with project instructions, but by default it took a very blinkered approach - not what I would expect from a reasonable human programmer.

It also extracted too many trivial helpers which I then had to inline. That is partly down to personal preference; I am probably more hostile to tiny helpers than many people. But there seems to be a broader pattern of Codex focusing on a given task very narrowly.

In StreamData generators, it overused bind and I had to get it to refactor towards simpler generator composition.

Codex was also biased towards preserving existing code - again, quite unlike what I'd expect from a human collaborator. I had to explicitly ask it to do a review pass and suggest improvements.

In a sense, this is good behaviour when I think of it as a tool - it tends to do what I asked and no more. However, in combination with all the natural-looking chatter it outputs, it continues to create a sense of cognitive dissonance for me. I am half expecting it to exhibit some judgement: notice duplication, question interfaces etc. It's hard to shake that off.

All up, it took about 40 short Codex sessions over a couple of weeks to finish the package. That includes implementation, tests, documentation, refactoring and some API review. This seems pretty good, and I would not have finished it this quickly without Codex.

I was able to focus on the high level concerns: what kind of functions to include, what the API should look like, what the tests should cover. I didn't have to overcome blank page syndrome and I could make progress when I was tired late in the evening. Of course, that's a blessing in the short run that could well turn into a curse in the long run - time will tell.

Lastly, another important finding for me is that I was the bottleneck. Because I was reviewing all the code and refactoring it to my liking, I was still using Codex in short bursts and didn't follow the latest fashion of leaving the agent to one-shot the whole package or launch a fleet of subagents to generate things in parallel. I think I would have had more code to review and correct if I'd left the agent in control. Particularly because Algo is a reusable package and not a throwaway script, I think this approach makes sense.