Writing readable SQL

2017-02-19

•

As I don’t use ORMs anymore, I end up writing a lot of fairly complicated 100+ line SQL queries. I found a few ways to keep them readable.

1. Use lowercase

In most languages newer than COBOL, most of the code is in lowercase, so there is no reason it shouldn’t work for SQL. I’ve been writing it in lowercase for a few years and find it to be quite readable, with or without syntax highlighting.

My queries look like this:

select geometries.id
    , geometries.type
    , geometries.geometry
    , geometries.properties
from geometries
join projects using (project_id)
join customers using (customer_id)
where project_id = $1 
    and GeometryType(geometry) = 'POLYGON'
order by geometries.properties->>'name'

2. Write inline comments

Just like in any other code, there will be bits of queries which are non-obvious, and they can be easily be commented on:

where user_id = $1 and project_id = $2
    and events.created_at >= to_timestamp($3) and events.created_at < to_timestamp($4)
-- Sort by event ID to deal with entry-exit pairs in the same epoch
order by area_id, created_at, event_id

Without that comment above order, I’d likely be puzzled by the non-obvious sort by event_id if I returned to this query a few months after writing it.

3. Format carefully

I apply a few rules:

Start each major block of a query on a new line
Use leading commas in lists - this prevents annoying errors due to forgetting a comma, and makes adding items to the list easier
Break up conditions into multiple lines such that logical operators are at the start of the line
Use whitespace liberally to make it clear which parts of a query belong together (e.g. when writing a subquery in a where).

For example:

select geometries.id
    , geometries.type
    , geometries.geometry
    , geometries.properties
from geometries
where project_id in (select project_id 
                     from projects 
                     where customer_id = $1)
    and GeometryType(geometry) = 'POLYGON'
    and (
        updated_at >= clock_timestamp() - interval '1 month'
        or deleted_at >= clock_timestamp() - interval '1 month'
    )
order by geometries.properties->>'name'

I do like to preserve vertical space (if I can see the whole query without scrolling, it’s much easier), so I’m likely to write multiple columns per line in a select list, but still use leading commas when a single line is not enough:

select geometries.id, geometries.type, geometries.geometry, geometries.properties
    , extract(epoch from created_at at time zone (select timezone from projects where id = $1))
    , updated_at, deleted_at

4. Use CTEs

I use CTEs a lot in Postgres as a substitute for a lot of subqueries, and I find that I work towards the final result in a different way, more like I would in code - by treating the query as a series of data transformations. So, for example, if I want to produce a result for a set of users on a range of days, I write a CTE to get the users, then another one to generate the range of dates, then another one to combine them to get all possible “user-days”, then another to retrieve the data for each and process it, and so on:

with roles as (
    select id
    from user_roles
    where is_active and (properties->>'is_person' or properties->>'does_signon')
) 
, relevant_users as (
    select * 
    from users 
    where role_id in (select id from roles)
)
, days as (
    select generate_series($1::date, $2::date, '1 day') as day_start
)
, user_days as (
    select day_start, day + interval '1 day' as day_end, user_id
    from relevant_users
    cross join days
)
select * 
from events
join user_days on events.user_id = user_days.user_id 
                  and created_at >= day_start and created_at < day_end
where type in ('entry', 'exit')

This results in very readable queries - way better than having subqueries all over the place. There are performance caveats to using CTEs so sometimes I have to structure the query differently, but it works well for a lot of them, as long as the intermediate result sets are small.

Comments or questions? I’m @alexkorban on Twitter.

Is the time dimension of your data giving you headaches?

“PostgreSQL: Time and Temporal Data” is a short book that will help you get it under control. Master Postgres time tools, get a handle on time zones, learn how to represent and query complex temporal data.

Pop in your email to receive a sample chapter.

You will also get notifications of new posts along with other mailing list only freebies.

|> Created with Elmstatic