As I don’t use ORMs anymore, I end up writing a lot of fairly complicated 100+ line SQL queries. I found a few ways to keep them readable.
In most languages newer than COBOL, most of the code is in lowercase, so there is no reason it shouldn’t work for SQL. I’ve been writing it in lowercase for a few years and find it to be quite readable, with or without syntax highlighting.
My queries look like this:
select geometries.id
, geometries.type
, geometries.geometry
, geometries.properties
from geometries
join projects using (project_id)
join customers using (customer_id)
where project_id = $1
and GeometryType(geometry) = 'POLYGON'
order by geometries.properties->>'name'
Just like in any other code, there will be bits of queries which are non-obvious, and they can be easily be commented on:
where user_id = $1 and project_id = $2
and events.created_at >= to_timestamp($3) and events.created_at < to_timestamp($4)
-- Sort by event ID to deal with entry-exit pairs in the same epoch
order by area_id, created_at, event_id
Without that comment above order
, I’d likely be puzzled by the non-obvious sort by event_id
if I returned to this query a few months after writing it.
I apply a few rules:
where
).For example:
select geometries.id
, geometries.type
, geometries.geometry
, geometries.properties
from geometries
where project_id in (select project_id
from projects
where customer_id = $1)
and GeometryType(geometry) = 'POLYGON'
and (
updated_at >= clock_timestamp() - interval '1 month'
or deleted_at >= clock_timestamp() - interval '1 month'
)
order by geometries.properties->>'name'
I do like to preserve vertical space (if I can see the whole query without scrolling, it’s much easier), so I’m likely to write multiple columns per line in a select
list, but still use leading commas when a single line is not enough:
select geometries.id, geometries.type, geometries.geometry, geometries.properties
, extract(epoch from created_at at time zone (select timezone from projects where id = $1))
, updated_at, deleted_at
I use CTEs a lot in Postgres as a substitute for a lot of subqueries, and I find that I work towards the final result in a different way, more like I would in code - by treating the query as a series of data transformations. So, for example, if I want to produce a result for a set of users on a range of days, I write a CTE to get the users, then another one to generate the range of dates, then another one to combine them to get all possible “user-days”, then another to retrieve the data for each and process it, and so on:
with roles as (
select id
from user_roles
where is_active and (properties->>'is_person' or properties->>'does_signon')
)
, relevant_users as (
select *
from users
where role_id in (select id from roles)
)
, days as (
select generate_series($1::date, $2::date, '1 day') as day_start
)
, user_days as (
select day_start, day + interval '1 day' as day_end, user_id
from relevant_users
cross join days
)
select *
from events
join user_days on events.user_id = user_days.user_id
and created_at >= day_start and created_at < day_end
where type in ('entry', 'exit')
This results in very readable queries - way better than having subqueries all over the place. There are performance caveats to using CTEs so sometimes I have to structure the query differently, but it works well for a lot of them, as long as the intermediate result sets are small.
Comments or questions? I’m @alexkorban on Twitter.
“PostgreSQL: Time and Temporal Data” is a short book that will help you get it under control. Master Postgres time tools, get a handle on time zones, learn how to represent and query complex temporal data.
Pop in your email to receive a sample chapter.
You will also get notifications of new posts along with other mailing list only freebies.