Conveying Respect and Kindness Through Readable Code

Time to Read:

7–11 minutes

Time is perhaps the most valuable resource. It is the one thing no power on this earth can give us back. Believing that, it seems desirable to spend as little time as possible performing painful activities – right? In this post, I will walk through a handful of measures that you can take to ensure the time spent by others reading your code is as painless as possible. As a disclaimer, for SQL, there is not a style guide or standard equivalent to that available in other programming languages like python (PEP8) that is both well-known and widely adopted. That said, I do believe the measures in the forms of rules and rough guidelines that I propose are sensible and easy to implement.

Line Length
The number of characters used on each line can heavily impact the amount of time and effort required to read a SQL statement. Avoid forcing the readers of your code to scroll in both directions when traversing each line by setting soft limits on the number of characters used per line. Scrolling right to read the contents of a line and subsequently having to scroll back in order to continue with the next is a painful exercise that wastes the reader’s time and will likely make that person less enthusiastic about reviewing your code in the future.

For the remainder of this discussion, consider the SELECT statement below that is both poorly written and largely void of formatting. Some of you will say it is contrived having never been faced with poor code while others will say, “I have seen worse!” With each subsequent topic, I will continue to transform this code by addressing the elements, present or absent, reducing its readability.

WITH cte AS (
SELECT c.customer_id, c.cust_last_name || ', ' || c.cust_first_name AS customer_name, SUM(o.order_total) AS order_total, DENSE_RANK() OVER (ORDER BY SUM(o.order_total) DESC) AS order_total_sequence
FROM oe.customers c INNER JOIN oe.orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '01-JAN-08' AND o.order_date < '01-JAN-09'
GROUP BY c.customer_id, c.cust_last_name || ', ' || c.cust_first_name
)
SELECT cte.customer_id, cte.customer_name, cte.order_total, cte.order_total_sequence
FROM cte
WHERE cte.order_total_sequence <= 5
ORDER BY 3 DESC;

Separating Clauses with Indentation and New Lines
The use of indentation and new lines in SQL code are perhaps the easiest ways to increase code readability and also address the scrolling problem resulting from excessive line length. While there are no hard-and-fast rules on where to indent and add new lines for SQL, I find it beneficial add a new line after the SELECT, FROM, WHERE, GROUP BY, HAVING, and ORDER BY clauses and to indent the contents or specifications of those clauses. A new line followed by indentation signifies to the reader, “this is the start of the clause’s specification.”

For example, the SELECT clause’s specification is a comma-separated list of the required pre-existing and/or derived attributes. Placing each attribute on a new line makes it easier to read those instructions and should feel natural to the writer (i.e., you) and reader given the concept of a list in everyday writing. The reader simply needs to scan the vertically aligned attributes from top to bottom. This also applies to the predicates in the FROM, WHERE, and HAVING clauses, and for the enumerated attributes found in GROUP BY and ORDER BY clauses.

The code below is identical to that found above; however, I have separated clause specifications with indentation and incorporated new lines for readability. Notice, even in the small amount of space allocated to the code block, scrolling in only one direction is required. Additionally, it is much easier to understand each clause’s specification given that the contents are indented, on separate lines, and vertically aligned.

WITH cte AS (
    SELECT
        c.customer_id,
        (
			c.cust_last_name
			|| ', '
			|| c.cust_first_name
		) AS customer_name,
        SUM(o.order_total) AS order_total,
        DENSE_RANK() OVER (
			ORDER BY SUM(o.order_total) DESC
		) AS order_total_sequence
    FROM
        oe.customers c
        INNER JOIN
        oe.orders o
            ON c.customer_id = o.customer_id
    WHERE
        o.order_date >= '01-JAN-08'
        AND o.order_date < '01-JAN-09'
    GROUP BY
        c.customer_id,
        (
			c.cust_last_name
			|| ', '
			|| c.cust_first_name
		) 
)
SELECT
    cte.customer_id,
    cte.customer_name,
    cte.order_total,
    cte.order_total_sequence
FROM
    cte
WHERE
    cte.order_total_sequence <= 5
ORDER BY
    3 DESC;

Table and Column Aliases
All relational database management systems maintain documentation to some extent on the use of table and column aliases. You will find that they are meant to increase readability; however, based on the code reviews I have performed in a professional capacity and while perusing posts on popular social media platforms like LinkedIn or video tutorials on YouTube, often the opposite occurs.

Table names are shortened to the point where the the reader must maintain a mapping of single-letter characters or acronyms and the contents of a table (e.g., a for sales force accounts, o for sales force opportunities, sfosnpsht for sales force opportunity snapshot, etc.). For simple queries containing one or few tables, sure, the additional cognitive effort is minimal, but this quickly gets out of control for many real-world queries merging data from a large amount of tables. This is especially true for code that extends beyond a single frame, resulting in the reader needing to scroll back to the initial reference of each table to remember which table the single character maps to. That is both time-consuming and painful.

Column aliases should be used to more effectively communicate its contents. This is often necessary when there is a table that is used by several different teams for reporting purposes. One team refers to the companies they do business with as “clients” while the other refers to those companies as “customers.” Another common use case is derived attributes. The concatenation of customer first and last name in the code above is used to return each customer’s full name. Aliasing for the sake of clarity based on context and/or the result of deriving something is aligned with the goal of maintaining or increasing readability. What is not? Arbitrary removal of characters for the sake of simply making the column name shorter. For example, consider an original column name of “shipping_address” and an alias of “shp_adr.” The removal of characters forces the reader to mentally fill in the gaps based on context and experience.

In the code below, I have completely removed the superfluous table aliases as the original table names, customers and orders, are clear and accurately capture their contents.

WITH cte AS (
    SELECT
        oe.customers.customer_id,
        (
			oe.customers.cust_last_name
			|| ', '
			|| oe.customers.cust_first_name
		) AS customer_name,
        SUM(oe.orders.order_total) AS order_total,
        DENSE_RANK() OVER (
			ORDER BY SUM(oe.orders.order_total) DESC
		) AS order_total_sequence
    FROM
        oe.customers
        INNER JOIN
        oe.orders
            ON oe.customers.customer_id = oe.orders.customer_id
    WHERE
        oe.orders.order_date >= '01-JAN-08'
        AND oe.orders.order_date < '01-JAN-09'
    GROUP BY
        oe.customers.customer_id,
        (
			oe.customers.cust_last_name
			|| ', '
			|| oe.customers.cust_first_name
		)
)
SELECT
    cte.customer_id,
    cte.customer_name,
    cte.order_total,
    cte.order_total_sequence
FROM
    cte
WHERE
    cte.order_total_sequence <= 5
ORDER BY
    3 DESC;

Common Table Expressions
Common table expressions (CTE) allow you to structure SQL code in a way that yields a transparent problem-solving strategy. They allow you to break complex problems down into manageable chunks or subproblems. A common mistake made by SQL users is to underestimate or be unaware of the benefits of providing meaningful names to CTEs. If you perform a quick search for SQL content on any popular social media platform, then you will, without a doubt, come across code containing CTEs named “cte,” “cte_1,” “cte_2,” and other less than useful names.

Avoid squandering the opportunity to inform the reader of your code on the intent or purpose of the CTE. In the code above, the purpose of the code defining “cte” is to compute the amount spent by each customer in year 2008. By providing the CTE with a meaningful name, we can prepare the reader for what is to come and even trigger intuition based on their past experiences in writing or reading code that sought to do the same.

Suppose the CTE is renamed to “customer_spending.” When reading the definition of the CTE, the question goes from, “What is this CTEs purpose?” to “How is this CTE defining and computing customer spend?” Providing meaningful names becomes increasingly important to readability when there are multiple CTEs or they are chained together where the output of one is used as input to another.

Ordinals in GROUP BY and ORDER BY Clauses
If you have read my blog post on the use of ordinals in the GROUP BY and ORDER BY clauses within production code, then you know I find it to be antipattern. For the purposes of this discussion, imagine the code above is lengthier, extending to several scroll frames, and also contains ordinals in GROUP BY clauses. For each GROUP BY and ORDER BY clause, you must be fortunate enough for the columns referenced in those clauses to be within the same frame or to have subconsciously noted the position of each column in the corresponding SELECT clause. Given that both those things are unlikely to occur, as the reader you must scroll back up to determine the attribute being referred to by the ordinal value (e.g., 1, 2, 3, etc.). After placing that information in your working memory, you must then scroll back down to where you left off and continue reading. That does not sound like a fun exercise, does it? Let us not impose that on the readers of our code.

The updated code containing the renamed common table expression and the removal of the ordinal in the ORDER BY clause can be found below.

WITH customer_spending AS (
    SELECT
        oe.customers.customer_id,
        (
			oe.customers.cust_last_name
			|| ', '
			|| oe.customers.cust_first_name
		) AS customer_name,
        SUM(oe.orders.order_total) AS order_total,
        DENSE_RANK() OVER (
			ORDER BY SUM(oe.orders.order_total) DESC
		) AS order_total_sequence
    FROM
        oe.customers
        INNER JOIN
        oe.orders
            ON oe.customers.customer_id = oe.orders.customer_id
    WHERE
        oe.orders.order_date >= '01-JAN-08'
        AND oe.orders.order_date < '01-JAN-09'
    GROUP BY
        oe.customers.customer_id,
        (
			oe.customers.cust_last_name
			|| ', '
			|| oe.customers.cust_first_name
		)
)
SELECT
    customer_spending.customer_id,
    customer_spending.customer_name,
    customer_spending.order_total,
    customer_spending.order_total_sequence
FROM
    customer_spending
WHERE
    customer_spending.order_total_sequence <= 5
ORDER BY
    customer_spending.order_total DESC;

Key Takeaways
To recap, the use of indentation and new lines allows the reader to scan your code in one direction and understand where clauses begin and end. Providing tables with succinct but complete names allow the reader to avoid constructing a mapping for original table names and specified aliases. Similarly, columns with meaningful and context relevant names convey the contents and purpose of each. When using common table expressions, ensure the names provided offer a glimpse into the code defining each. The keystrokes you save by using ordinals are not worth the additional time and effort required of the reader of your code whether that be a peer or your future self. Finally, remember, code is often written once (at least initially) and read many times.

Thank you for visiting SQL Short Reads.






SQL Antipatterns: Using Ordinals in GROUP BY and ORDER BY Clauses

Time to Read:

5–7 minutes

The SQL features available to users across relational database management system products continue to evolve over time. In the past 30 years, features have been developed allowing us to concatenate values existing across multiple rows into a single row using LISTAGG, implement highly customizable ranking systems through window functions, and traverse hierarchal data and perform data scaffolding through recursive common table expressions, among other useful features. The evolving SQL standard along with the continuous improvements made by vendors in the form of proprietary features (e.g., functions) continue to deliver for SQL users; however, these new additions are not without their misses.

MySQL permits users to specify GROUP BY and ORDER BY clauses consisting of expressions representing the ordinal positions of attributes within the corresponding SELECT clause. For example, the query below computes the total sales for each product category within each reporting month and sorts the result set by reporting month and product category. The expressions 1 and 2 within the GROUP BY and ORDER BY clauses state, “GROUP/ORDER the result set by the first and second attributes specified in the SELECT clause.”

SELECT
	CAST(
		DATE_FORMAT(report_date, '%Y-%m-01')
        AS DATE
	) AS reporting_month,
    north_america_sale.product_category,
    SUM(total_sale) AS total_sale
FROM
	north_america_sale
GROUP BY
	1, 2
ORDER BY
	1, 2;
reporting_monthproduct_categorytotal_sale
2023-01-01Hardware10532049
2023-01-01Office Supplies9527596
2023-01-01Software10032955
2023-02-01Hardware7713650
2023-02-01Office Supplies8363496
2023-02-01Software8072274
2023-03-01Hardware10107952
2023-03-01Office Supplies9308036
2023-03-01Software9566038
2023-04-01Hardware9981157
2023-04-01Office Supplies9101319
2023-04-01Software8094593
2023-05-01Hardware8224588
2023-05-01Office Supplies9854939
2023-05-01Software8719227
2023-06-01Hardware8547870
2023-06-01Office Supplies8569213
2023-06-01Software9424211
2023-07-01Hardware8956408
2023-07-01Office Supplies8742601
2023-07-01Software8850498
2023-08-01Hardware9718295
2023-08-01Office Supplies8566646
2023-08-01Software9848671
2023-09-01Hardware9198295
2023-09-01Office Supplies8742664
2023-09-01Software9532093
2023-10-01Hardware9138984
2023-10-01Office Supplies8658959
2023-10-01Software8430418
2023-11-01Hardware9559386
2023-11-01Office Supplies11124204
2023-11-01Software9945806
2023-12-01Hardware9680977
2023-12-01Office Supplies11923226
2023-12-01Software11490082

Had this post a different title and opening, you would likely be rejoicing at the thought of being able to type fewer characters to achieve the same result. Instead, you are skeptical of this perceived benefit and rightly so! Let us explore are few reasons as to why it is an overall bad idea to incorporate this feature into production code.

Explicit is Better than Implicit
If you are also a python user, then you will likely be familiar with the aphorism from the Zen of Python stating explicit is better than implicit. Not all programming best practices should be applied universally across languages; however, this one is particularly relevant to SQL in general and this specific discussion.

Given the example code above, imagine you are not the author of the code, or you are reviewing it again after not setting eyes on it for a long time. You traverse the query line by line and arrive at the GROUP BY clause. Unfortunately, your top-down approach to reading the query is interrupted due to having to retrace your steps back to the SELECT clause to clarify the expressions specified in the GROUP BY clause as indicated by the ordinal values. After clarifying which expressions are specified in the GROUP BY clause, back down to the GROUP BY clause you go. Hopefully, you remember the code you previously read lying between the SELECT clause and the GROUP BY clause. If not, then I guess you will need to read those lines again, too. With any luck, you can complete that mental exercise and can move onto the ORDER BY clause. The fun continues. Do it again. Imagine if the example query above was more complicated and lengthier, requiring you to scroll up and down. Pure joy.

When the ordinal position of expressions in the SELECT clause is relied upon in the GROUP BY or ORDER BY clauses, the code’s readability is decreased. Do yourself and the consumers of your code a favor by explicitly specifying the expressions. Readability counts.

Jenga as a Query
Data pipelines and the processes and scripts that support them are bound to change over time and for good reason. Your data customers are constantly trying to explore their data in different ways in hopes of acquiring a better understanding of it and the processes that generate it. As a result, they may request attributes to be added or removed from tables and views to take them one step closer to meeting that objective. When the ordinal position of expressions in the SELECT clause is relied upon in the GROUP BY or ORDER BY clauses, the code is reduced to an inauspicious game of jenga. It takes but one addition or removal of an attribute to cause the code to fail entirely, or worse, the code is processed successfully and produces a result set that does not meet the customer’s requirements.

For example, suppose your customers have decided they no longer need monthly sales to be broken out by product category. To account for this change, you update the code above by removing the product category attribute from the SELECT clause. As a result of not remembering to also account for this change in the GROUP BY clause, the query comes tumbling down. It fails because the second column within the SELECT clause is now the total sales associated with the reporting month instead of the product category. Additionally, the result set is also inadvertently sorted by the total sales values. This produces no visual net-change; however, the overhead associated with the now unnecessary sort remains.

-- Code fails due to invalid GROUP BY.
SELECT
	CAST(
		DATE_FORMAT(report_date, '%Y-%m-01')
        AS DATE
	) AS reporting_month,
    SUM(total_sale) AS total_sale
FROM
	north_america_sale
GROUP BY
	1, 2
ORDER BY
	1, 2;

If caution is not taken when adding or rearranging attributes within the SELECT clause of queries utilizing ordinals in the ORDER BY clause, then the query can silently fail in terms of meeting requirements. For example, suppose the original query near the top of this page is changed ever so slightly to that shown below.

SELECT
	north_america_sale.product_category,
	CAST(
		DATE_FORMAT(report_date, '%Y-%m-01')
        AS DATE
	) AS reporting_month,
    SUM(total_sale) AS total_sale
FROM
	north_america_sale
GROUP BY
	1, 2
ORDER BY
	1, 2;
product_categoryreporting_monthtotal_sale
Hardware2023-01-0110532049
Hardware2023-02-017713650
Hardware2023-03-0110107952
Hardware2023-04-019981157
Hardware2023-05-018224588
Hardware2023-06-018547870
Hardware2023-07-018956408
Hardware2023-08-019718295
Hardware2023-09-019198295
Hardware2023-10-019138984
Hardware2023-11-019559386
Hardware2023-12-019680977
Office Supplies2023-01-019527596
Office Supplies2023-02-018363496
Office Supplies2023-03-019308036
Office Supplies2023-04-019101319
Office Supplies2023-05-019854939
Office Supplies2023-06-018569213
Office Supplies2023-07-018742601
Office Supplies2023-08-018566646
Office Supplies2023-09-018742664
Office Supplies2023-10-018658959
Office Supplies2023-11-0111124204
Office Supplies2023-12-0111923226
Software2023-01-0110032955
Software2023-02-018072274
Software2023-03-019566038
Software2023-04-018094593
Software2023-05-018719227
Software2023-06-019424211
Software2023-07-018850498
Software2023-08-019848671
Software2023-09-019532093
Software2023-10-018430418
Software2023-11-019945806
Software2023-12-0111490082

Notice the re-ordering of attributes in the SELECT clause. The query produces the same data; however, the result set is not sorted in the same way because the ORDER BY clause remained the same. The original sorting specification allows consumers to easily compare product category sales within each month while the new sorting specification allows consumers to more easily compare a single product category’s performance across the entire year. Ease of use is an important factor for reporting, so customers may deem the sorting to be a key requirement.

Appropriate Use Cases
Taking advantage of ability to use ordinals in the GROUP BY and ORDER BY clause is perfectly acceptable in ad-hoc queries. Exploring a database schema and its corresponding tables? Validating the integrity of the data within a table? Save those keystrokes if you so desire!

Did this post make you laugh because you would never think of using ordinals in your GROUP BY or ORDER BY clauses, cry because you are currently suffering from you or a co-worker having taken advantage of these features, both, or neither? Let me know if the comments section.

Thank you for visiting SQL Short Reads.