SQL is a powerful tool — but it can be easy to get tripped up. Whether you’re just starting out or have years of experience, certain mistakes creep in more often than we’d like to admit. In this week’s edition of SELECT * FROM South.Shore, we’re covering five common SQL traps we see all the time, and what to do instead. These aren’t gotchas meant to make anyone feel bad (we still fall into some of these traps ourselves)! Rather, think of this as a few practical reminders to help you write cleaner, clearer, and more accurate queries.
1. Jamming Complex Logic Into a Join (When It Should Be a CTE)
The trap: It happens. You start on what you’re hoping will be a quick query and you don’t want to bother with setting up multiple CTEs. What begins as something simple, though, soon morphs into a bit of a monster. Before you know it, you’re staring at a tangled ball of yarn that no one other than you will be able to quickly parse through. Multiple aggregations, filters, and case statements crammed into a single inline subquery … making the code nearly impossible to read or debug.
What to do instead: Pull that logic into a clear, named CTE above your main query. It improves readability, simplifies debugging, and makes your logic reusable.
Messy example:
SELECT
c.customer_id,
c.customer_name,
o.order_date,
spend.total_spend
FROM core.customers c
LEFT JOIN (
SELECT
customer_id,
SUM(price * quantity) AS total_spend
FROM core.orders o
JOIN core.order_items oi ON o.order_id = oi.order_id
WHERE o.order_status = 'completed'
AND o.order_date >= DATE_TRUNC('year', CURRENT_DATE)
GROUP BY customer_id
) spend ON c.customer_id = spend.customer_id
This works, but try debugging it when the logic gets more complicated, or when someone else has to make changes later.
Cleaner approach:
WITH customer_spend AS (
SELECT
o.customer_id,
SUM(oi.price * oi.quantity) AS total_spend
FROM core.orders o
JOIN core.order_items oi ON o.order_id = oi.order_id
WHERE o.order_status = 'completed'
AND o.order_date >= DATE_TRUNC('year', CURRENT_DATE)
GROUP BY o.customer_id
)
SELECT
c.customer_id,
c.customer_name,
o.order_date,
cs.total_spend
FROM core.customers c
LEFT JOIN core.orders o ON c.customer_id = o.customer_id
LEFT JOIN customer_spend cs ON c.customer_id = cs.customer_id
Takeaway: CTEs improve readability, help you modularize logic, and make downstream testing much easier. So if you’re writing a long subquery inside a join, it’s probably time to promote it to a named CTE!
2. Misusing Aggregations (Especially in Joins)
The trap: A classic SQL mistake: aggregating one table, then joining it to another at a different level of granularity, then summing again. For example, say you’re analyzing some customer level trends, and in the process of doing so you need to join order_items to orders. Without thinking too deeply about it, you execute this join on order_id, then group by customer_id. That seems fine, until your totals are double what they should be.
What went wrong: In almost all cases, a table like order_items will contain multiple rows per order_id. If you join it directly to orders, and then aggregate at a higher level (like customer_id), you’re now summing repeated data — essentially counting each order multiple times.
What to do instead: First, aggregate order_items to the order_id level, then join that back to orders, which is also at the order_id level. Now, when you roll up to customer_id, you’re working with clean, de-duplicated data.
❌ 3. Over-Reliance on SELECT *
The trap: It’s fast, and it’s easy. So why not use SELECT * everywhere?
Because eventually, it bites you. We’ve seen dashboards break because someone added a column to a table — and SELECT * pulled it in without anyone noticing. Or a model bloats to 150+ columns, half of which no one actually uses, and now everything downstream is slower and harder to debug.
What went wrong: Using SELECT * means you’re no longer being intentional about what your query returns. It can cause:
- Schema drift issues when source tables change
- Pulling unnecessary data, leading to bloated query times
- Unreadable code that’s harder to maintain
- Downstream BI tools choking on overly wide datasets
It’s like asking for the entire menu when all you wanted was a sandwich.
What to do instead: Explicitly select only the columns you need. This makes your code more resilient, your models easier to maintain, and your queries easier to optimize. Being explicit helps your future self — and anyone else who reads your code — understand what this model is for. Think of it as defensive querying.
4. Improper Date Filtering
The trap: You need a quick date filter, so you write something like:
WHERE order_date BETWEEN '2025-06-01' AND '2025-06-30'
Looks harmless, right? Until you realize your report excludes orders from the very last day of the month because the timestamps don’t match. Or worse — you reuse this pattern across models and suddenly your fiscal calendar comparisons are a mess.
What went wrong: Hardcoding dates or using BETWEEN without understanding how timestamps work can introduce subtle bugs. You end up with off-by-one errors, partial period comparisons, or inconsistent results across different reports.
Even if your order_date column is a date (not a timestamp), BETWEEN is still a bit of a trap: it’s inclusive on both ends, which isn’t always what you want.
What to do instead: Use standardized, dynamic filters based on your reporting logic. If you want full months, truncate to the month. If you’re comparing month-to-date performance, define those boundaries explicitly. Here’s a better pattern:
-- Last full month
WHERE order_date >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
AND order_date < DATE_TRUNC('month', CURRENT_DATE)
-- Month to date
WHERE order_date >= DATE_TRUNC('month', CURRENT_DATE)
AND order_date < CURRENT_DATE
This structure avoids fuzzy logic and ensures consistent time comparisons, even as the current date shifts. Bonus: if you use these patterns across your models, your reporting layers stay in sync — no more manually updating filters.
5. Misusing DISTINCT
The trap: You’re debugging a join that’s suddenly returning way more rows than expected. Rather than stepping through what went wrong, you slap SELECT DISTINCT on top of your query — and boom, fewer rows.
But is it actually fixed? Not really.
What went wrong: Using DISTINCT as a band-aid hides the problem instead of solving it. SQL is a row-set language — if your query is returning more rows than expected, there’s usually a grain mismatch or a bad join condition.
In that case, DISTINCT doesn’t correct anything- it just filters out duplicate rows after the join has already bloated your result set. That may paper over the issue, but:
- You’re still pulling data at the wrong level of detail
- You might silently drop meaningful rows
- Worse, you now have a query that looks like it’s working, but is doing something completely unintended
What to do instead: Think carefully about your intended grain. Are you trying to return one row per customer? One row per order? One row per product per day?
Once you know the grain:
- Join only to tables at compatible grains
- Aggregate before joining if needed (see the prior section)
- Avoid DISTINCT unless you’re certain it’s semantically correct
Final Thought
SQL is an incredibly useful tool when analyzing and structuring data. Like anything else, the path to the optimal outcome is rarely linear and it’s totally normal to uncover mistakes as you go. Our hope is that by keeping some of these tips in mind, however, you will avoid potential headaches that can come back to bite you down the road!
Thanks for reading! Want more? Check out our blog and our YouTube channel for deeper dives and walkthroughs. We’ll be back each week with more content — subscribe to stay in the loop.
#AI #FutureOfData #Analytics #SouthShoreAnalytics