SQLrom

Both CTEs and subqueries solve the problem of breaking complex queries into manageable pieces. But when should you choose one over the other? This guide explores the practical differences, performance implications, and best use cases for each approach.

The Basics: CTEs vs Subqueries

Simple Subquery Example

Subqueries are nested queries that can appear in SELECT, FROM, WHERE, or HAVING clauses. They're evaluated and their results used by the outer query.

-- Basic subquery in WHERE clause
SELECT 
  user_id,
  username,
  signup_date,
  total_orders
FROM users
WHERE user_id IN (
  SELECT customer_id 
  FROM orders 
  WHERE order_date >= '2024-01-01'
    AND total_amount > 100
);

Equivalent CTE Example

CTEs (Common Table Expressions) create named temporary result sets that exist only for the duration of the query. They're defined at the beginning with the WITH clause.

-- Same logic using CTE
WITH high_value_customers AS (
  SELECT DISTINCT customer_id 
  FROM orders 
  WHERE order_date >= '2024-01-01'
    AND total_amount > 100
)
SELECT 
  u.user_id,
  u.username,
  u.signup_date,
  u.total_orders
FROM users u
JOIN high_value_customers hvc ON u.user_id = hvc.customer_id;

Readability: Where CTEs Shine

Complex Nested Subqueries

As queries become more complex, nested subqueries can become difficult to read and maintain. Consider this complex analysis:

-- Complex nested subqueries (hard to read)
SELECT 
  product_name,
  category_name,
  avg_rating,
  price,
  (price - (SELECT AVG(price) FROM products WHERE category_id = p.category_id)) as price_vs_category_avg
FROM products p
JOIN categories c ON p.category_id = c.category_id
WHERE p.product_id IN (
  SELECT product_id 
  FROM reviews 
  WHERE rating >= 4
  GROUP BY product_id
  HAVING COUNT(*) >= (
    SELECT AVG(review_count) * 0.8
    FROM (
      SELECT product_id, COUNT(*) as review_count
      FROM reviews
      GROUP BY product_id
    ) product_reviews
  )
)
AND p.price > (
  SELECT PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY price)
  FROM products 
  WHERE category_id = p.category_id
);

The Same Logic with CTEs

CTEs break the same logic into named, reusable components that are much easier to understand and maintain:

-- Same logic using CTEs (much clearer)
WITH product_review_counts AS (
  SELECT 
    product_id, 
    COUNT(*) as review_count,
    AVG(rating) as avg_rating
  FROM reviews
  GROUP BY product_id
),
review_threshold AS (
  SELECT AVG(review_count) * 0.8 as min_reviews
  FROM product_review_counts
),
category_price_stats AS (
  SELECT 
    category_id,
    AVG(price) as avg_price,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY price) as price_75th
  FROM products
  GROUP BY category_id
),
well_reviewed_products AS (
  SELECT product_id
  FROM product_review_counts prc
  CROSS JOIN review_threshold rt
  WHERE prc.avg_rating >= 4
    AND prc.review_count >= rt.min_reviews
)
SELECT 
  p.product_name,
  c.category_name,
  prc.avg_rating,
  p.price,
  (p.price - cps.avg_price) as price_vs_category_avg
FROM products p
JOIN categories c ON p.category_id = c.category_id
JOIN well_reviewed_products wrp ON p.product_id = wrp.product_id
JOIN product_review_counts prc ON p.product_id = prc.product_id
JOIN category_price_stats cps ON p.category_id = cps.category_id
WHERE p.price > cps.price_75th;

Readability advantages of CTEs:

Top-down logical flow that matches human thinking
Named intermediate results that document intent
Easier to debug individual components
Simpler to modify and extend logic

Recursive Queries: CTE's Unique Capability

Hierarchical Data Processing

Recursive CTEs can process hierarchical data like organizational charts, category trees, or network structures. This capability doesn't exist with standard subqueries.

-- Recursive CTE for hierarchical data
WITH RECURSIVE org_hierarchy AS (
  -- Anchor: Top-level managers
  SELECT 
    employee_id,
    manager_id,
    employee_name,
    title,
    1 as level,
    CAST(employee_name AS VARCHAR(1000)) as hierarchy_path
  FROM employees
  WHERE manager_id IS NULL
  
  UNION ALL
  
  -- Recursive: Direct reports
  SELECT 
    e.employee_id,
    e.manager_id,
    e.employee_name,
    e.title,
    oh.level + 1,
    CAST(oh.hierarchy_path || ' > ' || e.employee_name AS VARCHAR(1000))
  FROM employees e
  JOIN org_hierarchy oh ON e.manager_id = oh.employee_id
  WHERE oh.level < 10  -- Prevent infinite recursion
)
SELECT 
  employee_id,
  employee_name,
  title,
  level,
  hierarchy_path,
  -- Count total subordinates
  (SELECT COUNT(*) 
   FROM org_hierarchy sub 
   WHERE sub.hierarchy_path LIKE org_hierarchy.hierarchy_path || '%'
     AND sub.employee_id != org_hierarchy.employee_id
  ) as total_subordinates
FROM org_hierarchy
ORDER BY level, employee_name;

Recursive CTE use cases:

Organizational hierarchy analysis
Product category trees
Social network connection analysis
Geographic region hierarchies
Bill of materials (BOM) explosions

Performance Considerations

Multiple References and Evaluation

When the same logic needs to be used multiple times, CTEs can be more efficient because they're evaluated once and reused, while subqueries might be re-evaluated multiple times.

-- Performance comparison: Multiple references
-- ❌ Subquery approach (inefficient - calculated multiple times)
SELECT 
  current_month.user_id,
  current_month.revenue as current_revenue,
  previous_month.revenue as previous_revenue,
  (current_month.revenue - previous_month.revenue) as revenue_change,
  CASE 
    WHEN previous_month.revenue > 0 
    THEN ROUND((current_month.revenue - previous_month.revenue) / previous_month.revenue * 100, 2)
    ELSE NULL
  END as revenue_change_percent
FROM (
  -- This subquery is evaluated multiple times
  SELECT 
    user_id,
    SUM(order_amount) as revenue
  FROM orders
  WHERE DATE_TRUNC('month', order_date) = DATE_TRUNC('month', CURRENT_DATE)
  GROUP BY user_id
) current_month
LEFT JOIN (
  -- This subquery is also evaluated multiple times
  SELECT 
    user_id,
    SUM(order_amount) as revenue
  FROM orders
  WHERE DATE_TRUNC('month', order_date) = DATE_TRUNC('month', CURRENT_DATE) - INTERVAL '1 month'
  GROUP BY user_id
) previous_month ON current_month.user_id = previous_month.user_id;

-- ✅ CTE approach (efficient - calculated once)
WITH monthly_revenue AS (
  SELECT 
    user_id,
    DATE_TRUNC('month', order_date) as month,
    SUM(order_amount) as revenue
  FROM orders
  WHERE order_date >= DATE_TRUNC('month', CURRENT_DATE) - INTERVAL '1 month'
  GROUP BY user_id, DATE_TRUNC('month', order_date)
),
current_month AS (
  SELECT user_id, revenue
  FROM monthly_revenue
  WHERE month = DATE_TRUNC('month', CURRENT_DATE)
),
previous_month AS (
  SELECT user_id, revenue
  FROM monthly_revenue
  WHERE month = DATE_TRUNC('month', CURRENT_DATE) - INTERVAL '1 month'
)
SELECT 
  c.user_id,
  c.revenue as current_revenue,
  p.revenue as previous_revenue,
  (c.revenue - COALESCE(p.revenue, 0)) as revenue_change,
  CASE 
    WHEN p.revenue > 0 
    THEN ROUND((c.revenue - p.revenue) / p.revenue * 100, 2)
    ELSE NULL
  END as revenue_change_percent
FROM current_month c
LEFT JOIN previous_month p ON c.user_id = p.user_id;

Performance factors to consider:

Multiple references: CTEs are evaluated once and reused
Query optimization: Modern optimizers can optimize both effectively
Materialization: CTEs might be materialized, affecting memory usage
Execution order: CTEs execute sequentially, subqueries as needed

Correlated vs Non-Correlated Patterns

Correlated Subquery Limitations

Correlated subqueries execute once for each row in the outer query, which can be inefficient. CTEs often provide a better alternative using JOINs.

-- Correlated subquery example
SELECT 
  p.product_id,
  p.product_name,
  p.price,
  -- Correlated subquery executed for each row
  (SELECT AVG(price) 
   FROM products p2 
   WHERE p2.category_id = p.category_id
  ) as category_avg_price,
  -- Another correlated subquery
  (SELECT COUNT(*) 
   FROM reviews r 
   WHERE r.product_id = p.product_id
  ) as review_count
FROM products p
WHERE p.price > (
  SELECT AVG(price) * 1.2
  FROM products p3
  WHERE p3.category_id = p.category_id
);

CTE Alternative with JOINs

The same logic using CTEs with JOINs is typically more efficient and easier to optimize:

-- CTE alternative to correlated subqueries
WITH category_stats AS (
  SELECT 
    category_id,
    AVG(price) as avg_price,
    AVG(price) * 1.2 as premium_threshold
  FROM products
  GROUP BY category_id
),
product_review_counts AS (
  SELECT 
    product_id,
    COUNT(*) as review_count
  FROM reviews
  GROUP BY product_id
)
SELECT 
  p.product_id,
  p.product_name,
  p.price,
  cs.avg_price as category_avg_price,
  COALESCE(prc.review_count, 0) as review_count
FROM products p
JOIN category_stats cs ON p.category_id = cs.category_id
LEFT JOIN product_review_counts prc ON p.product_id = prc.product_id
WHERE p.price > cs.premium_threshold;

CTEs with Window Functions

Complex Analytics Pipelines

CTEs work excellently with window functions for multi-step analytical processes, where each step builds on the previous results.

-- CTEs with window functions for complex analytics
WITH user_engagement_metrics AS (
  SELECT 
    user_id,
    DATE(login_date) as login_date,
    session_duration_minutes,
    pages_viewed,
    -- Window functions for user-level metrics
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) as session_number,
    LAG(DATE(login_date)) OVER (PARTITION BY user_id ORDER BY login_date) as previous_login,
    AVG(session_duration_minutes) OVER (
      PARTITION BY user_id 
      ORDER BY login_date 
      ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) as avg_7day_duration
  FROM user_sessions
  WHERE login_date >= CURRENT_DATE - INTERVAL '90 days'
),
engagement_scores AS (
  SELECT 
    user_id,
    login_date,
    session_duration_minutes,
    pages_viewed,
    session_number,
    CASE 
      WHEN previous_login IS NULL THEN 0
      ELSE DATE(login_date) - DATE(previous_login)
    END as days_since_last_login,
    avg_7day_duration,
    -- Engagement score calculation
    CASE 
      WHEN session_duration_minutes >= 30 AND pages_viewed >= 10 THEN 'High'
      WHEN session_duration_minutes >= 15 AND pages_viewed >= 5 THEN 'Medium'
      ELSE 'Low'
    END as engagement_level
  FROM user_engagement_metrics
),
user_segments AS (
  SELECT 
    user_id,
    COUNT(*) as total_sessions,
    AVG(session_duration_minutes) as avg_session_duration,
    AVG(days_since_last_login) as avg_days_between_sessions,
    SUM(CASE WHEN engagement_level = 'High' THEN 1 ELSE 0 END) as high_engagement_sessions,
    CASE 
      WHEN AVG(days_since_last_login) <= 3 THEN 'Daily User'
      WHEN AVG(days_since_last_login) <= 7 THEN 'Weekly User'
      WHEN AVG(days_since_last_login) <= 14 THEN 'Bi-weekly User'
      ELSE 'Monthly+ User'
    END as usage_frequency
  FROM engagement_scores
  GROUP BY user_id
)
SELECT 
  usage_frequency,
  COUNT(*) as user_count,
  AVG(total_sessions) as avg_sessions_per_user,
  AVG(avg_session_duration) as avg_duration_per_user,
  AVG(high_engagement_sessions) as avg_high_engagement_sessions,
  ROUND(AVG(high_engagement_sessions) / AVG(total_sessions) * 100, 2) as high_engagement_rate
FROM user_segments
GROUP BY usage_frequency
ORDER BY 
  CASE usage_frequency
    WHEN 'Daily User' THEN 1
    WHEN 'Weekly User' THEN 2
    WHEN 'Bi-weekly User' THEN 3
    ELSE 4
  END;

Materialized CTEs for Performance

Forcing Materialization (PostgreSQL 12+)

In PostgreSQL 12+, you can force CTE materialization to improve performance when the CTE is used multiple times or contains expensive operations.

-- Materialized CTEs for performance (PostgreSQL 12+)
WITH expensive_aggregation AS MATERIALIZED (
  SELECT 
    DATE_TRUNC('day', order_date) as order_day,
    product_category,
    COUNT(*) as order_count,
    SUM(order_amount) as total_revenue,
    COUNT(DISTINCT customer_id) as unique_customers
  FROM orders o
  JOIN products p ON o.product_id = p.product_id
  WHERE order_date >= CURRENT_DATE - INTERVAL '365 days'
  GROUP BY DATE_TRUNC('day', order_date), product_category
),
daily_trends AS (
  SELECT 
    order_day,
    product_category,
    order_count,
    total_revenue,
    unique_customers,
    -- Compare to previous day
    LAG(order_count) OVER (
      PARTITION BY product_category 
      ORDER BY order_day
    ) as prev_day_orders,
    LAG(total_revenue) OVER (
      PARTITION BY product_category 
      ORDER BY order_day
    ) as prev_day_revenue
  FROM expensive_aggregation
)
SELECT 
  order_day,
  product_category,
  order_count,
  total_revenue,
  unique_customers,
  ROUND(
    (order_count - COALESCE(prev_day_orders, 0)) / 
    COALESCE(prev_day_orders, 1) * 100, 2
  ) as order_count_change_percent,
  ROUND(
    (total_revenue - COALESCE(prev_day_revenue, 0)) / 
    COALESCE(prev_day_revenue, 1) * 100, 2
  ) as revenue_change_percent
FROM daily_trends
WHERE order_day >= CURRENT_DATE - INTERVAL '30 days'
ORDER BY product_category, order_day;

When to Choose CTEs vs Subqueries

Choose CTEs When:

Complex logic: Multiple steps that build on each other
Multiple references: Same intermediate result used multiple times
Recursive queries: Hierarchical or tree-structured data
Team collaboration: Code needs to be readable and maintainable
Debugging: Need to test intermediate results separately
Window function pipelines: Multi-step analytical processes

Choose Subqueries When:

Simple filtering: Basic EXISTS or IN operations
Single use: Logic is used only once in the query
Small, focused logic: Simple calculations or lookups
Performance critical: Simple subquery might be faster than CTE setup
Scalar values: Single value calculations in SELECT or WHERE

Best Practices

CTE Best Practices

Meaningful names: Use descriptive names that explain the purpose
Logical order: Arrange CTEs in the order they're conceptually processed
Comment complex logic: Explain the business logic behind each CTE
Limit complexity: Break extremely complex CTEs into smaller ones
Consider materialization: Use MATERIALIZED hint when beneficial

Subquery Best Practices

Avoid deep nesting: More than 2-3 levels becomes hard to read
Prefer EXISTS over IN: For large datasets, EXISTS often performs better
Use correlated subqueries sparingly: Often can be rewritten as JOINs
Consider indexes: Ensure subquery filtering columns are indexed

Performance Testing and Optimization

Measuring Performance

Always test both approaches with your actual data and query patterns:

Execution plans: Use EXPLAIN to understand query execution
Timing tests: Measure actual execution time with realistic data volumes
Memory usage: Monitor memory consumption, especially with CTEs
Concurrent load: Test performance under realistic concurrent usage

Common Pitfalls

CTE Pitfalls

Over-materialization: CTEs might materialize when optimization would be better
Memory consumption: Large CTEs can consume significant memory
Recursive depth limits: Be aware of recursion limits in your database

Subquery Pitfalls

Correlated subquery performance: Can be extremely slow on large datasets
NULL handling in IN clauses: Unexpected behavior with NULL values
Readability issues: Complex nested queries become unmaintainable

Conclusion

The choice between CTEs and subqueries isn't always clear-cut. CTEs excel at improving readability, handling complex multi-step logic, and processing hierarchical data. Subqueries work well for simple filtering and single-use logic.

In modern SQL development, CTEs are generally preferred for complex analytics due to their readability and maintainability benefits. However, simple subqueries still have their place for straightforward operations.

The key is understanding your specific use case: prioritize readability and maintainability for complex business logic, and consider performance implications for high-volume or frequently executed queries. When in doubt, test both approaches with your actual data and choose based on measured performance and code clarity.