Snowflake Data Validation

Posted on August 17, 2025 by Bernie Pruss

Data validation is a critical component of any data pipeline, ensuring that transformations produce accurate and consistent results. While many data engineers rely on basic row counts and null checks, SQL set operations provide a more sophisticated approach to data validation that can catch subtle data quality issues.

In this tutorial, we’ll explore how to leverage Snowflake’s SQL set operations—EXCEPT, INTERSECT, and UNION—to build comprehensive data validation checks that go beyond simple aggregations.

Understanding SQL Set Operations

SQL set operations allow you to compare datasets at the row level, making them ideal for validation scenarios where you need to identify differences between expected and actual results.

The Three Key Operations

EXCEPT: Returns rows in the first dataset that don’t exist in the second
INTERSECT: Returns rows that exist in both datasets
UNION: Combines rows from both datasets (UNION ALL includes duplicates)

Common Data Validation Scenarios

Scenario 1: Validating Data Transformations

When transforming data, you often need to ensure that no records are lost or duplicated unexpectedly. Here’s how to validate a transformation that should preserve all records:

-- Validate that transformation preserves all records
WITH source_data AS (
  SELECT customer_id, order_date, amount
  FROM raw_orders
  WHERE order_date >= '2024-01-01'
),
transformed_data AS (
  SELECT customer_id, order_date, amount
  FROM processed_orders
  WHERE order_date >= '2024-01-01'
),
missing_records AS (
  SELECT 'MISSING_FROM_TARGET' as validation_type, *
  FROM source_data
  EXCEPT
  SELECT 'MISSING_FROM_TARGET' as validation_type, *
  FROM transformed_data
),
extra_records AS (
  SELECT 'EXTRA_IN_TARGET' as validation_type, *
  FROM transformed_data
  EXCEPT
  SELECT 'EXTRA_IN_TARGET' as validation_type, *
  FROM source_data
)
SELECT * FROM missing_records
UNION ALL
SELECT * FROM extra_records;

Scenario 2: Identifying Data Drift

Data drift occurs when the characteristics of your data change over time. Set operations can help identify when new patterns emerge:

-- Compare current data patterns with historical baseline
WITH current_patterns AS (
  SELECT 
    DATE_TRUNC('day', created_at) as date_key,
    product_category,
    COUNT(*) as record_count,
    AVG(price) as avg_price
  FROM current_sales
  WHERE created_at >= CURRENT_DATE - 7
  GROUP BY 1, 2
),
baseline_patterns AS (
  SELECT 
    DATE_TRUNC('day', created_at) as date_key,
    product_category,
    COUNT(*) as record_count,
    AVG(price) as avg_price
  FROM historical_sales
  WHERE created_at BETWEEN CURRENT_DATE - 30 AND CURRENT_DATE - 8
  GROUP BY 1, 2
),
pattern_drift AS (
  SELECT 'NEW_PATTERN' as drift_type, *
  FROM current_patterns
  EXCEPT
  SELECT 'NEW_PATTERN' as drift_type, *
  FROM baseline_patterns
)
SELECT * FROM pattern_drift
WHERE record_count > 10; -- Filter out noise

Scenario 3: Cross-System Data Reconciliation

When data flows between systems, validation ensures consistency across platforms:

-- Reconcile customer data between CRM and data warehouse
WITH crm_customers AS (
  SELECT 
    customer_id,
    email,
    UPPER(TRIM(first_name)) as first_name,
    UPPER(TRIM(last_name)) as last_name,
    status
  FROM crm_system.customers
  WHERE is_active = TRUE
),
warehouse_customers AS (
  SELECT 
    customer_id,
    email,
    UPPER(TRIM(first_name)) as first_name,
    UPPER(TRIM(last_name)) as last_name,
    status
  FROM warehouse.dim_customers
  WHERE is_active = TRUE
),
reconciliation_report AS (
  -- Records in CRM but not in warehouse
  SELECT 'CRM_ONLY' as source_system, *
  FROM crm_customers
  EXCEPT
  SELECT 'CRM_ONLY' as source_system, *
  FROM warehouse_customers
  
  UNION ALL
  
  -- Records in warehouse but not in CRM
  SELECT 'WAREHOUSE_ONLY' as source_system, *
  FROM warehouse_customers
  EXCEPT
  SELECT 'WAREHOUSE_ONLY' as source_system, *
  FROM crm_customers
)
SELECT 
  source_system,
  COUNT(*) as discrepancy_count,
  ARRAY_AGG(customer_id) as affected_customers
FROM reconciliation_report
GROUP BY source_system;

Advanced Validation Patterns

Fuzzy Matching with Set Operations

Sometimes exact matches aren’t sufficient. You can combine set operations with fuzzy matching:

-- Validate address data with fuzzy matching
WITH normalized_addresses AS (
  SELECT 
    customer_id,
    REGEXP_REPLACE(UPPER(address), '[^A-Z0-9\\s]', '') as clean_address,
    city,
    state,
    zip_code
  FROM customer_addresses
),
validation_base AS (
  SELECT 
    customer_id,
    clean_address,
    city,
    state,
    LEFT(zip_code, 5) as zip5 -- Normalize to 5-digit ZIP
  FROM normalized_addresses
)
-- Find addresses that appear in source but not target after normalization
SELECT 'NORMALIZATION_DISCREPANCY' as issue_type, *
FROM validation_base
EXCEPT
SELECT 'NORMALIZATION_DISCREPANCY' as issue_type, *
FROM target_addresses;

Performance Optimization Tips

Use appropriate clustering keys: Ensure tables are clustered on columns used in set operations
Leverage materialized views: For frequently-run validations, consider materialized views
Partition large datasets: Use date partitioning to limit the scope of comparisons
Index frequently-compared columns: While Snowflake is columnar, micro-partitions benefit from good clustering

-- Example of partitioned validation for better performance
WITH validation_scope AS (
  SELECT DATE_TRUNC('day', event_date) as partition_date
  FROM events
  WHERE event_date >= CURRENT_DATE - 1
  GROUP BY 1
)
SELECT v.partition_date, COUNT(*) as discrepancies
FROM validation_scope v
CROSS JOIN (
  -- Your set operation validation here
  SELECT event_date, event_id, user_id
  FROM source_events s
  WHERE DATE_TRUNC('day', s.event_date) = v.partition_date
  EXCEPT
  SELECT event_date, event_id, user_id
  FROM target_events t
  WHERE DATE_TRUNC('day', t.event_date) = v.partition_date
) discrepancies
GROUP BY 1;

Building a Validation Framework

Create reusable validation functions using Snowflake’s stored procedures:

CREATE OR REPLACE PROCEDURE validate_table_consistency(
  source_table STRING,
  target_table STRING,
  key_columns ARRAY,
  comparison_date DATE
)
RETURNS TABLE(validation_type STRING, discrepancy_count NUMBER)
AS
$$
DECLARE
  key_cols STRING;
  validation_sql STRING;
BEGIN
  -- Build key column list
  key_cols := ARRAY_TO_STRING(key_columns, ', ');
  
  -- Dynamic SQL for validation
  validation_sql := 'WITH source_data AS (SELECT ' || key_cols || 
                   ' FROM ' || source_table || 
                   ' WHERE date_column = ''' || comparison_date || '''), ' ||
                   'target_data AS (SELECT ' || key_cols || 
                   ' FROM ' || target_table || 
                   ' WHERE date_column = ''' || comparison_date || ''') ' ||
                   'SELECT ''MISSING_FROM_TARGET'' as validation_type, COUNT(*) ' ||
                   'FROM (SELECT * FROM source_data EXCEPT SELECT * FROM target_data) ' ||
                   'UNION ALL ' ||
                   'SELECT ''EXTRA_IN_TARGET'' as validation_type, COUNT(*) ' ||
                   'FROM (SELECT * FROM target_data EXCEPT SELECT * FROM source_data)';
  
  -- Execute validation
  RETURN TABLE(EXECUTE IMMEDIATE validation_sql);
END;
$$;

Conclusion

SQL set operations provide a powerful toolkit for data validation that goes far beyond basic counts and null checks. By implementing these patterns in your Snowflake data pipelines, you can:

Catch data quality issues early in the pipeline
Ensure consistency across system boundaries
Monitor for data drift and anomalies
Build confidence in your data transformations

The key to successful validation is making it systematic and automated. Consider integrating these validation patterns into your dbt tests, Airflow DAGs, or other orchestration tools to ensure data quality remains a first-class concern in your data architecture.

Remember: the best validation is the validation that runs automatically and fails loudly when issues are detected. Set operations give you the precision tools needed to build that level of confidence in your data systems.