Snowflake Data Validation
Posted on August 17, 2025 by Bernie Pruss
Data validation is a critical component of any data pipeline, ensuring that transformations produce accurate and consistent results. While many data engineers rely on basic row counts and null checks, SQL set operations provide a more sophisticated approach to data validation that can catch subtle data quality issues.
In this tutorial, we’ll explore how to leverage Snowflake’s SQL set operations—EXCEPT, INTERSECT, and UNION—to build comprehensive data validation checks that go beyond simple aggregations.
Understanding SQL Set Operations
SQL set operations allow you to compare datasets at the row level, making them ideal for validation scenarios where you need to identify differences between expected and actual results.
The Three Key Operations
- EXCEPT: Returns rows in the first dataset that don’t exist in the second
- INTERSECT: Returns rows that exist in both datasets
- UNION: Combines rows from both datasets (UNION ALL includes duplicates)
Common Data Validation Scenarios
Scenario 1: Validating Data Transformations
When transforming data, you often need to ensure that no records are lost or duplicated unexpectedly. Here’s how to validate a transformation that should preserve all records:
-- Validate that transformation preserves all records
WITH source_data AS (
SELECT customer_id, order_date, amount
FROM raw_orders
WHERE order_date >= '2024-01-01'
),
transformed_data AS (
SELECT customer_id, order_date, amount
FROM processed_orders
WHERE order_date >= '2024-01-01'
),
missing_records AS (
SELECT 'MISSING_FROM_TARGET' as validation_type, *
FROM source_data
EXCEPT
SELECT 'MISSING_FROM_TARGET' as validation_type, *
FROM transformed_data
),
extra_records AS (
SELECT 'EXTRA_IN_TARGET' as validation_type, *
FROM transformed_data
EXCEPT
SELECT 'EXTRA_IN_TARGET' as validation_type, *
FROM source_data
)
SELECT * FROM missing_records
UNION ALL
SELECT * FROM extra_records;
Scenario 2: Identifying Data Drift
Data drift occurs when the characteristics of your data change over time. Set operations can help identify when new patterns emerge:
-- Compare current data patterns with historical baseline
WITH current_patterns AS (
SELECT
DATE_TRUNC('day', created_at) as date_key,
product_category,
COUNT(*) as record_count,
AVG(price) as avg_price
FROM current_sales
WHERE created_at >= CURRENT_DATE - 7
GROUP BY 1, 2
),
baseline_patterns AS (
SELECT
DATE_TRUNC('day', created_at) as date_key,
product_category,
COUNT(*) as record_count,
AVG(price) as avg_price
FROM historical_sales
WHERE created_at BETWEEN CURRENT_DATE - 30 AND CURRENT_DATE - 8
GROUP BY 1, 2
),
pattern_drift AS (
SELECT 'NEW_PATTERN' as drift_type, *
FROM current_patterns
EXCEPT
SELECT 'NEW_PATTERN' as drift_type, *
FROM baseline_patterns
)
SELECT * FROM pattern_drift
WHERE record_count > 10; -- Filter out noise
Scenario 3: Cross-System Data Reconciliation
When data flows between systems, validation ensures consistency across platforms:
-- Reconcile customer data between CRM and data warehouse
WITH crm_customers AS (
SELECT
customer_id,
email,
UPPER(TRIM(first_name)) as first_name,
UPPER(TRIM(last_name)) as last_name,
status
FROM crm_system.customers
WHERE is_active = TRUE
),
warehouse_customers AS (
SELECT
customer_id,
email,
UPPER(TRIM(first_name)) as first_name,
UPPER(TRIM(last_name)) as last_name,
status
FROM warehouse.dim_customers
WHERE is_active = TRUE
),
reconciliation_report AS (
-- Records in CRM but not in warehouse
SELECT 'CRM_ONLY' as source_system, *
FROM crm_customers
EXCEPT
SELECT 'CRM_ONLY' as source_system, *
FROM warehouse_customers
UNION ALL
-- Records in warehouse but not in CRM
SELECT 'WAREHOUSE_ONLY' as source_system, *
FROM warehouse_customers
EXCEPT
SELECT 'WAREHOUSE_ONLY' as source_system, *
FROM crm_customers
)
SELECT
source_system,
COUNT(*) as discrepancy_count,
ARRAY_AGG(customer_id) as affected_customers
FROM reconciliation_report
GROUP BY source_system;
Advanced Validation Patterns
Fuzzy Matching with Set Operations
Sometimes exact matches aren’t sufficient. You can combine set operations with fuzzy matching:
-- Validate address data with fuzzy matching
WITH normalized_addresses AS (
SELECT
customer_id,
REGEXP_REPLACE(UPPER(address), '[^A-Z0-9\\s]', '') as clean_address,
city,
state,
zip_code
FROM customer_addresses
),
validation_base AS (
SELECT
customer_id,
clean_address,
city,
state,
LEFT(zip_code, 5) as zip5 -- Normalize to 5-digit ZIP
FROM normalized_addresses
)
-- Find addresses that appear in source but not target after normalization
SELECT 'NORMALIZATION_DISCREPANCY' as issue_type, *
FROM validation_base
EXCEPT
SELECT 'NORMALIZATION_DISCREPANCY' as issue_type, *
FROM target_addresses;
Performance Optimization Tips
- Use appropriate clustering keys: Ensure tables are clustered on columns used in set operations
- Leverage materialized views: For frequently-run validations, consider materialized views
- Partition large datasets: Use date partitioning to limit the scope of comparisons
- Index frequently-compared columns: While Snowflake is columnar, micro-partitions benefit from good clustering
-- Example of partitioned validation for better performance
WITH validation_scope AS (
SELECT DATE_TRUNC('day', event_date) as partition_date
FROM events
WHERE event_date >= CURRENT_DATE - 1
GROUP BY 1
)
SELECT v.partition_date, COUNT(*) as discrepancies
FROM validation_scope v
CROSS JOIN (
-- Your set operation validation here
SELECT event_date, event_id, user_id
FROM source_events s
WHERE DATE_TRUNC('day', s.event_date) = v.partition_date
EXCEPT
SELECT event_date, event_id, user_id
FROM target_events t
WHERE DATE_TRUNC('day', t.event_date) = v.partition_date
) discrepancies
GROUP BY 1;
Building a Validation Framework
Create reusable validation functions using Snowflake’s stored procedures:
CREATE OR REPLACE PROCEDURE validate_table_consistency(
source_table STRING,
target_table STRING,
key_columns ARRAY,
comparison_date DATE
)
RETURNS TABLE(validation_type STRING, discrepancy_count NUMBER)
AS
$$
DECLARE
key_cols STRING;
validation_sql STRING;
BEGIN
-- Build key column list
key_cols := ARRAY_TO_STRING(key_columns, ', ');
-- Dynamic SQL for validation
validation_sql := 'WITH source_data AS (SELECT ' || key_cols ||
' FROM ' || source_table ||
' WHERE date_column = ''' || comparison_date || '''), ' ||
'target_data AS (SELECT ' || key_cols ||
' FROM ' || target_table ||
' WHERE date_column = ''' || comparison_date || ''') ' ||
'SELECT ''MISSING_FROM_TARGET'' as validation_type, COUNT(*) ' ||
'FROM (SELECT * FROM source_data EXCEPT SELECT * FROM target_data) ' ||
'UNION ALL ' ||
'SELECT ''EXTRA_IN_TARGET'' as validation_type, COUNT(*) ' ||
'FROM (SELECT * FROM target_data EXCEPT SELECT * FROM source_data)';
-- Execute validation
RETURN TABLE(EXECUTE IMMEDIATE validation_sql);
END;
$$;
Conclusion
SQL set operations provide a powerful toolkit for data validation that goes far beyond basic counts and null checks. By implementing these patterns in your Snowflake data pipelines, you can:
- Catch data quality issues early in the pipeline
- Ensure consistency across system boundaries
- Monitor for data drift and anomalies
- Build confidence in your data transformations
The key to successful validation is making it systematic and automated. Consider integrating these validation patterns into your dbt tests, Airflow DAGs, or other orchestration tools to ensure data quality remains a first-class concern in your data architecture.
Remember: the best validation is the validation that runs automatically and fails loudly when issues are detected. Set operations give you the precision tools needed to build that level of confidence in your data systems.