Hub entities form the foundation of any Data Vault 2.0 architecture, serving as the central points of integration for business keys across your enterprise data warehouse.
Yet despite their structural simplicity, comprehensive hub testing requires validation techniques that go far beyond basic row count comparisons.
In this guide, we'll explore advanced hub validation approaches that ensure data quality, maintain referential integrity, and protect the reliability of your entire Data Vault implementation.
Whether you're a Data Vault architect designing testing frameworks or a data engineer implementing quality checks, these techniques will help you catch the hub defects that cascade farthest and fastest through the model.
SQL dialect note: Examples below use PostgreSQL/Snowflake syntax. Functions like
ARRAY_AGG(x LIMIT n),SPLIT_PART(), and the!~regex operator require adjustment for other platforms (SQL Server, BigQuery, etc.).
Before diving into advanced techniques, let's establish what makes hub testing unique within the Data Vault paradigm.
Hubs represent core business concepts — customers, products, accounts, or any entity with a persistent business identity. Unlike traditional dimensional models, Data Vault hubs are intentionally minimal, typically containing only:
This minimalist design doesn't mean testing is simple. In fact, the hub's role as the authoritative source for business key management makes hub validation critical to overall data warehouse integrity.
The Three Pillars of Hub Testing:
Each pillar requires specific testing approaches that we'll explore in depth.
Business key validation represents the most critical aspect of hub testing. Since hubs serve as the single source of truth for business identifiers, any compromise in business key quality cascades throughout your entire Data Vault architecture.
In enterprise environments, the same business entity often originates from multiple source systems, each potentially using different identifier formats or conventions. Advanced hub testing must validate that your hub correctly handles these variations.
-- Test: Multi-source business key reconciliation
-- Identifies hub records that cannot be traced back to their source systems
WITH source_comparison AS (
SELECT
h.customer_hub_key,
h.customer_id AS hub_business_key,
h.record_source,
s1.customer_number AS crm_key,
s2.account_id AS erp_key,
s3.user_id AS web_key
FROM hub_customer h
LEFT JOIN staging_crm s1
ON h.customer_id = s1.customer_number
AND h.record_source = 'CRM'
LEFT JOIN staging_erp s2
ON h.customer_id = s2.account_id
AND h.record_source = 'ERP'
LEFT JOIN staging_web s3
ON h.customer_id = s3.user_id
AND h.record_source = 'WEB'
),
validation_results AS (
SELECT
record_source,
COUNT(*) AS total_records,
COUNT(CASE
WHEN record_source = 'CRM' AND crm_key IS NULL THEN 1
WHEN record_source = 'ERP' AND erp_key IS NULL THEN 1
WHEN record_source = 'WEB' AND web_key IS NULL THEN 1
END) AS orphaned_keys,
COUNT(CASE
WHEN hub_business_key IS NULL OR
TRIM(hub_business_key) = '' THEN 1
END) AS null_or_empty_keys
FROM source_comparison
GROUP BY record_source
)
SELECT
record_source,
total_records,
orphaned_keys,
null_or_empty_keys,
ROUND(100.0 * orphaned_keys / NULLIF(total_records, 0), 2) AS orphan_percentage
FROM validation_results
WHERE orphaned_keys > 0 OR null_or_empty_keys > 0;
-- Expected result: 0 rows
-- Impact if found: ETL logic errors or source data corruption; orphaned hub records
-- will cause broken link and satellite relationships downstream
This validation pattern identifies business keys that exist in the hub but cannot be traced back to their source systems — a critical data quality issue that often indicates ETL logic errors or source data corruption.
Many business entities require composite keys — multiple attributes that together form a unique identifier. Testing composite business keys requires verifying that all components are present, correctly ordered, consistently delimited, and formatted to spec.
-- Test: Composite business key component validation
-- Assumes composite key format: "COUNTRY|REGION|CUSTOMER_NUM"
WITH key_analysis AS (
SELECT
customer_hub_key,
customer_id,
SPLIT_PART(customer_id, '|', 1) AS country_code,
SPLIT_PART(customer_id, '|', 2) AS region_code,
SPLIT_PART(customer_id, '|', 3) AS customer_number,
LENGTH(customer_id) - LENGTH(REPLACE(customer_id, '|', '')) AS delimiter_count,
load_timestamp,
record_source
FROM hub_customer
WHERE customer_id LIKE '%|%' -- Composite keys contain delimiters
)
SELECT
'Missing Components' AS test_category,
COUNT(*) AS failure_count,
ARRAY_AGG(customer_hub_key LIMIT 10) AS sample_failures
FROM key_analysis
WHERE country_code IS NULL
OR region_code IS NULL
OR customer_number IS NULL
OR TRIM(country_code) = ''
OR TRIM(region_code) = ''
OR TRIM(customer_number) = ''
UNION ALL
SELECT
'Incorrect Delimiter Count' AS test_category,
COUNT(*) AS failure_count,
ARRAY_AGG(customer_hub_key LIMIT 10) AS sample_failures
FROM key_analysis
WHERE delimiter_count != 2 -- Expected exactly 2 delimiters for a 3-part key
UNION ALL
SELECT
'Invalid Component Format' AS test_category,
COUNT(*) AS failure_count,
ARRAY_AGG(customer_hub_key LIMIT 10) AS sample_failures
FROM key_analysis
WHERE country_code !~ '^[A-Z]{2}$' -- Country code: exactly 2 uppercase letters
OR region_code !~ '^[A-Z]{2,3}$' -- Region code: 2-3 uppercase letters
OR customer_number !~ '^[0-9]{8}$'; -- Customer number: exactly 8 digits
-- Expected result: 0 rows across all three test categories
-- Impact if found: Malformed composite keys generate incorrect hash values,
-- creating phantom hub records and broken downstream relationships
This pattern validates not just presence but also format compliance for each component, catching subtle data quality issues that simple uniqueness checks would miss.
The hub hash key serves as the primary key for hub entities and must be deterministic, collision-resistant, and consistently generated. Advanced hub testing must validate hash key integrity across multiple dimensions.
One of the most insidious defects in Data Vault implementations occurs when hash key generation produces different values for identical business keys — often due to inconsistent handling of whitespace, case sensitivity, or null values.
-- Test: Hash key determinism and reproducibility
-- Regenerates the hash key from source components and compares to stored value
WITH hash_regeneration AS (
SELECT
customer_hub_key AS original_hash_key,
customer_id,
record_source,
-- Regenerate hash using same algorithm and normalization as ETL
-- Using SHA2 (SHA-256) per DV 2.0 best practices
-- Substitute SHA1() or MD5() if your ETL uses a different algorithm
SHA2(
UPPER(TRIM(COALESCE(customer_id, ''))) || '|' ||
UPPER(TRIM(COALESCE(record_source, '')))
) AS regenerated_hash_key,
load_timestamp
FROM hub_customer
),
hash_comparison AS (
SELECT
original_hash_key,
regenerated_hash_key,
customer_id,
record_source,
CASE
WHEN original_hash_key = regenerated_hash_key THEN 'MATCH'
ELSE 'MISMATCH'
END AS hash_status
FROM hash_regeneration
)
SELECT
hash_status,
COUNT(*) AS record_count,
ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) AS percentage,
ARRAY_AGG(
customer_id || ' (Source: ' || record_source || ')'
LIMIT 5
) FILTER (WHERE hash_status = 'MISMATCH') AS sample_mismatches
FROM hash_comparison
GROUP BY hash_status;
-- Expected result: 100% MATCH, 0 MISMATCH
-- Impact if found: Determinism failures indicate inconsistent normalization (whitespace,
-- case, NULL handling) in the ETL — downstream links and satellites
-- referencing the mismatched hash keys will fail to load correctly
Any mismatches indicate problems with hash key generation logic that will cause downstream link and satellite load failures.
While cryptographic hash functions like SHA-256 have extremely low collision probabilities, testing for collisions confirms that different business keys never map to the same hash value.
-- Test: Hash key collision detection
-- Identifies cases where different business keys share the same hash key
WITH collision_analysis AS (
SELECT
customer_hub_key,
customer_id,
record_source,
COUNT(*) OVER (PARTITION BY customer_hub_key) AS key_usage_count,
STRING_AGG(customer_id, ' | ')
OVER (PARTITION BY customer_hub_key) AS all_business_keys,
STRING_AGG(record_source, ' | ')
OVER (PARTITION BY customer_hub_key) AS all_sources
FROM hub_customer
)
SELECT DISTINCT
customer_hub_key AS colliding_hash_key,
key_usage_count AS collision_count,
all_business_keys AS conflicting_keys,
all_sources AS conflicting_sources
FROM collision_analysis
WHERE key_usage_count > 1
ORDER BY key_usage_count DESC;
-- Expected result: 0 rows
-- Impact if found: Critical defect — different entities sharing a hash key will
-- corrupt all downstream satellites and links attached to that hub record
Any results from this query represent critical defects requiring immediate investigation and remediation.
Data Vault 2.0 hub testing extends beyond business keys and hash values to include comprehensive metadata validation. Metadata attributes enable auditability, traceability, and temporal analysis — core Data Vault principles.
Load timestamps should follow logical sequences and align with source system extraction times. Anomalies in timestamp patterns often indicate ETL job failures, timezone handling errors, or data replay issues.
-- Test: Load timestamp sequence and anomaly detection
WITH timestamp_analysis AS (
SELECT
customer_hub_key,
customer_id,
load_timestamp,
record_source,
LAG(load_timestamp) OVER (
PARTITION BY record_source
ORDER BY load_timestamp
) AS previous_load_timestamp,
MIN(load_timestamp) OVER (PARTITION BY record_source) AS first_load,
MAX(load_timestamp) OVER (PARTITION BY record_source) AS last_load
FROM hub_customer
),
anomaly_detection AS (
SELECT
customer_hub_key,
customer_id,
record_source,
load_timestamp,
CASE
WHEN load_timestamp > CURRENT_TIMESTAMP
THEN 'FUTURE_TIMESTAMP'
WHEN load_timestamp < previous_load_timestamp
THEN 'OUT_OF_SEQUENCE'
WHEN EXTRACT(EPOCH FROM (load_timestamp - previous_load_timestamp)) > 86400 * 30
THEN 'LARGE_GAP' -- Gap > 30 days; adjust threshold to match your load frequency
ELSE 'NORMAL'
END AS timestamp_status
FROM timestamp_analysis
)
SELECT
timestamp_status,
record_source,
COUNT(*) AS occurrence_count,
MIN(load_timestamp) AS earliest_occurrence,
MAX(load_timestamp) AS latest_occurrence,
ARRAY_AGG(customer_id LIMIT 5) AS sample_records
FROM anomaly_detection
WHERE timestamp_status != 'NORMAL'
GROUP BY timestamp_status, record_source
ORDER BY occurrence_count DESC;
-- Expected result: 0 rows
-- Impact if found: OUT_OF_SEQUENCE or FUTURE_TIMESTAMP records indicate timezone issues
-- or ETL replay problems; LARGE_GAP may indicate a missed extraction window
Record source attributes enable data lineage tracking and source system identification. Validating record source consistency ensures that every hub record can be traced to its origin.
-- Test: Record source validation and lineage verification
WITH source_validation AS (
SELECT
h.customer_hub_key,
h.customer_id,
h.record_source,
h.load_timestamp,
CASE
WHEN h.record_source NOT IN (
SELECT DISTINCT source_system_code
FROM metadata.approved_source_systems
) THEN 'INVALID_SOURCE'
WHEN h.record_source IS NULL OR TRIM(h.record_source) = ''
THEN 'NULL_OR_EMPTY'
WHEN h.record_source != UPPER(h.record_source)
THEN 'CASE_INCONSISTENCY'
WHEN h.record_source LIKE '% %'
THEN 'CONTAINS_WHITESPACE'
ELSE 'VALID'
END AS validation_status
FROM hub_customer h
)
SELECT
validation_status,
record_source,
COUNT(*) AS record_count,
MIN(load_timestamp) AS first_occurrence,
MAX(load_timestamp) AS last_occurrence,
ARRAY_AGG(DISTINCT customer_id LIMIT 10) AS sample_business_keys
FROM source_validation
WHERE validation_status != 'VALID'
GROUP BY validation_status, record_source
ORDER BY record_count DESC;
-- Expected result: 0 rows
-- Impact if found: Invalid or inconsistent record sources break data lineage
-- and audit trail requirements; CASE_INCONSISTENCY will cause
-- hash key mismatches if record_source is included in the hash input
As Data Vault implementations grow, ensuring business key uniqueness remains efficient and consistently enforced.
This query provides a duplicate detection pattern that performs well against large hub tables.
-- Test: Business key uniqueness validation
-- Detects duplicate business key + record source combinations in the hub
WITH duplicate_analysis AS (
SELECT
customer_id,
record_source,
COUNT(*) AS occurrence_count,
ARRAY_AGG(customer_hub_key) AS duplicate_hash_keys,
MIN(load_timestamp) AS first_load,
MAX(load_timestamp) AS last_load
FROM hub_customer
GROUP BY customer_id, record_source
HAVING COUNT(*) > 1
)
SELECT
customer_id,
record_source,
occurrence_count,
duplicate_hash_keys,
first_load,
last_load,
EXTRACT(EPOCH FROM (last_load - first_load)) / 3600 AS hours_between_duplicates
FROM duplicate_analysis
ORDER BY occurrence_count DESC, last_load DESC;
-- Expected result: 0 rows
-- Impact if found: Duplicate business keys in the hub indicate a broken uniqueness
-- constraint in the ETL load logic; each duplicate will generate
-- separate satellite and link records, fragmenting the entity history
Advanced hub validation delivers maximum value when integrated into continuous testing frameworks rather than executed as one-time audits.
The following query provides a comprehensive hub health snapshot suitable for scheduling after each load cycle and persisting to a monitoring table.
-- Test: Automated hub quality metrics dashboard
-- Run after each ETL load; persist results to a monitoring table for trending
WITH quality_metrics AS (
SELECT
CURRENT_TIMESTAMP AS measurement_timestamp,
'hub_customer' AS hub_name,
COUNT(*) AS total_records,
COUNT(DISTINCT customer_id || '|' || record_source) AS unique_business_keys,
COUNT(*) - COUNT(DISTINCT customer_id || '|' || record_source) AS duplicate_count,
COUNT(CASE WHEN customer_id IS NULL THEN 1 END) AS null_key_count,
COUNT(DISTINCT record_source) AS source_system_count,
MIN(load_timestamp) AS earliest_load,
MAX(load_timestamp) AS latest_load
FROM hub_customer
)
SELECT
measurement_timestamp,
hub_name,
total_records,
unique_business_keys,
duplicate_count,
null_key_count,
source_system_count,
earliest_load,
latest_load,
ROUND(100.0 * duplicate_count / NULLIF(total_records, 0), 4) AS duplicate_percentage,
ROUND(100.0 * null_key_count / NULLIF(total_records, 0), 4) AS null_key_percentage,
CASE
WHEN ROUND(100.0 * duplicate_count / NULLIF(total_records, 0), 4) > 0.01 THEN 'CRITICAL'
WHEN ROUND(100.0 * null_key_count / NULLIF(total_records, 0), 4) > 0.01 THEN 'CRITICAL'
WHEN latest_load < CURRENT_TIMESTAMP - INTERVAL '2 hours' THEN 'WARNING'
ELSE 'HEALTHY'
END AS health_status
FROM quality_metrics;
-- Expected result: health_status = 'HEALTHY' after every load
-- Persist this output to a monitoring table to track quality trends over time
-- Adjust the 2-hour staleness threshold to match your load frequency SLA
This metrics query provides a hub health snapshot that you can schedule after each load cycle, persist to a monitoring table, and visualize in dashboards for proactive quality management.
Hub entities may be structurally simple, but comprehensive hub validation requires sophisticated approaches that address business key integrity, technical correctness, metadata accuracy, and continuous monitoring at scale.
The SQL patterns in this guide form the foundation of a robust hub testing framework that protects your Data Vault implementation from the failure modes that cascade farthest through the architecture.
Implementing these tests manually across multiple hubs is achievable for smaller vaults, but maintaining them as your hub count grows requires a more systematic approach.
When testing maintenance becomes a bottleneck, metadata-driven testing platforms purpose-built for Data Vault architectures can automatically generate and execute these patterns across all hub structures — ensuring comprehensive coverage without proportional manual effort.
For ready-to-use test templates, automation guidance, and a complete implementation framework covering hubs, links, and satellites, download our free Data Vault Test Plan and Automation Guide.