<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=789673508300973&amp;ev=PageView&amp;noscript=1">
Skip to main content

Data Vault 2.0 is designed to be trustworthy-  insert-only, hash-keyed, temporally correct, & fully auditable.

What breaks trust in practice isn't the architecture- it's the absence of the tests that enforce it.

Most Data Vault implementations run with some testing in place. A uniqueness check here, a row count there. But there's a significant difference between having tests and having the right tests at the right layers.

Without systematic coverage, failures propagate silently: a hub loads a duplicate hash key, and every downstream link and satellite fans out. A satellite timeline overlaps by a millisecond, and every point-in-time query for that entity starts returning the wrong version of history — with no error, no alert, just quietly wrong answers.

The 10 tests below are drawn directly from a comprehensive 78-test Data Vault 2.0 test catalog.

They span every major architectural layer, they're the tests that show up most consistently in implementations with quality gaps, and they're the ones where the cost of not testing is highest. Each one includes the SQL template so you can implement it today.


The raw vault is the immutable foundation of the architecture. Staging tests catch problems before bad data enters the insert-only vault. Hub, Link, and Satellite tests protect structural integrity once data is loaded. Problems caught at staging are cheap. Problems loaded into the raw vault require correction loads to fix.


Test 1: Staging Row Count

Layer: Staging | Severity: Critical | Quality Dimension: Completeness

The simplest question in data engineering is also the most consequential: did all the records arrive?

This test compares the row count from the source system against the count in the staging table for the current batch. Any discrepancy — whether data loss or unexpected duplication during extraction — surfaces here, before a single record touches the vault.

-- === TEST DATASET ===
-- Row count in staging for the current batch
SELECT COUNT(*) AS row_count
FROM   .
WHERE  batch_id = '';

-- === CONTROL DATASET ===
-- Row count in the source system
-- For incremental loads, filter the source query to match extraction criteria
SELECT COUNT(*) AS row_count
FROM   .;

-- Expected: test row_count = control row_count
-- Failure: Data was lost or duplicated during extraction

When this test fails, halt downstream vault loads immediately.

Anything loaded from incomplete or duplicated staging is incorrect by definition, and the vault's insert-only architecture means there is no straightforward rollback.


Test 2: No Duplicate Hash Keys in Hub

Layer: Hub | Severity: Critical | Quality Dimension: Uniqueness

The defining property of a hub is that each business key appears exactly once and produces exactly one hash key. This test returns any hash key that appears more than once. A passing test returns zero rows.

-- Expected: zero rows (no duplicate hash keys)
SELECT   ,
         COUNT(*) AS occurrence_count
FROM     .
GROUP BY 
HAVING   COUNT(*) > 1
ORDER BY COUNT(*) DESC;

A duplicate hub hash key is not a localized problem.

Every link that references that hub now has an ambiguous join target. Every satellite attached to that hub has two competing parent records. The defect cascades through the entire model downstream from this single hub.


Test 3: Hub Hash Key Recomputation

Layer: Hub | Severity: Critical | Quality Dimension: Consistency

Hash keys must be deterministic: given the same business key, the same hash must always be produced. This test recomputes the hash key from the stored business key using the same algorithm and compares it to the stored value. Any row returned means the stored hash no longer matches what the algorithm would produce from that key today.

-- Expected: zero rows (all stored hash keys match recomputed values)
SELECT ,
       ,
       (
           UPPER(TRIM(CAST( AS VARCHAR)))
       ) AS recomputed_hash
FROM   .
WHERE   IS NOT NULL
  AND  CAST( AS VARCHAR) <> ''
  AND   <> (
           UPPER(TRIM(CAST( AS VARCHAR)))
       );

This test catches algorithm changes, collation differences, encoding drift, and whitespace normalization inconsistencies — all of which produce stored hash keys that no longer match what links and satellites expect to find when they join back to the hub.

Implementation note: The recomputation logic in this test must use the exact same hash function, input transformations (UPPER, TRIM, COALESCE), concatenation separator (for composite keys), and NULL handling as the load process. Any divergence produces false positives on every row.


Layer: Link | Severity: Critical | Quality Dimension: Referential Integrity

Links record relationships between business concepts by storing the hash keys of the hubs they connect. A link's value depends entirely on both hub references being valid. This test LEFT JOINs each hub hash key column in the link back to its parent hub, returning any link rows where the hub record is missing.

-- Expected: zero rows (no orphaned foreign key references)
SELECT
    lnk.,
    lnk.,
    lnk.load_datetime,
    lnk.record_source,
    CASE WHEN h1. IS NULL
         THEN 'Missing in '
         ELSE NULL
    END AS hub1_status,
    CASE WHEN h2. IS NULL
         THEN 'Missing in '
         ELSE NULL
    END AS hub2_status
FROM   . AS lnk
LEFT JOIN . AS h1
    ON lnk. = h1.
LEFT JOIN . AS h2
    ON lnk. = h2.
WHERE  h1. IS NULL
   OR  h2. IS NULL
ORDER BY lnk.load_datetime DESC;

The most common cause is load ordering: hubs and links are loaded concurrently rather than sequentially, so the link records a relationship before the hub has loaded its end of it. The relationship appears to exist; one end of it points to nothing.


Satellite Temporal Integrity Suite

Satellites carry the most complexity in the raw vault because they track the full history of changes. Each satellite must maintain a clean, non-overlapping, gap-free timeline for every hash key.

The four tests below form the temporal integrity suite. They catch distinct failure modes — missing any one leaves a blind spot in your history tracking.


Test 5: Satellite Hash Diff Recomputation

Layer: Satellite | Severity: Critical | Quality Dimension: Consistency

Hash diffs drive change detection. When a staging record's hash diff matches the current satellite record's hash diff, the record is treated as unchanged — no new satellite row is inserted. If the hash diff algorithm or column ordering has drifted between load cycles, the satellite either misses real changes (producing gaps in history) or records false changes (producing phantom history).

-- Expected: zero rows (all stored hash_diff values match recomputed values)
SELECT           AS stored_hash_diff,
       (
           CONCAT_WS('||',
               COALESCE(CAST( AS VARCHAR), '^^')
           )
       )                             AS recomputed_hash_diff,
       hash_key,
       effective_from,
       load_datetime,
       record_source
FROM   .
WHERE   <> (
           CONCAT_WS('||',
               COALESCE(CAST( AS VARCHAR), '^^')
           )
       );

Because hash diffs repeat deterministically on every load, a drift in the algorithm does not produce a one-time problem. It produces the wrong result on every subsequent load until it is caught and corrected.


Test 6: No Overlapping Satellite Timelines

Layer: Satellite | Severity: Critical | Quality Dimension: Historical Integrity

For each hash key, no two satellite records should have overlapping effective date ranges. An overlap means that for some point in time, two different versions of truth coexist for the same entity. Point-in-time queries against an overlapping timeline return non-deterministic results.

-- Expected: zero rows (no overlapping date ranges per hash_key)
SELECT a.,
       a. AS a_effective_from,
       a.   AS a_effective_to,
       b. AS b_effective_from,
       b.   AS b_effective_to
FROM   . a
JOIN   . b
    ON a. = b.
   AND a. < b.
WHERE  a. < COALESCE(b., '9999-12-31'::TIMESTAMP)
   AND b. < COALESCE(a., '9999-12-31'::TIMESTAMP);

Overlapping timelines typically result from failed close-out logic during a load- a new version is inserted without correctly closing the prior version. Every PIT table built from a satellite with overlapping timelines is unreliable.


Test 7: Exactly One Current Record per Hash Key

Layer: Satellite | Severity: Critical | Quality Dimension: Historical Integrity

Every hash key in a satellite must have exactly one current record - one row where effective_to is NULL or the max-date sentinel value.

Zero current records means an entity has no active version. More than one current record means the close-out step failed when a new version was loaded.

-- Expected: zero rows (every hash_key has exactly one current record)
SELECT   ,
         COUNT(*) AS open_ended_record_count
FROM     .
WHERE     IS NULL
   OR     = ''::TIMESTAMP
GROUP BY 
HAVING   COUNT(*) <> 1
ORDER BY open_ended_record_count DESC;

This test is the complement to Test 6. Where Test 6 finds overlapping timelines caused by a close-out that never happened, this test finds cases where an entity has no current record at all — meaning it would be invisible to any query that filters on current state.


Test 8: No Timeline Gaps Between Consecutive Records

Layer: Satellite | Severity: High | Quality Dimension: Historical Integrity

For each hash key, the effective_to of each record should equal the effective_from of the next record (within a configurable tolerance).

A gap is a period where no version of truth is defined for that entity. Point-in-time queries for a date within the gap return NULL — not an error, just a silently missing answer.

-- Expected: zero rows (no gaps between consecutive satellite records)
WITH satellite_timeline AS (
    SELECT ,
           ,
           ,
           LEAD() OVER (
               PARTITION BY 
               ORDER BY     
           ) AS next_effective_from
    FROM   .
)
SELECT ,
           AS current_record_from,
             AS current_record_to,
       next_effective_from          AS next_record_from,
       EXTRACT(EPOCH FROM (
           next_effective_from - COALESCE(, '9999-12-31'::TIMESTAMP)
       )) AS gap_seconds
FROM   satellite_timeline
WHERE  next_effective_from IS NOT NULL
   AND ABS(
           EXTRACT(EPOCH FROM (
               next_effective_from - COALESCE(, '9999-12-31'::TIMESTAMP)
           ))
       ) > 
ORDER BY , ;

Set to 0 if your timeline model requires exact contiguity (each record's effective_to equals the next record's effective_from). Adjust if your convention allows a one-second or one-millisecond offset.


Business Vault: PIT and Bridge Integrity

PIT tables and Bridge tables are performance structures built on the raw vault. They pre-compute temporal lookups and relationship traversals so the information mart can use simple equi-joins instead of expensive temporal range queries. Their accuracy depends on the raw vault being correct — but they introduce their own failure modes that require dedicated tests.


Test 9: All PIT Hash Key References Resolve

Layer: PIT (Business Vault) | Severity: Critical | Quality Dimension: Referential Integrity

A PIT table stores, for each hub entity and snapshot date, a pointer to the satellite record that was active at that moment. If any PIT row references a hub hash key or satellite hash key that doesn't exist, that entity will return NULL in every downstream query that uses the PIT for satellite access.

-- Expected: zero rows (no orphaned references in the PIT table)
SELECT pit.,
       CASE WHEN hub. IS NULL
            THEN 'ORPHAN_HUB_HK'
            ELSE NULL
       END AS hub_integrity_issue,
       CASE WHEN sat1. IS NULL
            THEN 'ORPHAN_SAT1_HK'
            ELSE NULL
       END AS sat1_integrity_issue
FROM   . AS pit
LEFT JOIN . AS hub
    ON pit. = hub.
LEFT JOIN . AS sat1
    ON pit. = sat1.
WHERE  hub. IS NULL
   OR  sat1. IS NULL;

Extend the LEFT JOIN block for each additional satellite referenced by the PIT. A PIT table with orphaned references is not just incomplete — it is actively misleading, because queries against it will silently exclude the affected entities rather than raising an error.


Test 10: All Bridge Hash Keys Resolve

Layer: Bridge (Business Vault) | Severity: Critical | Quality Dimension: Referential Integrity

A Bridge table materializes hub-to-hub paths across links, giving the information mart a denormalized join surface. If any bridge row references a hub or link hash key that doesn't exist in its parent table, that path is silently excluded from every downstream fact query that uses the bridge. The bridge appears to work — it just quietly omits data.

-- Expected: zero rows (no orphaned hash key references in the bridge)
SELECT brg.,
       brg.,
       brg.,
       CASE WHEN hub1. IS NULL
            THEN 'ORPHAN_HUB1_HK'
            ELSE NULL
       END AS hub1_integrity_issue,
       CASE WHEN lnk.  IS NULL
            THEN 'ORPHAN_LINK_HK'
            ELSE NULL
       END AS link_integrity_issue,
       CASE WHEN hub2. IS NULL
            THEN 'ORPHAN_HUB2_HK'
            ELSE NULL
       END AS hub2_integrity_issue
FROM   . AS brg
LEFT JOIN . AS hub1
    ON brg. = hub1.
LEFT JOIN . AS lnk
    ON brg. = lnk.
LEFT JOIN . AS hub2
    ON brg. = hub2.
WHERE  hub1. IS NULL
   OR  lnk.  IS NULL
   OR  hub2. IS NULL;

Key Takeaways

  • Test at every layer, not just the output. Staging tests catch problems before they enter the insert-only raw vault. Hub tests catch structural defects before they cascade into Links and Satellites. Problems found early are cheap. Problems found in production are not.

  • Hub hash integrity is load-order sensitive. Links must load after hubs complete. Concurrent loading causes referential integrity failures that appear intermittent but are architecturally predictable. Test 4 will surface them on every affected load.

  • The satellite temporal integrity suite is four tests, not one. Overlapping timelines (Test 6), timeline gaps (Test 8), multiple current records (Test 7), and hash diff recomputation (Test 5) each catch a distinct failure mode. Missing any one leaves a specific class of data corruption undetected.

  • Hash diff logic must match exactly between load and test.  Column order, NULL handling (COALESCE sentinel value), and concatenation separator must be identical between the load process and the recomputation test. Any divergence produces false positives on every row.

  • PIT and Bridge failures are silent. Unlike upstream failures that may produce observable errors, referential integrity violations in PIT and Bridge structures show up as missing data in reports — with no error message, no failed job, and no immediate indication of where to look.

  • Six of these ten tests are production-blocking at Critical severity. A failing Critical test should halt downstream loads until the root cause is found and corrected. Running past a Critical failure means loading known-bad data into an insert-only system.


78 Tests. Every Layer. Ready to Implement.

These 10 tests are drawn from a complete Data Vault 2.0 Test Plan covering 78 tests across every architectural layer — transient staging, raw vault (hubs, links, satellites), business vault (PITs, bridges, business satellites), and presentation layer (facts and dimensions).

Each test in the full plan includes a SQL template, a quality dimension classification, a severity rating, scan type guidance (delta vs. full), and notes on when to run it in your load cycle.

Download the Data Vault 2.0 Test Plan to get the complete catalog — all 78 tests, organized layer by layer, with SQL templates you can adapt to your environment and implement incrementally, starting with the Critical-severity tests at the layers where you carry the most risk.

If you're looking to automate tests like these across every Hub, Link, and Satellite in your vault without writing and maintaining custom SQL for each object, Validatar's metadata-driven testing platform generates and executes them automatically based on your vault's structure — running the full suite after every load, at any scale.

Sam leads business development at Validatar bringing decades of experience in data management, data quality, and sales.