Skip to main content

Surrogate Key

A surrogate key is a system-generated identifier with no business meaning, used as the primary key of a database table. It is also called a synthetic key or meaningless key. Examples: auto-incrementing integers (BIGSERIAL), UUIDs.

Surrogate keys contrast with natural keys — identifiers that come from the business domain and carry meaning (e.g., customer_email, product_sku, tax_id).


Surrogate Key vs Natural Key

Surrogate KeyNatural Key
SourceSystem-generated (SERIAL, UUID)Business domain value
StabilityImmutable — never changesCan change (email updates, sku reuse)
MeaningNo business meaningEncodes business data
PrivacyNo data exposed in URLs/APIsExposes business data
UniquenessGuaranteed by systemRequires validation and deduplication
FK join performanceBIGINT is fast; UUID is slowerDepends on data type and uniqueness
SCD supportMultiple rows per entity — surrogate key distinguishes themNatural key cannot support SCD history rows

Why Surrogate Keys Are Essential in Data Warehouses

In dimensional modeling (Ralph Kimball methodology), dimension tables always use surrogate keys as their primary key. The natural key (source system business identifier) is retained as a separate column but is not the PK.

Three reasons Kimball mandates surrogate keys for dimensions:

  1. Slowly Changing Dimensions (SCD Type 2): When a customer's address changes, SCD Type 2 inserts a new row with a new surrogate key. The fact table FK points to the specific version of the customer as of the transaction date. If the natural key (customer_id) were the PK, there could only be one row per customer — no history.

  2. Source system independence: Source system keys change. Companies merge, primary keys get reused after deletions, or you're loading from multiple source systems with overlapping key spaces. Surrogate keys shield the warehouse from all of this.

  3. NULL handling: Source systems sometimes send NULL or unknown values for foreign keys. A surrogate key approach handles this with a special "unknown" dimension row (surrogate key = -1 or 0) — impossible with natural keys.


Surrogate Key Implementation

PostgreSQL (integer):

CREATE TABLE dim_customer (
customer_key BIGSERIAL PRIMARY KEY, -- surrogate key
customer_id BIGINT NOT NULL, -- natural key from source
email TEXT NOT NULL,
name TEXT NOT NULL,
valid_from DATE NOT NULL,
valid_to DATE,
is_current BOOLEAN NOT NULL DEFAULT TRUE
);

PostgreSQL (UUID):

CREATE TABLE dim_product (
product_key UUID DEFAULT gen_random_uuid() PRIMARY KEY, -- surrogate key
product_id TEXT NOT NULL, -- natural key (e.g., SKU from ERP)
name TEXT NOT NULL,
category TEXT NOT NULL
);

When to use integer vs UUID:

  • BIGSERIAL (integer): faster joins, smaller storage, simpler debugging. Best for internal warehouse tables.
  • UUID: better for distributed systems, external API exposure, merging data across environments.

TalkingSchema: Automatic Surrogate Key Generation

When you ask TalkingSchema to design a star schema or data warehouse model, surrogate keys are generated automatically for every dimension table:

Example prompt:

"Create a star schema for our e-commerce analytics. Include dimensions for customers, products, and dates, with SCD Type 2 for customers."

TalkingSchema generates dim_customer.customer_key BIGSERIAL PRIMARY KEY alongside customer_id BIGINT (natural key), valid_from DATE, valid_to DATE, and is_current BOOLEAN — the full SCD Type 2 pattern.


Frequently Asked Questions

Do transactional (OLTP) tables need surrogate keys?

In OLTP schemas, using a surrogate integer or UUID primary key is also recommended, but for different reasons: privacy (avoid exposing business data in URLs), stability (natural keys like email change), and consistency. The standard practice for OLTP is: surrogate PK + unique constraint on the natural key.

What is the "unknown" surrogate key?

A convention from Kimball dimensional modeling: reserve surrogate key value -1 (or 0) in every dimension table to represent "unknown" or "not applicable." When a fact row has a NULL or unresolvable FK, it points to the -1 row instead of NULL. This allows aggregations to work correctly without NULL handling in every query.

Does TalkingSchema use BIGSERIAL or UUID for surrogate keys?

TalkingSchema defaults to BIGSERIAL for dimension table surrogate keys (integer, auto-increment) following standard Kimball practice. You can explicitly request UUID surrogate keys with a prompt: "Use UUID primary keys for all dimension tables."