Identity Resolution Glossary | Semantic Gravity, Golden Record & More

Semantic Gravity

Definition

Semantic Gravity is the force that attracts related data points in high-dimensional vector space. In DarkMath's architecture, data points with strong contextual, behavioral, or semantic affinities naturally cluster together, creating "gravity wells" around true identities.

Extended Explanation

The concept borrows from physics: just as massive objects attract smaller ones through gravitational force, high-confidence identity records attract fragmentary data through semantic affinity. When a new data fragment enters the system, a transaction record, a browsing session, a device fingerprint—Semantic Gravity pulls it toward the existing customer profile with the highest vector similarity.

This process is dynamic and self-correcting. Each additional data point strengthens the gravitational pull of its associated Golden Record. Uncertainty (entropy) collapses as evidence accumulates. The more data ingested, the more precisely the system identifies true boundaries between distinct identities—even when those identities share names, addresses, or other surface-level attributes.

‍Example: A transaction for a "luxury SUV" and browsing history for "wealth management services" might orbit the same "High-Net-Worth Individual" profile, even if the names on the records differ slightly. The semantic signals create gravitational attraction that string matching cannot detect.

‍Related Terms: Golden Record, Vector Embedding, Entity Resolution, Latent Space

Golden Record

Definition

A Golden Record is the single, authoritative, unified profile for a resolved identity. It consolidates all touchpoints, interactions, and attributes from fragmented source records into one comprehensive view with confidence scores and full audit trail.

Extended Explanation

Before identity resolution, a single customer might appear as five separate records: one from email signup, one from mobile app, one from loyalty program, one from call center, and one from in-store purchase. Each fragment has partial information. The Golden Record unifies these fragments into a single view that includes:

Complete contact information (reconciled from all sources)
Full transaction and interaction history
Behavioral patterns and preferences
Confidence scores for each attribute
Source attribution showing which records contributed each field
Audit trail documenting resolution decisions

The term "Golden" refers to this record's status as the trusted source of truth—the definitive answer to "who is this customer?" when source records conflict. DarkMath constructs Golden Records through Semantic Gravity, attracting related fragments into unified identity clusters.

‍Related Terms: Semantic Gravity, Entity Resolution, Customer 360, Record Linkage

Vector Embedding

Definition

A vector embedding is a dense numerical representation of data (text, images, user behavior) that captures semantic meaning in high-dimensional space. In this space, distance between vectors corresponds to semantic similarity—related concepts cluster together.

Extended Explanation

Traditional databases represent data as strings and numbers in fixed schemas. Vector embeddings transform this structured data into continuous mathematical representations, typically arrays of 100-1000+ floating-point numbers. These vectors are generated by neural networks trained to position semantically similar items close together.

‍Example: In a well-trained embedding space, the vectors for "New York" and "Manhattan" would be closer together than "New York" and "Tokyo," even though all three are city names. The embedding captures that Manhattan is part of New York, a relationship invisible to string matching.

DarkMath uses embeddings to represent customer data points. A transaction record becomes a vector. A browsing session becomes a vector. A device fingerprint becomes a vector. Identity resolution then becomes a geometric problem: which vectors cluster together? Semantic Gravity attracts related vectors into Golden Record clusters.

‍Related Terms: Semantic Gravity, Latent Space, HNSW Indexing, Approximate Nearest Neighbor

HNSW Indexing

Definition

Hierarchical Navigable Small World (HNSW) is a graph-based algorithm for approximate nearest neighbor search in high-dimensional vector spaces. It enables efficient similarity search at web scale, finding the most similar vectors among billions in milliseconds.

Extended Explanation

Finding the k nearest neighbors to a query vector is computationally expensive. Brute-force comparison against 100 million vectors is impractical for real-time applications. HNSW solves this by constructing a multi-layered navigational graph:

Upper layers contain fewer nodes with long-range connections—"expressways" for rapid traversal
Lower layers contain more nodes with short-range connections—enabling fine-grained local search
Search starts at the top layer, descends through the hierarchy, converging on the nearest neighbors

DarkMath's vector database infrastructure uses HNSW indexing to achieve approximately 2.5k queries per second (QPS) per node. Combined with horizontal sharding across a distributed cluster, the system handles web-scale datasets—hundreds of millions to billions of vectors—with single-digit millisecond latency.

‍Related Terms: Vector Embedding, Approximate Nearest Neighbor, LANNS, Sharding

Entity Resolution

Definition

Entity resolution is the process of determining whether multiple data records refer to the same real-world entity (person, household, business, or object). Also called record linkage, data matching, or deduplication, it's fundamental to creating unified customer views.

Extended Explanation

Three approaches exist:

Deterministic Matching: Records match if and only if specific identifiers match exactly (same email, same SSN). Simple and fast, but fails on typos, format variations, and missing data.
Probabilistic Matching: Statistical models calculate the likelihood that two records refer to the same entity based on field similarities. Handles fuzzy matches but requires careful threshold tuning.
Semantic Matching: AI models analyze contextual meaning and behavioral patterns to resolve identities that string-based methods miss. DarkMath's approach using vector embeddings and Semantic Gravity.

Related Terms: Golden Record, Record Linkage, Fellegi-Sunter Model, Semantic Gravity

Fellegi-Sunter Model

Definition

The Fellegi-Sunter model is the mathematical foundation for probabilistic record linkage, published in 1969. It calculates match weights for record pairs based on the likelihood ratios of field agreements and disagreements, enabling statistically principled identity resolution.

Extended Explanation

The model classifies record pairs into three categories: matches (M), non-matches (U), and possible matches requiring manual review. For each field comparison (name, address, date of birth), the model calculates:

m-probability: P(field agrees | records are a true match)
u-probability: P(field agrees | records are not a match, agreement is coincidental)
Match weight: log2(m/u)—positive if agreement is evidence for a match, negative if against

Related Terms: Entity Resolution, Splink, Probabilistic Matching, Expectation Maximization

Shannon Entropy

Definition

Shannon Entropy is an information theory metric quantifying uncertainty in a probability distribution. In identity resolution, it measures how uncertain we are about the correct grouping of records, higher entropy means more ambiguity about which records belong together.

Extended Explanation

Mathematically, Shannon Entropy H(X) = -Σ P(x) log P(x), where P(x) is the probability of each possible outcome. Maximum entropy occurs when all outcomes are equally likely (complete uncertainty). Zero entropy means one outcome is certain.

DarkMath's uncertainty reduction framework treats entity resolution as an entropy minimization problem. The system maintains a probability distribution over all possible partitions of records. Each matching decision reduces entropy—collapsing uncertainty about which records belong to which identities.

When using LLMs to resolve ambiguous cases, DarkMath selects questions that maximize expected entropy reduction per token cost. This "Most Valuable Questions" (MVQ) algorithm ensures expensive API calls are spent on the decisions that most efficiently collapse uncertainty.

‍Related Terms: Uncertainty Reduction Framework, BoostER, Active Learning, Information Theory

Hamiltonian Neural Networks

Definition

Hamiltonian Neural Networks (HNNs) are physics-informed neural networks that learn the energy function (Hamiltonian) governing a dynamical system. Unlike standard networks, HNNs enforce energy conservation, enabling accurate long-term predictions without drift.

Extended Explanation

In classical mechanics, the Hamiltonian H(q,p) represents total system energy as a function of position (q) and momentum (p). Hamilton's equations describe how the system evolves: dq/dt = ∂H/∂p and dp/dt = -∂H/∂q. Standard neural networks that directly predict future states can violate energy conservation over time—accumulated errors cause predictions to spiral away from physical reality.

HNNs instead parameterize the Hamiltonian itself. They output a scalar "energy" value; the system dynamics emerge from automatic differentiation of this energy function. This architecture enforces symplectic structure—a geometric property guaranteeing that phase space volume (and thus energy) is conserved.

DarkWatch applies HNNs to fraud detection. User behavior is modeled as a dynamical system with conserved "energy." Normal behavior follows Hamiltonian dynamics; fraud violates energy conservation. This allows detection of anomalies that don't match known fraud patterns—anything that violates behavioral physics is flagged.

‍Related Terms: DarkWatch, Symplectic Structure, Anomaly Detection, Physics-Informed ML

Latent Space

Definition

Latent space is the high-dimensional vector space where embeddings reside. "Latent" means hidden, these dimensions don't correspond to explicit features but capture abstract patterns learned by neural networks. Proximity in latent space indicates semantic similarity.

Extended Explanation

When a neural network encodes data into vectors, it creates a latent space where each dimension captures some learned pattern. Unlike explicit features ("age", "income"), latent dimensions are emergent—Dimension 47 might correlate with "luxury purchase intent" without being explicitly programmed to.

DarkMirror performs Latent Space Arithmetic for audience expansion. The centroid of your best customers' vectors represents their average position in latent space—their collective "psychographic DNA." Finding other vectors near this centroid identifies lookalike audiences who share behavioral patterns, even if their demographics differ.

‍Related Terms: Vector Embedding, DarkMirror, Centroid, Lookalike Modeling

Approximate Nearest Neighbor (ANN)

Definition

Approximate Nearest Neighbor search finds vectors most similar to a query vector without exhaustively comparing every candidate. By accepting slight approximation, ANN algorithms achieve orders-of-magnitude speedup over exact search, essential for real-time applications at scale.

Extended Explanation

Exact nearest neighbor search requires O(n) comparisons—checking every vector in the database. For 100 million vectors, this is too slow for real-time use. ANN algorithms trade perfect accuracy for speed:

Tree-based methods (KD-trees, random projection trees) partition space hierarchically
Hash-based methods (LSH) map similar vectors to the same buckets
Graph-based methods (HNSW) navigate a proximity graph to find neighbors

DarkMath uses HNSW, currently the state-of-the-art for high-dimensional ANN. Combined with a two-level partitioning strategy (sharding + segmentation), the system maintains high recall (typically >95%) with millisecond latency on web-scale datasets.

‍Related Terms: HNSW Indexing, Vector Embedding, LANNS, Sharding

Additional Glossary Terms

Record Linkage: The process of identifying records that refer to the same entity across different data sources.
Customer 360: A complete, unified view of a customer across all touchpoints and data sources.
Deterministic Matching: Identity resolution based on exact field matches (if Email_A = Email_B, match).
Probabilistic Matching: Identity resolution based on statistical likelihood of match across multiple fields.
Deterministic Matching: Identity resolution based on exact field matches (if Email_A = Email_B, match).