The Incompressible Lie: Kolmogorov Complexity and Narrative Compressibility as Deception Metrics

ai complexity-theory compression deception-detection legal-analysis

Truth has structure. A genuine account of events, told and retold across different contexts and time periods, contains massive internal redundancy because every version is generated from the same underlying reality. This redundancy makes truthful narratives compressible — a compression algorithm finds repeated patterns, predictable sequences, and structural regularity that it can exploit to produce a compact representation. Fabrication lacks this structure. Each contradiction introduces irreducible information; each improvised exception adds complexity that cannot be predicted from prior statements. Kolmogorov complexity — the theoretical measure of a string's intrinsic information content — provides the mathematical foundation for exploiting this asymmetry.

The Mathematical Foundation

Kolmogorov complexity is the length of the shortest program that can reproduce a given output on a universal Turing machine. A string with simple structure — "ABABAB..." repeated a thousand times — has low Kolmogorov complexity because a short program generates it. A random string has maximal Kolmogorov complexity because no program shorter than the string itself can reproduce it. The concept captures an absolute, observer-independent measure of the structural regularity in data.

A foundational result in theoretical computer science is that Kolmogorov complexity is uncomputable — there is no algorithm that can determine the shortest possible program for an arbitrary string. However, practical approximation is straightforward: any specific compression algorithm provides an upper bound. If gzip compresses a file to 30% of its original size, the Kolmogorov complexity is at most 30% of the file length. Better compressors give tighter bounds. For comparative analysis — asking whether corpus A is more complex than corpus B — the absolute value matters less than the relative ranking, and standard compressors produce reliable relative rankings.

Application to Communication Analysis

Honest narratives have low Kolmogorov complexity relative to their length. They are compressible because they are consistent. The same core facts, expressed from different angles and at different times, produce enormous redundancy that compression algorithms exploit. A truthful account told ten different ways compresses to approximately one account plus minor surface variations — the deep structure is the same across all versions, and the compressor captures this.

Deceptive narratives have high Kolmogorov complexity relative to their length. Each contradiction introduces information that cannot be predicted from prior statements. Each exception to the general pattern adds irreducible complexity. A fabricated account told ten different ways compresses poorly because the variations are not noise around a signal — they are independent inventions that share surface features but lack structural coherence. The compressor finds less to exploit because less genuine structure exists.

The Normalised Compression Distance

The normalised compression distance (NCD) between two texts provides a metric for comparing structural similarity that requires no semantic understanding, no embedding model, and no training data. The NCD between a party's statements at time T₁ and their statements at time T₂ measures the structural dissimilarity of the narratives across those periods. Consistent communicators produce low NCD between temporal segments — their narratives at different times have the same structure. Inconsistent communicators produce high NCD — the structure changes across periods because the underlying generator has changed.

Applied systematically across all pairs of temporal segments, NCD produces a full dissimilarity matrix that reveals the evolution of narrative structure over time. Clusters of similar segments indicate stable narrative periods. Sudden jumps in NCD between adjacent segments pinpoint structural shifts — moments when the narrative was fundamentally reorganised. These shifts correlate with identifiable events: the receipt of legal advice, the escalation of a dispute, the introduction of a new false claim.

Segment-Level Analysis

Beyond corpus-wide compression, segment-level analysis reveals local complexity patterns. Divide a party's communications into temporal or thematic segments and compute the compression ratio for each independently. Truthful communicators show uniform compression ratios across segments — their structural consistency is maintained everywhere. Fabricators show variable compression ratios: low complexity in areas where they are telling the truth, high complexity in areas where they are improvising.

This variability signature is surprisingly robust because it exploits a fundamental constraint on fabrication. Producing a genuinely low-complexity (compressible) narrative requires actual consistency — the claims must cohere structurally, not just superficially. A fabricator can make individual statements sound plausible, but maintaining structural compressibility across a corpus of hundreds of communications requires either telling the truth or maintaining a fabrication of extraordinary internal consistency. The former is easy; the latter is computationally demanding for humans, who lack the working memory to track all the structural implications of their prior statements.

Adversarial Resistance

Compression-based analysis has a unique advantage over semantic methods: it is extremely difficult to defeat through careful wording. Semantic analysis can potentially be gamed by a communicator who understands embedding spaces and crafts statements to appear consistent in those spaces. Compression analysis operates on the raw text at a structural level that is much harder to control. To defeat compression analysis, a fabricator would need to produce a narrative that is genuinely structurally consistent — which is precisely the challenge they face and the constraint they cannot meet.