Information Theory: Leverage Map
A. EXISTENCE JUSTIFICATION
Communication kept failing in predictable ways.
The problem (1940s): Engineers at Bell Labs needed to send messages through noisy channels—telephone wires, radio waves. Sometimes bits got flipped. How much redundancy do you need to correct errors? Is there a fundamental limit to reliable communication?
Shannon’s revolution (1948): Claude Shannon proved that every channel has a capacity C—a maximum rate at which information can be transmitted reliably. Below capacity, you can communicate with arbitrarily small error. Above capacity, errors are unavoidable. The capacity is computable from the channel’s statistics.
The deeper insight: Information is physical. It can be measured, quantified, transmitted, stored, compressed, encrypted. Shannon gave information a unit (the bit) and a calculus (entropy, mutual information, channel capacity).
Why this was shocking:
Before Shannon, engineers thought:
- More noise → need more redundancy → lower rate
- Approaching zero error requires infinite redundancy
Shannon showed:
- There’s a sharp threshold (capacity)
- Below it, you can achieve arbitrarily small error with finite redundancy
- The capacity depends only on signal-to-noise ratio
The core move: Separate the source (what you want to say) from the channel (how you transmit it). Compress the source to remove redundancy. Add redundancy back strategically for error correction. These are independent problems with independent solutions.
Importance:
- Communication: Every modern communication system (internet, cell phones, WiFi, satellites) is designed using Shannon’s theory
- Compression: ZIP, MP3, JPEG, video codecs—all information theory
- Cryptography: Perfect secrecy, key rates, secure communication
- Statistics: Minimum description length, model selection
- Physics: Thermodynamic entropy, black hole information, Landauer’s principle
- Neuroscience: Neural coding, efficient representation, metabolic costs
- Machine learning: Variational inference, rate-distortion, information bottleneck
B. CORE OBJECTS & MORPHISMS
The fundamental quantities:
| Object | What it is | Formula |
|---|---|---|
| Entropy | Uncertainty / information content of a random variable | H(X) = -Σ p(x) log p(x) |
| Joint entropy | Uncertainty of two variables together | H(X,Y) = -Σ p(x,y) log p(x,y) |
| Conditional entropy | Remaining uncertainty given knowledge of another | H(X|Y) = H(X,Y) - H(Y) |
| Mutual information | Information shared between two variables | I(X;Y) = H(X) + H(Y) - H(X,Y) |
| Relative entropy / KL divergence | “Distance” from Q to P | D(P‖Q) = Σ p(x) log(p(x)/q(x)) |
| Cross entropy | Expected surprise using wrong distribution | H(P,Q) = -Σ p(x) log q(x) = H(P) + D(P‖Q) |
| Differential entropy | Continuous version of entropy | h(X) = -∫ p(x) log p(x) dx |
Key relationships (the Venn diagram):
H(X) H(Y)
┌─────────────────────────────┐
│ ┌───────┐ │
│ │ I(X;Y)│ │
│ H(X|Y) │ │ H(Y|X) │
│ └───────┘ │
└─────────────────────────────┘
H(X,Y)
- H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)
- I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)
- I(X;Y) = H(X) + H(Y) - H(X,Y)
- I(X;Y) = D(P_{XY} ‖ P_X ⊗ P_Y)
The fundamental problems:
| Problem | Question | Answer |
|---|---|---|
| Source coding | How much can you compress? | H(X) bits per symbol |
| Channel coding | How fast can you communicate reliably? | C = max_{P_X} I(X;Y) bits per use |
| Rate-distortion | How much compression at given fidelity? | R(D) = min_{P_{X̂|X}: 𝔼[d(X,X̂)]≤D} I(X;X̂) |
| Cryptography | How much key for perfect secrecy? | H(Message) bits |
Morphisms:
Channels are the morphisms. A channel P_{Y|X} transforms input X to output Y. Composition: cascade channels. The “good” channels are those with high capacity.
C. CENTRAL INVARIANTS
Entropy as the fundamental measure:
| Property | Statement |
|---|---|
| Non-negativity | H(X) ≥ 0 (= 0 iff X is deterministic) |
| Maximum | H(X) ≤ log |
| Additivity | H(X,Y) = H(X) + H(Y) if X ⊥ Y |
| Chain rule | H(X₁,…,Xₙ) = Σᵢ H(Xᵢ |
| Conditioning reduces entropy | H(X |
Mutual information properties:
| Property | Statement |
|---|---|
| Non-negativity | I(X;Y) ≥ 0 (= 0 iff X ⊥ Y) |
| Symmetry | I(X;Y) = I(Y;X) |
| Chain rule | I(X;Y,Z) = I(X;Y) + I(X;Z |
| Data processing inequality | If X → Y → Z (Markov chain), then I(X;Z) ≤ I(X;Y) |
The data processing inequality is fundamental: You can’t create information by processing. Any operation on Y can only lose information about X, never gain it.
KL divergence properties:
| Property | Statement |
|---|---|
| Non-negativity | D(P‖Q) ≥ 0 (Gibbs’ inequality) |
| Zero iff equal | D(P‖Q) = 0 ⟺ P = Q |
| Not symmetric | D(P‖Q) ≠ D(Q‖P) in general |
| Convexity | D(P‖Q) is convex in the pair (P,Q) |
What counts as “the same”:
- Distributions with same entropy have same “information content”
- Channels with same capacity have same “communication power”
- But: different distributions/channels can have same entropy/capacity while being very different in structure
D. SIGNATURE THEOREMS
1. Shannon’s Source Coding Theorem (Noiseless Coding)
A source with entropy H can be compressed to H bits per symbol on average, but not fewer.
Precise statement: For n i.i.d. samples from X, there exist codes achieving rate R > H(X) with vanishing error probability, and no code can achieve rate R < H(X) with vanishing error.
Importance: Entropy IS the compressibility. H(X) bits is both achievable and necessary. This defines entropy operationally—it’s not just a formula, it’s what compression achieves.
Typical sequences: The key insight is typicality. Among 2^{nH} possible sequences, almost all probability mass concentrates on ~2^{nH(X)} “typical” sequences. Encode only these.
2. Shannon’s Channel Coding Theorem (Noisy Coding)
A channel with capacity C can reliably transmit at any rate R < C, but not at R > C.
Precise statement: For any R < C and any ε > 0, there exist codes of rate R with error probability < ε. Conversely, for R > C, error probability is bounded away from 0.
Importance: This is the most important theorem in information theory. It says:
- There’s a sharp threshold (capacity)
- Reliable communication is possible right up to that threshold
- The threshold is computable
The converse is crucial: It’s not just that we don’t know how to exceed capacity—it’s impossible. This is proven via Fano’s inequality.
3. Rate-Distortion Theorem
To represent source X with average distortion ≤ D, you need at least R(D) bits per symbol, where R(D) = min I(X;X̂) over distributions achieving distortion D.
Importance: Source coding with a fidelity criterion. You can’t compress below H(X) losslessly, but you can compress further if you accept distortion. R(D) is the rate-distortion function—it tells you exactly how bits trade off against fidelity.
4. Data Processing Inequality
If X → Y → Z forms a Markov chain, then I(X;Z) ≤ I(X;Y).
Importance: Processing can only destroy information, never create it. This is why:
- Sufficient statistics capture all relevant information
- Cascaded channels can only lose capacity
- You can’t recover information that’s been lost
Equality condition: I(X;Z) = I(X;Y) iff X → Z → Y also holds (Z is sufficient for Y about X).
5. Entropy Power Inequality
For independent continuous random variables X, Y: $$e^{2h(X+Y)} \geq e^{2h(X)} + e^{2h(Y)}$$
Importance: This is the information-theoretic analog of the Brunn-Minkowski inequality in geometry. It implies:
- Gaussian has maximum entropy for given variance
- Central limit theorem has information-theoretic proof
- Bounds on capacity of Gaussian channels
6. Asymptotic Equipartition Property (AEP)
For i.i.d. X₁,…,Xₙ: $$-\frac{1}{n} \log P(X_1,…,X_n) \to H(X) \text{ in probability}$$
Importance: The probability of a typical sequence is approximately 2^{-nH}. There are approximately 2^{nH} typical sequences. This is why compression to H bits works—typical sequences are all roughly equally likely, and there are just the right number of them.
E. BRIDGES TO OTHER DOMAINS
| Domain | Connection |
|---|---|
| Probability / Statistics | Entropy is expected log-likelihood. KL divergence is log-likelihood ratio. Mutual information measures dependence. Maximum entropy = principle for inference. |
| Measure Theory | Entropy is an integral. Differential entropy needs measure theory. Radon-Nikodym for KL divergence. |
| Coding Theory | Source coding, channel coding, error correction. Algebraic codes, LDPC codes, polar codes. |
| Cryptography | Perfect secrecy requires key entropy ≥ message entropy. Information-theoretic security vs computational security. |
| Thermodynamics | Boltzmann entropy S = k_B ln W. Gibbs entropy S = -k_B Σ p ln p. Landauer’s principle: erasing 1 bit costs k_B T ln 2 energy. |
| Statistical Mechanics | Maximum entropy distributions are Boltzmann. Free energy = energy - temperature × entropy. Partition function. |
| Machine Learning | Cross-entropy loss. KL divergence in VAEs. Information bottleneck. Mutual information neural estimation. |
| Neuroscience | Efficient coding hypothesis. Neural information transmission. Rate coding vs temporal coding. |
| Quantum Information | Von Neumann entropy S(ρ) = -Tr(ρ log ρ). Quantum channels. Entanglement entropy. Holevo bound. |
| Information Geometry | Fisher information as metric. KL divergence induces geometry. Statistical manifolds. |
| Algorithmic Information | Kolmogorov complexity K(x). Minimum description length. Algorithmic randomness. |
| Linguistics | Zipf’s law. Entropy rate of language. Compression as language model evaluation. |
Pattern-linking gold:
Entropy is universal:
The formula H = -Σ p log p appears everywhere:
- Information theory: uncertainty
- Thermodynamics: disorder
- Statistical mechanics: Gibbs entropy
- Ecology: species diversity (Shannon index)
- Economics: market concentration (Theil index)
Why? Because it’s the unique measure satisfying natural axioms (continuity, maximality for uniform, additivity for independent). Entropy isn’t chosen—it’s forced.
The duality: compression ↔ prediction
Good compression requires good prediction. If you can predict the next symbol well, you don’t need many bits to encode it. This is why:
- Language models are evaluated by perplexity (= 2^H)
- Compression benchmarks test intelligence (Hutter Prize)
- MDL principle: best hypothesis = best compression
The geometry of distributions:
KL divergence induces geometry on probability space (information geometry). Mutual information is KL divergence from product to joint. The Fisher information is the metric tensor. All the information-theoretic quantities have geometric interpretations.
F. COMMON MISCONCEPTIONS
“Entropy measures disorder/randomness” — Entropy measures uncertainty given your model. A deterministic chaotic system has zero entropy (deterministic), but might look random. Entropy depends on the distribution you assign, not the “true” randomness.
“More entropy = less information” — Confusing senses of “information.” High entropy means high potential information (many bits needed to specify the outcome). Low entropy means high prior information (you already know roughly what will happen).
“KL divergence is a distance” — It’s not symmetric and doesn’t satisfy the triangle inequality. It’s a divergence, measuring the penalty for using Q when P is true. The asymmetry is meaningful: D(P‖Q) ≠ D(Q‖P).
“Mutual information measures correlation” — It measures all statistical dependence, including nonlinear. Zero correlation doesn’t imply zero mutual information. I(X;Y) = 0 iff X and Y are independent.
“Capacity is a property of the physical channel” — Capacity depends on what you’re allowed to put in. The same physical wire has different capacity depending on power constraints, bandwidth, etc. Capacity = max over input distributions.
“Compression to entropy is easy” — Achieving the entropy limit requires codes that grow with block length, and optimal codes (arithmetic coding, ANS) are non-trivial. Simple codes like Huffman approach but don’t always achieve the limit.
“Differential entropy is just continuous entropy” — Differential entropy can be negative! It’s not the limit of discrete entropy. It’s coordinate-dependent. The “right” continuous quantity is often KL divergence relative to a reference measure.
“Shannon entropy captures all notions of information” — There are other measures: Rényi entropies, min-entropy (for cryptography), Kolmogorov complexity (algorithmic), Fisher information (for estimation). Shannon entropy is central but not unique.
G. NOTATION SURVIVAL KIT
| Symbol | Meaning |
|---|---|
| H(X) | Entropy of X |
| H(X,Y) | Joint entropy |
| H(X|Y) | Conditional entropy |
| I(X;Y) | Mutual information |
| I(X;Y|Z) | Conditional mutual information |
| D(P‖Q) or D_{KL}(P‖Q) | KL divergence from Q to P |
| h(X) | Differential entropy (continuous) |
| C | Channel capacity |
| R | Rate (bits per symbol or per channel use) |
| R(D) | Rate-distortion function |
| P_X or p(x) | Distribution of X |
| P_{Y|X} or p(y|x) | Conditional distribution / channel |
| 𝒳 | Alphabet (set of possible values) |
| log | Usually log base 2 (bits); sometimes natural log (nats) |
| X^n | n i.i.d. copies: (X₁,…,Xₙ) |
| T_ε^{(n)} | Typical set (sequences within ε of entropy rate) |
| P_e | Probability of error |
Key formulas:
| Quantity | Formula |
|---|---|
| Entropy | H(X) = -Σ_x p(x) log p(x) = 𝔼[-log p(X)] |
| Mutual information | I(X;Y) = Σ_{x,y} p(x,y) log(p(x,y)/(p(x)p(y))) |
| KL divergence | D(P‖Q) = Σ_x p(x) log(p(x)/q(x)) |
| Capacity (DMC) | C = max_{P_X} I(X;Y) |
| Binary entropy | H_b(p) = -p log p - (1-p) log(1-p) |
| Gaussian entropy | h(X) = ½ log(2πeσ²) for X ~ N(μ,σ²) |
| AWGN capacity | C = ½ log(1 + SNR) bits per channel use |
Common channels:
| Channel | Description | Capacity |
|---|---|---|
| Binary symmetric | Flips each bit with prob p | C = 1 - H_b(p) |
| Binary erasure | Erases each bit with prob ε | C = 1 - ε |
| AWGN | Y = X + N, N ~ N(0,σ²) | C = ½ log(1 + P/σ²) |
| Z-channel | 0→0, 1→0 with prob p | Asymmetric; C < 1-H_b(p) |
H. ONE WORKED MICRO-EXAMPLE
Computing entropy: the biased coin
Setup: X ~ Bernoulli(p), taking values {0, 1} with P(X=1) = p.
Entropy:
$$H(X) = -p \log_2 p - (1-p) \log_2(1-p) = H_b(p)$$
This is the binary entropy function.
Key values:
| p | H_b(p) |
|---|---|
| 0 | 0 (certain: always 0) |
| 0.5 | 1 (maximum uncertainty) |
| 1 | 0 (certain: always 1) |
| 0.1 | 0.469 bits |
| 0.01 | 0.081 bits |
The shape: H_b(p) is concave, symmetric around p = 0.5, maximum at p = 0.5.
H_b(p)
1 | ___
| / \
| / \
| / \
0 |_/___________\___
0 0.5 1 p
Operational meaning:
To encode n i.i.d. biased coin flips:
- Naive: n bits (one per flip)
- Optimal: nH_b(p) bits
For p = 0.1: only need 0.469n bits instead of n. You can compress by more than half!
Why: Most sequences have about 0.1n ones. There are only C(n, 0.1n) ≈ 2^{nH_b(0.1)} such sequences. Encode only those.
Micro-example 2: Binary symmetric channel capacity
Setup: Input X ∈ {0,1}, output Y ∈ {0,1}. Each bit flips independently with probability p (crossover probability).
$$P(Y=0|X=0) = P(Y=1|X=1) = 1-p$$ $$P(Y=1|X=0) = P(Y=0|X=1) = p$$
Computing capacity:
$$C = \max_{P_X} I(X;Y)$$
By symmetry, optimal input is uniform: P(X=0) = P(X=1) = ½.
Then Y is also uniform: P(Y=0) = P(Y=1) = ½.
$$H(Y) = 1 \text{ bit}$$
$$H(Y|X) = H_b(p)$$
(Given X, Y is a biased coin flip with parameter p.)
$$I(X;Y) = H(Y) - H(Y|X) = 1 - H_b(p)$$
Therefore:
$$C_{BSC} = 1 - H_b(p)$$
Key values:
| p | C |
|---|---|
| 0 | 1 bit (perfect channel) |
| 0.1 | 0.531 bits |
| 0.5 | 0 bits (pure noise, output independent of input) |
Interpretation: The noise “uses up” H_b(p) bits of capacity. What remains, 1 - H_b(p), is what you can reliably communicate.
Shannon’s promise: For any rate R < 1 - H_b(p), there exist codes with error probability → 0 as block length → ∞.
Micro-example 3: Data processing inequality in action
Setup: Markov chain X → Y → Z.
Claim: I(X;Z) ≤ I(X;Y).
Proof:
$$I(X;Y,Z) = I(X;Y) + I(X;Z|Y)$$
by chain rule for mutual information.
But also:
$$I(X;Y,Z) = I(X;Z) + I(X;Y|Z)$$
So:
$$I(X;Y) + I(X;Z|Y) = I(X;Z) + I(X;Y|Z)$$
Since X → Y → Z, we have X ⊥ Z | Y (X and Z are conditionally independent given Y).
Therefore I(X;Z|Y) = 0.
$$I(X;Y) = I(X;Z) + I(X;Y|Z) \geq I(X;Z)$$
since I(X;Y|Z) ≥ 0. ∎
Example:
X = original image Y = compressed version Z = re-compressed version
I(X;Z) ≤ I(X;Y): The twice-compressed version has less information about the original than the once-compressed version. You can’t recover lost information.
Importance:
This is why:
- Lossy compression is irreversible
- Sufficient statistics capture all information (equality case)
- Neural network layers can only lose information about the input
- Every stage of processing is an information bottleneck
Information and Physics
Thermodynamic connection:
Shannon entropy and Boltzmann-Gibbs entropy are the same formula:
$$S_{Boltzmann} = -k_B \sum_i p_i \ln p_i$$
Just different units (k_B converts nats to energy/temperature).
Landauer’s principle: Erasing 1 bit of information requires at least k_B T ln 2 of energy dissipation. Information is physical—you can’t erase it for free.
Maxwell’s demon: A demon who knows molecule positions could violate the second law by sorting fast/slow molecules. Resolution: The demon must store information, and erasing that memory costs entropy.
Black hole entropy: Bekenstein-Hawking entropy S = A/(4ℓ_P²) counts information hidden behind the horizon. The holographic principle: information content of a region scales with surface area, not volume.
Quantum information:
Von Neumann entropy: S(ρ) = -Tr(ρ log ρ) for density matrix ρ.
Entanglement entropy: S(ρ_A) where ρ_A is the reduced state of subsystem A.
Key difference: Quantum information can be entangled. Two bits can carry more than 2 bits of correlation (Bell inequality violations). But you can’t transmit classical information faster than light (no-signaling).
Leverage for your work:
The information bottleneck:
For a Markov chain X → Y → Z, the bottleneck principle says: Z should retain information about X that’s relevant for some task T, while discarding irrelevant information.
$$\mathcal{L} = I(X;Z) - \beta I(Z;T)$$
This is a variational principle for representation learning. Your Convergence Thesis might say: optimal cognitive representations are those that solve the information bottleneck for the organism’s task distribution.
Efficient coding hypothesis:
Barlow (1961): Neural codes should maximize information transmission given metabolic constraints. This predicts:
- Sparse coding in visual cortex
- Center-surround receptive fields (decorrelation)
- Adaptation to stimulus statistics
If brains optimize information transmission, information theory tells you what the code should look like. The “constraints force the structure” theme applies.
Compression and prediction:
Good compression requires good prediction (entropy = average surprise). Language models are evaluated by cross-entropy (compression). If cognition does prediction (predictive coding theory), then cognition does compression.
The minimum description length (MDL) principle: the best hypothesis is the one that most compresses the data. Learning = compression.
Mutual information and neural networks:
The information plane: track I(X;T) vs I(T;Y) where T is a hidden layer representation, X is input, Y is label. Training dynamics trace a path in this plane.
Deep networks might work by progressive refinement: early layers maximize I(X;T), later layers maximize I(T;Y) while compressing I(X;T). This is speculative but suggests information-theoretic constraints on architectures.
Channel capacity as cognitive limit:
If neural pathways are channels, they have capacity. The capacity limits how much information can flow. Attention might be capacity allocation—routing limited capacity to task-relevant channels.
Rate-distortion for representations:
You can’t represent everything perfectly (R(0) = H). Optimal lossy representation trades off rate (how many bits) against distortion (how much error). Different tasks have different distortion measures, so optimal representations depend on task.
This connects to your pattern: constraints (task, distortion measure, rate budget) force structure (optimal representation).
Zipf’s law and optimal coding:
Zipf’s law (word frequency ∝ 1/rank) is close to optimal for communication efficiency. If the goal is to minimize communication cost, Zipf-like distributions emerge. Language structure might be shaped by information-theoretic pressure.
The zeta connection (via physics):
In statistical mechanics, the partition function Z = Σ e^{-βE} generates all thermodynamic quantities. Free energy F = -T log Z. Entropy S = -∂F/∂T.
The partition function is like a zeta function: a sum over states weighted by energy. Your “spectral echoes” intuition connects: the spectrum of the Hamiltonian determines the partition function, which determines the entropy, which determines the information-theoretic properties.
Now we have 16 domains:
- Representation Theory
- Measure Theory
- Homotopy Type Theory
- Category Theory
- Spectral Theory
- Information Geometry
- Differential Geometry
- Algebraic Topology
- Lie Theory
- Dynamical Systems
- Algebraic Geometry
- Functional Analysis
- Complex Analysis
- Number Theory
- Mathematical Physics
- Information Theory