Information Theory

ℹ️Information Theory Unit 2 – Probability Theory Foundations

Probability theory forms the foundation of information theory, providing tools to measure uncertainty and quantify information. It introduces key concepts like sample spaces, events, and random variables, along with probability axioms and rules for calculating likelihoods. The study covers various probability distributions, expected values, and variance. It also delves into conditional probability, Bayes' Theorem, and information measures like entropy and mutual information, which are crucial for understanding data compression, error correction, and communication systems.

Key Concepts and Definitions

  • Probability measures the likelihood of an event occurring and ranges from 0 (impossible) to 1 (certain)
  • Sample space (Ω\Omega) includes all possible outcomes of an experiment or random process
    • Example: Rolling a fair six-sided die has a sample space of Ω={1,2,3,4,5,6}\Omega = \{1, 2, 3, 4, 5, 6\}
  • An event (A) is a subset of the sample space containing outcomes of interest
  • Random variables (X) assign numerical values to outcomes in a sample space
    • Discrete random variables have countable outcomes (number of heads in 10 coin flips)
    • Continuous random variables have uncountable outcomes within an interval (time until next bus arrives)
  • Probability mass function (PMF) defines the probability distribution for a discrete random variable
  • Probability density function (PDF) describes the probability distribution for a continuous random variable
  • Cumulative distribution function (CDF) gives the probability that a random variable is less than or equal to a specific value

Probability Axioms and Rules

  • Axiom 1: Non-negativity - Probability of any event A is greater than or equal to zero, P(A)0P(A) \geq 0
  • Axiom 2: Normalization - Probability of the entire sample space is equal to one, P(Ω)=1P(\Omega) = 1
  • Axiom 3: Additivity - For mutually exclusive events A and B, P(AB)=P(A)+P(B)P(A \cup B) = P(A) + P(B)
  • Complement rule: P(Ac)=1P(A)P(A^c) = 1 - P(A), where AcA^c is the complement of event A
  • Multiplication rule: For events A and B, P(AB)=P(A)P(BA)P(A \cap B) = P(A) \cdot P(B|A), where P(BA)P(B|A) is the conditional probability of B given A
  • Total Probability Theorem: For a partition of the sample space {A1,A2,...,An}\{A_1, A_2, ..., A_n\}, P(B)=i=1nP(BAi)P(Ai)P(B) = \sum_{i=1}^n P(B|A_i) \cdot P(A_i)
  • Independence: Events A and B are independent if P(AB)=P(A)P(B)P(A \cap B) = P(A) \cdot P(B)

Random Variables and Distributions

  • Bernoulli distribution models a single trial with two possible outcomes (success with probability p, failure with probability 1-p)
  • Binomial distribution describes the number of successes in n independent Bernoulli trials with success probability p
    • PMF: P(X=k)=(nk)pk(1p)nkP(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, where (nk)\binom{n}{k} is the binomial coefficient
  • Poisson distribution models the number of events occurring in a fixed interval with a constant average rate (λ\lambda)
    • PMF: P(X=k)=λkeλk!P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, where k=0,1,2,...k = 0, 1, 2, ...
  • Gaussian (normal) distribution is a continuous probability distribution with PDF f(x)=1σ2πe(xμ)22σ2f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}
    • Characterized by mean (μ\mu) and standard deviation (σ\sigma)
  • Uniform distribution has equal probability for all outcomes within a specified range
    • Discrete uniform distribution: P(X=k)=1nP(X = k) = \frac{1}{n} for k=1,2,...,nk = 1, 2, ..., n
    • Continuous uniform distribution: f(x)=1baf(x) = \frac{1}{b-a} for axba \leq x \leq b

Expected Value and Variance

  • Expected value (mean) of a discrete random variable X: E[X]=xxP(X=x)E[X] = \sum_{x} x \cdot P(X = x)
    • Represents the average value of X over many trials
  • Expected value of a continuous random variable X: E[X]=xf(x)dxE[X] = \int_{-\infty}^{\infty} x \cdot f(x) dx
  • Linearity of expectation: For random variables X and Y, E[aX+bY]=aE[X]+bE[Y]E[aX + bY] = aE[X] + bE[Y]
  • Variance measures the spread of a random variable around its mean: Var(X)=E[(XE[X])2]Var(X) = E[(X - E[X])^2]
    • For a discrete random variable: Var(X)=x(xE[X])2P(X=x)Var(X) = \sum_{x} (x - E[X])^2 \cdot P(X = x)
    • For a continuous random variable: Var(X)=(xE[X])2f(x)dxVar(X) = \int_{-\infty}^{\infty} (x - E[X])^2 \cdot f(x) dx
  • Standard deviation is the square root of variance: σ=Var(X)\sigma = \sqrt{Var(X)}
  • Properties of variance: Var(aX+b)=a2Var(X)Var(aX + b) = a^2 Var(X) and for independent X and Y, Var(X+Y)=Var(X)+Var(Y)Var(X + Y) = Var(X) + Var(Y)

Conditional Probability and Bayes' Theorem

  • Conditional probability P(AB)P(A|B) measures the probability of event A occurring given that event B has occurred
    • Formula: P(AB)=P(AB)P(B)P(A|B) = \frac{P(A \cap B)}{P(B)}, where P(B)>0P(B) > 0
  • Bayes' Theorem relates conditional probabilities: P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
    • Useful for updating probabilities based on new information
  • Posterior probability P(AB)P(A|B) updates the prior probability P(A)P(A) using the likelihood P(BA)P(B|A) and evidence P(B)P(B)
  • Bayes' Theorem with multiple events: P(AiB)=P(BAi)P(Ai)jP(BAj)P(Aj)P(A_i|B) = \frac{P(B|A_i) \cdot P(A_i)}{\sum_{j} P(B|A_j) \cdot P(A_j)}
  • Independence and conditional probability: If A and B are independent, then P(AB)=P(A)P(A|B) = P(A) and P(BA)=P(B)P(B|A) = P(B)

Information Measures

  • Shannon entropy H(X)H(X) quantifies the average uncertainty or information content of a random variable X
    • For a discrete random variable: H(X)=xP(X=x)log2P(X=x)H(X) = -\sum_{x} P(X = x) \log_2 P(X = x)
    • Units are bits (base-2 logarithm) or nats (natural logarithm)
  • Joint entropy H(X,Y)H(X, Y) measures the uncertainty in the joint distribution of random variables X and Y
    • H(X,Y)=x,yP(X=x,Y=y)log2P(X=x,Y=y)H(X, Y) = -\sum_{x, y} P(X = x, Y = y) \log_2 P(X = x, Y = y)
  • Conditional entropy H(YX)H(Y|X) is the average uncertainty in Y given knowledge of X
    • H(YX)=xP(X=x)H(YX=x)H(Y|X) = \sum_{x} P(X = x) H(Y|X = x)
  • Mutual information I(X;Y)I(X; Y) quantifies the reduction in uncertainty of Y due to knowledge of X (and vice versa)
    • I(X;Y)=H(Y)H(YX)=H(X)H(XY)I(X; Y) = H(Y) - H(Y|X) = H(X) - H(X|Y)
  • Kullback-Leibler (KL) divergence measures the difference between two probability distributions P and Q
    • DKL(PQ)=xP(x)log2P(x)Q(x)D_{KL}(P||Q) = \sum_{x} P(x) \log_2 \frac{P(x)}{Q(x)}

Applications in Information Theory

  • Source coding (data compression) aims to represent information efficiently using minimal resources
    • Shannon's source coding theorem establishes the limits of lossless compression
    • Huffman coding and arithmetic coding are examples of lossless compression algorithms
  • Channel coding (error correction) introduces redundancy to protect information from errors during transmission
    • Shannon's channel coding theorem determines the maximum achievable rate for reliable communication
    • Examples include block codes (Hamming, BCH, Reed-Solomon) and convolutional codes
  • Cryptography uses principles of information theory to secure communication and data storage
    • One-time pad is a provably secure encryption scheme based on Shannon's perfect secrecy
    • Modern cryptosystems rely on computational complexity and the hardness of certain mathematical problems
  • Machine learning and artificial intelligence leverage information-theoretic concepts
    • Mutual information is used for feature selection and dimensionality reduction
    • KL divergence is employed in variational inference and optimization of generative models

Common Pitfalls and Tips

  • Ensure events are well-defined and collectively exhaustive when calculating probabilities
  • Remember to normalize probabilities to satisfy the axioms, especially when dealing with continuous distributions
  • Be cautious when assuming independence between events or random variables
    • Independence is a strong condition that requires verification
  • Use the appropriate probability distribution for the given problem context
    • Discrete distributions for countable outcomes, continuous distributions for uncountable outcomes
  • Consider the limitations of information-theoretic measures in practical applications
    • Entropy and mutual information estimations may be challenging with limited data
  • Apply Bayes' Theorem correctly by identifying the appropriate prior, likelihood, and evidence terms
  • Understand the assumptions and constraints of coding theorems and compression algorithms
    • Lossless compression has theoretical limits, while lossy compression allows for a trade-off between quality and size
  • Regularly review and practice problem-solving techniques to reinforce understanding of probability concepts
    • Work through various examples and counterexamples to develop intuition


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.