All Study Guides Information Theory Unit 2
ℹ️ Information Theory Unit 2 – Probability Theory FoundationsProbability theory forms the foundation of information theory, providing tools to measure uncertainty and quantify information. It introduces key concepts like sample spaces, events, and random variables, along with probability axioms and rules for calculating likelihoods.
The study covers various probability distributions, expected values, and variance. It also delves into conditional probability, Bayes' Theorem, and information measures like entropy and mutual information, which are crucial for understanding data compression, error correction, and communication systems.
Key Concepts and Definitions
Probability measures the likelihood of an event occurring and ranges from 0 (impossible) to 1 (certain)
Sample space (Ω \Omega Ω ) includes all possible outcomes of an experiment or random process
Example: Rolling a fair six-sided die has a sample space of Ω = { 1 , 2 , 3 , 4 , 5 , 6 } \Omega = \{1, 2, 3, 4, 5, 6\} Ω = { 1 , 2 , 3 , 4 , 5 , 6 }
An event (A) is a subset of the sample space containing outcomes of interest
Random variables (X) assign numerical values to outcomes in a sample space
Discrete random variables have countable outcomes (number of heads in 10 coin flips)
Continuous random variables have uncountable outcomes within an interval (time until next bus arrives)
Probability mass function (PMF) defines the probability distribution for a discrete random variable
Probability density function (PDF) describes the probability distribution for a continuous random variable
Cumulative distribution function (CDF) gives the probability that a random variable is less than or equal to a specific value
Probability Axioms and Rules
Axiom 1: Non-negativity - Probability of any event A is greater than or equal to zero, P ( A ) ≥ 0 P(A) \geq 0 P ( A ) ≥ 0
Axiom 2: Normalization - Probability of the entire sample space is equal to one, P ( Ω ) = 1 P(\Omega) = 1 P ( Ω ) = 1
Axiom 3: Additivity - For mutually exclusive events A and B, P ( A ∪ B ) = P ( A ) + P ( B ) P(A \cup B) = P(A) + P(B) P ( A ∪ B ) = P ( A ) + P ( B )
Complement rule: P ( A c ) = 1 − P ( A ) P(A^c) = 1 - P(A) P ( A c ) = 1 − P ( A ) , where A c A^c A c is the complement of event A
Multiplication rule: For events A and B, P ( A ∩ B ) = P ( A ) ⋅ P ( B ∣ A ) P(A \cap B) = P(A) \cdot P(B|A) P ( A ∩ B ) = P ( A ) ⋅ P ( B ∣ A ) , where P ( B ∣ A ) P(B|A) P ( B ∣ A ) is the conditional probability of B given A
Total Probability Theorem: For a partition of the sample space { A 1 , A 2 , . . . , A n } \{A_1, A_2, ..., A_n\} { A 1 , A 2 , ... , A n } , P ( B ) = ∑ i = 1 n P ( B ∣ A i ) ⋅ P ( A i ) P(B) = \sum_{i=1}^n P(B|A_i) \cdot P(A_i) P ( B ) = ∑ i = 1 n P ( B ∣ A i ) ⋅ P ( A i )
Independence: Events A and B are independent if P ( A ∩ B ) = P ( A ) ⋅ P ( B ) P(A \cap B) = P(A) \cdot P(B) P ( A ∩ B ) = P ( A ) ⋅ P ( B )
Random Variables and Distributions
Bernoulli distribution models a single trial with two possible outcomes (success with probability p, failure with probability 1-p)
Binomial distribution describes the number of successes in n independent Bernoulli trials with success probability p
PMF: P ( X = k ) = ( n k ) p k ( 1 − p ) n − k P(X = k) = \binom{n}{k} p^k (1-p)^{n-k} P ( X = k ) = ( k n ) p k ( 1 − p ) n − k , where ( n k ) \binom{n}{k} ( k n ) is the binomial coefficient
Poisson distribution models the number of events occurring in a fixed interval with a constant average rate (λ \lambda λ )
PMF: P ( X = k ) = λ k e − λ k ! P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!} P ( X = k ) = k ! λ k e − λ , where k = 0 , 1 , 2 , . . . k = 0, 1, 2, ... k = 0 , 1 , 2 , ...
Gaussian (normal) distribution is a continuous probability distribution with PDF f ( x ) = 1 σ 2 π e − ( x − μ ) 2 2 σ 2 f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} f ( x ) = σ 2 π 1 e − 2 σ 2 ( x − μ ) 2
Characterized by mean (μ \mu μ ) and standard deviation (σ \sigma σ )
Uniform distribution has equal probability for all outcomes within a specified range
Discrete uniform distribution: P ( X = k ) = 1 n P(X = k) = \frac{1}{n} P ( X = k ) = n 1 for k = 1 , 2 , . . . , n k = 1, 2, ..., n k = 1 , 2 , ... , n
Continuous uniform distribution: f ( x ) = 1 b − a f(x) = \frac{1}{b-a} f ( x ) = b − a 1 for a ≤ x ≤ b a \leq x \leq b a ≤ x ≤ b
Expected Value and Variance
Expected value (mean) of a discrete random variable X: E [ X ] = ∑ x x ⋅ P ( X = x ) E[X] = \sum_{x} x \cdot P(X = x) E [ X ] = ∑ x x ⋅ P ( X = x )
Represents the average value of X over many trials
Expected value of a continuous random variable X: E [ X ] = ∫ − ∞ ∞ x ⋅ f ( x ) d x E[X] = \int_{-\infty}^{\infty} x \cdot f(x) dx E [ X ] = ∫ − ∞ ∞ x ⋅ f ( x ) d x
Linearity of expectation: For random variables X and Y, E [ a X + b Y ] = a E [ X ] + b E [ Y ] E[aX + bY] = aE[X] + bE[Y] E [ a X + bY ] = a E [ X ] + b E [ Y ]
Variance measures the spread of a random variable around its mean: V a r ( X ) = E [ ( X − E [ X ] ) 2 ] Var(X) = E[(X - E[X])^2] Va r ( X ) = E [( X − E [ X ] ) 2 ]
For a discrete random variable: V a r ( X ) = ∑ x ( x − E [ X ] ) 2 ⋅ P ( X = x ) Var(X) = \sum_{x} (x - E[X])^2 \cdot P(X = x) Va r ( X ) = ∑ x ( x − E [ X ] ) 2 ⋅ P ( X = x )
For a continuous random variable: V a r ( X ) = ∫ − ∞ ∞ ( x − E [ X ] ) 2 ⋅ f ( x ) d x Var(X) = \int_{-\infty}^{\infty} (x - E[X])^2 \cdot f(x) dx Va r ( X ) = ∫ − ∞ ∞ ( x − E [ X ] ) 2 ⋅ f ( x ) d x
Standard deviation is the square root of variance: σ = V a r ( X ) \sigma = \sqrt{Var(X)} σ = Va r ( X )
Properties of variance: V a r ( a X + b ) = a 2 V a r ( X ) Var(aX + b) = a^2 Var(X) Va r ( a X + b ) = a 2 Va r ( X ) and for independent X and Y, V a r ( X + Y ) = V a r ( X ) + V a r ( Y ) Var(X + Y) = Var(X) + Var(Y) Va r ( X + Y ) = Va r ( X ) + Va r ( Y )
Conditional Probability and Bayes' Theorem
Conditional probability P ( A ∣ B ) P(A|B) P ( A ∣ B ) measures the probability of event A occurring given that event B has occurred
Formula: P ( A ∣ B ) = P ( A ∩ B ) P ( B ) P(A|B) = \frac{P(A \cap B)}{P(B)} P ( A ∣ B ) = P ( B ) P ( A ∩ B ) , where P ( B ) > 0 P(B) > 0 P ( B ) > 0
Bayes' Theorem relates conditional probabilities: P ( A ∣ B ) = P ( B ∣ A ) ⋅ P ( A ) P ( B ) P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} P ( A ∣ B ) = P ( B ) P ( B ∣ A ) ⋅ P ( A )
Useful for updating probabilities based on new information
Posterior probability P ( A ∣ B ) P(A|B) P ( A ∣ B ) updates the prior probability P ( A ) P(A) P ( A ) using the likelihood P ( B ∣ A ) P(B|A) P ( B ∣ A ) and evidence P ( B ) P(B) P ( B )
Bayes' Theorem with multiple events: P ( A i ∣ B ) = P ( B ∣ A i ) ⋅ P ( A i ) ∑ j P ( B ∣ A j ) ⋅ P ( A j ) P(A_i|B) = \frac{P(B|A_i) \cdot P(A_i)}{\sum_{j} P(B|A_j) \cdot P(A_j)} P ( A i ∣ B ) = ∑ j P ( B ∣ A j ) ⋅ P ( A j ) P ( B ∣ A i ) ⋅ P ( A i )
Independence and conditional probability: If A and B are independent, then P ( A ∣ B ) = P ( A ) P(A|B) = P(A) P ( A ∣ B ) = P ( A ) and P ( B ∣ A ) = P ( B ) P(B|A) = P(B) P ( B ∣ A ) = P ( B )
Shannon entropy H ( X ) H(X) H ( X ) quantifies the average uncertainty or information content of a random variable X
For a discrete random variable: H ( X ) = − ∑ x P ( X = x ) log 2 P ( X = x ) H(X) = -\sum_{x} P(X = x) \log_2 P(X = x) H ( X ) = − ∑ x P ( X = x ) log 2 P ( X = x )
Units are bits (base-2 logarithm) or nats (natural logarithm)
Joint entropy H ( X , Y ) H(X, Y) H ( X , Y ) measures the uncertainty in the joint distribution of random variables X and Y
H ( X , Y ) = − ∑ x , y P ( X = x , Y = y ) log 2 P ( X = x , Y = y ) H(X, Y) = -\sum_{x, y} P(X = x, Y = y) \log_2 P(X = x, Y = y) H ( X , Y ) = − ∑ x , y P ( X = x , Y = y ) log 2 P ( X = x , Y = y )
Conditional entropy H ( Y ∣ X ) H(Y|X) H ( Y ∣ X ) is the average uncertainty in Y given knowledge of X
H ( Y ∣ X ) = ∑ x P ( X = x ) H ( Y ∣ X = x ) H(Y|X) = \sum_{x} P(X = x) H(Y|X = x) H ( Y ∣ X ) = ∑ x P ( X = x ) H ( Y ∣ X = x )
Mutual information I ( X ; Y ) I(X; Y) I ( X ; Y ) quantifies the reduction in uncertainty of Y due to knowledge of X (and vice versa)
I ( X ; Y ) = H ( Y ) − H ( Y ∣ X ) = H ( X ) − H ( X ∣ Y ) I(X; Y) = H(Y) - H(Y|X) = H(X) - H(X|Y) I ( X ; Y ) = H ( Y ) − H ( Y ∣ X ) = H ( X ) − H ( X ∣ Y )
Kullback-Leibler (KL) divergence measures the difference between two probability distributions P and Q
D K L ( P ∣ ∣ Q ) = ∑ x P ( x ) log 2 P ( x ) Q ( x ) D_{KL}(P||Q) = \sum_{x} P(x) \log_2 \frac{P(x)}{Q(x)} D K L ( P ∣∣ Q ) = ∑ x P ( x ) log 2 Q ( x ) P ( x )
Source coding (data compression) aims to represent information efficiently using minimal resources
Shannon's source coding theorem establishes the limits of lossless compression
Huffman coding and arithmetic coding are examples of lossless compression algorithms
Channel coding (error correction) introduces redundancy to protect information from errors during transmission
Shannon's channel coding theorem determines the maximum achievable rate for reliable communication
Examples include block codes (Hamming, BCH, Reed-Solomon) and convolutional codes
Cryptography uses principles of information theory to secure communication and data storage
One-time pad is a provably secure encryption scheme based on Shannon's perfect secrecy
Modern cryptosystems rely on computational complexity and the hardness of certain mathematical problems
Machine learning and artificial intelligence leverage information-theoretic concepts
Mutual information is used for feature selection and dimensionality reduction
KL divergence is employed in variational inference and optimization of generative models
Common Pitfalls and Tips
Ensure events are well-defined and collectively exhaustive when calculating probabilities
Remember to normalize probabilities to satisfy the axioms, especially when dealing with continuous distributions
Be cautious when assuming independence between events or random variables
Independence is a strong condition that requires verification
Use the appropriate probability distribution for the given problem context
Discrete distributions for countable outcomes, continuous distributions for uncountable outcomes
Consider the limitations of information-theoretic measures in practical applications
Entropy and mutual information estimations may be challenging with limited data
Apply Bayes' Theorem correctly by identifying the appropriate prior, likelihood, and evidence terms
Understand the assumptions and constraints of coding theorems and compression algorithms
Lossless compression has theoretical limits, while lossy compression allows for a trade-off between quality and size
Regularly review and practice problem-solving techniques to reinforce understanding of probability concepts
Work through various examples and counterexamples to develop intuition