🎲Data Science Statistics Unit 1 – Intro to Probability & Stats for Data Science
Probability and statistics form the backbone of data science, providing essential tools for analyzing and interpreting data. This unit covers key concepts like probability, types of data, descriptive statistics, and inferential methods, laying the groundwork for more advanced techniques.
From basic probability rules to hypothesis testing and data visualization, these foundational skills enable data scientists to extract meaningful insights from complex datasets. Understanding these concepts is crucial for making informed decisions, building predictive models, and communicating findings effectively in various data science applications.
Probability quantifies the likelihood of an event occurring and ranges from 0 to 1
0 indicates an impossible event, while 1 represents a certain event
Statistics involves collecting, analyzing, and interpreting data to make informed decisions
Population refers to the entire group of individuals or objects under study
Sample is a subset of the population used to draw inferences about the whole
Variable is a characteristic or attribute that can take on different values (age, height, income)
Categorical variables have distinct categories or groups (gender, race, marital status)
Continuous variables can take on any value within a range (weight, temperature, time)
Distribution describes how data is spread out or dispersed across different values
Hypothesis is a statement or claim about a population parameter that can be tested using sample data
Probability Basics
Probability is expressed as a number between 0 and 1, often as a decimal or fraction
The sum of probabilities for all possible outcomes in a sample space equals 1
Independent events have no influence on each other's occurrence (flipping a coin multiple times)
Dependent events affect the probability of subsequent events (drawing cards without replacement)
Conditional probability measures the likelihood of an event occurring given that another event has already occurred, denoted as P(A∣B)
Bayes' theorem relates conditional probabilities and can be used to update probabilities based on new information
Expected value is the average outcome of an experiment if repeated many times, calculated by multiplying each possible outcome by its probability and summing the results
Types of Data and Distributions
Nominal data consists of categories with no inherent order (colors, breeds of dogs)
Ordinal data has categories with a natural order but no consistent scale (rankings, survey responses)
Interval data has ordered categories with consistent intervals but no true zero (temperature in Celsius)
Ratio data possesses all properties of interval data plus a true zero (height, weight, income)
Normal distribution is a symmetric, bell-shaped curve characterized by its mean and standard deviation
Approximately 68% of data falls within one standard deviation of the mean, 95% within two, and 99.7% within three
Binomial distribution models the number of successes in a fixed number of independent trials with two possible outcomes (coin flips, pass/fail exams)
Poisson distribution describes the probability of a given number of events occurring in a fixed interval of time or space (number of customers arriving per hour)
Descriptive Statistics
Measures of central tendency summarize data with a single value representing the center or typical value
Mean is the arithmetic average of a dataset, calculated by summing all values and dividing by the number of observations
Median is the middle value when data is ordered from lowest to highest, resistant to outliers
Mode is the most frequently occurring value in a dataset
Measures of dispersion quantify the spread or variability of data
Range is the difference between the maximum and minimum values
Variance measures the average squared deviation from the mean, denoted as σ2 for population and s2 for sample
Standard deviation is the square root of variance, expressed in the same units as the original data
Skewness describes the asymmetry of a distribution, with positive skew having a longer right tail and negative skew having a longer left tail
Kurtosis measures the thickness of the tails relative to a normal distribution, with higher values indicating more extreme outliers
Inferential Statistics
Inferential statistics uses sample data to make generalizations or predictions about a larger population
Sampling is the process of selecting a subset of individuals from a population to estimate characteristics of the whole
Simple random sampling gives each member of the population an equal chance of being selected
Stratified sampling divides the population into homogeneous subgroups before sampling to ensure representativeness
Sampling distribution is the distribution of a sample statistic over many samples of the same size
Central Limit Theorem states that the sampling distribution of the mean approaches a normal distribution as sample size increases, regardless of the shape of the population distribution
Confidence interval is a range of values likely to contain the true population parameter with a specified level of confidence (90%, 95%, 99%)
Margin of error is the maximum expected difference between the sample estimate and the true population value, often reported alongside confidence intervals
Hypothesis Testing
Hypothesis testing is a statistical method for making decisions or inferences about a population based on sample data
Null hypothesis (H0) is a statement of no effect or no difference, assumed to be true unless there is strong evidence against it
Alternative hypothesis (Ha or H1) is the claim that contradicts the null hypothesis, representing the effect or difference of interest
Type I error (false positive) occurs when the null hypothesis is rejected when it is actually true, denoted by α (significance level)
Type II error (false negative) happens when the null hypothesis is not rejected when it is actually false, denoted by β
p-value is the probability of obtaining a sample statistic as extreme as the observed value, assuming the null hypothesis is true
A small p-value (typically < 0.05) provides evidence against the null hypothesis and suggests statistical significance
Test statistic is a standardized value calculated from the sample data used to determine the p-value and make a decision about the null hypothesis (z-score, t-score, chi-square)
Data Visualization Techniques
Scatter plot displays the relationship between two continuous variables, with each point representing an observation
Positive correlation shows an upward trend, negative correlation shows a downward trend, and no correlation appears as a random scatter
Line graph connects data points in chronological order, useful for showing trends over time
Bar chart compares categories using rectangular bars, with the length proportional to the value
Histogram visualizes the distribution of a continuous variable by dividing the range into bins and displaying the frequency or density of observations in each bin
Box plot summarizes the distribution of a variable using five key statistics: minimum, first quartile, median, third quartile, and maximum
Outliers are plotted as individual points beyond the whiskers, which extend 1.5 times the interquartile range from the box edges
Heatmap represents data values using colors, often in a grid format, to identify patterns and clusters
Pie chart displays proportions or percentages of a whole, with each slice representing a category
Applications in Data Science
Exploratory data analysis (EDA) involves summarizing and visualizing data to uncover patterns, relationships, and anomalies before applying formal modeling techniques
Predictive modeling uses historical data to build models that can forecast future outcomes or behaviors (customer churn, sales revenue)
Clustering is an unsupervised learning technique that groups similar observations based on their features without predefined labels (k-means, hierarchical clustering)
Anomaly detection identifies rare or unusual observations that deviate significantly from the norm, useful for fraud detection and quality control
A/B testing compares two versions of a product or service to determine which performs better based on a specific metric (click-through rate, conversion rate)
Time series analysis examines data collected over regular intervals to extract meaningful statistics, uncover trends, and make forecasts (stock prices, weather patterns)
Natural language processing (NLP) applies statistical and computational techniques to analyze and understand human language data (sentiment analysis, topic modeling)