Data Science Statistics

🎲Data Science Statistics Unit 1 – Intro to Probability & Stats for Data Science

Probability and statistics form the backbone of data science, providing essential tools for analyzing and interpreting data. This unit covers key concepts like probability, types of data, descriptive statistics, and inferential methods, laying the groundwork for more advanced techniques. From basic probability rules to hypothesis testing and data visualization, these foundational skills enable data scientists to extract meaningful insights from complex datasets. Understanding these concepts is crucial for making informed decisions, building predictive models, and communicating findings effectively in various data science applications.

Key Concepts and Definitions

  • Probability quantifies the likelihood of an event occurring and ranges from 0 to 1
    • 0 indicates an impossible event, while 1 represents a certain event
  • Statistics involves collecting, analyzing, and interpreting data to make informed decisions
  • Population refers to the entire group of individuals or objects under study
  • Sample is a subset of the population used to draw inferences about the whole
  • Variable is a characteristic or attribute that can take on different values (age, height, income)
    • Categorical variables have distinct categories or groups (gender, race, marital status)
    • Continuous variables can take on any value within a range (weight, temperature, time)
  • Distribution describes how data is spread out or dispersed across different values
  • Hypothesis is a statement or claim about a population parameter that can be tested using sample data

Probability Basics

  • Probability is expressed as a number between 0 and 1, often as a decimal or fraction
  • The sum of probabilities for all possible outcomes in a sample space equals 1
  • Independent events have no influence on each other's occurrence (flipping a coin multiple times)
  • Dependent events affect the probability of subsequent events (drawing cards without replacement)
  • Conditional probability measures the likelihood of an event occurring given that another event has already occurred, denoted as P(AB)P(A|B)
  • Bayes' theorem relates conditional probabilities and can be used to update probabilities based on new information
  • Expected value is the average outcome of an experiment if repeated many times, calculated by multiplying each possible outcome by its probability and summing the results

Types of Data and Distributions

  • Nominal data consists of categories with no inherent order (colors, breeds of dogs)
  • Ordinal data has categories with a natural order but no consistent scale (rankings, survey responses)
  • Interval data has ordered categories with consistent intervals but no true zero (temperature in Celsius)
  • Ratio data possesses all properties of interval data plus a true zero (height, weight, income)
  • Normal distribution is a symmetric, bell-shaped curve characterized by its mean and standard deviation
    • Approximately 68% of data falls within one standard deviation of the mean, 95% within two, and 99.7% within three
  • Binomial distribution models the number of successes in a fixed number of independent trials with two possible outcomes (coin flips, pass/fail exams)
  • Poisson distribution describes the probability of a given number of events occurring in a fixed interval of time or space (number of customers arriving per hour)

Descriptive Statistics

  • Measures of central tendency summarize data with a single value representing the center or typical value
    • Mean is the arithmetic average of a dataset, calculated by summing all values and dividing by the number of observations
    • Median is the middle value when data is ordered from lowest to highest, resistant to outliers
    • Mode is the most frequently occurring value in a dataset
  • Measures of dispersion quantify the spread or variability of data
    • Range is the difference between the maximum and minimum values
    • Variance measures the average squared deviation from the mean, denoted as σ2\sigma^2 for population and s2s^2 for sample
    • Standard deviation is the square root of variance, expressed in the same units as the original data
  • Skewness describes the asymmetry of a distribution, with positive skew having a longer right tail and negative skew having a longer left tail
  • Kurtosis measures the thickness of the tails relative to a normal distribution, with higher values indicating more extreme outliers

Inferential Statistics

  • Inferential statistics uses sample data to make generalizations or predictions about a larger population
  • Sampling is the process of selecting a subset of individuals from a population to estimate characteristics of the whole
    • Simple random sampling gives each member of the population an equal chance of being selected
    • Stratified sampling divides the population into homogeneous subgroups before sampling to ensure representativeness
  • Sampling distribution is the distribution of a sample statistic over many samples of the same size
  • Central Limit Theorem states that the sampling distribution of the mean approaches a normal distribution as sample size increases, regardless of the shape of the population distribution
  • Confidence interval is a range of values likely to contain the true population parameter with a specified level of confidence (90%, 95%, 99%)
  • Margin of error is the maximum expected difference between the sample estimate and the true population value, often reported alongside confidence intervals

Hypothesis Testing

  • Hypothesis testing is a statistical method for making decisions or inferences about a population based on sample data
  • Null hypothesis (H0H_0) is a statement of no effect or no difference, assumed to be true unless there is strong evidence against it
  • Alternative hypothesis (HaH_a or H1H_1) is the claim that contradicts the null hypothesis, representing the effect or difference of interest
  • Type I error (false positive) occurs when the null hypothesis is rejected when it is actually true, denoted by α\alpha (significance level)
  • Type II error (false negative) happens when the null hypothesis is not rejected when it is actually false, denoted by β\beta
  • p-value is the probability of obtaining a sample statistic as extreme as the observed value, assuming the null hypothesis is true
    • A small p-value (typically < 0.05) provides evidence against the null hypothesis and suggests statistical significance
  • Test statistic is a standardized value calculated from the sample data used to determine the p-value and make a decision about the null hypothesis (z-score, t-score, chi-square)

Data Visualization Techniques

  • Scatter plot displays the relationship between two continuous variables, with each point representing an observation
    • Positive correlation shows an upward trend, negative correlation shows a downward trend, and no correlation appears as a random scatter
  • Line graph connects data points in chronological order, useful for showing trends over time
  • Bar chart compares categories using rectangular bars, with the length proportional to the value
  • Histogram visualizes the distribution of a continuous variable by dividing the range into bins and displaying the frequency or density of observations in each bin
  • Box plot summarizes the distribution of a variable using five key statistics: minimum, first quartile, median, third quartile, and maximum
    • Outliers are plotted as individual points beyond the whiskers, which extend 1.5 times the interquartile range from the box edges
  • Heatmap represents data values using colors, often in a grid format, to identify patterns and clusters
  • Pie chart displays proportions or percentages of a whole, with each slice representing a category

Applications in Data Science

  • Exploratory data analysis (EDA) involves summarizing and visualizing data to uncover patterns, relationships, and anomalies before applying formal modeling techniques
  • Predictive modeling uses historical data to build models that can forecast future outcomes or behaviors (customer churn, sales revenue)
    • Regression models predict continuous target variables (linear regression, logistic regression)
    • Classification models predict categorical target variables (decision trees, support vector machines)
  • Clustering is an unsupervised learning technique that groups similar observations based on their features without predefined labels (k-means, hierarchical clustering)
  • Anomaly detection identifies rare or unusual observations that deviate significantly from the norm, useful for fraud detection and quality control
  • A/B testing compares two versions of a product or service to determine which performs better based on a specific metric (click-through rate, conversion rate)
  • Time series analysis examines data collected over regular intervals to extract meaningful statistics, uncover trends, and make forecasts (stock prices, weather patterns)
  • Natural language processing (NLP) applies statistical and computational techniques to analyze and understand human language data (sentiment analysis, topic modeling)


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.