What Are Random Variables and Probability in Data Science?
Have you ever wondered how we predict things that happen by chance, like flipping a coin or waiting for an elevator? In data science, we use statistics to solve these mysteries! This guide answers questions about random variables, probability density functions (PDFs), and cumulative distribution functions (CDFs)—tools that help us understand probability.
These ideas come from the “Statistics for Data Science” notes by Teesside University London. We’ll break it down with examples, equations, and easy explanations so anyone can follow along.
What Is a Random Variable?
A random variable is a number tied to something random. What does that look like? Imagine flipping a coin twice and counting the heads. That number is a random variable because it depends on luck!
Here’s an example from the notes:
- You flip a coin twice. The possible results are:
- Heads, Heads (hh)
- Heads, Tails (ht)
- Tails, Heads (th)
- Tails, Tails (tt)
- We call the number of heads “X.” So:
X(hh) = 2 X(ht) = 1 X(th) = 1 X(tt) = 0
X can be 0, 1, or 2—it’s different every time! The formal way to write it is:
X: S → ℝ
This means X takes outcomes from a sample space (S—all possible flips) and gives them numbers.
- X: Represents a function or random variable.
- S: The sample space or domain of X.
- →: Indicates a mapping or function.
- ℝ: The set of real numbers (range of X).
- X: S → ℝ: X assigns a real number to each element in S.
What Are the Types of Random Variables?
Random variables come in two types. What are they?
- Discrete Random Variable: These are numbers you can count, like how many heads you get. Examples:
- How many times a dice lands on 4 in 100 rolls.
- How many tails in 20 coin flips.
- Continuous Random Variable: These are numbers you measure, like time or height. Examples:
- Time waiting for an elevator (e.g., 1.3 minutes).
- A person’s height (e.g., 4.8 feet).
How do you know which is which? If you can count it (like candies), it’s discrete. If you measure it (like a ruler), it’s continuous.
What Does a Random Variable Need?
Think of a random variable as a magical box that gives you different numbers based on chance! To work properly, it needs two important things:
A List of Possible Numbers (Range).
A random variable can only give certain numbers as results. These numbers depend on what you are looking at:
If you flip a coin, the possible results could be 0 for tails and 1 for heads.
If you roll a six-sided die, the possible results are 1, 2, 3, 4, 5, and 6.
If you measure how tall your friends are, the possible results could be any number like 4.5 feet, 5.2 feet, or 6 feet.
A Rule for Figuring Out How Likely Each Number Is Not all numbers are equally likely! We need a way to know which numbers are more common and which are rare:
For a fair coin, heads and tails each happen half the time (50% each).
For a fair die, each number (1 to 6) happens one-sixth of the time (about 16.7% each).
If you measure people’s heights, shorter and taller heights might be less common, and most people might be close to an average height. This can be described using special rules like a Probability Density Function (PDF) for continuous numbers or a Probability Mass Function (PMF) for counting numbers.
Why Does This Matter?
Understanding random variables helps us predict things! If we know the possible numbers and how often they happen, we can guess what might come next. Whether it’s rolling dice in a game or predicting tomorrow’s weather, random variables help make sense of the world around us!
What Is a Probability Density Function (PDF)?
A Probability Density Function (PDF) is a mathematical tool used to describe the likelihood of different outcomes for a continuous random variable—something that can take any value within a specific range. For example, imagine waiting for an elevator, where the wait time could be anywhere between 0 and 2 minutes. The PDF doesn’t give exact probabilities for specific points (like 1 or 2 minutes) but instead describes the relative likelihood of all possible values in between (e.g., 0.7 or 1.3 minutes).
Example: Waiting for an Elevator
Let’s say X represents the wait time for an elevator, ranging from 0 to 2 minutes. The PDF for this scenario is defined as follows:
-
For x between 0 and 1:
f(x) = x
(The longer you wait in this range, the more likely it becomes.) -
For x between 1 and 2:
f(x) = 2 - x
(The longer you wait in this range, the less likely it becomes.) -
Outside the range of 0 to 2:
f(x) = 0
(There’s no chance of waiting -1 or 3 minutes.)
Interpreting the PDF
The PDF can be visualized as a hill:
- It starts at 0, climbs to a peak at 1 minute, and then descends back to 0 by 2 minutes.
- The total area under this “hill” equals 1, meaning all possible probabilities sum to 100%.
Let’s break it down mathematically:
-
From 0 to 1:
∫₀¹ x dx = [x²/2]₀¹ = (1²/2) - (0²/2) = 0.5 -
From 1 to 2:
∫₁² (2 - x) dx = [2x - x²/2]₁² = (4 - 2) - (2 - 0.5) = 0.5 -
Total Area:
0.5 + 0.5 = 1
The symbol
∫
represents integration, which is a way to “add up all the tiny pieces” under the curve. Since the total area is 1, the PDF is valid.
Using the PDF to Calculate Probabilities
Suppose you want to find the probability of waiting less than 0.5 minutes. You can calculate this using the PDF:
P(0 ≤ X ≤ 0.5) = ∫₀⁰·⁵ x dx = [x²/2]₀⁰·⁵ = (0.5²/2) - (0²/2) = 0.125
This means there’s a 12.5% chance (or about 1 in 8) that you’ll wait less than half a minute.
Rules for a Valid PDF
A PDF must satisfy the following conditions:
-
Non-Negativity:
f(x) ≥ 0
for allx
.
(Probabilities can’t be negative.) -
Smoothness:
The function can have minor breaks (like atx = 1
in this example), but it must not have extreme discontinuities. -
Total Area Equals 1:
∫₋∞⁺∞ f(x) dx = 1
(The entire area under the curve must sum to 100%.)
- Probability Between Two Points:
P(a ≤ X ≤ b) = ∫ₐᵇ f(x) dx
(To find the probability of X
being between a
and b
, integrate the PDF over that range.)
Key Terms and Symbols
Here’s a quick guide to the mathematical notation used:
f(x)
: The PDF function—it describes the likelihood ofx
.X
: The random variable being measured (e.g., wait time).∫
: The integral—a mathematical operation that sums up tiny pieces of a function.dx
: An infinitesimally small piece ofx
being summed.P(a ≤ X ≤ b)
: The probability thatX
falls betweena
andb
.≥
: Greater than or equal to—used to ensuref(x)
is non-negative.₋∞⁺∞
: From negative infinity to infinity—covers all possible values ofX
.
In summary, a PDF is a powerful tool for understanding how probabilities are distributed across a range of possible outcomes. It’s particularly useful for continuous random variables, where exact probabilities for specific points are less meaningful than the overall distribution.
What Is a Cumulative Distribution Function (CDF)?
A CDF is like a running total of chances. It tells you the probability that something (like waiting for an elevator) is less than or equal to a certain time. Imagine a progress bar that starts at 0% and fills up to 100% as time goes on.
Using the same elevator example (waiting 0 to 2 minutes), here’s the CDF:
- F(x) = 0, if x < 0 (no chance before 0 minutes).
- F(x) = x²/2, if 0 ≤ x ≤ 1 (grows slowly at first).
- F(x) = 2x - (x²/2) - 1, if 1 < x ≤ 2 (keeps growing, but curves).
- F(x) = 1, if x > 2 (100% chance by 2 minutes).
How Do You Use a CDF?
Want to know the chance of waiting 0.5 minutes or less?
- F(0.5) = (0.5²/2) = 0.25/2 = 0.125.
- That’s 12.5%—about 1 in 8 times.
How about 1.5 minutes or less?
- F(1.5) = 2(1.5) - (1.5²/2) - 1 = 3 - 1.125 - 1 = 0.875.
- That’s 87.5%—very likely!
What about exactly 1 minute?
- F(1) = (1²/2) = 0.5.
- That’s 50%—half the time you’ll wait 1 minute or less.
The CDF starts at 0 and climbs to 1, showing how the chances build up.
How Is the CDF Made?
The CDF comes from adding up the PDF step by step. Here’s how:
- For x < 0:
- F(x) = ∫_{-\infty}^x 0 dt = 0 (nothing happens before 0, so it’s 0).
- For 0 ≤ x ≤ 1:
- F(x) = ∫_0^x t dt = [t²/2]_0^x = x²/2 (add up from 0 to x).
- For 1 < x ≤ 2:
- F(x) = ∫_0^1 t dt + ∫_1^x (2 - t) dt = 0.5 + [2x - x²/2 - 1.5] = 2x - (x²/2) - 1 (add the first part, then from 1 to x).
- For x > 2:
- F(x) = ∫_{-\infty}^2 f(t) dt = 1 (everything’s added up by 2, so it’s 100%).
It’s like counting how much chance you’ve collected as you go!
English for the Equations with Statistical Symbols
- F(x): The “CDF function”—shows the total probability up to x.
- x: The value you’re checking (like wait time).
- ∫: The “integral”—means “add up all the little pieces.”
- dt or dx: A tiny bit of time or value you’re adding.
- t: A placeholder for the values you’re adding up.
- ≤: “Less than or equal to”—used to set the ranges.
- _{-\infty}^x: “From negative infinity to x”—covers everything up to your point.
That’s the CDF—super simple way to track how chances pile up!
Why Are These Ideas Important?
Why do random variables, PDFs, and CDFs matter? They let us:
- Predict outcomes in games (like dice rolls).
- Understand waits or measurements (like elevator times).
- Make sense of data in data science.
For example:
- What’s the chance of rain? Use a random variable.
- How long until a bus arrives? Check a PDF or CDF.
How Can You Try These Concepts?
How can you test these ideas yourself? Here are some ways:
- Coin Flip Challenge
- Flip a coin 10 times (or 5 pairs). Count heads (0, 1, 2 for pairs).
- What’s the range? How often do you get 1?
- Waiting Game
- Time something short (up to 2 minutes, like a phone ringing).
- Use the PDF: Chance less than 0.5 minutes = 0.125. Does it match?
- Dice Roll Fun
- Roll a dice 10 times. Count how many 4s. That’s discrete!
- What’s the range of possible 4s?
Conclusion
What did we discover? Random variables turn random events into numbers, PDFs show how likely continuous outcomes are, and CDFs add up those chances. From coin flips to elevator waits, these tools from Andrew et al. help us explore probability in data science. They’re like a guidebook for understanding the unpredictable—super useful for tackling real-world questions!
Additional Reading and Courses
-
Khan Academy - Statistics and Probability
Link: khanacademy.org/math/statistics-probability
Description: Free lessons on statistics and probability with examples like dice and coins. -
edX - Introduction to Probability
Link: edx.org/course/introduction-to-probability
Description: Free course on probability, including random variables and more.
References
- BBC Bitesize. (n.d.). Probability Density Functions and Cumulative Distribution Functions. Retrieved March 17, 2025, from https://www.bbc.co.uk/bitesize/guides/zyj77ty/revision/1
- BBC Bitesize. (n.d.). Calculating Probabilities with PDFs and CDFs. Retrieved March 17, 2025, from https://www.bbc.co.uk/bitesize/guides/zgxttfr/revision/2