Video/Text

Subscribers only

84 Lessons

0% Not started

Ready?

Start Course

Module 3 The certified biostatistician 2026

Lesson 1 module 3 intro into statistical relationship

7 Lessons

Premium course

Lesson 1 module 3 intro into statistical relationship video

1. Are relations between variables always linear ? 2. Why and How correlation coefficient was developed ? 3. Pearson and spearman correlation formulas 4. How to interpret correlation coefficient 5.Hint on linear regression

Premium course

1.1 Not all relationships are linear

Relationships can be linear (positive or negative) or non-linear Pearson correlation and linear regression require a linear relationship Always inspect a scatter plot before choosing your statistical method Non-linear patterns (e.g. warfarin–INR) need different approaches

Premium course

1.2 From Variance to CoVariance

Variance measures how one variable spreads around its own mean Covariance extends this idea to two variables by multiplying their deviations together A positive covariance means the variables tend to move in the same direction A negative covariance means they move in opposite directions Covariance cannot tell you how strong the relationship is because it depends on the units of measurement Standardising covariance by dividing by both standard deviations gives us Pearson r

Premium course

1.3: The Birth of the Correlation Coefficient

Pearson r is covariance divided by the product of both standard deviations This standardisation forces r to always fall between -1 and +1, making it unit-free and comparable Spearman rs uses ranks instead of raw values, making it robust to outliers and valid for ordinal data Use Pearson for continuous, normally distributed, linear data with no major outliers Use Spearman when data are skewed, ordinal, or contain influential outliers In R: change method="pearson" to method="spearman" and the function is identical

Premium course

1.4: Reading the r Value, Sign, Magnitude and the Interpretation Table

r has two components: sign (direction) and magnitude (strength). Read both, every time. A negative r is not a weak r. An r of -0.90 is stronger than an r of +0.20. The standard thresholds are: 0.00-0.30 negligible, 0.30-0.50 low, 0.50-0.70 moderate, 0.70-0.90 high, 0.90-1.0 very high. The same r value means different things in different clinical fields. Always interpret in context. Correlation never proves causation. Always consider confounding variables.

Premium course

1.5: First Look at Linear Regression

Correlation measures strength of association. Regression gives you an equation to predict one variable from another. The regression equation is Y = a + bX, where a is the intercept and b is the slope. The slope b tells you how much Y changes for every 1-unit increase in X. This is the clinically actionable number. The intercept anchors the line mathematically. It is often not meaningful on its own in clinical contexts. Never use a regression equation to predict values outside the range of your original data. Lecture 2 will show you how the best-fit line is mathematically calculated and how to assess model quality.

Premium course

Interactive R session on Lesson 1 module 3

lesson 2 module 3 Simple Linear Regression

7 Lessons

Premium course

lesson 2 module 3 Simple Linear Regression Video

You can use simple linear regression when you want to know: 1-How strong the relationship is between two variables (e.g. the relationship between weight and height). 2-The value of the dependent variable at a certain value of the independent variable (e.g. the amount of weight at a specific height ).

Premium course

2.1 — When to Use Simple Linear Regression

SLR answers two questions: how strong is the relationship, and what is the predicted value at a given X? Both predictor and outcome must be continuous and the relationship must be approximately linear SLR requires four assumptions: linearity, independence, normality of residuals, and homoscedasticity The assumptions are checked after fitting, using diagnostic plots from plot(model) When the outcome is categorical or you have multiple predictors, different models apply

Premium course

Lesson 2.2 — How R Finds the Best Line: The Least Squares Method

Least squares finds the line that minimises the sum of squared residuals across all data points Slope b1 = Sum of products of deviations / Sum of squared X deviations Intercept b0 = y_bar minus (b1 times x_bar) — it anchors the line through the means The slope is the clinically meaningful number: it is the change in Y for each one-unit change in X lm() in R performs these exact calculations — the manual method confirms what is happening inside

Premium course

Lesson 2 3 summary output

Read summary() in four sections: residuals, coefficients table, R-squared, and F-statistic The Coefficients table gives: the estimate, its uncertainty (SE), the t-test (Estimate / SE), and the p-value for H0: coefficient = 0 R-squared is the proportion of Y variance explained by the model (0 to 1) Adjusted R-squared penalises for the number of predictors — always prefer this when comparing models A non-significant p-value with small n often means insufficient power, not that the predictor is useless

Premium course

Lesson 2.4 — SST, SSR, SSE and the ANOVA Table

SST = SSR + SSE: total variation = explained variation + unexplained variation R-squared = SSR / SST: the proportion of Y variation your model accounts for The ANOVA table gives Df, Sum Sq, Mean Sq, F, and p for each source of variation F = MSR / MSE: the ratio of signal (regression) to noise (residual) sqrt(MSE) = Residual Standard Error: the two outputs confirm each other

Premium course

Lesson 2.5 — The Four Diagnostic Plots

Always run plot(model) before reporting any regression result — numbers alone can hide serious problems Plot 1 (Residuals vs Fitted): checks linearity — look for a flat horizontal red line Plot 2 (Normal Q-Q): checks normality of residuals — points should follow the diagonal Plot 3 (Scale-Location): checks homoscedasticity — look for even spread across all fitted values Plot 4 (Residuals vs Leverage): detects influential observations using Cook's distance An influential value is not necessarily an outlier — Cook's distance combines both leverage and residual size

Premium course

Interactive R session on Lesson 2 module 3

lesson 3 module 3 Multiple Linear Regression

7 Lessons

Premium course

lesson 3 module 3 Multiple Linear Regression Video

1- What is multiple linear regression ? When to use multiple linear regression ? 2- Difference between simple and multiple linear regression 3- Decomposition of the total deviation multiple linear regression 4- Terms to be used in modeling multiple linear regression 5- Let's apply on a simulated data ( ps: it is simulated from results of real data) 6- How to identify confounder? (Interaction effect in multiple linear regression equation)

Premium course

3.1 What is Multiple Linear Regression and When Do You Use It?

MLR extends SLR to predict a continuous outcome from two or more predictors simultaneously The equation is Y = b0 + b1X1 + b2X2 + ... + bkXk + e Each coefficient is a partial slope: the effect of one predictor while holding all others fixed SLR fits a line; MLR fits a plane or hyperplane depending on the number of predictors MLR is the go-to tool when you need to adjust for confounders in observational data

Premium course

3.2 Three Types of Predictors in Multiple Linear Regression

MLR models can include three types of predictors: continuous (linear), dummy variables, and interaction terms Dummy variables convert categorical variables into 0/1 flags; you need k-1 dummies for k categories The reference group is the category assigned 0 across all dummies; all other group comparisons are made relative to it An interaction term is the product of two predictors; it tests whether the effect of one variable depends on another A significant interaction means you must interpret the main effects separately for each group

Premium course

3.3 Building Model 1: Treatment as a Dummy Predictor of HDL

Model 1 regresses HDL on treatment (dummy: New Drug = 1, Placebo = 0) The intercept = predicted mean HDL for the reference group (Placebo) The coefficient for dummy_treatment = difference in mean HDL between New Drug and Placebo MLR with a single dummy variable is mathematically identical to an independent samples t-test The R-squared of 0.004 tells us treatment alone explains almost nothing: a confounder is present

Premium course

3.4 Model 2: Adding Sex as a Predictor and Identifying Confounders

Premium course

3.5 Model 3: Interaction Effects in Multiple Linear Regression

An interaction term tests whether the effect of one predictor depends on the level of another predictor In R, add the interaction using dummy_treatment * dummy_sex inside the lm() formula When an interaction is significant, report the stratified effects (treatment effect in each sex group separately) In Model 3: treatment effect in females = +1.07 mg/dL (not significant); treatment effect in males = -4.44 mg/dL A significant interaction is called effect modification. The overall (unadjusted) analysis was masking opposite effects in the two groups Main effect coefficients change meaning when an interaction is in the model. Each now represents the effect in the reference category of the other variable

Premium course

Interactive R session on Lesson 3 module 3

Lesson 4 module 3 Logistic Regression

7 Lessons

Premium course

Lesson 4 module 3 Logistic Regression Video

1 When to use logistic regression? 2- What are the underlying calculations of logistic regression? 3- Example and how to implement in R Univariate 4- How to interpret? 5- Example and how to implement in R Multivariate

Premium course

4.1 When to Use Logistic Regression

Identify when an outcome is binary and requires logistic regression Explain why linear regression cannot model a 0/1 outcome Recognise the shape of the logistic (S-curve) function Describe the dataset used throughout Lecture 4

Premium course

4.2 The Logistic Function, Logit Transformation, and Odds Ratio

Write the logistic function and explain what it outputs Explain the logit transformation and why it linearises the model Show that OR = eβ₁ and ln(OR) = β₁ Compute odds from a probability and a probability from log-odds

Premium course

4.3 Univariate Logistic Regression in R — The Reference Value Problem

Run glm() with family = binomial to fit a logistic model Explain what "reference category" means and why it matters Recognise when R has chosen the wrong reference (the "badmodel") Use relevel() to set the correct reference category

Premium course

4..4 How to Interpret Logistic Regression Output

Extract the odds ratio using exp(coef(model)) Convert log-odds to probability using the logistic formula Report absolute odds and probabilities for each group Write a complete, clinically meaningful interpretation sentence

Premium course

4.5 Multivariate Logistic Regression in R

Add multiple predictors to a logistic model using glm() Read a correlation matrix and corrplot for predictors Spot the reference value trap in a multivariate badmodel Fix both reference categories and build goodmodel2 Write an adjusted interpretation for each predictor in the model

Premium course

Interactive R session on Lesson 4 module 3

Lesson 5 module 3 Conditional Logistic Regression

5 Lessons

Premium course

Lesson 5 module 3 Conditional Logistic Regression Video

Conditional Logistic Regression Multinomial Logistic Regression Ordinal Logistic Regression

Premium course

5.1 When to Use Conditional Logistic Regression

Premium course

5.2 Running clogit() and Reading the Output

Load the survival package and fit a model using clogit() Explain the role of strata(match) in the formula Read the clogit output: coef, exp(coef), se, z, and p-value Interpret each OR and its 95% CI for smk, sbp, and ecg Use anova() to test whether an interaction model is needed

Premium course

5.3 Three Methods, One Truth

Run all three approaches for the MI dataset in R Explain why factor(match) inflates ORs and why ignoring strata deflates them Read the three-method comparison table and identify the direction of bias Write a complete interpretation of the clogit results in manuscript format

Premium course

Interactive R session on Lesson 5 module 3

Lesson 6 module 3 Ordinal Logistic Regression

7 Lessons

Premium course

Lesson 6 module 3 Ordinal Logistic Regression Video

What is Ordinal logistic regression? When to use Ordinal logistic regression? Don’t use ordinal model if … How to label dummy variables in R fit ordered logit model using clm function fit ordered logit model using polr and brant functions

Premium course

6.1 What is Ordinal Logistic Regression?

Premium course

6.2 Labelling Variables and Exploring the Data

Premium course

6.3 Fitting the Ordinal Model and Testing Overall Fit

Premium course

6.4 Coefficients, Odds Ratios, and Confidence Intervals

Premium course

6.5 polr(), the Brant Test, and Writing Up

Premium course

Interactive R session on Lesson 6 module 3

Lesson 7 module 3 Poisson Regression

5 Lessons

Premium course

Lesson 7 module 3 Poisson Regression Video

Premium course

Lesson 7.1 When to Use Poisson Regression

Premium course

Lesson 7.2 Understanding the Poisson distribution and what λ actually does

Premium course

Lesson 7.3 Building the Model: glm() with family='poisson'

Premium course

Lesson 7.4 The Magic of exp(): From Log to Rate Ratio

Lesson 8 module 3 Repeated Measures ANOVA

6 Lessons

Premium course

Lesson 8 module 3 Repeated Measures ANOVA Video

When to use a Repeated Measures ANOVA Hypothesis for Repeated Measures ANOVA Logic of the Repeated Measures ANOVA Assumptions Computing and visualization

Premium course

Lesson 8.1 When to Use Repeated Measures ANOVA

Premium course

Lesson 8.2 The Logic and Advantage of RM-ANOVA

Describe how total variability (SST) is partitioned in RM-ANOVA Explain why SS_subjects is removed from the error term Understand why RM-ANOVA is more powerful than independent ANOVA State the null and alternative hypotheses for RM-ANOVA

Premium course

Lesson 8.3 Assumptions: Outliers, Normality, Sphericity

Check for outliers using identify_outliers() from rstatix Test normality using Shapiro-Wilk per time point and interpret p-values Understand what sphericity means and use Mauchly's test to check it Know what to do when each assumption is violated

Premium course

Lesson 8.4 Running the Test, Post-hoc, and Reporting

Run anova_test() and read every column of the ANOVA table Distinguish generalized vs. partial eta-squared Perform pairwise post-hoc tests with Bonferroni correction Write a complete, publishable report sentence

Premium course

Interactive R session on Lesson 8 module 3

Lesson 9 module 3 GEE

6 Lessons

Premium course

Lesson 9 module 3 GEE Video

What is GEE When to use GEE Computation Choosing the best model Computing the confidence interval

Premium course

Lesson 9.1 What is GEE and When to Use It

Define GEE and explain its core statistical idea Identify when GEE is appropriate vs. RM-ANOVA Understand the population-average interpretation of GEE results Name the R package and key function used to run GEE

Premium course

Lesson 9.2 Correlation Structures in GEE

Explain what a "working correlation structure" is in GEE Read and interpret an Independence correlation matrix Read and interpret an Exchangeable correlation matrix Choose the appropriate structure for a given clinical scenario

Premium course

Lesson 9.3 Building and Selecting the Best GEE Model

Write correct geeglm() syntax including all required arguments Understand the difference between main effects and interaction models Use QIC() to compare and select the best GEE model Read the geeglm summary output including the Wald statistic

Premium course

Lesson 9.4 Interpreting Results and Computing CIs

Calculate predicted means for each group at each time point Interpret interaction coefficients in clinical terms Build the custom confint.geeglm() function and explain each line Read the 95% CI table and identify significant effects

Premium course

Practical R session on lesson 9 GEE

Dataset: 150 observations. 50 MS patients in 2 groups (Cognition-targeted, Symptom-targeted), each measured at 3 time points. Outcome: Modified Fatigue Impact score. Covariates: Work/Social Adjustment, Hospital Anxiety, Perceived Stress.

Lesson 10 module 3 Multinomial Logistic Regression

6 Lessons

Premium course

Lesson 10 module 3 Multinomial Logistic Regression Video

Premium course

Lesson 10.1 What is Multinomial Logistic Regression and When to Use It

Define multinomial logistic regression and its relationship to binary logistic regression Identify when to use multinomial vs. binary, ordinal, or other regression models Recognise clinical examples of multinomial outcomes Name the R package and function used: nnet and multinom()

Premium course

Lesson 10.2 Assumptions and the Equation

Check all 4 assumptions before running multinom() Understand why multinomial regression produces K-1 equations Interpret the log-odds equations for each non-reference category Set the reference category correctly using relevel()

Premium course

Lesson 10.3 Fitting the Model and Reading the Output

Write correct multinom() syntax with all required arguments Calculate p-values manually from z-scores (R does not give these automatically) Interpret McFadden's pseudo R-squared for model fit Use stepAIC() for model selection and anova() for model comparison

Premium course

Lesson 10.4 Interpretation: ORs, P-values, and Reporting From exp(coef()) to a complete, publishable conclusion

Compute and interpret Odds Ratios from exp(coef()) Correctly phrase OR interpretation in the multinomial context Use stargazer() to generate publication-ready tables Write a complete report sentence for a multinomial analysis

Premium course

Interactive R session on Multinomial Logistic Regression

Dataset: COVID-19 survey data. Outcome: covidthreat_ph (0=not a threat, 1=minor threat, 2=major threat). Predictors: gender (femaleID), age category (agecat: 18-29, 30-49, 50-64, 65+), education level (educationlev: 1-6), belief COVID is made up (CovidMadeUp: not at all, not much, some, a lot).

Lesson 11 Module 3 Survival Analysis

6 Lessons

Premium course

Lesson 11 Module 3 Survival Analysis Video

The survival probability (which is also called the survivor function) S(t) is the probability that an individual survives from the time origin (e.g. diagnosis of cancer) to a specified future time t. It is fundamental to a survival analysis because survival probabilities for different values of t provide crucial summary information from time to event data.

Premium course

Lesson 11.1 The Survival Function, Censoring, and the KM Life Table

Define the survival function S(t) and explain what it measures Distinguish between the three patient types: event, censored at end, lost to follow-up Apply the KM formula to calculate survival probability step by step Read the columns of a Kaplan-Meier life table (Nt, Dt, Ct, pt, St)

Premium course

Lesson 11.2 The Surv Object and the Overall KM Curve

Create a Surv object combining time and event status Fit an overall KM curve using survfit(~1) Read the n.risk, n.event, survival, std.err, and CI columns Interpret the KM plot including the confidence bands

Premium course

Lesson 11.3 Comparing Survival Curves: The Log-Rank Test

Stratify a KM curve by group using survfit(survobj ~ sex) Read the by-group summary table (n, events, median, 95% CI) Run and interpret survdiff() for the log-rank test Write a complete report sentence for a survival comparison

Premium course

Lesson 11.4 Progression-Free Survival and Combining Curves

Define Progression-Free Survival (PFS) and distinguish it from Overall Survival Understand how the pfs status vector is constructed in the lung dataset Overlay OS and PFS curves on one plot using ggsurvplot_combine() Interpret the 4-curve combined plot stratified by sex

Premium course

Interactive R session on Lesson 11 module 3 Survival Analysis Practice

Dataset: lung (built into R's survival package). 228 patients with advanced lung cancer. Variables: time (survival days), status (1=censored, 2=dead), sex (1=male, 2=female), age, ph.ecog (ECOG score 0-5), ph.karno (Karnofsky score), meal.cal, wt.loss.

Bonus lectures sampling methods

4 Lessons

Premium course

Lesson 12.1 The Sampling Breakdown Four questions, four layers - from 100 million people to your study dataset

Distinguish the theoretical population, study population, sampling frame, and sample Explain how gaps between each layer can introduce bias Differentiate probability from non-probability sampling Name the five probability sampling methods covered in this lecture

Premium course

Lesson 12.2 Simple Random Sampling and Systematic

Define simple random sampling and explain sampling with vs. without replacement Calculate the sampling interval k for systematic sampling using k = N/n Implement both methods in R using sample() and seq() Identify the specific scenario where systematic sampling fails

Premium course

Lesson 12.3 Stratified and Cluster Sampling

Define stratified random sampling and distinguish proportional from equal allocation Explain when stratified sampling is preferred over SRS Define cluster sampling and explain why entire clusters are selected State the critical distinction: stratified samples WITHIN groups; cluster samples ENTIRE groups

Premium course

Lesson 12.4 Multistage Sampling and Choosing the Right Method

Define multistage sampling and describe how stages nest within each other Trace the Population → Clusters → Strata → Sample hierarchy Apply the decision guide to match any research scenario to the correct sampling method Compare all five methods across key dimensions: cost, precision, complexity, and requirements

Bonus lecture sample size calculation

5 Lessons

Premium course

Lesson 13.1 Power, Alpha, Beta, and the Three Trial Types

Define Type I error (alpha), Type II error (beta), and power (1 - beta) Explain the testing margin delta and its role in each trial type Distinguish superiority, non-inferiority, and equivalence trials State why equivalence trials consistently require larger sample sizes than superiority trials

Premium course

Lesson 13.2 One -Sample Proportion and Mean Sample Size

Apply the 1-sample proportion non-inferiority formula to compute n = 155 Apply the 1-sample proportion equivalence formula to compute n = 51 Apply the 1-sample mean non-inferiority formula to compute n = 6 Apply the 1-sample mean equivalence formula to compute n = 34

Premium course

Lesson 13.3 Two-Sample Proportion and Mean Sample Size

Apply the 2-sample proportion non-inferiority formula with kappa to get nB = 24 Explain why 2-sample proportion equivalence with |pA-pB| = delta gives n = astronomical Apply the 2-sample mean non-inferiority formula to get nB = 49 per group Apply the 2-sample mean equivalence formula to get nB = 107 per group

Premium course

Lesson 13.4 Time-to-Event Sample Size (Cox Proportional Hazards)

Define the hazard ratio and explain why log(HR) is used in the formula Identify the unique parameter pE and explain its role in survival sample size Apply the Cox PH non-inferiority formula to compute n = 82 Apply the Cox PH equivalence formula to compute n = 171

Premium course

Interactive R session on Sample Size Calculation Practice

No packages required. All sample size calculations use base R functions only: qnorm(), pnorm(), sqrt(), abs(), log(). Every exercise reproduces the exact outputs from the lecture material. Always round n up to the nearest whole number in practice.

Bonus lecture Cox Proportional Hazards Regression

6 Lessons

Premium course

Bonus lecture Cox Proportional Hazards Regression Video

Premium course

Lesson 14.1 From KM to Cox: Why We Need a Regression Model for Survival

Identify the three key limitations of Kaplan-Meier that require Cox regression Define the hazard function h(t) and distinguish it from the survival function S(t) Interpret the Cox PH formula and explain what "proportional hazards" means Read a hazard ratio (HR) and explain whether it indicates risk or protection

Premium course

Lesson 14.2 - Building Cox Models in R: From Simple to Multivariable

Write and run coxph() for both simple and multivariable Cox models Read all columns of the Cox summary output: coef, exp(coef), se(coef), z, p Interpret the concordance index (C-statistic) and the three overall model tests Identify which covariates are statistically significant and which are not

Premium course

Lesson 14.3 Interpreting Hazard Ratios and the Forest Plot

Apply the HR interpretation template to any protective or harmful covariate Read a forest plot: variable position, CI width, reference line at HR=1 Identify the parsimonious model (sex + ECOG) and explain why it was chosen Write a complete results paragraph for a Cox regression in clinical language

Premium course

Lesson 14.4 - Testing the Proportional Hazards Assumption

Explain the proportional hazards assumption in clinical terms and why it matters Run cox.zph() and read all four columns: rho, chisq, p, and GLOBAL Interpret Schoenfeld residual plots - what flat means vs what a trend means Apply the decision guide: assumption holds, borderline, or violated

Premium course

Interactive R session on Cox Regression

5 exercises covering coxph(), HR interpretation, model comparison, and the PH assumption test. Each exercise uses the Mayo Clinic lung dataset.

About the teacher

Nouran Hamza

No Author Description