Module 3 The certified biostatistician 2026

Premium course

Advanced

Video/Text

Video/Text

Subscribers only

84 Lessons

0% Not started

Module 3 The certified biostatistician 2026

Lesson 1 module 3 intro into statistical relationship

7 Lessons

Premium course

1. Are relations between variables always linear ? 2. Why and How correlation coefficient was developed ? 3. Pearson and spearman correlation formulas 4. How to interpret correlation coefficient 5.Hint on linear regression

Video lesson

Premium course

Relationships can be linear (positive or negative) or non-linear Pearson correlation and linear regression require a linear relationship Always inspect a scatter plot before choosing your statistical method Non-linear patterns (e.g. warfarin–INR) need different approaches

Text lesson

Premium course

Variance measures how one variable spreads around its own mean Covariance extends this idea to two variables by multiplying their deviations together A positive covariance means the variables tend to move in the same direction A negative covariance means they move in opposite directions Covariance cannot tell you how strong the relationship is because it depends on the units of measurement Standardising covariance by dividing by both standard deviations gives us Pearson r

Text lesson

Premium course

Pearson r is covariance divided by the product of both standard deviations This standardisation forces r to always fall between -1 and +1, making it unit-free and comparable Spearman rs uses ranks instead of raw values, making it robust to outliers and valid for ordinal data Use Pearson for continuous, normally distributed, linear data with no major outliers Use Spearman when data are skewed, ordinal, or contain influential outliers In R: change method="pearson" to method="spearman" and the function is identical

Text lesson

Premium course

r has two components: sign (direction) and magnitude (strength). Read both, every time. A negative r is not a weak r. An r of -0.90 is stronger than an r of +0.20. The standard thresholds are: 0.00-0.30 negligible, 0.30-0.50 low, 0.50-0.70 moderate, 0.70-0.90 high, 0.90-1.0 very high. The same r value means different things in different clinical fields. Always interpret in context. Correlation never proves causation. Always consider confounding variables.

Text lesson

Premium course

Correlation measures strength of association. Regression gives you an equation to predict one variable from another. The regression equation is Y = a + bX, where a is the intercept and b is the slope. The slope b tells you how much Y changes for every 1-unit increase in X. This is the clinically actionable number. The intercept anchors the line mathematically. It is often not meaningful on its own in clinical contexts. Never use a regression equation to predict values outside the range of your original data. Lecture 2 will show you how the best-fit line is mathematically calculated and how to assess model quality.

Text lesson

lesson 2 module 3 Simple Linear Regression

7 Lessons

Premium course

You can use simple linear regression when you want to know: 1-How strong the relationship is between two variables (e.g. the relationship between weight and height). 2-The value of the dependent variable at a certain value of the independent variable (e.g. the amount of weight at a specific height ).

Video lesson

Premium course

SLR answers two questions: how strong is the relationship, and what is the predicted value at a given X? Both predictor and outcome must be continuous and the relationship must be approximately linear SLR requires four assumptions: linearity, independence, normality of residuals, and homoscedasticity The assumptions are checked after fitting, using diagnostic plots from plot(model) When the outcome is categorical or you have multiple predictors, different models apply

Text lesson

Premium course

Least squares finds the line that minimises the sum of squared residuals across all data points Slope b1 = Sum of products of deviations / Sum of squared X deviations Intercept b0 = y_bar minus (b1 times x_bar) — it anchors the line through the means The slope is the clinically meaningful number: it is the change in Y for each one-unit change in X lm() in R performs these exact calculations — the manual method confirms what is happening inside

Text lesson

Premium course

Read summary() in four sections: residuals, coefficients table, R-squared, and F-statistic The Coefficients table gives: the estimate, its uncertainty (SE), the t-test (Estimate / SE), and the p-value for H0: coefficient = 0 R-squared is the proportion of Y variance explained by the model (0 to 1) Adjusted R-squared penalises for the number of predictors — always prefer this when comparing models A non-significant p-value with small n often means insufficient power, not that the predictor is useless

Text lesson

Premium course

SST = SSR + SSE: total variation = explained variation + unexplained variation R-squared = SSR / SST: the proportion of Y variation your model accounts for The ANOVA table gives Df, Sum Sq, Mean Sq, F, and p for each source of variation F = MSR / MSE: the ratio of signal (regression) to noise (residual) sqrt(MSE) = Residual Standard Error: the two outputs confirm each other

Text lesson

Premium course

Always run plot(model) before reporting any regression result — numbers alone can hide serious problems Plot 1 (Residuals vs Fitted): checks linearity — look for a flat horizontal red line Plot 2 (Normal Q-Q): checks normality of residuals — points should follow the diagonal Plot 3 (Scale-Location): checks homoscedasticity — look for even spread across all fitted values Plot 4 (Residuals vs Leverage): detects influential observations using Cook's distance An influential value is not necessarily an outlier — Cook's distance combines both leverage and residual size

Text lesson

lesson 3 module 3 Multiple Linear Regression

7 Lessons

Premium course

1- What is multiple linear regression ? When to use multiple linear regression ? 2- Difference between simple and multiple linear regression 3- Decomposition of the total deviation multiple linear regression 4- Terms to be used in modeling multiple linear regression 5- Let's apply on a simulated data ( ps: it is simulated from results of real data) 6- How to identify confounder? (Interaction effect in multiple linear regression equation)

Video lesson

Premium course

MLR extends SLR to predict a continuous outcome from two or more predictors simultaneously The equation is Y = b0 + b1X1 + b2X2 + ... + bkXk + e Each coefficient is a partial slope: the effect of one predictor while holding all others fixed SLR fits a line; MLR fits a plane or hyperplane depending on the number of predictors MLR is the go-to tool when you need to adjust for confounders in observational data

Text lesson

Premium course

MLR models can include three types of predictors: continuous (linear), dummy variables, and interaction terms Dummy variables convert categorical variables into 0/1 flags; you need k-1 dummies for k categories The reference group is the category assigned 0 across all dummies; all other group comparisons are made relative to it An interaction term is the product of two predictors; it tests whether the effect of one variable depends on another A significant interaction means you must interpret the main effects separately for each group

Text lesson

Premium course

Model 1 regresses HDL on treatment (dummy: New Drug = 1, Placebo = 0) The intercept = predicted mean HDL for the reference group (Placebo) The coefficient for dummy_treatment = difference in mean HDL between New Drug and Placebo MLR with a single dummy variable is mathematically identical to an independent samples t-test The R-squared of 0.004 tells us treatment alone explains almost nothing: a confounder is present

Text lesson

Premium course

An interaction term tests whether the effect of one predictor depends on the level of another predictor In R, add the interaction using dummy_treatment * dummy_sex inside the lm() formula When an interaction is significant, report the stratified effects (treatment effect in each sex group separately) In Model 3: treatment effect in females = +1.07 mg/dL (not significant); treatment effect in males = -4.44 mg/dL A significant interaction is called effect modification. The overall (unadjusted) analysis was masking opposite effects in the two groups Main effect coefficients change meaning when an interaction is in the model. Each now represents the effect in the reference category of the other variable

Text lesson

Lesson 4 module 3 Logistic Regression

7 Lessons

Premium course

1 When to use logistic regression? 2- What are the underlying calculations of logistic regression? 3- Example and how to implement in R Univariate 4- How to interpret? 5- Example and how to implement in R Multivariate

Video lesson

Premium course

Identify when an outcome is binary and requires logistic regression Explain why linear regression cannot model a 0/1 outcome Recognise the shape of the logistic (S-curve) function Describe the dataset used throughout Lecture 4

Text lesson

Premium course

Write the logistic function and explain what it outputs Explain the logit transformation and why it linearises the model Show that OR = eβ₁ and ln(OR) = β₁ Compute odds from a probability and a probability from log-odds

Text lesson

Premium course

Run glm() with family = binomial to fit a logistic model Explain what "reference category" means and why it matters Recognise when R has chosen the wrong reference (the "badmodel") Use relevel() to set the correct reference category

Text lesson

Premium course

Extract the odds ratio using exp(coef(model)) Convert log-odds to probability using the logistic formula Report absolute odds and probabilities for each group Write a complete, clinically meaningful interpretation sentence

Text lesson

Premium course

Add multiple predictors to a logistic model using glm() Read a correlation matrix and corrplot for predictors Spot the reference value trap in a multivariate badmodel Fix both reference categories and build goodmodel2 Write an adjusted interpretation for each predictor in the model

Text lesson

Lesson 5 module 3 Conditional Logistic Regression

5 Lessons

Premium course

Conditional Logistic Regression Multinomial Logistic Regression Ordinal Logistic Regression

Video lesson

Premium course

Load the survival package and fit a model using clogit() Explain the role of strata(match) in the formula Read the clogit output: coef, exp(coef), se, z, and p-value Interpret each OR and its 95% CI for smk, sbp, and ecg Use anova() to test whether an interaction model is needed

Text lesson

Premium course

Run all three approaches for the MI dataset in R Explain why factor(match) inflates ORs and why ignoring strata deflates them Read the three-method comparison table and identify the direction of bias Write a complete interpretation of the clogit results in manuscript format

Text lesson

Lesson 6 module 3 Ordinal Logistic Regression

7 Lessons

Premium course

What is Ordinal logistic regression? When to use Ordinal logistic regression? Don’t use ordinal model if … How to label dummy variables in R fit ordered logit model using clm function fit ordered logit model using polr and brant functions

Video lesson

Premium course

Lesson 7 module 3 Poisson Regression

5 Lessons

Lesson 8 module 3 Repeated Measures ANOVA

6 Lessons

Premium course

When to use a Repeated Measures ANOVA Hypothesis for Repeated Measures ANOVA Logic of the Repeated Measures ANOVA Assumptions Computing and visualization

Video lesson

Premium course

Describe how total variability (SST) is partitioned in RM-ANOVA Explain why SS_subjects is removed from the error term Understand why RM-ANOVA is more powerful than independent ANOVA State the null and alternative hypotheses for RM-ANOVA

Text lesson

Premium course

Check for outliers using identify_outliers() from rstatix Test normality using Shapiro-Wilk per time point and interpret p-values Understand what sphericity means and use Mauchly's test to check it Know what to do when each assumption is violated

Text lesson

Premium course

Run anova_test() and read every column of the ANOVA table Distinguish generalized vs. partial eta-squared Perform pairwise post-hoc tests with Bonferroni correction Write a complete, publishable report sentence

Text lesson

Lesson 9 module 3 GEE

6 Lessons

Premium course

What is GEE When to use GEE Computation Choosing the best model Computing the confidence interval

Video lesson

Premium course

Define GEE and explain its core statistical idea Identify when GEE is appropriate vs. RM-ANOVA Understand the population-average interpretation of GEE results Name the R package and key function used to run GEE

Text lesson

Premium course

Explain what a "working correlation structure" is in GEE Read and interpret an Independence correlation matrix Read and interpret an Exchangeable correlation matrix Choose the appropriate structure for a given clinical scenario

Text lesson

Premium course

Write correct geeglm() syntax including all required arguments Understand the difference between main effects and interaction models Use QIC() to compare and select the best GEE model Read the geeglm summary output including the Wald statistic

Text lesson

Premium course

Calculate predicted means for each group at each time point Interpret interaction coefficients in clinical terms Build the custom confint.geeglm() function and explain each line Read the 95% CI table and identify significant effects

Text lesson

Premium course

Dataset: 150 observations. 50 MS patients in 2 groups (Cognition-targeted, Symptom-targeted), each measured at 3 time points. Outcome: Modified Fatigue Impact score. Covariates: Work/Social Adjustment, Hospital Anxiety, Perceived Stress.

Text lesson

Lesson 10 module 3 Multinomial Logistic Regression

6 Lessons

Premium course

Define multinomial logistic regression and its relationship to binary logistic regression Identify when to use multinomial vs. binary, ordinal, or other regression models Recognise clinical examples of multinomial outcomes Name the R package and function used: nnet and multinom()

Text lesson

Premium course

Check all 4 assumptions before running multinom() Understand why multinomial regression produces K-1 equations Interpret the log-odds equations for each non-reference category Set the reference category correctly using relevel()

Text lesson

Premium course

Write correct multinom() syntax with all required arguments Calculate p-values manually from z-scores (R does not give these automatically) Interpret McFadden's pseudo R-squared for model fit Use stepAIC() for model selection and anova() for model comparison

Text lesson

Premium course

Compute and interpret Odds Ratios from exp(coef()) Correctly phrase OR interpretation in the multinomial context Use stargazer() to generate publication-ready tables Write a complete report sentence for a multinomial analysis

Text lesson

Premium course

Dataset: COVID-19 survey data. Outcome: covidthreat_ph (0=not a threat, 1=minor threat, 2=major threat). Predictors: gender (femaleID), age category (agecat: 18-29, 30-49, 50-64, 65+), education level (educationlev: 1-6), belief COVID is made up (CovidMadeUp: not at all, not much, some, a lot).

Text lesson

Lesson 11 Module 3 Survival Analysis

6 Lessons

Premium course

The survival probability (which is also called the survivor function) S(t) is the probability that an individual survives from the time origin (e.g. diagnosis of cancer) to a specified future time t. It is fundamental to a survival analysis because survival probabilities for different values of t provide crucial summary information from time to event data.

Video lesson

Premium course

Define the survival function S(t) and explain what it measures Distinguish between the three patient types: event, censored at end, lost to follow-up Apply the KM formula to calculate survival probability step by step Read the columns of a Kaplan-Meier life table (Nt, Dt, Ct, pt, St)

Text lesson

Premium course

Create a Surv object combining time and event status Fit an overall KM curve using survfit(~1) Read the n.risk, n.event, survival, std.err, and CI columns Interpret the KM plot including the confidence bands

Text lesson

Premium course

Stratify a KM curve by group using survfit(survobj ~ sex) Read the by-group summary table (n, events, median, 95% CI) Run and interpret survdiff() for the log-rank test Write a complete report sentence for a survival comparison

Text lesson

Premium course

Define Progression-Free Survival (PFS) and distinguish it from Overall Survival Understand how the pfs status vector is constructed in the lung dataset Overlay OS and PFS curves on one plot using ggsurvplot_combine() Interpret the 4-curve combined plot stratified by sex

Text lesson

Premium course

Dataset: lung (built into R's survival package). 228 patients with advanced lung cancer. Variables: time (survival days), status (1=censored, 2=dead), sex (1=male, 2=female), age, ph.ecog (ECOG score 0-5), ph.karno (Karnofsky score), meal.cal, wt.loss.

Text lesson

Bonus lectures sampling methods

4 Lessons

Premium course

Distinguish the theoretical population, study population, sampling frame, and sample Explain how gaps between each layer can introduce bias Differentiate probability from non-probability sampling Name the five probability sampling methods covered in this lecture

Text lesson

Premium course

Define simple random sampling and explain sampling with vs. without replacement Calculate the sampling interval k for systematic sampling using k = N/n Implement both methods in R using sample() and seq() Identify the specific scenario where systematic sampling fails

Text lesson

Premium course

Define stratified random sampling and distinguish proportional from equal allocation Explain when stratified sampling is preferred over SRS Define cluster sampling and explain why entire clusters are selected State the critical distinction: stratified samples WITHIN groups; cluster samples ENTIRE groups

Text lesson

Premium course

Define multistage sampling and describe how stages nest within each other Trace the Population → Clusters → Strata → Sample hierarchy Apply the decision guide to match any research scenario to the correct sampling method Compare all five methods across key dimensions: cost, precision, complexity, and requirements

Text lesson

Bonus lecture sample size calculation

5 Lessons

Premium course

Define Type I error (alpha), Type II error (beta), and power (1 - beta) Explain the testing margin delta and its role in each trial type Distinguish superiority, non-inferiority, and equivalence trials State why equivalence trials consistently require larger sample sizes than superiority trials

Text lesson

Premium course

Apply the 1-sample proportion non-inferiority formula to compute n = 155 Apply the 1-sample proportion equivalence formula to compute n = 51 Apply the 1-sample mean non-inferiority formula to compute n = 6 Apply the 1-sample mean equivalence formula to compute n = 34

Text lesson

Premium course

Apply the 2-sample proportion non-inferiority formula with kappa to get nB = 24 Explain why 2-sample proportion equivalence with |pA-pB| = delta gives n = astronomical Apply the 2-sample mean non-inferiority formula to get nB = 49 per group Apply the 2-sample mean equivalence formula to get nB = 107 per group

Text lesson

Premium course

Define the hazard ratio and explain why log(HR) is used in the formula Identify the unique parameter pE and explain its role in survival sample size Apply the Cox PH non-inferiority formula to compute n = 82 Apply the Cox PH equivalence formula to compute n = 171

Text lesson

Premium course

No packages required. All sample size calculations use base R functions only: qnorm(), pnorm(), sqrt(), abs(), log(). Every exercise reproduces the exact outputs from the lecture material. Always round n up to the nearest whole number in practice.

Text lesson

Bonus lecture Cox Proportional Hazards Regression

6 Lessons

Premium course

Identify the three key limitations of Kaplan-Meier that require Cox regression Define the hazard function h(t) and distinguish it from the survival function S(t) Interpret the Cox PH formula and explain what "proportional hazards" means Read a hazard ratio (HR) and explain whether it indicates risk or protection

Text lesson

Premium course

Write and run coxph() for both simple and multivariable Cox models Read all columns of the Cox summary output: coef, exp(coef), se(coef), z, p Interpret the concordance index (C-statistic) and the three overall model tests Identify which covariates are statistically significant and which are not

Text lesson

Premium course

Apply the HR interpretation template to any protective or harmful covariate Read a forest plot: variable position, CI width, reference line at HR=1 Identify the parsimonious model (sex + ECOG) and explain why it was chosen Write a complete results paragraph for a Cox regression in clinical language

Text lesson

Premium course

Explain the proportional hazards assumption in clinical terms and why it matters Run cox.zph() and read all four columns: rho, chisq, p, and GLOBAL Interpret Schoenfeld residual plots - what flat means vs what a trend means Apply the decision guide: assumption holds, borderline, or violated

Text lesson

Premium course

5 exercises covering coxph(), HR interpretation, model comparison, and the PH assumption test. Each exercise uses the Mayo Clinic lung dataset.

Text lesson

Follow

About the teacher

Nouran Hamza

No Author Description

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}
>