Biostatistician Interview
Analysis of an oncological dataset
Analyzing a dataset related to lung cancer patients to demonstrate my skills in data analysis and statistical modeling
Overview
This article provides an overview of the technical project I completed as part of a biostatistician interview process. The assignment involved analyzing an oncology dataset using R, with the objective of demonstrating statistical and analytical skills through a series of exercises. Although I did not secure the position, the feedback received has been invaluable for my professional growth in the field of clinical trial statistics.
Interactive Slides
Project Description
Dataset
The dataset provided for the project focused on lung cancer patients, with SurvTime
as the primary response variable, representing survival time in days. Key covariates included:
Cell
: Type of cancer cellTherapy
: Type of therapy (standard or test)Prior
: Prior therapy status (0 for no, 10 for yes)Age
: Age in yearsDiagTime
: Time in months from diagnosis to trial entryKps
: Performance status
A censoring indicator variable was also included to distinguish between censored and event times. The dataset required transformation and preparation for analysis, aligning with clinical research standards.
Exercises and Analysis
The project required solving six exercises, each designed to assess different analytical skills:
- Maximum Survival Time for Adeno Cell Type: Identifying the longest survival time for the adeno cell type.
- Average Age of Subjects: Calculating the mean age of study participants.
- Cell Type Frequency: Determining which cell type appeared most frequently.
- Descriptive Statistics: Generating descriptive statistics for all numeric variables.
- Survival Analysis: Performing survival analysis using Kaplan-Meier curves and Cox regression.
- Multivariable Analysis: Analyzing the effect of age on hazard ratios across different cancerous cells.
Methodology and Tools
The analysis involved utilizing a variety of R packages for data cleaning, statistical analysis, and visualization, including tidyverse
, survival
, and ggsurvfit
. The project was presented using revealjs
for interactive slides, showcasing the statistical findings and interpretations.