Chunhui Gu

7900 Cambridge St, Houston, Texas 77054 · (404) 952-6630 · chunhui.gu@uth.tmc.edu

After five years as a medical student, I found myself more interested in and good at quantitative sciences and programming. Still love medicine but have felt tired about cramming style education in medicine, I wish I can boost the development of medicine from another aspect with my multi-disciplinary background. Considered myself as "full-stack" biomedical data scientist with 5 years of experience in biostatistics, computer science, and bioinformatics. I am currently a PhD candidate in Biostatistics at UTHealth School of Public Health, and working jointly as a GRA in MD Anderson biostatistics and Clinical Cancer Prevention Department. Looking for a full-time position in biomedical data science and machine learning.

Portfolio

Project 1
Multi-early cancer detection calculator

This is a web application to simulate the performance of a multi-cancer early detection test in reducing late-stage cancer incidence under different scenarios.

View Project
Project 2
GrapePi: Graph Neural Networks for Protein Identification

A comprehensive graph deep learning framework and CLI software for enhancing mass spectrometry proteomics protein detection with protein-protein interaction information.

View Project
Linux Server bioinformatics pipeline

A comprehensive pipeline searching PTM using DIANN on a Linux server in batch mode, data aggregation, and quality control.

View Project

Interests

I am keen on adapting advanced statistical and computer science techniques, as well as devising novel computational statistical methods, to address challenges in biomedicine.


Experience

Data Scientist Intern

Novo Nordisk, Plainsboro, New Jersey, USA
  • Drafted study protocol for testing impact of different settings of time-zero (start of follow-up) in target trial emulation longitudinal study focusing on chronic diseases without active comparator.
  • Developed validation study to justify or disprove proposed time-zero settings by adapting previous active-user comparator RWE studies into non-user comparator design and compared results with well-studied treatment effects.
  • Performed data extraction from large RWE database (Optum Clinformatics Data Mart) using SQL and R, dplyr, and data harmonization for creating cohorts using different time-zero settings.
  • Assessed performance of different settings of time-zero using survival analysis models (Cox model).
  • Automated cohort creation pipeline and created Shiny app/user-friendly R package for selection and visualization.
June 2024 - August 2024

Graduate Research Assistant

the Department of Biostatistics at The University of Texas MD Anderson Cancer Center
Advisor: Prof. Irajizad Ehsan, Ph.D
Supervisor: Prof. Kim-Anh Do, Ph.D
  • Learned structure and properties of mass spectrometry proteomics data
  • Built in-house R programming package for automating proteomics analysis pipeline including special data structure, statistical testing, and visualization modules
  • Identified missing proteins using combined information in mass spectrometry protein expression and mRNA expression

Using graph neural networks with protein-protein-interaction for enhancing protein identification
  • Reviewed status quo in using information from other sources to improve mass spectrometry-based protein detection
  • Learned current state-of-art graph neural network models, such as GCNConv, GraphSAGE, and GAT
  • Implemented graph neural network deep learning framework for enhancing mass spectrometry proteomics protein detection with protein-protein interaction information

Does Cancer History Drive COVID-19 Outcome? A Large-scale Matched Cohort Analysis
  • Debugged SAS code from other biostatisticians
  • Translated SAS code into R code to test the compatibility of R and SAS in our routine analyses
  • Developed R in-house packages to automate analysis pipeline
  • Offered additional expertise from clinical medicine and biostatistics perspective in topic development and modeling
July 2022 - Current

Assistant Bioinformatics Analyst

Children's Hospital of Atlanta
Advisor: Prof. Rabindra M Tirouvanziam, Ph.D.
  • Processed raw data from microarray and RNA-seq using R (beadarray) and Linux command line (bowtie)
  • Analyzed microarray and RNA-seq data by differential expression analysis and genes enrichment analysis using R (limma and DESeq2). Summarized analysis results with chromosome ideogram, heatmap, bubble chart, and other plots by JavaScript, Python, and R
  • Built well-documented in-house R packages for automated RNA-seq data analysis and plotting
  • Conducted research genes associated with a good prognosis in severe influenza infection by PCA, ingenuity pathway, modular analysis, Cibersort, and etc
May 2019 - August 2020

Research Assistant

Department of Biostatistics and Bioinformatics, Emory University
Advisor: Prof. Nelson Chen, Ph.D.
Expansion of a Risk Prediction Model for Healthcare facility-onset Clostridioides difficile infection in Patients Receiving Systemic Antibiotics
  • Conducted ANOVA to compare mean of covariates of patients among case and control group, constructed multivariate logistic regression model with variables that significantly associated with hospital-onset CDI and evaluable at hospital admission.
  • Evaluated regression model and point-based risk prediction model by receiver operating characteristic curve (ROC curve), positive predictive value, negative predictive value, sensitivity, specificity, and accuracy at various point cutoffs.

Oral Thiazide Diuretic Comparison in Acute Decompensated Heart Failure
  • Compared efficacy and safety of two thiazide diuretic in weight change, 24-hour-urine-output (UOP), and length of stay with Chi-square and ANCOVA in SAS.
  • Used multivariate logistic regression model to analyze ICU transfer rate between two groups after adjusting for other factors.
April 2018 - May 2020

Education

University of Texas Health Science Center at Houston

Doctor of Philosophy
Biostatistics

GPA: 3.916/4.0

Core courses: Generalized Linear Regression, Linear Models, Stochastic Process, Statistical Computating

September 2020 - Present

Rollins School of Public Health, Emory University

Master of Science
Biostatistics

GPA: 3.961/4.0

Core courses: Probability and Distribution Theory, Statistical Inference, Lienar Regression, Categorical Analysis, Survival Analysis

Sep 2018 - May 2020

Fudan University

Bachelor of Medicine and Bachelor of Surgery
Clinical Medicine

Rank: 80/203

Core base courses: Biochemistry, Biology, Anatomy, Physiology, Imunology, Microbiology, Pharmalogy

September 2013 - July 2018

Skills

Programming Languages & Tools

Statistical Software and Programming: Python (mastery), R programming (mastery), Linux command line (familiar), SQL (proficiency), SAS (proficiency), Java (proficiency), C (familiar), SPSS (familiar), MongoDB (familiar)
Technique: Propensity-score methods, Linear and Non-linear regression, Bayesian model, Survival analysis, Complex survey design and analysis, Clinical trials, Bootstrap, Times series, Differential expression analysis
Software: Microsoft Office (Word, Excel, and PowerPoint), Photoshop, Premiere, and Lightroom



Awards

  • Fudan University Outstanding Graduate Student Scholar 2018
  • 3rd prize winning contest group in Clove Programming Competition 2016
  • Fudan University Academic Excellent Scholarship 2013/2014/2015
  • Excellent Volunteer of Shanghai Children's Museum 2014