Coding Projects

Some of my work from over the years

Diabetes Classification Case Studies

I developed two machine learning case studies using harmonized open-source CGM datasets from the Glucose-ML collection to evaluate diabetes status classification from participant-level glucose features. The first benchmark trained logistic regression, random forest, and XGBoost models across 13 datasets to classify T1D, T2D, no diabetes, and prediabetes. The second compared single-dataset versus multi-dataset training to evaluate how dataset diversity and sample size influence classification performance.

Python • Scikit-Learn • Machine Learning

Glucose-ML workflow preview

Glucose-ML Standardization Workflows

I developed a Python-based data acquisition and harmonization pipeline for the Glucose-ML collection to support reproducible use of open-source continuous glucose monitoring (CGM) datasets. The workflow automates dataset downloads from public repositories, validates and extracts raw files, and then standardizes heterogeneous CGM data into consistent participant-level formats. It harmonizes dataset-specific differences in file structure, timestamps, glucose units, metadata, and coverage criteria to generate analysis-ready time series CGM data for downstream machine learning and statistical analyses.

Python • Bash

Glucose-ML workflow preview

GONE Workflow

I developed a Snakemake-based pipeline to perform GONE analyses for estimating recent effective population size from whole genome sequencing data. The workflow automates population assignment, VCF filtering, SNP subsampling, and conversion to PLINK formats, and incorporates multiple random seed replicates to improve robustness of inference. It also integrates downstream analyses, including runs of homozygosity and nucleotide diversity, enabling scalable and reproducible population genomic analyses.

Snakemake Pipeline • Python • Bioinformatics • HPC

GONE workflow preview

Stairway Workflow

I developed a Snakemake-based pipeline to perform Stairway Plot analyses for reconstructing long-term effective population size (Ne) trajectories from reference haplotypes. The workflow integrates outputs from PAV and MSMC, automates input preparation (including reference genome configuration and chromosome extraction), and incorporates validation steps to ensure reliable downstream inference. Applied across 100+ custom reference genomes from the California Conservation Genomics Project, this pipeline supports scalable and reproducible demographic reconstruction.

Snakemake Pipeline • Python • Bash • R • Bioinformatics • HPC

Stairway workflow preview

NCBI SRA/BioSample Metadata Generator

I developed a Python tool to convert MongoDB sample metadata into structured SRA and BioSample sheets, streamlining the NCBI submission process. The tool handles read pair matching, validates sample metadata, and adapts formatting based on species type (plant, vertebrate, or invertebrate).

Python • MongoDB • NCBI