Glucose-ML

Emory University - Augmented Health Lab

Overview

Glucose-ML is an evolving collection of continuous glucose monitoring datasets designed to support data-centric machine learning research in diabetes. The broader goal of the project is to make CGM data more accessible for researchers by aggregatting various heterogeneous CGM datasets and standardizing them into a common format to accelerate the development of next generation computng systems. At launch, the Glucose-ML database is compromised of 44.9 million CGM samples across 20+ datasets collected from 4,400+ individuals with Type 1 diabetes, Type 2 diabetes, Prediabetes, and No diabetes.

To support researchers and promote diabetes-related research, Glucose-ML has engineered automated tools to seamlessly download and harmonize multiple CGM datasets and their corresponding demographic metadata. Additionally, the Glucose-ML interactive database aims to provide the users with multiple opportunities to visually explore, compare, and filter for datasets that meet their research needs.

The Problem We Solved

CGM data is often disorganized and fragmented across studies, which creates an inconvenient and challenging barrier for generating reproducible research. Datasets can differ in structure, naming conventions, metadata quality, temporal resolution, and overall completeness. In practice, this means that a large amount of potentially useful CGM data is not immediately ready for machine learning or large-scale comparative analysis. These inconsistencies make it difficult to combine data sources or build workflows that generalize across cohorts.

Think about how invonvenient it would be to download genomics data without NCBI, or (think of a more general comparison that a wider audience would relate to). This is the issue Glucose-ML set out to address in the diabetes world.

My Contributions

My role has focused on building the infrastructure that helps turn messy, heterogeneous CGM datasets into structured and research-ready resources. I develop Python-based workflows to ingest, clean, and standardize glucose data while organizing metadata in a way that supports downstream analysis and public-facing use.

This includes writing dataset-specific harmonization scripts that transforms various different CGM datasets and metadata into Glucose-ML ready format. I work closely with the frontend and backend teams to design and engineer the database from the ground-up. I have also contributed to the generation of derived metrics, statistical outputs, and structured data products that support the website and broader research goals of the project.

A major part of the work is not just processing data, but creating systems that make that processing reproducible, scalable, and understandable across many datasets with different quirks and constraints.