California Conservation Genomics Project
UC Los Angeles & UC Santa Cruz
Overview
The California Conservation Genomics Project (CCGP) is a large-scale, collaborative WGS sequencing project focused on generating genomic resources to support biodiversity research, conservation, and population health monitoring across California. The project brings together researchers, institutions, and sequencing efforts to help build one of the most comprehensive multispecies regional genomics datasets assembled to date.
The goal of the project is to provide scientists and policymakers with high-quality genomic data and analyses to better conserve the state's species and their habitats, better protect natural resources, and inform strategies to ensure that California's people, places, and wildlife are resilient to climate change now and in the coming decades.
The Problem We Solved
Next-generation sequencing (NGS) has transformed biological research, but large-scale population genomics remains difficult to execute due to the high cost, computational demands, and complexity of organizing sequencing data across many species and projects.
CCGP addresses this challenge by bringing together experts in genomics and conservation science to build shared resources and infrastructure. The project has generated a large-scale genomic library consisting of 20,000+ sequenced individuals spanning 150+ genera and over 230 species across California's diverse ecosystems. In addition, high-quality reference genomes and population-level analysis tools have been developed to make downstream research more accessible and scalable.
My Contributions
As the CCGP's Data Wrangler, my work focuses on building and maintaining the infrastructure that supports large-scale genomics research. I develop and deploy Snakemake-based bioinformatics workflows for population-level whole genome sequencing analyses, including pipelines that map sequencing reads to reference genomes and generate variant-level outputs for downstream research.
Alongside pipeline development, I manage the project's data systems, including a MongoDB database of 20,000+ samples and the storage and organization of 140,000+ FASTQ files across AWS S3 and Google Cloud. A major part of this work involves maintaining data quality through metadata validation and sequencing quality control to ensure we are producing biologically meaningful datasets.
I also coordinate the submission of sample-level datasets to NCBI repositories (SRA, BioProject, and BioSample) and collaborate with 70+ consortium researchers during quality control and data delivery. In addition, I provide ongoing technical and research support and have contributed to population-level genomics analyses (see main page).