EasyGiraffe: Validating Polygenic Variant Extraction Using Graph-Based Genomics

Blog Image

Knowledge Graphs Meet Genomics: A Transatlantic Collaboration

Knowledge graphs (KGs) are structured representations of information that model real-world entities and their relationships in a network format. Unlike traditional databases that store data in tables, KGs organize information as nodes (entities) connected by edges (relationships) that may carry labels or attributes providing semantic meaning to those connections. As directed, labeled graphs, KGs associate meaning with nodes and edges; information can be incorporated through manual curation, semi-automated extraction, or fully automated data integration methods. Once established, these graphs can be efficiently explored through graph navigation where search and query operations can be performed, enabling complex reasoning over large, heterogeneous datasets.

In parallel, biomedical investigators are increasingly adopting large language models (LLMs) to accelerate discovery. However, LLMs often lack transparency into the evidence underpinning their outputs. This opacity is problematic for biomedical research, where verifiable, evidence-based reasoning is essential. Furthermore, generative artificial intelligence (GenAI) systems are prone to hallucinations, generating plausible-sounding but false statements. To address these challenges, the graph-based retrieval-augmented generation (GraphRAG) framework integrates knowledge graph structure into the retrieval and reasoning process of LLMs. GraphRAG reduces hallucinations observed in free-form GenAI systems by constraining generation to graph-anchored evidence, and scales to heterogeneous biomedical datasets without discarding graph topology.

The Transatlantic Hackathon

EasyGiraffe was developed as one of seven transatlantic projects during the NVIDIA - AWS Open Data Knowledge Graph Hackathon held in Arlington, VA, USA in November 2025. This collaborative event brought together interdisciplinary teams of scientists, bioinformaticians, computational biologists, data scientists, and software engineers to prototype and deploy pipelines that connected heterogeneous biomedical datasets.

The hackathon leveraged:

  • AWS Open Data resources (Registry of Open Data on AWS)
  • AWS Neptune (a graph database service)
  • Open-source tools to construct and integrate biomedical KGs

Collectively, these projects showcase methods for constructing KGs from existing biomedical datasets and exemplify the practical deployment of GraphRAG in real-world biomedical contexts, thereby advancing the biomedical sciences through the creation of new KGs and the promotion of evidence-grounded, graph-aware GenAI methodologies.

Transforming Cancer Genomics Through Pangenome Analysis

Sizeable cancer genome repositories such as TCGA offer a wealth of genomic data with the potential to transform cancer diagnostics, therapeutics, and precision medicine. However, extracting meaningful polygenic insights from this complex and heterogeneous dataset remains a major challenge.

The Problem with Linear Reference Genomes

Traditional methods for variant calling rely heavily on alignment against a single linear reference genome, typically GRCh38. While effective in many settings, this approach often fails to account for the extensive genomic variation across human populations, leading to biases in variant detection and interpretation.

The consequences are significant:

  • Missed variants in populations not well-represented in the reference genome
  • Reduced accuracy in detecting structural and polygenic variants
  • Limitations in cancer genomics where multicentric and population-specific variants play pivotal roles

The Pangenomics Revolution

Recent advancements in pangenomics have introduced graph-based representations of the genome, which encode known population-level variations directly into the reference itself. Tools such as the Variation Graph (VG) toolkit enable the construction of sequence graphs from short reads, allowing for the generation of pangenome structures that better reflect the genetic diversity of the sample cohort.

The GIRAFFE Mapper

The GIRAFFE mapper, a component of the VG toolkit, accelerates this process by providing fast and accurate read alignment against variation graphs. These graphs:

  • Improve alignment accuracy across diverse populations
  • Enhance detection of structural and polygenic variants
  • Better represent genetic diversity of the sample cohort

Introducing EasyGiraffe: A Validation Framework

Despite these methodological improvements, there remained a critical need for comprehensive validation frameworks. Existing pipelines tackle production steps—such as alignment, variant calling, and indexing—but do not offer an integrated method to evaluate the accuracy and reproducibility of polygenic variant extraction.

EasyGiraffe addresses this gap by providing a simulator-based validation framework tailored for multicentric polygenic variant extraction.

Methods & Implementation

The JaSaPaGe Backbone

We used the JaSaPaGe pangenome graph developed by Kulmatov et al. as the backbone for variant extraction. This graph, constructed from whole-genome sequences of:

  • 10 Japanese individuals
  • 9 Saudi Arabian individuals

The graph encodes common population-level variants and serves as the reference structure.

Simulation Pipeline

  1. Synthetic Read Generation: Our simulator generates synthetic sequencing reads (FASTQ files) embedded with known variants
  2. Realistic Configurations: Reads reflect realistic multicentric polygenic configurations in both short-read and long-read formats
  3. Graph-Based Alignment: Reads are mapped using the GIRAFFE mapper from the VG toolkit, outputting alignment files
  4. Variant Calling: The VG pipeline produces VCF files containing detected variants
  5. Validation: Detected variants are compared against known ground truth data

Supported Variant Types

The simulation generates multiple variant types:

  • SNP (Single Nucleotide Polymorphisms)
  • MNP (Multiple Nucleotide Polymorphisms)
  • InDel (Insertions and Deletions)
  • Inversions
  • Translocations

Workflow Overview

EasyGiraffe provides a complete benchmarking framework:

Setup Process

  1. Clone the repository from GitHub
  2. Download pangenome data using download-pangenome-data.sh
  3. Install VG toolkit with install_vg.sh
  4. Install simulation tools via install-tools.sh
  5. Generate variants by running disease_to_variant_resolver.sh with a disease name (e.g., “Sickle Cell Anemia”)

Validation Workflow

Once setup is complete, you have a validated VG environment ready for:

  • Read mapping against the pangenome graph using VG GIRAFFE
  • Variant calling to produce VCF files
  • Evaluation of detected variants against known ground truth
  • Metrics calculation: Precision, recall, and F1-scores

This makes EasyGiraffe particularly useful for validating variant calling pipelines across diverse population structures.

Key Features

Adaptable Parameters

The simulator was designed with adaptable parameters, enabling:

  • Precise control over minor allele frequencies
  • Population-specific scenarios reflecting diverse genetic backgrounds
  • Reproducibility across multiple replicates

System Requirements

  • Unix-based system with bash, wget, curl, git
  • Python 3.7+
  • conda
  • C++ compiler
  • At least 100 GB of free disk space

Results & Impact

EasyGiraffe successfully:

  • Extracts pathogenic variants from user-queried phenotypic features or diseases
  • Generates synthetic sequence reads with embedded variants
  • Outputs assigned variants from the simulator with high accuracy
  • Validates variant calling pipelines across diverse population structures
  • Provides benchmarking metrics for genomic analysis accuracy

Future Vision

Future development could expand the capabilities of the simulator to:

  • Generate long reads up to 1 megabase in length
  • Accommodate complex variant types including large InDels, inversions, and translocations
  • Extend population diversity with additional reference genomes
  • Integrate machine learning for improved variant prediction
  • Support clinical workflows for precision medicine applications

Open Source & Availability

All tools and data used in EasyGiraffe are publicly available:

Hackathon Details:
Developed at the NVIDIA - AWS Open Data Knowledge Graph Hackathon
Arlington, VA, USA, November 2025
Submitted: 25 Nov 2025

All repositories include documentation, installation instructions, and example usage. Custom scripts and analysis pipelines are available from the corresponding authors upon reasonable request.

Conclusion

EasyGiraffe represents a significant advancement in validating polygenic variant extraction from cancer genomes. By leveraging pangenome graphs and providing a comprehensive simulation framework, it addresses critical gaps in genomic analysis accuracy and reproducibility.

The framework is particularly valuable for:

  • Cancer genomics research requiring population-specific variant detection
  • Precision medicine applications needing validated variant calling pipelines
  • Bioinformatics tool development requiring robust benchmarking frameworks
  • Population genetics studies exploring diverse genetic backgrounds

As cancer research increasingly relies on large-scale genomic data, tools like EasyGiraffe become essential for ensuring the accuracy and reliability of variant detection across diverse populations.


EasyGiraffe is an open-source project developed through collaborative bioinformatics efforts. For more information about applying pangenomics to precision medicine, contact CloudR Solutions.