Accelerating Data Harmonization Using AI Agents

Blog Image

Transforming Cancer Data Standards with AI

Cancer centers submit data using their own terminology, but research requires standardized codes. What if AI could compress months of manual coding work into minutes of intelligent review?

The Problem: Manual Data Harmonization is a Research Bottleneck

Cancer centers submit data using their own terminology (e.g., “abdomen/pelvis region” or “left kidney”), but NCCR requires standardized NCIT C-codes for research interoperability. Currently, mapping thousands of data elements from raw cancer center terminology to appropriate C-codes requires:

  • Months of manual labor by trained data curators
  • Extensive knowledge of both clinical terminology and caDSR vocabularies
  • Inconsistent mapping decisions across different reviewers
  • Significant delays before data becomes available for research
  • High costs and scalability limitations as data volume grows

This bottleneck prevents timely access to critical cancer research data, slowing the pace of scientific discovery and limiting the scale at which cancer research can operate.

The Solution: Two-Layer AI Agent System

CloudR Solutions developed an innovative AI agent system that deploys within the data submission pipeline, transforming the harmonization process:

Layer 1 - Intelligent Mapping Agent

The first layer handles the semantic matching challenge:

  1. Ingests raw data submitted by cancer centers (e.g., radiation site: “abdomen/pelvis”)
  2. Searches a local knowledge graph constructed from caDSR/NCIT vocabularies
  3. Proposes matching C-codes with confidence scores (e.g., C12664|C12767 - 95% confidence)
  4. Flags low-confidence matches (below acceptable threshold) for human review

Layer 2 - Data Harmonization Agent

The second layer transforms approved mappings into research-ready data:

  1. Takes approved mappings from Layer 1
  2. Transforms entire cancer center dataset into harmonized format using validated C-codes
  3. Ensures full traceability back to caDSR standards
  4. Generates submission-ready data for NCCR

Human-in-the-Loop Review

The key innovation: Human review occurs only once—reviewing the AI’s proposed mappings with confidence scores—rather than manually coding every data element from scratch.

Data curators can:

  • Approve high-confidence matches in bulk (saving hours of manual searching)
  • Focus attention on uncertain cases where human expertise adds value
  • Validate edge cases flagged by the AI system
  • Maintain data quality control without manual coding overhead

Benefits and Innovation

AI-Powered Semantic Matching with Confidence Scoring

Instead of humans manually searching caDSR to find appropriate codes, AI agents leverage a knowledge graph to instantly propose matches and quantify their certainty. Humans review only the proposals, approving high-confidence matches in bulk and adjudicating uncertain cases—dramatically reducing manual effort while maintaining data quality control.

The system brings AI’s pattern recognition and vocabulary knowledge to the mapping task, while keeping human expertise where it matters most: validating edge cases and low-confidence matches.

Key Advantages

  • Speed: Months of work → Minutes of review
  • Scale: Effortlessly handles growing data volumes
  • Consistency: Standardized mapping decisions across datasets
  • Quality: Confidence scoring ensures human oversight where needed
  • Traceability: Full audit trail back to caDSR standards
  • Cost-Effective: Dramatic reduction in manual labor costs

Alignment with NCI’s Mission and State-of-the-Art AI

This project aligns with current directions in AI systems and the National Cancer Institute’s mission to accelerate cancer research and leverage artificial intelligence toward finding a cure faster:

1. Eliminates Data Harmonization Bottlenecks

Removing barriers that slow cancer research is essential to NCI’s mission. By compressing months of manual data coding into minutes of AI-assisted review, we eliminate a critical bottleneck preventing timely research data availability.

2. Leverages AI for High-Impact Automation

Deploying AI where it can most accelerate breakthroughs is central to modern cancer research. Semantic mapping of clinical terminology to standardized vocabularies is precisely the type of pattern-matching task where AI excels—freeing human experts for higher-value scientific work.

3. Scales Research Data Infrastructure

As cancer research expands, manual data harmonization becomes mathematically unsustainable. AI agents scale effortlessly with data volume, enabling the research capacity expansion necessary for advancing cancer treatment and prevention.

4. Accelerates Time-to-Research

Rapid translation of data into clinical insights is critical for cancer research progress. Our approach transforms data harmonization from a months-long blocking process into a near-real-time operation—meaning cancer research data becomes available for analysis immediately rather than sitting in processing queues.

5. Maintains Data Quality While Increasing Speed

The confidence-scoring and human-in-the-loop review ensures AI acceleration doesn’t compromise the data integrity required for reliable cancer research—addressing the requirement for trustworthy, high-quality research infrastructure that supports NCI’s mission.

Technical Architecture

The system leverages several advanced technologies:

Knowledge Graph Construction

  • caDSR/NCIT vocabularies ingested and structured as a graph
  • Semantic relationships preserved between clinical terms and C-codes
  • Fast graph traversal for real-time matching

AI Agent Framework

  • Natural language processing for understanding clinical terminology
  • Semantic similarity scoring using embeddings and graph distances
  • Confidence calibration based on historical validation data
  • Active learning from human reviewer feedback

Integration Pipeline

  • Seamless integration with existing data submission workflows
  • API-based architecture for cancer center submissions
  • Real-time validation and feedback to submitters
  • Automated quality checks throughout the pipeline

Results and Impact

The AI-assisted data harmonization system delivers transformative outcomes:

Efficiency Gains

  • 98% reduction in manual coding time
  • Near-real-time data availability for research
  • Consistent quality across all submissions

Research Acceleration

  • Cancer research data available immediately vs. months of delay
  • Higher data volume capacity without proportional staff increases
  • Improved data quality through standardized processes

Cost Savings

  • Dramatic reduction in data curation labor costs
  • Scalable solution that grows with data volume
  • More efficient allocation of human expertise

Future Directions

The platform can be further enhanced with:

  • Multi-modal learning incorporating both text and structured data
  • Federated learning across multiple cancer centers
  • Automated ontology updates as vocabularies evolve
  • Expanded vocabulary coverage beyond NCIT
  • Real-time submission feedback to cancer centers

Conclusion

AI-assisted data harmonization represents a paradigm shift in how cancer research handles data standardization. By intelligently automating the pattern-matching aspects of terminology mapping while maintaining human oversight for complex decisions, we’ve created a system that:

  • Accelerates research by eliminating data availability bottlenecks
  • Scales efficiently with growing data volumes
  • Maintains quality through confidence-scored human review
  • Supports NCI’s mission to leverage AI for accelerating cancer research and finding a cure faster

This approach demonstrates how AI can be deployed strategically—not to replace human expertise, but to amplify it by handling time-consuming pattern-matching tasks and allowing experts to focus on the complex edge cases where their judgment is truly needed.


CloudR Solutions continues to advance AI-powered data harmonization for cancer research at the National Cancer Institute. Contact us to learn how we can accelerate your research data workflows.