NCCR Data Platform: Large-Scale Cancer Research Infrastructure

Radu Robotin
Saturday, Nov 15, 2025

Transforming Cancer Research Through Cloud Innovation

The National Cancer Institute transforms cancer research by building a secure, scalable data platform on AWS that democratizes access to critical cancer data while meeting strict federal compliance requirements.

The Challenge: Democratizing Access to Critical Cancer Data

Cancer research depends on access to comprehensive, high-quality datasets from multiple sources. The National Cancer Institute’s Division of Cancer Control and Population Sciences (NCI-DCCPS) faced a significant challenge: how to provide researchers with secure, flexible access to sensitive cancer data from multiple registries while meeting strict federal compliance requirements.

The existing approach required researchers to navigate complex data request processes, often waiting weeks for standardized datasets that didn’t perfectly match their research needs. Meanwhile, valuable research time was lost to data preparation instead of scientific discovery.

The Solution: A Modern Data Lake Architecture

CloudR Solutions designed and implemented the National Childhood Cancer Registry (NCCR) Data Platform using a modern data lake architecture that separates raw data ingestion from research-ready datasets, enabling secure data validation and transformation while maintaining complete audit trails.

Core AWS Services

Amazon S3 provides the foundation with intelligent tiering and lifecycle management. The dual-bucket approach (raw and clean data buckets) creates a secure boundary between unvalidated and research-ready data.

AWS Glue powers the platform’s data transformation engine, converting raw CSV files into optimized Parquet format while applying data quality rules and schema validation. The serverless nature allows automatic scaling based on data volume.

AWS Step Functions serves as the orchestration backbone, managing complex ETL workflows with parallel processing capabilities. This was critical for handling multiple data sources simultaneously while maintaining data lineage and error handling.

Amazon Athena enables researchers to write custom SQL queries for precise data extraction, moving away from one-size-fits-all datasets to personalized research packages.

Amazon QuickSight delivers embedded analytics directly in the research interface, allowing interactive data exploration and cohort discovery without requiring researchers to download entire datasets.

External Data Sources

The platform integrates five major data sources:

Consolidated Tumor Case (CTC) - adjudicated data from population-based cancer registries, including SEER registries
Area-Based Measures - non-clinical variables from US Census Department providing broader health context
Children’s Oncology Group (COG) - data from COG studies including clinical trials and registry protocols
Medical Claims - diagnosis, enrollment, and procedure data from insurance claims
Pharmacy Claims - outpatient pharmacy medication dispensing data

Security and Compliance: Built-in from Day One

Meeting FISMA requirements while enabling research flexibility required security controls at every layer:

Network Isolation: Data uploads restricted to specific CIDR blocks with VPC isolation
Encryption: End-to-end encryption using AWS KMS for data at rest and in transit
Access Control: Integration with NCI’s SEER system via Amazon Cognito with granular role-based permissions
Audit Trail: Every data access logged using CloudTrail and CloudWatch monitoring
Data Lifecycle: Automated retention policies with S3 Intelligent-Tiering for cost optimization

The Game Changer: Precision Data Cuts

One of the platform’s most innovative features is its ability to generate customized, research-specific datasets on demand. Instead of providing massive, generic datasets, researchers can:

Explore interactively using embedded QuickSight dashboards to visualize available data and define cohorts
Select precisely the variables they need from over 500 available variables
Generate automatically analysis-ready packages with documentation and import scripts
Receive securely time-limited access to their custom dataset

This approach reduced data preparation time from weeks to minutes while improving security by minimizing data exposure.

Infrastructure as Code: Enabling Rapid Innovation

The entire platform is defined using AWS Cloud Development Kit (CDK), providing:

Consistent Environments: Identical infrastructure across development, staging, and production
Rapid Iteration: Branch-based development with isolated testing environments
Security Testing: Automated security scanning with GitHub Advanced Security
Reliable Deployments: Rollback capabilities and infrastructure versioning

Results: Accelerating Cancer Research

The platform has transformed cancer research workflows:

98% reduction in data preparation time - from weeks to hours
Self-service analytics enabling researchers to explore data independently
Enhanced security with minimal data exposure through precision cuts
Improved collaboration through standardized data formats and shared dashboards
Faster time to insights with analysis-ready datasets and automated documentation

Key Technical Lessons Learned

Orchestration First: AWS Step Functions proved essential for managing complex, multi-stage data processing
Data Quality as Code: Implementing validation rules in Glue jobs early prevented downstream issues
Cost-Conscious Architecture: S3 Intelligent-Tiering and Lifecycle Policies optimize storage costs automatically
Security by Design: Building security controls into every component creates a more robust security posture
User-Centric Development: Regular feedback sessions with researchers drove feature development

Conclusion

The NCCR Data Platform demonstrates how the right combination of AWS services can solve complex scientific challenges while maintaining the highest security standards. The key to success was treating security, scalability, and user experience as equally important requirements from day one.

Every user can craft their dataset down to the variable (out of more than 500 variables). The cohort is immediately visualized in QuickSight where users can adjust and fine-tune each variable in the data request. This powerful feature allows users to visualize across their data request before submitting documentation for access.

For organizations facing similar challenges in scientific data management, the NCCR Data Platform offers a proven blueprint: start with security, embrace infrastructure as code, invest in data quality, and build for your users’ actual workflows.

CloudR Solutions specializes in building FISMA-compliant data platforms on AWS for healthcare and life sciences organizations. Contact us to learn how we can accelerate your research mission.