Transforming Cancer Research Through Cloud Innovation
The National Cancer Institute transforms cancer research by building a secure, scalable data platform on AWS that democratizes access to critical cancer data while meeting strict federal compliance requirements.
The Challenge: Democratizing Access to Critical Cancer Data
Cancer research depends on access to comprehensive, high-quality datasets from multiple sources. The National Cancer Institute’s Division of Cancer Control and Population Sciences (NCI-DCCPS) faced a significant challenge: how to provide researchers with secure, flexible access to sensitive cancer data from multiple registries while meeting strict federal compliance requirements.
The existing approach required researchers to navigate complex data request processes, often waiting weeks for standardized datasets that didn’t perfectly match their research needs. Meanwhile, valuable research time was lost to data preparation instead of scientific discovery.
The Solution: A Modern Data Lake Architecture
CloudR Solutions designed and implemented the National Childhood Cancer Registry (NCCR) Data Platform using a modern data lake architecture that separates raw data ingestion from research-ready datasets, enabling secure data validation and transformation while maintaining complete audit trails.
Core AWS Services
Amazon S3 provides the foundation with intelligent tiering and lifecycle management. The dual-bucket approach (raw and clean data buckets) creates a secure boundary between unvalidated and research-ready data.
AWS Glue powers the platform’s data transformation engine, converting raw CSV files into optimized Parquet format while applying data quality rules and schema validation. The serverless nature allows automatic scaling based on data volume.
AWS Step Functions serves as the orchestration backbone, managing complex ETL workflows with parallel processing capabilities. This was critical for handling multiple data sources simultaneously while maintaining data lineage and error handling.
Amazon Athena enables researchers to write custom SQL queries for precise data extraction, moving away from one-size-fits-all datasets to personalized research packages.
Amazon QuickSight delivers embedded analytics directly in the research interface, allowing interactive data exploration and cohort discovery without requiring researchers to download entire datasets.
External Data Sources
The platform integrates five major data sources:
- Consolidated Tumor Case (CTC) - adjudicated data from population-based cancer registries, including SEER registries
- Area-Based Measures - non-clinical variables from US Census Department providing broader health context
- Children’s Oncology Group (COG) - data from COG studies including clinical trials and registry protocols
- Medical Claims - diagnosis, enrollment, and procedure data from insurance claims
- Pharmacy Claims - outpatient pharmacy medication dispensing data
Security and Compliance: Built-in from Day One
Meeting FISMA requirements while enabling research flexibility required security controls at every layer:
- Network Isolation: Data uploads restricted to specific CIDR blocks with VPC isolation
- Encryption: End-to-end encryption using AWS KMS for data at rest and in transit
- Access Control: Integration with NCI’s SEER system via Amazon Cognito with granular role-based permissions
- Audit Trail: Every data access logged using CloudTrail and CloudWatch monitoring
- Data Lifecycle: Automated retention policies with S3 Intelligent-Tiering for cost optimization
The Game Changer: Precision Data Cuts
One of the platform’s most innovative features is its ability to generate customized, research-specific datasets on demand. Instead of providing massive, generic datasets, researchers can:
- Explore interactively using embedded QuickSight dashboards to visualize available data and define cohorts
- Select precisely the variables they need from over 500 available variables
- Generate automatically analysis-ready packages with documentation and import scripts
- Receive securely time-limited access to their custom dataset
This approach reduced data preparation time from weeks to minutes while improving security by minimizing data exposure.
Infrastructure as Code: Enabling Rapid Innovation
The entire platform is defined using AWS Cloud Development Kit (CDK), providing:
- Consistent Environments: Identical infrastructure across development, staging, and production
- Rapid Iteration: Branch-based development with isolated testing environments
- Security Testing: Automated security scanning with GitHub Advanced Security
- Reliable Deployments: Rollback capabilities and infrastructure versioning
Results: Accelerating Cancer Research
The platform has transformed cancer research workflows:
- 98% reduction in data preparation time - from weeks to hours
- Self-service analytics enabling researchers to explore data independently
- Enhanced security with minimal data exposure through precision cuts
- Improved collaboration through standardized data formats and shared dashboards
- Faster time to insights with analysis-ready datasets and automated documentation
Key Technical Lessons Learned
- Orchestration First: AWS Step Functions proved essential for managing complex, multi-stage data processing
- Data Quality as Code: Implementing validation rules in Glue jobs early prevented downstream issues
- Cost-Conscious Architecture: S3 Intelligent-Tiering and Lifecycle Policies optimize storage costs automatically
- Security by Design: Building security controls into every component creates a more robust security posture
- User-Centric Development: Regular feedback sessions with researchers drove feature development
Conclusion
The NCCR Data Platform demonstrates how the right combination of AWS services can solve complex scientific challenges while maintaining the highest security standards. The key to success was treating security, scalability, and user experience as equally important requirements from day one.
Every user can craft their dataset down to the variable (out of more than 500 variables). The cohort is immediately visualized in QuickSight where users can adjust and fine-tune each variable in the data request. This powerful feature allows users to visualize across their data request before submitting documentation for access.
For organizations facing similar challenges in scientific data management, the NCCR Data Platform offers a proven blueprint: start with security, embrace infrastructure as code, invest in data quality, and build for your users’ actual workflows.
CloudR Solutions specializes in building FISMA-compliant data platforms on AWS for healthcare and life sciences organizations. Contact us to learn how we can accelerate your research mission.
