Project Overview
Leanda.io is an extensible open science data repository developed by ArquiSoft that revolutionizes how researchers consume, process, visualize, and analyze scientific data. Built on a microservice-based architecture, Leanda.io supports multiple data types, formats, and volumes while providing real-time data curation, machine learning capabilities, and ontology-based property assignment. Our challenge was to migrate this complex distributed application from its original infrastructure to AWS cloud while maintaining its performance, scalability, and adhering to cloud-native best practices.
The Challenge
Leanda.io’s architecture consists of numerous interconnected microservices supporting diverse scientific domains—from chemical structures and reactions to microscopy files and machine learning models. The system requires:
- Multiple Data Domain Services: Handling generic images, PDFs, office files, tabular data (CSV/TSV), chemical structures (SDF, MOL), chemical reactions (RXN), crystals, microscopy files, and ML models
- Real-time Processing Pipeline: Automated data mining, text extraction, and format conversion during file deposition
- Resource-Intensive ML Framework: Embedded machine learning capabilities for drug discovery workflows, model training, and predictive analytics
- Complex Service Dependencies: Infrastructure, monitoring, and application services that must work seamlessly together
- Security Model: Support for private, shared, and public data access patterns
Migrating such a sophisticated distributed system required careful orchestration of containerized services, robust networking, and adherence to AWS security and operational standards.
Solution Architecture
We implemented a comprehensive AWS migration strategy leveraging modern container orchestration and infrastructure-as-code practices:
Container Orchestration with ECS
Amazon Elastic Container Service (ECS) served as the foundation for running Leanda.io’s microservices architecture. We:
- Containerized all application components using Docker, ensuring consistent environments across development and production
- Configured ECS task definitions for each microservice, specifying resource requirements, networking modes, and health checks
- Implemented ECS Service Auto Scaling to handle varying workload demands, particularly for resource-intensive ML operations
- Utilized ECS task placement strategies to optimize resource utilization across the cluster
Container Registry with ECR
Amazon Elastic Container Registry (ECR) provided secure, scalable storage for all container images:
- Created dedicated repositories for each microservice component
- Implemented image scanning for vulnerability detection
- Configured lifecycle policies to automatically clean up unused images and manage storage costs
- Integrated with ECS for seamless image deployment and version management
Infrastructure as Code with CloudFormation
We used AWS CloudFormation to define and provision the entire infrastructure stack:
- Templated all infrastructure components for reproducibility and version control
- Implemented nested stacks for modular infrastructure management
- Automated deployment workflows with parameterized templates for different environments
- Enabled disaster recovery through infrastructure recreation from templates
Network Architecture and VPC Configuration
Designed a secure, scalable VPC architecture following AWS best practices:
- Multi-AZ Deployment: Configured resources across multiple availability zones for high availability
- Network Segmentation: Implemented public and private subnets to isolate application layers
- Security Groups: Created fine-grained security group rules to control inter-service communication
- NAT Gateways: Enabled private subnet resources to access external services securely
- Load Balancing: Deployed Application Load Balancers for traffic distribution and health monitoring
- Service Discovery: Configured AWS Cloud Map for microservice discovery and DNS-based routing
Cloud-Native Services Integration
Enhanced Leanda.io with AWS-managed services:
- Amazon RDS: Migrated database workloads to managed relational database services
- Amazon S3: Utilized object storage for scientific data files, leveraging S3’s durability and scalability
- Amazon CloudWatch: Implemented comprehensive monitoring and logging for all services
- AWS Secrets Manager: Secured API keys, database credentials, and sensitive configuration data
- Amazon ElastiCache: Added caching layers to improve performance for frequently accessed data
Technical Implementation
Docker Containerization
We optimized each microservice’s Docker configuration:
- Multi-stage builds to reduce image sizes
- Alpine Linux base images where appropriate for minimal footprint
- Proper signal handling for graceful container shutdown
- Health check endpoints for container orchestration
ECS Task and Service Configuration
Carefully tuned ECS configurations for optimal performance:
- Resource Allocation: Right-sized CPU and memory for each service based on profiling
- Networking Mode: Used awsvpc networking mode for enhanced security and performance
- Service Mesh: Considered AWS App Mesh for advanced traffic management (future enhancement)
- Fargate Option: Evaluated ECS Fargate for certain stateless services to reduce operational overhead
Monitoring and Observability
Established comprehensive monitoring:
- CloudWatch Container Insights for ECS cluster metrics
- Custom CloudWatch metrics for application-level KPIs
- CloudWatch Logs aggregation from all containers
- X-Ray integration for distributed tracing (planned)
Results and Benefits
The migration to AWS delivered significant improvements:
Scalability
- Automatic scaling of microservices based on demand
- Independent scaling of resource-intensive ML components
- Ability to handle varying research workloads without over-provisioning
Reliability
- Multi-AZ deployment ensuring high availability
- Automated health checks and container replacement
- Disaster recovery capabilities through infrastructure-as-code
Security
- VPC isolation and security group controls
- Encrypted data at rest and in transit
- Secrets management with AWS Secrets Manager
- Compliance with research data security requirements
Operational Efficiency
- Reduced infrastructure management overhead
- Automated deployments through CloudFormation
- Centralized monitoring and logging
- Cost optimization through right-sizing and auto-scaling
Cloud-Native Architecture
- Leveraged AWS managed services to reduce operational burden
- Implemented best practices for containerized applications
- Positioned Leanda.io for future cloud-native enhancements
- Enabled seamless integration with AWS AI/ML services
Technical Challenges and Solutions
Challenge: Complex Service Dependencies
Solution: Implemented ECS service discovery and dependency ordering in CloudFormation templates to ensure services start in the correct sequence.
Challenge: Resource-Intensive ML Services
Solution: Configured dedicated ECS capacity providers with GPU-enabled instances for ML workloads, while using standard instances for other services.
Challenge: Data Migration
Solution: Developed a phased migration strategy, transferring data to S3 with versioning enabled, ensuring no data loss during cutover.
Challenge: Network Performance
Solution: Optimized VPC design with VPC endpoints for AWS services, reducing data transfer costs and improving performance.
About Leanda.io
Leanda.io represents the future of open science data management. By providing a unified platform for data curation, processing, and machine learning, it addresses the disconnect between domain-specific databases, publishers’ repositories, and semantic web knowledge bases. Its microservice architecture allows the scientific community to extend functionality without modifying core systems, making it ideal for collaborative research across chemistry, materials science, microscopy, and other scientific domains.
The platform supports:
- Seamless format conversion between scientific data types
- Real-time automated and manual data curation
- Embedded machine learning framework for drug discovery
- Hierarchical data categorization and metadata assignment
- Public, private, and shared data security models
Conclusion
Migrating Leanda.io to AWS cloud demonstrated our expertise in handling complex distributed applications, container orchestration, and cloud-native architecture design. By leveraging ECS, ECR, CloudFormation, and Docker technologies, we successfully transformed a resource-intensive scientific platform into a scalable, secure, and highly available cloud service. The implementation adheres to AWS best practices and positions Leanda.io for continued growth and enhancement within the open science community.
This project showcases our ability to:
- Migrate sophisticated multi-service applications to AWS
- Design secure, scalable VPC architectures
- Implement infrastructure-as-code for repeatability and reliability
- Optimize container orchestration for diverse workload patterns
- Integrate cloud-native services for enhanced functionality
For organizations seeking to migrate complex applications to the cloud or modernize their infrastructure, our experience with Leanda.io demonstrates the technical depth and architectural expertise needed for successful cloud transformations.
Learn more about Leanda.io at github.com/ArqiSoft/leanda.io
