Cloud-Native Architecture for a Music Analytics Platform
The platform had grown fast. Millions of streams analyzed daily, dashboards used by labels and distributors, and a data pipeline that mostly worked — until it didn't. Query times were climbing. Deployments were manual and fragile. And the AWS bill kept going up without a clear picture of what was driving it.
The team was strong but stretched thin. They needed someone who could look at the full stack — infrastructure, CI/CD, cost structure — and tell them where the leverage was.
What we found
The application was running on oversized EC2 instances with no autoscaling. The database layer was a single RDS instance handling both transactional writes and heavy analytical reads. Deployments were SSH-and-pray: a series of shell scripts that worked on a good day and produced unclear failures on a bad one.
Cost allocation was nonexistent. The team knew the monthly total but couldn't attribute spend to specific services, environments, or workloads. Dev and staging environments ran 24/7 at production scale.
None of this was unusual. It's what happens when a product grows faster than the infrastructure team around it.
What we changed
Compute — Migrated core services to ECS Fargate with autoscaling policies tuned to actual traffic patterns. The analytics ingestion pipeline moved to a combination of SQS and Lambda for bursty workloads, replacing a set of always-on workers that spent most of their time idle.
Data layer — Split the read and write paths. Transactional data stayed on RDS with a right-sized instance class. Analytical queries moved to a read replica initially, then to Athena over S3 for the heavier historical analyses. Query times for the main dashboard dropped from 8–12 seconds to under 2.
CI/CD — Replaced the shell-script deployment process with GitHub Actions pipelines deploying to ECS via AWS CDK. Blue-green deployments with automated rollback. The team went from deploying once a week (because it was stressful) to deploying multiple times a day (because it wasn't).
FinOps — Implemented AWS cost allocation tags across all resources. Set up non-production environments to scale down overnight and on weekends. Moved to Savings Plans for baseline compute. Introduced a monthly cost review cadence with the engineering leads.
The numbers
60% reduction in p95 query latency on the primary analytics dashboard. Users noticed before we told them.
38% reduction in monthly AWS spend — achieved in the first billing cycle after migration, without removing any capabilities. Further savings followed as reserved capacity and Savings Plans took effect.
Deployments went from weekly to daily with zero-downtime blue-green releases and automated rollback. The deploy process went from a 45-minute runbook to a merged PR.
Cost visibility went from zero to granular. Every service, every environment, every workload — attributable and reviewable. The team could finally make informed trade-offs about where to spend and where to cut.
What made this work
This wasn't a rebuild. It was a sequence of targeted changes, ordered by impact, delivered over a focused advisory engagement. The existing team executed the migration — we provided the architecture, the sequencing, and the AWS-specific expertise to do it without disrupting the product.
The platform's core value was in its data and its user experience. The infrastructure just needed to stop being the bottleneck.