The No-BS Guide to Cloud Migration: Lessons from 30+ Migrations
What actually goes wrong in cloud migrations — and how to avoid the mistakes that cost companies months of rework and six-figure recovery bills.
Why Migrations Fail
After 30+ cloud migrations, we've seen the same failure modes repeat. They're almost never technical. They're organizational, architectural, and planning failures. The tech is well understood. What's not is the discipline required to execute a migration without disrupting a business that's running on the system you're replacing.
The most expensive migrations we've seen had three things in common: no clear owner, underestimated dependency mapping, and pressure to go fast. Speed kills migrations.
Start with a Dependency Map, Not a Timeline
The first thing we do on every migration engagement is a full dependency audit. That means mapping every service, database, queue, cron job, shared file system, and third-party integration before touching anything. This alone takes 2–3 weeks and clients often push back on it. It's the most valuable thing we do.
Undiscovered dependencies are the number one cause of production incidents during migrations. That legacy batch job that writes to a shared NFS mount. The undocumented API that a vendor integration calls directly. The cron that runs twice a month and silently fails without anyone noticing. Find all of them before you move anything.
Lift-and-Shift Is Usually the Wrong Starting Point
Lift-and-shift (moving VMs as-is to the cloud) is fast but creates technical debt from day one. You pay cloud prices for an architecture designed for on-premise. You miss containerization, managed services, auto-scaling, and every other benefit that made cloud attractive.
We recommend a 're-platform' approach for most workloads: move to containers (ECS or Kubernetes), swap self-managed databases for RDS/Aurora, and replace SFTP servers with S3. The extra work pays back within 12 months in operational savings and reliability gains.
Run Parallel in Production, Not in Staging
Staging environments lie. Production traffic is messier, more varied, and higher volume than any staging setup can replicate. The only way to validate a migration is to run the new system in parallel with real production traffic and compare outputs.
We use blue-green or shadow traffic patterns for every migration. New system gets 0% of traffic while we compare results in the background, then we ramp up gradually: 1%, 5%, 20%, 50%, 100%. This means the rollback plan is always 'flip traffic back', not 'restore from backup at 2am'.
Database Migrations Are the Hard Part
Application migrations are relatively easy. Database migrations are not. The database contains years of accumulated schema decisions, indexes tuned for specific query patterns, and often implicit assumptions baked into application code.
We use CDC (Change Data Capture) tools like Debezium or AWS DMS to replicate data in real-time between old and new databases during the migration window. This keeps both systems in sync until cutover and allows instant rollback. Never plan a big-bang database cutover.
Invest in Observability Before Cutover
Set up your monitoring, alerting, and tracing before you cut over any traffic. You need a baseline for how the new system performs under zero load so you can distinguish 'this is normal cloud behavior' from 'something is wrong' when you flip the switch.
Our standard stack for migrations: CloudWatch/Datadog for metrics, OpenTelemetry for traces, structured logging to CloudWatch Logs Insights. Make sure you can answer 'is this working?' in under 30 seconds on migration day.
Work With Us
Ready to put this into practice?
iSpecia builds what you've been reading about. Tell us your challenge.