Recently, we had a production outage for a few minutes due a database migration on one of our Ruby on Rails apps. The deployment went fine through a few stages, but the problem only showed up at the last and the largest stage. This is exactly what happened during the deploy process.
- New code was deployed. Restart was pending, so the server was still running old code.
- Migrations ran.
- One migration removed a column that was used in the old code, but no longer used in the new code.
- The next migration was a data migration that inserted one row / user to a table. This was a very slow migration, taking 5+ minutes.
- The old code failed because it tried to use a column in the database that was no longer there. To make things worse, the column was referenced at all page loads within the app.
- The long running migration didn’t finish because it ran into a timeout.
- The servers weren’t restarted because the migration had failed. So, the new code wasn’t served at all.
- There was no automatic database rollback to restore the system into a good state with the old code.
The team was able to resolve the issues within the next 5 minutes, but it was the worst system outage we’ve seen in years. For anyone dealing with a large Ruby on Rails app, you can use the following safeguards to avoid such problems:
- Do not remove a column from the database while the current code is still using it. Do it at a later release.
- When a deployment fails at the migration step, ensure you have a rollback policy so that the system can be automatically restored to a known good state.
- Consider data migrations to be a performance problem and always test the migrations with relaistic load before production release.
- If possible, run your data migrations seprately from schema migrations so that you don’t incur deployment delays for optional new data.