skip to content
🏗️ test.metters.dev 🚧
Table of Contents

Data migration on a live database

The application of one project I used to work on had a spring boot backend. We used Flyway scripts to update the postgres database alongside the backend code.

Updating data leads to challenges, especially with live data, already stored in the database.

A simple and common scenario would be renaming a field from country to alphaCountryCode. In such a case, it is sufficient to simply rename the column using a SQL script:

ALTER TABLE items RENAME COLUMN country TO alpha_country_code;

But what if existing data needed to be migrated because the business logic changed?

Dry run and logging

At first, I tried to execute a dry run in SQL. All changes needed to be logged, so we would be able to investigate any possible issues later on. I ended up with a script similar to the one below, but this just does not feel satisfying:

BEGIN;
DO
$$
BEGIN
IF NOT EXISTS (SELECT 1
FROM information_schema.columns
WHERE table_name = 'items'
AND column_name = 'name') THEN
ALTER TABLE items
ADD COLUMN name CHAR
END IF;
END
$$;
CREATE TEMP TABLE dryrun_check
(
uuid UUID PRIMARY KEY,
name CHAR(1),
status TEXT CHECK (status IN ('UPDATED', 'NOT_UPDATED_OR_NULL'))
) ON COMMIT PRESERVE ROWS;
WITH upd AS (
UPDATE items item
SET
name = CASE
WHEN item.type IN ('REGULAR', 'LARGE') THEN 'B'
WHEN item.type = 'SMALL' THEN 'S'
ELSE 'X'
END
FROM collections col
WHERE col.uuid = item.col_uuid
RETURNING item.uuid)
INSERT
INTO dryrun_check (uuid, name, status)
SELECT *
FROM (SELECT item.uuid,
item.name,
'UPDATED' AS status
FROM items item
JOIN upd u ON u.uuid = item.uuid
UNION ALL
SELECT item.uuid,
item.name,
item.payment_scheme_2,
'NOT_UPDATED_OR_NULL' AS status
FROM items item
LEFT JOIN upd u ON u.uuid = item.uuid
WHERE u.uuid IS NULL
OR item.name IS NULL) src
ON CONFLICT (uuid) DO NOTHING;
DO
$$
DECLARE
r RECORD;
BEGIN
FOR r IN
SELECT *
FROM dryrun_check
ORDER BY uuid
LOOP
RAISE NOTICE '%',
json_build_object(
'uuid', r.uuid,
'name', r.name,
'status', r.status
)::text;
END LOOP;
END
$$;
ROLLBACK;

A better approach?

An event listener triggered by the ApplicationReadyEvent is perfect for this scenario. First, flyway scripts are executed, then the application boots up, and lastly, the migration method gets executed.

Applying new logic to data for me as a developer with a higher focus on backend is much more intuitive using backend code instead of SQL scripts. The compiler is extremely helpful, too. Also, it is possible to write tests for such methods to assert the outcome is as intended. If there are any issues or unexpected results during the migration, it is much easier to log what has happened. I would rather have to analyse my structured log files, instead of trying to figure out the root cause of the problem from error messages thrown during the script execution.

@Service
class ItemsMigrationProcessor(
private val itemRepository: ItemRepository,
) {
private val logger = LoggerFactory.getLogger(javaClass)
@EventListener(ApplicationReadyEvent::class)
@Transactional
fun migrateItems() {
logger.atDebug().log { "Starting to migrate items" }
val items = itemRepository.findAll()
items.forEach {
it.editSomeStuff()
}
logger.atDebug().log { "Migrated [${items.size}] items" }
}
private fun Item.editSomeStuff() {
logger.atDebug().log { "Migrating item ${this.uuid}" }
this.name = when (this.type) {
Type.REGULAR, Type.LARGE -> "B"
Type.SMALL -> "S"
}
logger.atDebug().log { "Migrated item ${this.uuid} to name ${this.name}" }
}
}

Disadvantages

Using the spring boot application to migrate data has some downsides too!

1. Limited memory on large datasets

Make sure the application has enough memory to update all entries at once. Alternatively, do not update all data in one run, but use chunks instead.

Issues like that are easy to overlook, because usually tests focus on correct logic, but not on performance/memory restrictions.

2. Execution time on large datasets

Using Flyway, the script gets executed before the application has booted. Our migration processor needs to wait until the application has finished starting in order to execute.

In most deployment processes, the application gets live as soon as it is started up and processes, such as the data migration, get executed on a live system. Keep in mind that users work with outdated data, if a freshly deployed application is accessible, before the migration has finished.

3. Testability does not mean all scenarios get tested

Tests need realistic data and a proper test setup to be able to predict issues.

4. Clean up

Do not forget to remove migration methods: even with idempotency, the new business logic of the application already gets applied and Database entries only need to be migrated one time. In general, it is preferable to remove unnecessary code after its execution.

Conclusion

  1. I am more familiar with writing backend code
  2. The migration code is testable, but requires proper test setup
  3. Logged errors are easier to analyse