Part 15 of 18

Diff, Snapshot, and Reverse Engineering: Onboarding Existing Databases

Most Liquibase tutorials start with a blank database. Most real projects start with a production database that has been evolving for years without version control. This article covers the complete workflow for adopting Liquibase on an existing database — generating a baseline changelog, bootstrapping Liquibase tracking, detecting drift between environments, and maintaining snapshots for offline comparisons.


The Onboarding Problem

An existing database has no DATABASECHANGELOG table. Running liquibase update with a fresh changelog would try to create tables that already exist, causing immediate failures. The solution has two phases:

  1. Capture the current state — generate a changelog that represents the database as it exists today
  2. Bootstrap Liquibase — tell Liquibase that the current state is already applied, without re-running the DDL

Phase 1: Generate a Baseline Changelog

generateChangelog

generateChangelog connects to an existing database and generates a changelog file representing its current schema. It is equivalent to a diff between the live database and an empty database.

liquibase generateChangelog \
  --changelog-file=db/changelog/baseline/baseline.yaml \
  --url=jdbc:mysql://localhost:3306/ecommerce \
  --username=lb_user \
  --password=lb_pass

Output: a YAML file with one changeset per database object (tables, indexes, foreign keys, views, stored procedures). Each changeset gets an auto-generated ID.

What it captures:

  • Tables and columns (with types, defaults, nullability)
  • Primary keys, unique constraints
  • Indexes and foreign keys
  • Views (as createView changesets)
  • Stored procedures and functions (if the database supports exporting them)

What it does not capture well:

  • ON UPDATE CURRENT_TIMESTAMP (MySQL-specific, not in the standard DDL model)
  • AUTO_INCREMENT starting values
  • Column comments
  • CHECK constraints (MySQL pre-8.0.16 silently ignores these)
  • Table-level charset and collation (may not appear in the generated changeset)
  • Triggers (often incomplete or missing)

The generated file needs review before use. It is a starting point, not production-ready output.

Cleaning up the generated changelog

Open the generated file and inspect it. Common cleanup tasks:

1. Remove auto-generated changeset IDs and replace with your convention:

Generated:

- changeSet:
    id: "1712345678901-1"
    author: lb_user (generated)

Cleaned:

- changeSet:
    id: "baseline-001"
    author: abhay
    comment: Baseline — users table as of 2026-06-01

2. Add charset/collation to createTable changesets:

Generated:

- createTable:
    tableName: users
    columns: [...]

Cleaned — add modifySql:

- createTable:
    tableName: users
    columns: [...]
modifySql:
  - dbms: mysql
    append:
      value: " ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=DYNAMIC"

3. Add ON UPDATE CURRENT_TIMESTAMP back as a sql changeset for updated_at columns that had it — the generated changeset will have dropped it.

4. Remove objects you don’t want tracked — temporary tables, legacy views that are about to be dropped, etc.

5. Add rollback blocks where they are missing (the generated file often lacks rollback for complex changesets).

Limiting what generateChangelog captures

Use --diff-types to control which object types are included:

# Only capture tables and views, skip indexes and stored procedures
liquibase generateChangelog \
  --changelog-file=db/changelog/baseline/baseline.yaml \
  --diff-types=tables,views

Available diff types: tables, columns, indexes, foreignkeys, primarykeys, uniqueconstraints, data, views, sequences, storedprocedures.

For most onboarding scenarios, start with tables only, then add indexes and foreignkeys after reviewing the table definitions. Views and stored procedures are better managed with runOnChange: true changesets written manually.


Phase 2: Bootstrap Liquibase on Existing Databases

After generating and cleaning the baseline changelog, you have two options for telling Liquibase that these changesets have already been applied.

changelogSync marks all changesets in the changelog as applied in DATABASECHANGELOG without executing any SQL:

liquibase changelogSync \
  --changelog-file=db/changelog/baseline/baseline.yaml

After this runs, liquibase status reports 0 pending changesets. Future liquibase update runs will skip the baseline and only apply new changesets.

Limitation: changelogSync marks the entire changelog file at once. If the database is slightly different from the changelog (e.g., some tables are missing), it will be marked as if those tables were created — and subsequent runs won’t try to create them, leaving the schema inconsistent.

Add preconditions to each baseline changeset so they self-detect whether their work is already done:

- changeSet:
    id: "baseline-001"
    author: abhay
    preConditions:
      onFail: MARK_RAN
      not:
        - tableExists:
            tableName: users
    changes:
      - createTable:
          tableName: users
          columns: [...]

This approach is more robust because each changeset checks before running. Tables that exist are skipped (MARK_RAN); tables that don’t exist are created. This handles the case where the database is partially in sync with the changelog.

Combining both approaches: For a large existing database, run changelogSync first to bootstrap quickly, then add MARK_RAN preconditions to future changesets that need to be idempotent.


Snapshot: Capturing Database State for Later Comparison

snapshot saves the current database schema to a JSON file without connecting to a database:

liquibase snapshot \
  --snapshot-format=json \
  --output-file=snapshots/ecommerce-prod-20260601.json

The snapshot captures:

  • All tables, columns, types, defaults
  • Indexes, foreign keys, constraints
  • Views, sequences

Use cases for snapshots:

  1. Offline diff: Compare two database states without needing both databases connected simultaneously
  2. Pre-deployment baseline: Capture prod state before a deployment for rollback comparison
  3. Drift detection: Compare a snapshot taken last week against today’s schema

Using a snapshot in diff

# Compare current database against a saved snapshot
liquibase diff \
  --url=jdbc:mysql://localhost:3306/ecommerce \
  --username=lb_user \
  --password=lb_pass \
  --reference-url="offline:mysql?snapshot=snapshots/ecommerce-prod-20260601.json"

The offline:mysql?snapshot= URL tells Liquibase to use the JSON snapshot as the reference database. This is how you detect schema drift between a production deployment and the current state.

Pre-deployment snapshot workflow

# Step 1: Capture state before deployment
liquibase snapshot \
  --snapshot-format=json \
  --output-file=snapshots/pre-deploy-v1.2.0.json

# Step 2: Deploy
liquibase tag v1.2.0
liquibase updateSQL        # Review
liquibase update           # Deploy

# Step 3: If something went wrong, compare current state to pre-deploy snapshot
liquibase diff \
  --url=jdbc:mysql://prod-host:3306/ecommerce \
  --reference-url="offline:mysql?snapshot=snapshots/pre-deploy-v1.2.0.json"

This gives you a forensic record of exactly what changed during the deployment.


diffChangelog: Generate a Changelog from Schema Drift

diffChangelog compares two databases (or a database and a snapshot) and generates a changelog of the differences — not a full schema, just the delta:

# Compare dev to prod and generate the missing changesets
liquibase diffChangelog \
  --changelog-file=db/changelog/migrations/drift-fix.yaml \
  --url=jdbc:mysql://prod-host:3306/ecommerce \
  --reference-url=jdbc:mysql://dev-host:3306/ecommerce \
  --reference-username=lb_user \
  --reference-password=lb_pass

Workflow: Dev database has changes that production is missing. diffChangelog generates the changesets needed to bring production in sync with dev.

Output example:

databaseChangeLog:
  - changeSet:
      id: "diff-1717200000000-1"
      author: lb_user (generated)
      changes:
        - addColumn:
            tableName: users
            columns:
              - column:
                  name: phone
                  type: VARCHAR(20)

As with generateChangelog, the output must be reviewed and cleaned before use — rename IDs, add rollback, add modifySql for MySQL options.

diffChangelog vs writing changesets manually

ApproachBest For
diffChangelogCapturing drift that happened outside Liquibase (hotfixes, DBA changes)
Manual changesetsNormal development workflow

diffChangelog is a recovery tool, not a development workflow. If developers are regularly using diffChangelog to generate changesets, it means schema changes are happening outside Liquibase — which defeats the purpose of version control.


Detecting Drift Between Environments

Use diff regularly as part of your release process to detect schema drift before it causes a deployment failure:

# Staging should match production — find any differences
liquibase diff \
  --url=jdbc:mysql://staging-host:3306/ecommerce \
  --username=lb_user \
  --password=lb_pass \
  --reference-url=jdbc:mysql://prod-host:3306/ecommerce \
  --reference-username=lb_user \
  --reference-password=lb_pass

A clean output:

Diff Results:
Reference Database: jdbc:mysql://prod-host:3306/ecommerce
Comparison Database: jdbc:mysql://staging-host:3306/ecommerce

Differences: NONE

Any reported differences are either:

  1. Expected — staging has migrations that haven’t been deployed to prod yet
  2. Unexpected — drift caused by manual changes, rollback failures, or inconsistent deployments

Make liquibase diff between staging and prod a step in your release checklist.


Complete Onboarding Workflow

Here is the full sequence for onboarding an existing production MySQL database:

# Step 1: Generate baseline changelog from production
liquibase generateChangelog \
  --changelog-file=db/changelog/baseline/baseline.yaml \
  --url=jdbc:mysql://prod-host:3306/ecommerce \
  --diff-types=tables,columns,indexes,foreignkeys,primarykeys,uniqueconstraints

# Step 2: Review and clean the generated file
# - Fix changeset IDs
# - Add charset/collation via modifySql
# - Add missing rollback blocks
# - Remove objects you don't want tracked

# Step 3: Test against a copy of production
liquibase validateChangeLog \
  --changelog-file=db/changelog/baseline/baseline.yaml

# Step 4: Bootstrap Liquibase on production
# (mark the baseline as already applied without running DDL)
liquibase changelogSync \
  --changelog-file=db/changelog/baseline/baseline.yaml \
  --url=jdbc:mysql://prod-host:3306/ecommerce

# Step 5: Verify the baseline is registered
liquibase status    # Should show 0 pending changesets

# Step 6: Tag the baseline
liquibase tag baseline-2026-06-01

# Step 7: All future changes go in new changelog files
# db/changelog/migrations/2026/06/20260602-001-add-phone-column.yaml

Common Mistakes

Using the generated changelog as-is without review: generateChangelog produces a functional but incomplete representation of the schema. MySQL-specific features like ON UPDATE CURRENT_TIMESTAMP, charset/collation, and triggers are often missing or incorrect. Always review the output before committing it.

Running changelogSync without verifying the schema matches: changelogSync marks changesets as applied unconditionally. If the database schema doesn’t actually match the changelog (e.g., a column is a different type), Liquibase won’t know — it will just skip those changesets. Use diff before changelogSync to verify the schema matches.

Using diffChangelog as the normal development workflow: Generating changesets from drift means schema changes happened outside Liquibase. This is a recovery path, not a development pattern. All planned schema changes should be written as explicit changesets first, then applied to the database.


Best Practices

  • generateChangelog is the starting point, not the finished product — always review, clean, and test it before committing
  • snapshot before every production deployment — it is your forensic record and enables offline diff comparison after the fact
  • changelogSync with diff validation — run diff between the generated baseline and the live database before bootstrapping to confirm they match
  • Tag immediately after changelogSyncliquibase tag baseline-YYYY-MM-DD gives you a rollback point from day one
  • Use diffChangelog only for drift recovery — not as a routine changeset generation method
  • Add MARK_RAN preconditions to baseline changesets for robustness when environments might be partially in sync

What You’ve Learned

  • generateChangelog reverse-engineers an existing database into a Liquibase changelog — it needs review and cleanup before use
  • changelogSync marks changesets as applied without executing DDL — the bootstrap step for existing databases
  • snapshot saves database state to JSON for offline comparison — capture it before every production deployment
  • diff compares two live databases or a database against a snapshot — use it to detect drift
  • diffChangelog generates a delta changelog from schema differences — a recovery tool, not a development workflow
  • The full onboarding sequence: generate → review → validate → changelogSync → tag → continue with normal workflow

Next: Article 16 — Rollback in Production: Tag-Based Strategies and CI Validation — operationalising rollback so it is fast, tested, and reliable when you need it at 2am.