feat(migration): add lock heartbeats, predictive dry-run planning, and stricter ledger option validation

2026-04-14 12:31:34 +00:00
parent 19ebdee31a
commit 1b4358aca5
17 changed files with 695 additions and 180 deletions
@@ -28,8 +28,8 @@ Report bugs and security issues at [community.foss.global](https://community.fos
 | **Unified mongo + S3** | A single semver represents your combined data version. One step can touch both. |
 | **Drivers exposed via context** | `ctx.db`, `ctx.mongo`, `ctx.bucket`, `ctx.s3` — write any operation you can write directly |
 | **Idempotent** | `run()` is a no-op when data is at `targetVersion`. Costs one ledger read on the happy path. |
-| **Resumable** | Mark a step `.resumable()` and it gets `ctx.checkpoint.read/write/clear` for restartable bulk operations |
-| **Lockable** | Mongo-backed lock with TTL serializes concurrent SaaS instances — safe for rolling deploys |
+| **Resumable** | Mark a step `.resumable()` and it gets `ctx.checkpoint.read/write/clear` for restartable bulk operations; successful runs clear the step checkpoint automatically |
+| **Lockable** | Mongo-backed lock uses atomic updates plus TTL heartbeats to serialize concurrent SaaS instances — safe for rolling deploys |
 | **Fresh-install fast path** | Configure `freshInstallVersion` to skip migrations on a brand-new database |
 | **Dry-run** | `dryRun: true` or `.plan()` returns the execution plan without writing anything |
 | **Structured errors** | All failures throw `SmartMigrationError` with a stable `code` field for branching |
@@ -83,8 +83,9 @@ migration
      const cursor = ctx.bucket!.createCursor('uploads/');
      const startToken = await ctx.checkpoint!.read<string>('cursorToken');
      if (startToken) cursor.setToken(startToken);
-      while (await cursor.hasMore()) {
-        for (const key of (await cursor.next()) ?? []) {
+      while (cursor.hasMore()) {
+        const { keys } = await cursor.next();
+        for (const key of keys) {
          await ctx.bucket!.fastMove({
            sourcePath: key,
            destinationPath: 'media/' + key.slice('uploads/'.length),
@@ -134,7 +135,7 @@ If no step's `toVersion` is greater than `currentVersion` (the ledger is past th

 The ledger is the source of truth for "what data version are we at, what steps have been applied, who holds the lock right now." It is persisted in one of two backends:

- **`mongo` (default when `db` is provided)** — backed by smartdata's `EasyStore`, stored as a single document. Lock semantics work safely across multiple SaaS instances. **Recommended.**
+- **`mongo` (default when `db` is provided)** — stored as a single document in smartdata's `SmartdataEasyStore` collection. Lock acquisition / renewal uses atomic mongo updates and works safely across multiple SaaS instances. **Recommended.**
 - **`s3` (default when only `bucket` is provided)** — a single JSON object at `<bucket>/.smartmigration/<ledgerName>.json`. Lock is best-effort because S3 has no atomic CAS without additional infrastructure; do not use for multi-instance deployments without external coordination.

 If you pass both `db` and `bucket`, mongo is used.
@@ -217,9 +218,9 @@ migration
    const cursor = ctx.bucket!.createCursor('attachments/legacy/', { pageSize: 200 });
    const start = await ctx.checkpoint!.read<string>('cursor');
    if (start) cursor.setToken(start);
-    while (await cursor.hasMore()) {
-      const batch = (await cursor.next()) ?? [];
-      for (const key of batch) {
+    while (cursor.hasMore()) {
+      const { keys } = await cursor.next();
+      for (const key of keys) {
        await ctx.bucket!.fastMove({
          sourcePath: key,
          destinationPath: 'attachments/' + key.slice('attachments/legacy/'.length),
@@ -232,6 +233,7 @@ migration
 ```

 If the process crashes mid-migration, the next call to `run()` will resume from the last persisted cursor token.
+If the step completes successfully, smartmigration clears that step's checkpoint bag automatically.

 ### Mongo + S3 in lockstep

@@ -276,6 +278,7 @@ await migration.run(); // for a fresh DB, runs zero steps and stamps version 5.0
 const planned = await migration.plan();
 console.log(`would apply ${planned.stepsSkipped.length} steps:`,
  planned.stepsSkipped.map((s) => `${s.id}(${s.fromVersion}→${s.toVersion})`).join(' → '));
+console.log(`predicted version after run: ${planned.currentVersionAfter}`);

 // or — same thing, via dryRun option
 const m = new SmartMigration({ targetVersion: '2.0.0', db, dryRun: true });
@@ -283,6 +286,8 @@ const m = new SmartMigration({ targetVersion: '2.0.0', db, dryRun: true });
 const result = await m.run(); // returns plan, doesn't write
 ```

+`plan()` / `dryRun` resolve the same fresh-install shortcut that `run()` would use, so the returned `currentVersionAfter` is the predicted post-run version. For S3-backed ledgers, planning does not create the `.smartmigration/<ledger>.json` sidecar.
+
 ## API reference

 ### `new SmartMigration(options: ISmartMigrationOptions)`
@@ -306,6 +311,9 @@ The constructor throws `SmartMigrationError` with one of these `code`s on bad in
 - `INVALID_VERSION` — `targetVersion` is not a valid semver
 - `NO_RESOURCES` — neither `db` nor `bucket` provided
 - `LEDGER_BACKEND_MISMATCH` — explicit `ledgerBackend` doesn't match the resources you provided
+- `INVALID_LEDGER_NAME` — `ledgerName` is blank or not a string
+- `INVALID_LOCK_WAIT_MS` — `lockWaitMs` is not an integer `>= 0`
+- `INVALID_LOCK_TTL_MS` — `lockTtlMs` is not an integer `>= 1`

 ### `migration.step(id: string).from(v).to(v).[description(t)].[resumable()].up(handler)`

@@ -321,7 +329,7 @@ interface IMigrationRunResult {
  currentVersionAfter: string;
  targetVersion: string;
  wasUpToDate: boolean;        // true if no steps ran
-  wasFreshInstall: boolean;    // true if freshInstallVersion was used
+  wasFreshInstall: boolean;    // true if startup took the fresh-install shortcut
  stepsApplied: IMigrationStepResult[];
  stepsSkipped: IMigrationStepResult[];
  totalDurationMs: number;
@@ -330,12 +338,13 @@ interface IMigrationRunResult {

 Throws `SmartMigrationError` with these `code`s:
 - `LOCK_TIMEOUT` — could not acquire lock within `lockWaitMs`
+- `LOCK_LOST` — the runner lost its lock while a migration was still in flight
 - `STEP_FAILED` — a step's handler threw; the failure is persisted to the ledger
 - `CHAIN_*`, `DUPLICATE_STEP_ID`, `NON_INCREASING_STEP`, `TARGET_NOT_REACHABLE`, `DOWNGRADE_NOT_SUPPORTED` — chain validation / planning errors

 ### `migration.plan(): Promise<IMigrationRunResult>`

-Same as `run()` but does not acquire the lock or execute anything. Useful for `--dry-run` style probes in CI.
+Same as `run()` but does not acquire the lock or execute anything. Useful for `--dry-run` style probes in CI. `stepsSkipped` are the steps that would run, and `currentVersionAfter` is the predicted post-run version after fresh-install resolution.

 ### `migration.getCurrentVersion(): Promise<string | null>`

@@ -347,6 +356,10 @@ Returns the current data version from the ledger, or `null` if the ledger has ne

 Another instance crashed while holding the lock. Wait for `lockTtlMs` (default 10 minutes) for the lock to expire, or manually clear the `lock` field on the ledger document.

+### `LOCK_LOST` during a migration
+
+The runner stopped renewing its lock or another instance replaced the lock document while a step was still running. Check for long-running steps, make sure `lockTtlMs` is comfortably larger than the expected renewal cadence, and investigate other processes touching the same ledger.
+
 ### `CHAIN_GAP` at startup

 Two adjacent steps have mismatched versions: `step[N].to !== step[N+1].from`. Steps must form a contiguous chain in registration order. Fix the version on the offending step.