It started with a locker
I was the sole programmer on a firmware project under R&D writing the firmware, the mobile app, and the backend all at once. One of those "figure it out as we go" situations that forces us to actually think.
At some point our R&D head mentioned the system might be deployed to out-of-city clients. That one sentence changed everything. Suddenly this wasn't a local lab setup. It was something that would need to be maintained remotely. And that meant one thing: how do we update firmware on a device we can't physically touch?
I looked around. Most OTA solutions were LAN-only, enterprise-grade overkill, or tightly coupled to specific vendor platforms. None of them felt right for a small custom ESP32 project. So I built something minimal myself a quick API, some OTA logic in the firmware, deployed it to a VPS. Done, right?
Not really. It worked, but it was messy. The firmware and the backend were tightly coupled in ways that made changes fragile. Reusing any of it later would be painful. So I split them and that split was the moment VoyagerOTA went from a quick utility to an actual platform.
The name
I've always been obsessed with space. So when it came time to name this thing, I went straight to the Voyager probes NASA's spacecraft that have been flying since 1977, still sending data from the edge of the solar system. They carry a Golden Record music, greetings, and sounds from Earth.
But the part that really got me: Voyager 1 developed a software problem billions of miles away. NASA tracked down engineers who had worked on the original code, studied plans from the 70s, ran tests, and sent a software fix across interstellar space. And it worked.
What the platform does
VoyagerOTA manages over-the-air firmware updates with a proper lifecycle: Draft, Staging, Production and Revoked. Releases don't just get uploaded and served they go through a validation pipeline before hitting any real device.
The idea is simple: a bad firmware update on a remote device is a catastrophic outcome. So the system tries hard to stop us from doing something stupid before we do it.
The bugs: Sharing my own rambled comments here but their english is refined for the blog :P
Bug #1 The semver comparison that lied
VoyagerOTA enforces monotonic semver we can't upload a release that's "older" than what's already out. Makes sense. But the comparison logic had a silent flaw.
The original approach converted a version like 1.10.2 into a flat number
(1102) by just concatenating the parts. Fast, simple.
Also completely wrong for any version with double-digit minor or patch numbers.
// ! [PRI-0]: Critical edge case in version comparison
//
// ! Problem:
// ! x = 1.10.2 → 1102
// ! y = 1.9.10 → 1910
// ! x > y == false WRONG. semver says x > y
//
// After fix — pad MINOR and PATCH to 5 digits:
// ! x` = 1.10.2 → 10001000002
// ! y` = 1.9.10 → 10000900010
// ! x` > y` == true ✓
//
// * Patched via getNormalizedVersion()
The fix was straightforward once spotted: pad the minor and patch components to a fixed width before concatenation. But if this had slipped into production, devices could have been served stale firmware marked as "latest."
Bug #2 The dual-write problem on release restoration
This was the hardest for me as i had to keep reading architecture blogs, reddit, or stuff about that specific problem is when a user deletes or restores a firmware release, two things need to happen atomically: update the database record and update the binary file in storage. The problem is those are two separate systems. If one succeeds and the other fails, we end up in a partial state the database says the file exists, but it doesn't. Or vice versa.
The Dual Write Issue. Code from ArtifactReleaseService module:
if (release.isStaging()) {
const transaction = await db.transaction();
try {
const artifact = await ArtifactDAL.findProcessedArtifact(release.getId());
await ReleaseDAL.deleteReleaseByPublicId(releaseId, transaction);
await ArtifactDAL.deleteArtifactByReleaseId(release.getId(), transaction);
const filename = artifact!.getFileName();
await transaction.commit();
// queue job dispatched OUTSIDE the transaction
// if this fails, DB is updated but file is never deleted
await this._queue.putJob({ filename: filename, mode: "soft-delete" });
Logger.info("Release successfully deleted");
} catch (error) {
Logger.error(error as string);
await transaction.rollback();
throw error;
}
}
// ! Release Restoration — dual-write issue
// ! State can be partial for some time.
// ! Either: file restored first, or a separate db flag first?
// !
// ! Options considered:
// ! Outbox pattern (polling outbox table via worker)
// ! CDC (Change Data Capture)
// !
// ! CDC would be massive technical debt for this scale.
// ! Polling has db perf implications even on indexed records.
// !
// * Fixed via Outbox pattern:
// * 1. Write event to Outbox table (releaseId, artifactId,
// * event type, state: pending)
// * 2. Relay service polls outbox, pushes to BullMQ queue
// * 3. Worker picks up job → executes delete or restore
// * 4. State tracked: pending → processing → processed
if (release.isStaging()) {
const transaction = await db.transaction();
try {
const artifact = await ArtifactDAL.findProcessedArtifact(release.getId());
await ReleaseDAL.deleteReleaseByPublicId(releaseId, transaction);
await ArtifactDAL.deleteArtifactByReleaseId(release.getId(), transaction);
// outbox row written atomically with the delete
// relay service will pick this up and push to BullMQ
await OutBoxDAL.createOutbox(
release.getId(),
artifact!.getId(),
"pending",
"delete",
transaction,
);
await transaction.commit();
Logger.info("Release has been accepted for deletion!");
} catch (error) {
Logger.error(error as string);
await transaction.rollback();
throw error;
}
}
The Outbox pattern was the right call here. It trades some latency for proper consistency tracking every operation has a record, every state transition is visible.
Where it goes from here
The client SDK handles the version comparison and binary fetching on the ESP32 side.
Is it production-ready for millions of devices? Not today. Is it the right architecture to grow into that? I think so but not sure of that. And honestly, building this taught me more about backend systems, consistency problems, and queue-based architectures than any tutorial ever could have.