A Death Of A Routing Node

On the frigid evening of Wednesday, January 31st, 2024, as a weak winter sun was setting over Central Europe's data centers, something went very, very wrong for the node known by the alias XMRK.

XMRK was as big routing node, with an IP address locating the node in south-central Germany, hosting channels with several million US dollars worth of BTC.

At around midnight GMT, A node runner wrote in the popular PLEBNET telegram group:

"Probably need to drink to xmrk tonight... ~28BTC node might be toast"

A few minutes later, On Alex Bosworth's telegram group, a node runner wrote:

"Alex, any advice for xmrk? Looks like he came back online with possible data loss. So far 13 FCs and all got revoked"

Bosworth replied:

"don’t come back online if you have data loss"

So: What went wrong? Did this node runner actually lose 28 BTC?

You can read about the whole saga in this stacker news thread, but let me see if I can explain it here.

Lightning nodes shut down all the time, and it is rarely a problem

Lightning nodes (especially those that use under-powered hardware) do frequently shut down.

Ideally a node's shutdown will be planned, and that node runner will have a chance to request mutual closures. This minimizes costs.

Sometimes that is not possible, most commonly in cases where a node runner cannot boot his node, and therefore has to use his Static Channel Backup, resulting in forced-closed channels.

As we've seen, forced closures can get expensive in the case that there are many pending HTLCs. But that's relatively rare, and in any case, force closures do not lead to loss of funds.

The big danger is also very, very rare: Penalty transactions.

XMRK didn't just have forced closures: He was hit with Penalty Transactions.

Remember, when we talked about Watchtowers, we discussed the important role of penalty transactions in securing the network: If your channel partner tries to cheat, and tries to claim a channel balance that does not match the agreed balance in your channel, your node can instantly "go nuclear", and write a transaction to the blockchain that will grant the ENTIRE balance of the channel to the non-cheating partner.

In the case of XMRK, however...

XMRK was not trying to cheat

Nobody knows for sure how it happened, but it seems that XMRK inadvertently published outdated channel state.

This is, why, in his cryptic message, Bosworth wrote:

"don’t come back online if you have data loss"

The problem was that something had gone terribly wrong in XMRK's node database, and it had "forgotten" some number of recent transactions, and therefore, it (by mistake) tried to claim balances that did not match the agreed balances between the channels.

How did this happen?

It looks like some kind of nasty hardware problem.

In his stacker news post, XMRK wrote:

"I stopped lnd, installed updates for 30-60 minutes, wanted to reboot because systemd did not respond, but even reboot did nothing. Decided to do sync, wait a few seconds and do reboot -f. As if I haven't heard about umount or mount -o remount,ro ... sigh. lnd data are stored on ZFS pool as RAID1. Working theory is that important data stayed only in cache, lnd was running on 32 GB RAM machine while channel.db has 14GB so enough to cache everything. Still surprises me, usually lnd is writing at least 1 MB/s, as seen in dstat, perhaps some data is written and some is not?"

It seems that something was not right with XMRK's database, so he rebooted, and when he rebooted, his node auto-started, and tried to claim incorrect balances on the network.

His channel partners (and/or their active watchtowers) then issued penalty transactions, and XMRK lost all of his balance in those channels instantly.

This is why I tell you not to auto-start your node with systemd

Keep in mind that 90%+ of the node runner community might disagree with me, but I never allow my nodes to auto-start on boot as I described here.

And: This is why you can't properly back up your node

You'll notice that nowhere in this tutorial have I told you how to "back up" your node.

We've definitely spent time on disaster recovery, but that's not really a backup, it's just a safe way to close all your channels if your node is compromised or destroyed.

The somewhat stunning fact is this:

You basically can NOT back up your node.

This is for a very simple and quite terrifying reason: The only way to safely back up a node (and retain the channel state), would be to back up the entire channel database immediately after EACH INDIVIDUAL TRANSACTION.

And that amount of disk activity, if you're doing thousands of transactions per day, would likely grind your computer to a halt.

I don't get it

I know, it's confusing. But think of it this way. Let's imagine a situation where we were backing up our node every ten minutes, and averaging about one transaction per minute.

Minute	Event
1	Transaction
2	Transaction
3	Transaction
4	Transaction
5	Transaction
6	Transaction
7	Transaction
8	Transaction
9	Transaction
10	Transaction, Backup
11	Transaction
12	Transaction
13	Transaction
14	Transaction

So you can see at minute 10, in addition to a Transaction, you have backed up your node.

Now, we are at minute 15, and your node suddenly blows up, and you can't boot it.

What do you do?

You see, if you used that backup from minute 10 -- you would be fucked! Because that backup has the wrong channel balance!

Smart people are trying to fix this

It's not a good thing that node runners can "accidentally" publish outdated state, and then be subjected to a penalty transaction. A couple teams of people who are significantly smarter than myself are currently working on this, and trying to figure out a "safe" way that a node could somehow "recover" its balances, and keep going, after data loss.

Lightning nodes shut down all the time, and it is rarely a problem​

The big danger is also very, very rare: Penalty transactions.​

XMRK was not trying to cheat​

This is why I tell you not to auto-start your node with systemd​

And: This is why you can't properly back up your node​

I don't get it​

Smart people are trying to fix this​