it doesnt seem possible to simulate a failed node with the ondisk example #34

ouvaa · 2024-03-30T18:13:37Z

i've tried running the ondisk replica id 3 on another new directory and it doesnt work

when ran again on a new folder, the error appears.

2024-03-30 14:17:23.563027 I | dragonboat: LogDB info received, shard 14, busy false
2024-03-30 14:17:23.569342 I | logdb: using plain logdb
2024-03-30 14:17:23.569446 I | dragonboat: LogDB info received, shard 15, busy false
2024-03-30 14:17:23.570693 I | dragonboat: logdb memory limit: 8192 MBytes
2024-03-30 14:17:23.571517 I | dragonboat: NodeHost ID: f4c83988-8282-4f78-ac09-8b9fddfb3614
2024-03-30 14:17:23.571525 I | dragonboat: using regular node registry
2024-03-30 14:17:23.571529 I | dragonboat: filesystem error injection mode enabled: false
2024-03-30 14:17:23.573204 I | dragonboat: transport type: go-tcp-transport
2024-03-30 14:17:23.573214 I | dragonboat: logdb type: sharded-pebble
2024-03-30 14:17:23.573218 I | dragonboat: nodehost address: localhost:63003
2024-03-30 14:17:23.576440 I | dragonboat: [00128:00003] replaying raft logs
Usage - 
put key value
get key
2024-03-30 14:17:23.584012 I | dragonboat: [00128:00003] initialized using <00128:00003:0>
2024-03-30 14:17:23.584028 I | dragonboat: [00128:00003] initial index set to 0
2024-03-30 14:17:24.607819 C | raft: invalid commitTo index 6, lastIndex() 3
panic: invalid commitTo index 6, lastIndex() 3

goroutine 513 [running]:
github.com/lni/goutils/logutil/capnslog.(*PackageLogger).Panicf(0xc0000136e0, {0xb75303?, 0x778826b11108?}, {0xc0002ba1a0?, 0xc00019cfb0?, 0xc000973a38?})
	/root/go/pkg/mod/github.com/lni/goutils@v1.3.1-0.20220604063047-388d67b4dbc4/logutil/capnslog/pkg_logger.go:88 +0xb5
github.com/lni/dragonboat/v4/logger.(*capnsLog).Panicf(0xc00019cf90?, {0xb75303?, 0x411005?}, {0xc0002ba1a0?, 0xa7a8e0?, 0xa73a01?})
	/root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20230917160253-d9f49378cd2d/logger/capnslogger.go:74 +0x25
github.com/lni/dragonboat/v4/logger.(*dragonboatLogger).Panicf(0x8?, {0xb75303, 0x29}, {0xc0002ba1a0, 0x2, 0x2})
	/root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20230917160253-d9f49378cd2d/logger/logger.go:135 +0x51
github.com/lni/dragonboat/v4/internal/raft.(*entryLog).commitTo(0xc0003ad420, 0x6)
	/root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20230917160253-d9f49378cd2d/internal/raft/logentry.go:341 +0xf3
github.com/lni/dragonboat/v4/internal/raft.(*raft).handleHeartbeatMessage(_, {0x11, 0x3, 0x2, 0x80, 0x6, 0x0, 0x0, 0x6, 0x0, ...})
	/root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20230917160253-d9f49378cd2d/internal/raft/raft.go:1398 +0x3c
github.com/lni/dragonboat/v4/internal/raft.(*raft).handleFollowerHeartbeat(_, {0x11, 0x3, 0x2, 0x80, 0x6, 0x0, 0x0, 0x6, 0x0, ...})
	/root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20230917160253-d9f49378cd2d/internal/raft/raft.go:2134 +0x65
github.com/lni/dragonboat/v4/internal/raft.defaultHandle(_, {0x11, 0x3, 0x2, 0x80, 0x6, 0x0, 0x0, 0x6, 0x0, ...})
	/root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20230917160253-d9f49378cd2d/internal/raft/raft.go:2332 +0x75
github.com/lni/dragonboat/v4/internal/raft.(*raft).Handle(_, {0x11, 0x3, 0x2, 0x80, 0x6, 0x0, 0x0, 0x6, 0x0, ...})
	/root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20230917160253-d9f49378cd2d/internal/raft/raft.go:1601 +0x102
github.com/lni/dragonboat/v4/internal/raft.(*Peer).Handle(_, {0x11, 0x3, 0x2, 0x80, 0x6, 0x0, 0x0, 0x6, 0x0, ...})
	/root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20230917160253-d9f49378cd2d/internal/raft/peer.go:192 +0x185
github.com/lni/dragonboat/v4.(*node).handleReceivedMessages(0xc0001b0008)
	/root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20230917160253-d9f49378cd2d/node.go:1364 +0x19b
github.com/lni/dragonboat/v4.(*node).handleEvents(0xc0001b0008)
	/root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20230917160253-d9f49378cd2d/node.go:1178 +0xaf
github.com/lni/dragonboat/v4.(*node).stepNode(_)
	/root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20230917160253-d9f49378cd2d/node.go:1144 +0x145
github.com/lni/dragonboat/v4.(*engine).processSteps(0xc000142a00, 0x1, 0xc000975de0?, 0xc000210090, {0x10a6480, 0x1?, 0x0}, 0xc0004c70e0?)
	/root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20230917160253-d9f49378cd2d/engine.go:1320 +0x25d
github.com/lni/dragonboat/v4.(*engine).stepWorkerMain(0xc000142a00, 0x1)
	/root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20230917160253-d9f49378cd2d/engine.go:1254 +0x3c6
github.com/lni/dragonboat/v4.newExecEngine.func1()
	/root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20230917160253-d9f49378cd2d/engine.go:1047 +0x5d
github.com/lni/goutils/syncutil.(*Stopper).runWorker.func1()
	/root/go/pkg/mod/github.com/lni/goutils@v1.3.1-0.20220604063047-388d67b4dbc4/syncutil/stopper.go:79 +0x123
created by github.com/lni/goutils/syncutil.(*Stopper).runWorker in goroutine 1
	/root/go/pkg/mod/github.com/lni/goutils@v1.3.1-0.20220604063047-388d67b4dbc4/syncutil/stopper.go:74 +0xc6

The text was updated successfully, but these errors were encountered:

kevburnsjr · 2024-04-29T15:14:49Z

This looks like a scenario where the node lost commits at index 4 and 5 and can't recover because the log has already been compacted. I believe the correct operation here is to replace the node rather than trying to recover it.

Does this happen on all nodes or just this one?

sprappcom · 2024-05-01T08:24:31Z

@kevburnsjr just one, can you provide instructions on how to recover it?

can u try deleting one of the node and "recover"?

kevburnsjr · 2024-05-01T14:57:49Z

Stop the nodehost
Delete all data on disk
Start the nodehost

It should have a new node host id.
You may need to configure it to join the cluster.
It will then pull a snapshot from one of the other hosts and start replaying logs.

sprappcom · 2024-05-01T15:14:08Z

@kevburnsjr are you sure? can you like ascii video show it? i think that was what i did and it's not working.

can u try and show the video code? also, how do u configure it to join the cluster?

i've tried before. hope u can show a video version whereby we can see how it's truly done.

pls do the on-disk version

pls show it. thx in advance.

kevburnsjr · 2024-05-01T16:21:21Z

Firstly, I would recommend that you experiment with this on a test cluster. Make sure you are simulating failures and learning the operating procedures on a cluster that doesn't contain critical data before performing the operations on a cluster containing production data.

Joining a node to a cluster is an example of a multi-node operation.

Say we have 3 nodes, A, B and C. We'll say that node C is your failed node that we are replacing.

First you need to start the host on node C.

Then you need to call NodeHost.RequestAddReplica on either node A or node B.

func (nh *NodeHost) RequestAddReplica(
	shardID uint64,
	replicaID uint64, 
	target Target, 
	configChangeIndex uint64,
	timeout time.Duration,
) (*RequestState, error)

Then, to join your replacement node to the existing cluster, you need to call NodeHost.StartReplica on node C with join set to false (iirc).

func (nh *NodeHost) StartReplica(
	initialMembers map[uint64]Target,
	join bool, 
	create sm.CreateStateMachineFunc, 
	cfg config.Config,
) error

Once you can successfully perform a NodeHost.ReadIndex from node C, then you will know that your new node is active.

I created Zongzi to simplify these types of multi-node Dragonboat operations using a GRPC coordination API that effectively turns multi-node operations like this into single-node operations. If this were a Zongzi cluster, all you would need to do is delete the data directory and restart the node and this multi-node rejoin process would happen automatically.

If you're not using Zongzi, then you will likely end up reimplementing portions of it your own project. Agent.Start is a good place to see the various different node host startup scenarios.

sprappcom · 2024-05-02T01:00:15Z

@kevburnsjr i hope u can show on video.

can there be an example that will automatically do all these? when down, a new instance is started in another directory, will work without intervention.

kevburnsjr · 2024-05-02T01:59:50Z

Example here
https://github.com/logbn/zongzi/blob/main/_example/kv1/main.go

sprappcom · 2024-06-17T14:42:49Z

@lni can you help resolve this issue?

i've done on-disk 1 , 2 , 3

stopped the 3, deleted 3's data folder, try to start 3 again same location or otherwise, get the error above.

possible to resolve this?

lni · 2024-06-18T04:33:42Z

hi,

Deleting a replica's data folder and then restart the replica would violate the most fundamental requirements of Raft/Paxos. The reason is explained in the following issue in details. It is also explained in the doc/devops.md file as well.

lni/dragonboat#256

sprappcom · 2024-06-18T05:03:37Z

@lni i explained here too:
lni/dragonboat#256

ouvaa mentioned this issue Apr 12, 2024

possible to help resolve this issue whereby the deleted node is not recoverable? lni/dragonboat#352

Open

lni closed this as completed Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

it doesnt seem possible to simulate a failed node with the ondisk example #34

it doesnt seem possible to simulate a failed node with the ondisk example #34

ouvaa commented Mar 30, 2024 •

edited

Loading

kevburnsjr commented Apr 29, 2024

sprappcom commented May 1, 2024 •

edited

Loading

kevburnsjr commented May 1, 2024 •

edited

Loading

sprappcom commented May 1, 2024 •

edited

Loading

kevburnsjr commented May 1, 2024

sprappcom commented May 2, 2024

kevburnsjr commented May 2, 2024

sprappcom commented Jun 17, 2024

lni commented Jun 18, 2024

sprappcom commented Jun 18, 2024

it doesnt seem possible to simulate a failed node with the ondisk example #34

it doesnt seem possible to simulate a failed node with the ondisk example #34

Comments

ouvaa commented Mar 30, 2024 • edited Loading

kevburnsjr commented Apr 29, 2024

sprappcom commented May 1, 2024 • edited Loading

kevburnsjr commented May 1, 2024 • edited Loading

sprappcom commented May 1, 2024 • edited Loading

kevburnsjr commented May 1, 2024

sprappcom commented May 2, 2024

kevburnsjr commented May 2, 2024

sprappcom commented Jun 17, 2024

lni commented Jun 18, 2024

sprappcom commented Jun 18, 2024

ouvaa commented Mar 30, 2024 •

edited

Loading

sprappcom commented May 1, 2024 •

edited

Loading

kevburnsjr commented May 1, 2024 •

edited

Loading

sprappcom commented May 1, 2024 •

edited

Loading