Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

it doesnt seem possible to simulate a failed node with the ondisk example #34

Closed
ouvaa opened this issue Mar 30, 2024 · 10 comments
Closed

Comments

@ouvaa
Copy link

ouvaa commented Mar 30, 2024

i've tried running the ondisk replica id 3 on another new directory and it doesnt work

when ran again on a new folder, the error appears.

2024-03-30 14:17:23.563027 I | dragonboat: LogDB info received, shard 14, busy false
2024-03-30 14:17:23.569342 I | logdb: using plain logdb
2024-03-30 14:17:23.569446 I | dragonboat: LogDB info received, shard 15, busy false
2024-03-30 14:17:23.570693 I | dragonboat: logdb memory limit: 8192 MBytes
2024-03-30 14:17:23.571517 I | dragonboat: NodeHost ID: f4c83988-8282-4f78-ac09-8b9fddfb3614
2024-03-30 14:17:23.571525 I | dragonboat: using regular node registry
2024-03-30 14:17:23.571529 I | dragonboat: filesystem error injection mode enabled: false
2024-03-30 14:17:23.573204 I | dragonboat: transport type: go-tcp-transport
2024-03-30 14:17:23.573214 I | dragonboat: logdb type: sharded-pebble
2024-03-30 14:17:23.573218 I | dragonboat: nodehost address: localhost:63003
2024-03-30 14:17:23.576440 I | dragonboat: [00128:00003] replaying raft logs
Usage - 
put key value
get key
2024-03-30 14:17:23.584012 I | dragonboat: [00128:00003] initialized using <00128:00003:0>
2024-03-30 14:17:23.584028 I | dragonboat: [00128:00003] initial index set to 0
2024-03-30 14:17:24.607819 C | raft: invalid commitTo index 6, lastIndex() 3
panic: invalid commitTo index 6, lastIndex() 3

goroutine 513 [running]:
github.com/lni/goutils/logutil/capnslog.(*PackageLogger).Panicf(0xc0000136e0, {0xb75303?, 0x778826b11108?}, {0xc0002ba1a0?, 0xc00019cfb0?, 0xc000973a38?})
	/root/go/pkg/mod/github.com/lni/goutils@v1.3.1-0.20220604063047-388d67b4dbc4/logutil/capnslog/pkg_logger.go:88 +0xb5
github.com/lni/dragonboat/v4/logger.(*capnsLog).Panicf(0xc00019cf90?, {0xb75303?, 0x411005?}, {0xc0002ba1a0?, 0xa7a8e0?, 0xa73a01?})
	/root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20230917160253-d9f49378cd2d/logger/capnslogger.go:74 +0x25
github.com/lni/dragonboat/v4/logger.(*dragonboatLogger).Panicf(0x8?, {0xb75303, 0x29}, {0xc0002ba1a0, 0x2, 0x2})
	/root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20230917160253-d9f49378cd2d/logger/logger.go:135 +0x51
github.com/lni/dragonboat/v4/internal/raft.(*entryLog).commitTo(0xc0003ad420, 0x6)
	/root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20230917160253-d9f49378cd2d/internal/raft/logentry.go:341 +0xf3
github.com/lni/dragonboat/v4/internal/raft.(*raft).handleHeartbeatMessage(_, {0x11, 0x3, 0x2, 0x80, 0x6, 0x0, 0x0, 0x6, 0x0, ...})
	/root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20230917160253-d9f49378cd2d/internal/raft/raft.go:1398 +0x3c
github.com/lni/dragonboat/v4/internal/raft.(*raft).handleFollowerHeartbeat(_, {0x11, 0x3, 0x2, 0x80, 0x6, 0x0, 0x0, 0x6, 0x0, ...})
	/root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20230917160253-d9f49378cd2d/internal/raft/raft.go:2134 +0x65
github.com/lni/dragonboat/v4/internal/raft.defaultHandle(_, {0x11, 0x3, 0x2, 0x80, 0x6, 0x0, 0x0, 0x6, 0x0, ...})
	/root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20230917160253-d9f49378cd2d/internal/raft/raft.go:2332 +0x75
github.com/lni/dragonboat/v4/internal/raft.(*raft).Handle(_, {0x11, 0x3, 0x2, 0x80, 0x6, 0x0, 0x0, 0x6, 0x0, ...})
	/root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20230917160253-d9f49378cd2d/internal/raft/raft.go:1601 +0x102
github.com/lni/dragonboat/v4/internal/raft.(*Peer).Handle(_, {0x11, 0x3, 0x2, 0x80, 0x6, 0x0, 0x0, 0x6, 0x0, ...})
	/root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20230917160253-d9f49378cd2d/internal/raft/peer.go:192 +0x185
github.com/lni/dragonboat/v4.(*node).handleReceivedMessages(0xc0001b0008)
	/root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20230917160253-d9f49378cd2d/node.go:1364 +0x19b
github.com/lni/dragonboat/v4.(*node).handleEvents(0xc0001b0008)
	/root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20230917160253-d9f49378cd2d/node.go:1178 +0xaf
github.com/lni/dragonboat/v4.(*node).stepNode(_)
	/root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20230917160253-d9f49378cd2d/node.go:1144 +0x145
github.com/lni/dragonboat/v4.(*engine).processSteps(0xc000142a00, 0x1, 0xc000975de0?, 0xc000210090, {0x10a6480, 0x1?, 0x0}, 0xc0004c70e0?)
	/root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20230917160253-d9f49378cd2d/engine.go:1320 +0x25d
github.com/lni/dragonboat/v4.(*engine).stepWorkerMain(0xc000142a00, 0x1)
	/root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20230917160253-d9f49378cd2d/engine.go:1254 +0x3c6
github.com/lni/dragonboat/v4.newExecEngine.func1()
	/root/go/pkg/mod/github.com/lni/dragonboat/v4@v4.0.0-20230917160253-d9f49378cd2d/engine.go:1047 +0x5d
github.com/lni/goutils/syncutil.(*Stopper).runWorker.func1()
	/root/go/pkg/mod/github.com/lni/goutils@v1.3.1-0.20220604063047-388d67b4dbc4/syncutil/stopper.go:79 +0x123
created by github.com/lni/goutils/syncutil.(*Stopper).runWorker in goroutine 1
	/root/go/pkg/mod/github.com/lni/goutils@v1.3.1-0.20220604063047-388d67b4dbc4/syncutil/stopper.go:74 +0xc6

@kevburnsjr
Copy link
Contributor

This looks like a scenario where the node lost commits at index 4 and 5 and can't recover because the log has already been compacted. I believe the correct operation here is to replace the node rather than trying to recover it.

Does this happen on all nodes or just this one?

@sprappcom
Copy link

sprappcom commented May 1, 2024

@kevburnsjr just one, can you provide instructions on how to recover it?

can u try deleting one of the node and "recover"?

@kevburnsjr
Copy link
Contributor

kevburnsjr commented May 1, 2024

  1. Stop the nodehost
  2. Delete all data on disk
  3. Start the nodehost

It should have a new node host id.
You may need to configure it to join the cluster.
It will then pull a snapshot from one of the other hosts and start replaying logs.

@sprappcom
Copy link

sprappcom commented May 1, 2024

@kevburnsjr are you sure? can you like ascii video show it? i think that was what i did and it's not working.

can u try and show the video code? also, how do u configure it to join the cluster?

i've tried before. hope u can show a video version whereby we can see how it's truly done.

pls do the on-disk version

pls show it. thx in advance.

@kevburnsjr
Copy link
Contributor

Firstly, I would recommend that you experiment with this on a test cluster. Make sure you are simulating failures and learning the operating procedures on a cluster that doesn't contain critical data before performing the operations on a cluster containing production data.

Joining a node to a cluster is an example of a multi-node operation.

Say we have 3 nodes, A, B and C. We'll say that node C is your failed node that we are replacing.

First you need to start the host on node C.

Then you need to call NodeHost.RequestAddReplica on either node A or node B.

func (nh *NodeHost) RequestAddReplica(
	shardID uint64,
	replicaID uint64, 
	target Target, 
	configChangeIndex uint64,
	timeout time.Duration,
) (*RequestState, error)

Then, to join your replacement node to the existing cluster, you need to call NodeHost.StartReplica on node C with join set to false (iirc).

func (nh *NodeHost) StartReplica(
	initialMembers map[uint64]Target,
	join bool, 
	create sm.CreateStateMachineFunc, 
	cfg config.Config,
) error

Once you can successfully perform a NodeHost.ReadIndex from node C, then you will know that your new node is active.


I created Zongzi to simplify these types of multi-node Dragonboat operations using a GRPC coordination API that effectively turns multi-node operations like this into single-node operations. If this were a Zongzi cluster, all you would need to do is delete the data directory and restart the node and this multi-node rejoin process would happen automatically.

If you're not using Zongzi, then you will likely end up reimplementing portions of it your own project. Agent.Start is a good place to see the various different node host startup scenarios.

@sprappcom
Copy link

@kevburnsjr i hope u can show on video.

can there be an example that will automatically do all these? when down, a new instance is started in another directory, will work without intervention.

@kevburnsjr
Copy link
Contributor

@sprappcom
Copy link

@lni can you help resolve this issue?

i've done on-disk 1 , 2 , 3

stopped the 3, deleted 3's data folder, try to start 3 again same location or otherwise, get the error above.

possible to resolve this?

@lni
Copy link
Owner

lni commented Jun 18, 2024

hi,

Deleting a replica's data folder and then restart the replica would violate the most fundamental requirements of Raft/Paxos. The reason is explained in the following issue in details. It is also explained in the doc/devops.md file as well.

lni/dragonboat#256

@lni lni closed this as completed Jun 18, 2024
@sprappcom
Copy link

@lni i explained here too:
lni/dragonboat#256

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants