-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
it doesnt seem possible to simulate a failed node with the ondisk example #34
Comments
This looks like a scenario where the node lost commits at index 4 and 5 and can't recover because the log has already been compacted. I believe the correct operation here is to replace the node rather than trying to recover it. Does this happen on all nodes or just this one? |
@kevburnsjr just one, can you provide instructions on how to recover it? can u try deleting one of the node and "recover"? |
It should have a new node host id. |
@kevburnsjr are you sure? can you like ascii video show it? i think that was what i did and it's not working. can u try and show the video code? also, how do u configure it to join the cluster? i've tried before. hope u can show a video version whereby we can see how it's truly done. pls do the on-disk version pls show it. thx in advance. |
Firstly, I would recommend that you experiment with this on a test cluster. Make sure you are simulating failures and learning the operating procedures on a cluster that doesn't contain critical data before performing the operations on a cluster containing production data. Joining a node to a cluster is an example of a multi-node operation. Say we have 3 nodes, First you need to start the host on node Then you need to call NodeHost.RequestAddReplica on either node func (nh *NodeHost) RequestAddReplica(
shardID uint64,
replicaID uint64,
target Target,
configChangeIndex uint64,
timeout time.Duration,
) (*RequestState, error) Then, to join your replacement node to the existing cluster, you need to call NodeHost.StartReplica on node func (nh *NodeHost) StartReplica(
initialMembers map[uint64]Target,
join bool,
create sm.CreateStateMachineFunc,
cfg config.Config,
) error Once you can successfully perform a NodeHost.ReadIndex from node I created Zongzi to simplify these types of multi-node Dragonboat operations using a GRPC coordination API that effectively turns multi-node operations like this into single-node operations. If this were a Zongzi cluster, all you would need to do is delete the data directory and restart the node and this multi-node rejoin process would happen automatically. If you're not using Zongzi, then you will likely end up reimplementing portions of it your own project. Agent.Start is a good place to see the various different node host startup scenarios. |
@kevburnsjr i hope u can show on video. can there be an example that will automatically do all these? when down, a new instance is started in another directory, will work without intervention. |
@lni can you help resolve this issue? i've done on-disk 1 , 2 , 3 stopped the 3, deleted 3's data folder, try to start 3 again same location or otherwise, get the error above. possible to resolve this? |
hi, Deleting a replica's data folder and then restart the replica would violate the most fundamental requirements of Raft/Paxos. The reason is explained in the following issue in details. It is also explained in the doc/devops.md file as well. |
@lni i explained here too: |
i've tried running the ondisk replica id 3 on another new directory and it doesnt work
when ran again on a new folder, the error appears.
The text was updated successfully, but these errors were encountered: