Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AbstractClientService::connect() need an explicit check. #930

Open
yuyang0423 opened this issue Feb 6, 2023 · 0 comments
Open

AbstractClientService::connect() need an explicit check. #930

yuyang0423 opened this issue Feb 6, 2023 · 0 comments

Comments

@yuyang0423
Copy link
Contributor

Describe the bug

In a crash scenario, 80.64 was previous leader then pull out its link interface. 80.58 become new leader during its start process.
isConnected(rc, endipoint) return true and new leader start Replicator even in a down connection.
It seems sofa-bolt didn't close the connection in this short period yet.
So the trick thing is if called getAlivePeers() through cliService will return the unreachable nodes between 2023-02-06 10:40:45 and 2023-02-06 10:40:50 because r.lastRpcSendTimestamp = Utils.monotonicMs(); was updated to now.

below is some log info

2023-02-06 10:40:29.655 [JRaft-FSMCaller-Disruptor-0] INFO  StateMachineAdapter:84 - onStopFollowing: LeaderChangeContext [leaderId=192.168.80.64:8082, term=4, status=Status[ERAFTTIMEDOUT<10001>: Lost connection from leader 192.168.80.64:8082.]].
2023-02-06 10:40:29.655 [JRaft-ElectionTimer-<election--1/192.168.80.58:8082>0] INFO  NodeImpl:2650 - Node <election--1/192.168.80.58:8082> term 4 start preVote.
2023-02-06 10:40:29.658 [Bolt-conn-event-executor-15-thread-1] INFO  ClientServiceConnectionEventProcessor:50 - Peer 192.168.80.118:8082 is connected
2023-02-06 10:40:29.663 [JRaft-RPC-Processor-4] INFO  NodeImpl:2604 - Node <election--1/192.168.80.58:8082> received PreVoteResponse from 192.168.80.118:8082, term=4, granted=true.
2023-02-06 10:40:30.145 [Rpc-netty-server-worker-10-thread-9] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-raft-rpc-executor #11,5,main].
2023-02-06 10:40:30.145 [rheakv-raft-rpc-executor #11] INFO  NodeImpl:1697 - Node <election--1/192.168.80.58:8082> received PreVoteRequest from 192.168.80.117:8082, term=5, currTerm=4, granted=true, requestLastLogId=LogId [index=45295, term=4], lastLogId=LogId [index=45295, term=4].
2023-02-06 10:40:30.148 [JRaft-RPC-Processor-5] INFO  NodeImpl:2604 - Node <election--1/192.168.80.58:8082> received PreVoteResponse from 192.168.80.117:8082, term=4, granted=true.
2023-02-06 10:40:30.148 [JRaft-RPC-Processor-5] INFO  NodeImpl:1126 - Node <election--1/192.168.80.58:8082> start vote and grant vote self, term=4.
2023-02-06 10:40:30.154 [JRaft-RPC-Processor-5] INFO  LocalRaftMetaStorage:132 - Save raft meta, path=/tmp/election/raft_data_region_-1_8082/meta, term=5, votedFor=192.168.80.58:8082, cost time=5 ms
2023-02-06 10:40:31.156 [Rpc-netty-server-worker-10-thread-9] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-raft-rpc-executor #12,5,main].
2023-02-06 10:40:31.156 [rheakv-raft-rpc-executor #12] INFO  NodeImpl:1768 - Node <election--1/192.168.80.58:8082> received RequestVoteRequest from 192.168.80.117:8082, term=5, currTerm=5.
2023-02-06 10:40:34.659 [JRaft-RPC-Processor-8] WARN  NodeImpl:2639 - Node <election--1/192.168.80.58:8082> PreVote to 192.168.80.64:8082 error: Status[EINTERNAL<1004>: RPC exception:Invoke timeout when invoke with callback.The address is 192.168.80.64:8082].
2023-02-06 10:40:35.158 [JRaft-RPC-Processor-9] WARN  NodeImpl:2576 - Node <election--1/192.168.80.58:8082> RequestVote to 192.168.80.64:8082 error: Status[EINTERNAL<1004>: RPC exception:Invoke timeout when invoke with callback.The address is 192.168.80.64:8082].
2023-02-06 10:40:37.127 [Rpc-netty-server-worker-10-thread-10] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-raft-rpc-executor #13,5,main].
2023-02-06 10:40:37.128 [rheakv-raft-rpc-executor #13] INFO  NodeImpl:1697 - Node <election--1/192.168.80.58:8082> received PreVoteRequest from 192.168.80.118:8082, term=6, currTerm=5, granted=true, requestLastLogId=LogId [index=45295, term=4], lastLogId=LogId [index=45295, term=4].
2023-02-06 10:40:37.401 [JRaft-VoteTimer-<election--1/192.168.80.58:8082>0] WARN  NodeImpl:2717 - Candidate node <election--1/192.168.80.58:8082> term 5 steps down when election reaching vote timeout: fail to get quorum vote-granted.
2023-02-06 10:40:37.401 [JRaft-VoteTimer-<election--1/192.168.80.58:8082>0] INFO  NodeImpl:2650 - Node <election--1/192.168.80.58:8082> term 5 start preVote.
2023-02-06 10:40:37.403 [JRaft-RPC-Processor-10] INFO  NodeImpl:2604 - Node <election--1/192.168.80.58:8082> received PreVoteResponse from 192.168.80.118:8082, term=5, granted=true.
2023-02-06 10:40:37.863 [Rpc-netty-server-worker-10-thread-9] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-raft-rpc-executor #14,5,main].
2023-02-06 10:40:37.863 [rheakv-raft-rpc-executor #14] INFO  NodeImpl:1697 - Node <election--1/192.168.80.58:8082> received PreVoteRequest from 192.168.80.117:8082, term=6, currTerm=5, granted=true, requestLastLogId=LogId [index=45295, term=4], lastLogId=LogId [index=45295, term=4].
2023-02-06 10:40:37.864 [JRaft-RPC-Processor-11] INFO  NodeImpl:2604 - Node <election--1/192.168.80.58:8082> received PreVoteResponse from 192.168.80.117:8082, term=5, granted=true.
2023-02-06 10:40:37.864 [JRaft-RPC-Processor-11] INFO  NodeImpl:1126 - Node <election--1/192.168.80.58:8082> start vote and grant vote self, term=5.
2023-02-06 10:40:37.871 [JRaft-RPC-Processor-11] INFO  LocalRaftMetaStorage:132 - Save raft meta, path=/tmp/election/raft_data_region_-1_8082/meta, term=6, votedFor=192.168.80.58:8082, cost time=7 ms
2023-02-06 10:40:38.870 [Rpc-netty-server-worker-10-thread-9] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-raft-rpc-executor #15,5,main].
2023-02-06 10:40:38.870 [rheakv-raft-rpc-executor #15] INFO  NodeImpl:1768 - Node <election--1/192.168.80.58:8082> received RequestVoteRequest from 192.168.80.117:8082, term=6, currTerm=6.
2023-02-06 10:40:38.872 [Rpc-netty-server-worker-10-thread-10] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-raft-rpc-executor #16,5,main].
2023-02-06 10:40:38.873 [rheakv-raft-rpc-executor #16] INFO  NodeImpl:1768 - Node <election--1/192.168.80.58:8082> received RequestVoteRequest from 192.168.80.118:8082, term=6, currTerm=6.
2023-02-06 10:40:42.407 [JRaft-RPC-Processor-14] WARN  NodeImpl:2639 - Node <election--1/192.168.80.58:8082> PreVote to 192.168.80.64:8082 error: Status[EINTERNAL<1004>: RPC exception:Invoke timeout when invoke with callback.The address is 192.168.80.64:8082].
2023-02-06 10:40:42.867 [JRaft-RPC-Processor-15] WARN  NodeImpl:2576 - Node <election--1/192.168.80.58:8082> RequestVote to 192.168.80.64:8082 error: Status[EINTERNAL<1004>: RPC exception:Invoke timeout when invoke with callback.The address is 192.168.80.64:8082].
2023-02-06 10:40:44.401 [JRaft-VoteTimer-<election--1/192.168.80.58:8082>0] WARN  NodeImpl:2717 - Candidate node <election--1/192.168.80.58:8082> term 6 steps down when election reaching vote timeout: fail to get quorum vote-granted.
2023-02-06 10:40:44.401 [JRaft-VoteTimer-<election--1/192.168.80.58:8082>0] INFO  NodeImpl:2650 - Node <election--1/192.168.80.58:8082> term 6 start preVote.
2023-02-06 10:40:44.404 [JRaft-RPC-Processor-16] INFO  NodeImpl:2604 - Node <election--1/192.168.80.58:8082> received PreVoteResponse from 192.168.80.117:8082, term=6, granted=true.
2023-02-06 10:40:45.091 [Rpc-netty-server-worker-10-thread-10] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-raft-rpc-executor #17,5,main].
2023-02-06 10:40:45.091 [rheakv-raft-rpc-executor #17] INFO  NodeImpl:1697 - Node <election--1/192.168.80.58:8082> received PreVoteRequest from 192.168.80.118:8082, term=7, currTerm=6, granted=true, requestLastLogId=LogId [index=45295, term=4], lastLogId=LogId [index=45295, term=4].
2023-02-06 10:40:45.092 [JRaft-RPC-Processor-17] INFO  NodeImpl:2604 - Node <election--1/192.168.80.58:8082> received PreVoteResponse from 192.168.80.118:8082, term=6, granted=true.
2023-02-06 10:40:45.092 [JRaft-RPC-Processor-17] INFO  NodeImpl:1126 - Node <election--1/192.168.80.58:8082> start vote and grant vote self, term=6.
2023-02-06 10:40:45.098 [JRaft-RPC-Processor-17] INFO  LocalRaftMetaStorage:132 - Save raft meta, path=/tmp/election/raft_data_region_-1_8082/meta, term=7, votedFor=192.168.80.58:8082, cost time=6 ms
2023-02-06 10:40:45.821 [Rpc-netty-server-worker-10-thread-9] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-raft-rpc-executor #18,5,main].
2023-02-06 10:40:45.822 [rheakv-raft-rpc-executor #18] INFO  NodeImpl:1697 - Node <election--1/192.168.80.58:8082> received PreVoteRequest from 192.168.80.117:8082, term=7, currTerm=7, granted=true, requestLastLogId=LogId [index=45295, term=4], lastLogId=LogId [index=45295, term=4].
2023-02-06 10:40:45.833 [JRaft-RPC-Processor-19] INFO  NodeImpl:1223 - Node <election--1/192.168.80.58:8082> become leader of group, term=7, conf=192.168.80.58:8082,192.168.80.64:8082,192.168.80.117:8082,192.168.80.118:8082, oldConf=.
2023-02-06 10:40:45.848 [JRaft-RPC-Processor-19] INFO  Replicator:911 - Replicator=Replicator [state=Created, statInfo=<running=null, firstLogIndex=0, lastLogIncluded=0, lastLogIndex=0, lastTermIncluded=0>, peerId=192.168.80.64:8082, waitId=-1, type=Follower]@192.168.80.64:8082 is started
2023-02-06 10:40:45.855 [JRaft-RPC-Processor-19] INFO  Replicator:911 - Replicator=Replicator [state=Created, statInfo=<running=null, firstLogIndex=0, lastLogIncluded=0, lastLogIndex=0, lastTermIncluded=0>, peerId=192.168.80.118:8082, waitId=-1, type=Follower]@192.168.80.118:8082 is started
2023-02-06 10:40:45.855 [JRaft-RPC-Processor-19] INFO  Replicator:911 - Replicator=Replicator [state=Created, statInfo=<running=null, firstLogIndex=0, lastLogIncluded=0, lastLogIndex=0, lastTermIncluded=0>, peerId=192.168.80.117:8082, waitId=-1, type=Follower]@192.168.80.117:8082 is started
2023-02-06 10:40:45.874 [JRaft-FSMCaller-Disruptor-0] INFO  StateMachineAdapter:62 - onLeaderStart: term=7.
2023-02-06 10:40:46.860 [Rpc-netty-server-worker-10-thread-12] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-kv-store-rpc-executor #0,5,main].
2023-02-06 10:40:46.869 [Rpc-netty-server-worker-10-thread-13] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-kv-store-rpc-executor #1,5,main].
2023-02-06 10:40:46.881 [Rpc-netty-server-worker-10-thread-12] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-kv-store-rpc-executor #2,5,main].
2023-02-06 10:40:46.892 [Rpc-netty-server-worker-10-thread-12] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-kv-store-rpc-executor #3,5,main].
2023-02-06 10:40:48.857 [Append-Entries-Thread-Send2] WARN  Replicator:1201 - Fail to issue RPC to 192.168.80.64:8082, consecutiveErrorTimes=1, error=Status[EINTERNAL<1004>: RPC exception:Invoke timeout when invoke with callback.The address is 192.168.80.64:8082]
2023-02-06 10:40:49.407 [JRaft-RPC-Processor-20] WARN  NodeImpl:2639 - Node <election--1/192.168.80.58:8082> PreVote to 192.168.80.64:8082 error: Status[EINTERNAL<1004>: RPC exception:Invoke timeout when invoke with callback.The address is 192.168.80.64:8082].
2023-02-06 10:40:50.098 [JRaft-RPC-Processor-21] WARN  NodeImpl:2576 - Node <election--1/192.168.80.58:8082> RequestVote to 192.168.80.64:8082 error: Status[EINTERNAL<1004>: RPC exception:Invoke timeout when invoke with callback.The address is 192.168.80.64:8082].
2023-02-06 10:40:53.265 [Rpc-netty-server-worker-10-thread-13] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-kv-store-rpc-executor #4,5,main].
2023-02-06 10:40:53.268 [Rpc-netty-server-worker-10-thread-13] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-kv-store-rpc-executor #5,5,main].
2023-02-06 10:40:53.274 [Rpc-netty-server-worker-10-thread-13] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-kv-store-rpc-executor #6,5,main].
2023-02-06 10:40:59.241 [Rpc-netty-server-worker-10-thread-12] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-kv-store-rpc-executor #7,5,main].
2023-02-06 10:41:07.387 [Append-Entries-Thread-Send2] WARN  Replicator:1434 - Fail to issue RPC to 192.168.80.64:8082, consecutiveErrorTimes=11, error=Status[EINTERNAL<1004>: RPC exception:Invoke timeout when invoke with callback.The address is 192.168.80.64:8082]
2023-02-06 10:41:08.242 [Rpc-netty-server-worker-10-thread-13] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-kv-store-rpc-executor #8,5,main].
2023-02-06 10:41:17.337 [Append-Entries-Thread-Send2] WARN  Replicator:1434 - Fail to issue RPC to 192.168.80.64:8082, consecutiveErrorTimes=21, error=Status[EINTERNAL<1004>: Check connection[192.168.80.64:8082] fail and try to create new one]
2023-02-06 10:41:19.841 [Append-Entries-Thread-Send2] WARN  Replicator:1434 - Fail to issue RPC to 192.168.80.64:8082, consecutiveErrorTimes=31, error=Status[EINTERNAL<1004>: Check connection[192.168.80.64:8082] fail and try to create new one]
2023-02-06 10:41:22.343 [Append-Entries-Thread-Send2] WARN  Replicator:1434 - Fail to issue RPC to 192.168.80.64:8082, consecutiveErrorTimes=41, error=Status[EINTERNAL<1004>: Check connection[192.168.80.64:8082] fail and try to create new one]
2023-02-06 10:41:24.846 [Append-Entries-Thread-Send2] WARN  Replicator:1434 - Fail to issue RPC to 192.168.80.64:8082, consecutiveErrorTimes=51, error=Status[EINTERNAL<1004>: Check connection[192.168.80.64:8082] fail and try to create new one]
2023-02-06 10:41:27.347 [Append-Entries-Thread-Send2] WARN  Replicator:1434 - Fail to issue RPC to 192.168.80.64:8082, consecutiveErrorTimes=61, error=Status[EINTERNAL<1004>: Check connection[192.168.80.64:8082] fail and try to create new one]

Expected behavior

Original connection is exists, but not able to use.
Do an explicit connection ping-pong check to make sure the connection is works fine.

diff --git a/jraft-core/src/main/java/com/alipay/sofa/jraft/rpc/impl/AbstractClientService.java b/jraft-core/src/main/java/com/alipay/sofa/jraft/rpc/impl/AbstractClientService.java
index fec82f7..9166db1 100644
--- a/jraft-core/src/main/java/com/alipay/sofa/jraft/rpc/impl/AbstractClientService.java
+++ b/jraft-core/src/main/java/com/alipay/sofa/jraft/rpc/impl/AbstractClientService.java
@@ -139,9 +139,10 @@ public abstract class AbstractClientService implements ClientService {
         if (rc == null) {
             throw new IllegalStateException("Client service is uninitialized.");
         }
-        if (isConnected(rc, endpoint)) {
-            return true;
-        }
+        isConnected(rc, endpoint);
         try {
             final PingRequest req = PingRequest.newBuilder() //
                 .setSendTimestamp(System.currentTimeMillis()) //

Actual behavior

trusted an unreachable connection.

Steps to reproduce

hard to reproduce.

Minimal yet complete reproducer code (or GitHub URL to code)

Environment

  • SOFAJRaft version: 1.3.11
  • JVM version (e.g. java -version):
openjdk version "1.8.0_262"
OpenJDK Runtime Environment (build 1.8.0_262-b10)
OpenJDK 64-Bit Server VM (build 25.262-b10, mixed mode)
  • OS version (e.g. uname -a):
    Linux uni58 5.4.42-200.el7.x86_64 #1 SMP Fri May 22 14:46:15 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Maven version:
  • IDE version:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant