Specifying a revision for a range request in a transaction may cause data inconsistency #18667

ahrtr · 2024-10-02T09:37:44Z

What happened?

Specifying a revision for a range request in a transaction may cause data inconsistency. The client may get different values against the same key from different endpoints.

How can we reproduce it?

Step 1: start a brand new 3 member cluster
Step 2: Execute for i in {1..20}; do ./etcdctl put k$i v$i; done
Step 3: Execute ./etcdctl compact 21
Step 4: Execute ./etcdctl txn --interactive

compares:
value("k1") = "v1"

success requests (get, put, del):
put k2 foo
get k1 --rev=10

failure requests (get, put, del):

The client will get a **etcdserver: mvcc: required revision has been compacted** error.

Step 5: Execute ./etcdctl get k2 against different endpoints, you will get different values,

$ ./etcdctl --endpoints=127.0.0.1:2379 get k2
k2
v2
$ ./etcdctl --endpoints=127.0.0.1:22379 get k2
k2
foo
$ ./etcdctl --endpoints=127.0.0.1:32379 get k2
k2
foo

Root cause

The root cause is that etcd server removes range requests from the TXN for endpoints the client isn’t connected to.

etcd/server/etcdserver/server.go

Lines 1967 to 1971 in 2c97110

    
           needResult := s.w.IsRegistered(id) 
        
           if needResult || !noSideEffect(&raftReq) { 
        
           	if !needResult && raftReq.Txn != nil { 
        
           		removeNeedlessRangeReqs(raftReq.Txn) 
        
           	}

For example, if the client connects to member 1, then etcdserver removes the range request (get k1 --rev=10 in above example) from the TXN in member 2 and 3. Accordingly, member 1 applies failed due to checkRange's failures, but member 2 & 3 apply the TXN successfully because the range request was removed. Eventually it leads to the situation that different members have different data.

etcd/server/etcdserver/txn/txn.go

Lines 431 to 441 in 2c97110

    
           func checkRange(rv mvcc.ReadView, req *pb.RangeRequest) error { 
        
           	switch { 
        
           	case req.Revision == 0: 
        
           		return nil 
        
           	case req.Revision > rv.Rev(): 
        
           		return mvcc.ErrFutureRev 
        
           	case req.Revision < rv.FirstRev(): 
        
           		return mvcc.ErrCompacted 
        
           	} 
        
           	return nil 
        
           }

Solution

The simplest solution is we don't remove range requests from TXN for any member. The side effect is that other endpoints (the client isn't connected to) will execute the unnecessary range operations.

To resolve the side effect above, we don't execute the range requests on other endpoints that the client isn't connected to; instead etcdservers only verify them to ensure all endpoints always execute consistent validation.

Workaround

If only one member is inconsistent, just replace it. See this guide. After removing the member, delete its data.

If all members are inconsistent, things get trickier. You'll need to pick one member as the source of truth, force creating a single-member cluster, and then re-add other members (don’t forget to clear their data first).

Impact

All versions (including 3.4.x, 3.5.x and main) are affected.

Just searched the Kubernetes repo, and confirmed that kubernetes doesn't specify revision for range request in TXN. So Kubernetes isn't affected
For non-Kubernetes usage... It’s uncommon for this to be used this way in a real-world product, but I’m not entirely sure it won’t be.

The text was updated successfully, but these errors were encountered:

ahrtr · 2024-10-03T17:50:09Z

The simplest solution is that we never remove range request from TXN, see ahrtr@76ac23f .

The good side is simple.
The side effect is that the members the client isn't connect to will execute some unnecessary range requests.

wenjiaswe · 2024-10-03T18:12:48Z

@ArkaSaha30 Thanks for helping out!

wenjiaswe · 2024-10-03T18:13:25Z

cc @ivanvc as well as you have some context with performance tooling.

shyamjvs · 2024-10-03T18:42:18Z

The simplest solution is that we never remove range request from TXN

Sorry if I missed this part during the call. Is there an option-2 we've already excluded here, which is to treat working-as-intended range responses (like mvcc: required revision has been compacted) in the success/failure ops not as an error that requires Tx to rollback? It will change the semantic for existing clients, but that seems to be broken anyway (in a worse way) as shown by this bug iiuc.

ahrtr · 2024-10-03T18:46:27Z

Rough steps as mentioned in the community meeting,

create an e2e or integration test to reproduce the issue.
apply the patch (ahrtr@76ac23f) to ensure it fixes the issue.
evaluate the performance impact.

cc @ArkaSaha30 @ivanvc

The other solution as mentioned in the meeting is to keep the behaviour unchanged.

Ensure all members execute the same validation.
But only the member that the client connects to execute the range request in a TXN

But the problem is that this solution will reduce the readability and complicate the etcdserver a lot. So If the performance impact of first solution is minimum, then we should NOT go for this solution 2

shyamjvs · 2024-10-03T20:26:00Z

@ahrtr @serathius say even with the fix to execute the range request on all members, wdyt about the same issue happening when (for lack of a better word) non-deterministic failures lead to one node failing the txn.Range() while the other doesn't?

ahrtr · 2024-10-03T20:33:40Z

@ahrtr @serathius say even with the fix to execute the range request on all members, wdyt about the same issue happening when (for lack of a better word) non-deterministic failures lead to one node failing the txn.Range() while the other doesn't?

It's a generic "problem", not specific to this issue.

If a node somehow fails to apply entries due to environment issue (i.e. OOM), then etctserver crashes. When it gets started again, it will re-apply the entries.

shyamjvs · 2024-10-03T20:36:13Z

So far the only vector I found for what I was saying above to happen is if the Tx context was cancelled -

etcd/server/storage/mvcc/kvstore_txn.go

Lines 103 to 104 in c1976a6

    
           case <-ctx.Done(): 
        
           	return nil, fmt.Errorf("rangeKeys: context cancelled: %w", ctx.Err())

But fortunately (or maybe intentionally) the context that gets piped down to it from Apply is a context.TODO() that should never expire:

etcd/server/etcdserver/apply/uber_applier.go

Line 118 in c1976a6

return a.applyV3.Apply(context.TODO(), r, a.dispatch)

We probably need to add a big bold note there saying it's risky (wrt data consistency) to change that context to anything finite (I'll send out a PR). But does the overarching risk and #18667 (comment) feel worth discussing to y'all?

shyamjvs · 2024-10-03T20:40:10Z

When it gets started again, it will re-apply the entries.

This part makes sense when the failure happens in transaction validation phase (CI isn't incremented). But cases like what you found would be more dangerous when CI is incremented but applied asymmetrically across nodes. Lemme try digging thru the code to see if there are any potential risks there (hopefully none).

serathius · 2024-10-04T08:25:19Z

So far the only vector I found for what I was saying above to happen is if the Tx context was cancelled -

This is a great point about how context can introduce non-determinism into the apply loop, which is highly undesirable. Since the apply loop forms the core of the replicated state machine, any inconsistency caused by context cancellation at different times could lead to significant issues.

I think that using context here seems unnecessary and increases the risk of compromising transactionality. From what I understand, it's primarily used for passing call stack metadata like traces and authorization information. We can definitely find alternative ways to achieve this without relying on context.

To mitigate the risk, I strongly recommend removing context entirely from the apply loop code. Since the asynchronous code within the loop should not involve any non-deterministic asynchronous calls, we can safely eliminate context. We can then re-implement trace logging without relying on it.

ahrtr · 2024-10-04T09:08:20Z

To mitigate the risk, I strongly recommend removing context entirely from the apply loop code.

Agree to this in principle. Clearly, it isn't a minor change. So let's get the low-hanging fix #18674 approved & merged first. Afterwards, we can revisit removing the context without rushing.

shyamjvs · 2024-10-04T19:32:31Z

@serathius @ahrtr the idea of removing context sure seems tempting. Should we discuss few things first:

Given context is canonical golang way of passing down other metadata, are we sure we won't need it long-term? There are ways to use context in non-cancellable fashion (see WithoutCancel)
Do we foresee any cases where we do want a timeout (for e.g read-only txns, those should be safe to timeout)? Also in case of server shutdown, we may want to cleanly stop an apply (without incrementing CI) when it's safe to do so (for e.g if we're executing the "check" phase of a write txn)
Will context be helpful in writing certain class of robustness tests? For e.g say we want to test that entry X+1 isn't applied when entry X fails to apply (we could use context to trigger this)
Are there any safe alternatives to solve the overarching problem here i.e. prevent server from moving on when an apply fails? ideally we want the server to halt/crash and not increment the CI at all (making it safe even across restarts)

Also happy to discuss this over our next biweekly call, if you prefer.

serathius · 2024-10-04T21:02:44Z

Thanks for the comment, my thoughts:

Metadata is passed, but only through the lower levels of mvcc, It's not really used throughout the most of the apply code. Let's not try to predict what we will need in the future, let's focus on the current needs, it's not like we cannot revert the decision in the future.
Never, we will never need a timeout. Like I mentioned above, we cannot use timeout as it introduces dependency on local node performance for execution of code that needs to be deterministic. If apply loop is stuck, we should crash, not timeout.
No, we have the gofail library to inject failures through on exact code location without context.
I don't think this is a problem at all. Server should move on if read fails, because the read fail as the revision it tried to read was no longer available. There is no other place to validate revision used here. Incrementing the CI was correct here, fact we executed TXN in inconsistent way is the issue.

shyamjvs · 2024-10-05T01:24:01Z

Thanks @serathius for laying out that reasoning.

Never, we will never need a timeout

Wdyt about read-only txns (which can go through the apply layer today iiuc)? Those we might want to timeout if taking too long instead of blocking the subsequent applies

it's not like we cannot revert the decision in the future

+1 to this. The part I wasn't fully sure about is how costly/complex will it be to add that back. Seems not too bad from the size of your PRs :)

Incrementing the CI was correct here, fact we executed TXN in inconsistent way is the issue.

Agreed - and the plan laid out above makes full sense to fix this particular issue. Zooming out a bit though, I'm wondering if we need one more invariant besides "txn execution symmetry" - which is "txn side-effect symmetry". FMU the former doesn't necessarily guarantee the latter - potentially because of non-determinism (your context removal change is def going to improve there, but we might have more risks). Since this is a different issue than the original one, I've opened #18679 to discuss more.

serathius · 2024-10-05T09:20:21Z

Wdyt about read-only txns (which can go through the apply layer today iiuc)? Those we might want to timeout if taking too long instead of blocking the subsequent applies

If I remember correctly, Read only TXN don't go through apply.

etcd/server/etcdserver/v3_server.go

Lines 161 to 192 in 448fb7e

    
           if txn.IsTxnReadonly(r) { 
        
           	trace := traceutil.New("transaction", 
        
           		s.Logger(), 
        
           		traceutil.Field{Key: "read_only", Value: true}, 
        
           	) 
        
           	ctx = context.WithValue(ctx, traceutil.TraceKey{}, trace) 
        
           	if !txn.IsTxnSerializable(r) { 
        
           		err := s.linearizableReadNotify(ctx) 
        
           		trace.Step("agreement among raft nodes before linearized reading") 
        
           		if err != nil { 
        
           			return nil, err 
        
           		} 
        
           	} 
        
           	var resp *pb.TxnResponse 
        
           	var err error 
        
           	chk := func(ai *auth.AuthInfo) error { 
        
           		return txn.CheckTxnAuth(s.authStore, ai, r) 
        
           	} 
        
           	defer func(start time.Time) { 
        
           		txn.WarnOfExpensiveReadOnlyTxnRequest(s.Logger(), s.Cfg.WarningApplyDuration, start, r, resp, err) 
        
           		trace.LogIfLong(traceThreshold) 
        
           	}(time.Now()) 
        
           	get := func() { 
        
           		resp, _, err = txn.Txn(ctx, s.Logger(), r, s.Cfg.ExperimentalTxnModeWriteWithSharedBuffer, s.KV(), s.lessor) 
        
           	} 
        
           	if serr := s.doSerialize(ctx, chk, get); serr != nil { 
        
           		return nil, serr 
        
           	} 
        
           	return resp, err 
        
           }

+1 to this. The part I wasn't fully sure about is how costly/complex will it be to add that back. Seems not too bad from the size of your PRs :)

Yea, know the code and re-read it before.

your context removal change is def going to improve there, but we might have more risks.

Yes, if I could I would rewrite apply code to ensure clear control of side effects. However, that would require a large investment into testing and don't give a large payout as the apply code is not frequently changed. Now most of the etcd correctness issues come from versions before v3.4. Rewriting it might resurface new bugs, but also introduces a high risk of introducing new issues. That's why I would think robustness testing is better strategically, we need to build a trust into testing that no matter how new is contributor, they can propose improvements to any part of etcd codebase and we can catch up any concurrency bugs.

shyamjvs · 2024-10-07T21:16:32Z

Read only TXN don't go through apply.

Thanks for the code ref, that makes sense. IIUC there can be a case where txn isn't seen as read-only (say there's a write in the if-branch) but the compare operation renders it as read-only (in else-branch). One good thing here is we should be able to know this before executing the txn itself as we know compare result and traverse the txn tree in the check phase. I can make a change to improve this behavior - but a case to keep in mind for timeouts. Wdyt?

serathius · 2024-10-24T09:26:48Z

I think #18749 should resolve the issue, but we are missing an e2e regression test. Anyone interested in implementing one?

ahrtr · 2024-10-24T13:22:58Z

I think #18749 should resolve the issue,

#18749 just fixed a separate but related potential issue caught by @shyamjvs . For this issue, it isn't fixed yet. We still need to apply a patch something like ahrtr@76ac23f

Pls refer to #18667 (comment). Also @ArkaSaha30 is working on the e2e test.

ahrtr · 2024-10-24T13:26:42Z

/assign @ArkaSaha30

ahrtr · 2024-11-27T08:17:43Z

@ArkaSaha30 are you still working on this? I know that you are still working on etcd-operator. If you don't have enough bandwidth, would you mind we assign this to other contributors?

ArkaSaha30 · 2024-11-28T05:25:58Z

@ArkaSaha30 are you still working on this? I know that you are still working on etcd-operator. If you don't have enough bandwidth, would you mind we assign this to other contributors?

Sure @ahrtr , it is taking more time for me than expected to get familiar with writing the E2E tests. If any of the contributors are interested and familiar with the E2E tests this issue can be assigned to them. Sorry for the inconvenience of blocking this issue.

ahrtr added type/bug priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Oct 2, 2024

ahrtr pinned this issue Oct 2, 2024

shyamjvs mentioned this issue Oct 3, 2024

Add a caution note about making apply context bounded #18674

Closed

This was referenced Oct 4, 2024

Remove context from top level apply #18675

Merged

Remove context from dispatch #18676

Merged

Remove context from appliers #18677

Merged

serathius mentioned this issue Oct 4, 2024

Remove context from mvcc public interface #18678

Closed

shyamjvs mentioned this issue Oct 5, 2024

Write txn shouldn't End() on a failure #18679

Open

ArkaSaha30 mentioned this issue Oct 5, 2024

[WIP] Execute the same validation for TXN to avoid data inconsistency #18680

Open

3 tasks

serathius mentioned this issue Oct 5, 2024

Add revision to reads in TXN to robustness test and add regression test #18683

Open

ahrtr added backport/v3.4 backport/v3.5 labels Oct 17, 2024

k8s-ci-robot assigned ArkaSaha30 Oct 24, 2024

shyamjvs mentioned this issue Oct 28, 2024

Handle read-only write txn failures gracefully #18803

Closed

ahrtr unassigned ArkaSaha30 Nov 28, 2024

serathius mentioned this issue Dec 4, 2024

Robustness Regression test for #18667 #19009

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specifying a revision for a range request in a transaction may cause data inconsistency #18667

Specifying a revision for a range request in a transaction may cause data inconsistency #18667

ahrtr commented Oct 2, 2024 •

edited

Loading

ahrtr commented Oct 3, 2024

wenjiaswe commented Oct 3, 2024

wenjiaswe commented Oct 3, 2024

shyamjvs commented Oct 3, 2024 •

edited

Loading

ahrtr commented Oct 3, 2024 •

edited

Loading

shyamjvs commented Oct 3, 2024

ahrtr commented Oct 3, 2024

shyamjvs commented Oct 3, 2024

shyamjvs commented Oct 3, 2024 •

edited

Loading

serathius commented Oct 4, 2024

ahrtr commented Oct 4, 2024

shyamjvs commented Oct 4, 2024

serathius commented Oct 4, 2024

shyamjvs commented Oct 5, 2024 •

edited

Loading

serathius commented Oct 5, 2024 •

edited

Loading

shyamjvs commented Oct 7, 2024

serathius commented Oct 24, 2024

ahrtr commented Oct 24, 2024

ahrtr commented Oct 24, 2024

ahrtr commented Nov 27, 2024

ArkaSaha30 commented Nov 28, 2024

Specifying a revision for a range request in a transaction may cause data inconsistency #18667

Specifying a revision for a range request in a transaction may cause data inconsistency #18667

Comments

ahrtr commented Oct 2, 2024 • edited Loading

What happened?

How can we reproduce it?

Root cause

Solution

Workaround

Impact

ahrtr commented Oct 3, 2024

wenjiaswe commented Oct 3, 2024

wenjiaswe commented Oct 3, 2024

shyamjvs commented Oct 3, 2024 • edited Loading

ahrtr commented Oct 3, 2024 • edited Loading

shyamjvs commented Oct 3, 2024

ahrtr commented Oct 3, 2024

shyamjvs commented Oct 3, 2024

shyamjvs commented Oct 3, 2024 • edited Loading

serathius commented Oct 4, 2024

ahrtr commented Oct 4, 2024

shyamjvs commented Oct 4, 2024

serathius commented Oct 4, 2024

shyamjvs commented Oct 5, 2024 • edited Loading

serathius commented Oct 5, 2024 • edited Loading

shyamjvs commented Oct 7, 2024

serathius commented Oct 24, 2024

ahrtr commented Oct 24, 2024

ahrtr commented Oct 24, 2024

ahrtr commented Nov 27, 2024

ArkaSaha30 commented Nov 28, 2024

ahrtr commented Oct 2, 2024 •

edited

Loading

shyamjvs commented Oct 3, 2024 •

edited

Loading

ahrtr commented Oct 3, 2024 •

edited

Loading

shyamjvs commented Oct 3, 2024 •

edited

Loading

shyamjvs commented Oct 5, 2024 •

edited

Loading

serathius commented Oct 5, 2024 •

edited

Loading