Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server,tests: add additional lease metrics and test #18711

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

vivekpatani
Copy link
Contributor

@vivekpatani vivekpatani commented Oct 9, 2024

  • metrics to capture leases attached and detached
  • metrics to capture duration to grant, revoke, and renew leases
  • metric to capture initial lease count at startup

Help

  • Primarily I'd like to get feedback, I think these metrics can be useful, especially under heavy load.
  • Need help with the last part of the testing, where I try to capture the count the count at initial startup, please let me know if my understanding is wanting in this case. More specifically - this.
    • Currently that part of the test does not pass.
    • My understanding is that when the cluster recovers the lease should exist in the database and should reflect in terms of metric and LeaseLeases response.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vivekpatani
Once this PR has been reviewed and has the lgtm label, please assign jmhbnz for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link

Hi @vivekpatani. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@codecov-commenter
Copy link

codecov-commenter commented Oct 9, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 80.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 68.80%. Comparing base (995027f) to head (97e634a).

Current head 97e634a differs from pull request most recent head ba35d47

Please upload reports for the commit ba35d47 to get more accurate results.

Files with missing lines Patch % Lines
server/lease/lessor.go 70.00% 3 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
Files with missing lines Coverage Δ
server/lease/metrics.go 100.00% <100.00%> (ø)
server/lease/lessor.go 88.83% <70.00%> (-0.50%) ⬇️

... and 19 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #18711      +/-   ##
==========================================
+ Coverage   68.79%   68.80%   +0.01%     
==========================================
  Files         420      420              
  Lines       35523    35538      +15     
==========================================
+ Hits        24437    24453      +16     
+ Misses       9658     9655       -3     
- Partials     1428     1430       +2     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 995027f...ba35d47. Read the comment docs.

Comment on lines +53 to +65
leaseAttached = prometheus.NewCounter(prometheus.CounterOpts{
Namespace: "etcd_debugging",
Subsystem: "lease",
Name: "attach_total",
Help: "The number of leases that are attached to a lease item.",
})

leaseDetached = prometheus.NewCounter(prometheus.CounterOpts{
Namespace: "etcd_debugging",
Subsystem: "lease",
Name: "detach_total",
Help: "The number of leases that are detached from a lease item.",
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we can consolidate the two metrics (leaseAttached and leaseDetached) into one something like below,

	leaseTotalAttachedKeys = prometheus.NewGaugeVec(prometheus.GaugeOpts{
		Namespace: "etcd_debugging",
		Subsystem: "lease",
		Name:      "attached_keys_total",
		Help:      "Total number of attached key for each lease",
	},
		[]string{"lease_id"},
	)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have enough background about cardinality but my understanding is that, adding a key multiplies the cardinalities. But if not, then the idea sounds good. @ahrtr thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding a key multiplies the cardinalities

Yes, it's true. Indeed, It isn't a good idea to add label "lease_id". The prometheus official document clearly clarifies it https://prometheus.io/docs/practices/naming/#labels

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So

  • either keep it as it's (two metrics, leaseAttached and leaseDetached), but shouldn't them counter instead of gauge?
  • or consolidate them into one as mentioned above, but without the label "lease_id". In this case, it should be Gauge for sure.

Copy link
Member

@serathius serathius Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have background in metrics cardinality. label_id is a big no-no. Please note that not everything can be expressed in metrics, they are aggregations that allow you to observe state of program. Some things, especially for debugging purposes are better expressed as events. So write logs.

Copy link
Contributor Author

@vivekpatani vivekpatani Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or consolidate them into one as mentioned above, but without the label "lease_id". In this case, it should be Gauge for sure.

This seems like a good idea. Thoughts? @serathius

Some things, especially for debugging purposes are better expressed as events. So write logs

Logs are great but sometimes it's difficult to miss a trend, when there's log. Makes it easier to setup a trigger based on logs, than the trend of metrics, if that makes sense.

server/lease/lessor.go Outdated Show resolved Hide resolved
server/lease/lessor.go Outdated Show resolved Hide resolved
server/lease/lessor.go Outdated Show resolved Hide resolved
@@ -49,11 +49,88 @@ var (
// 1 second -> 3 months
Buckets: prometheus.ExponentialBuckets(1, 2, 24),
})

leaseAttached = prometheus.NewCounter(prometheus.CounterOpts{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of metrics about lease attachements? What you are trying to measure? number of keys per lease?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

number of keys per lease?

Precisely, yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@serathius ping^

- metrics to capture leases attached and detached
- metric to capture initial lease count at startup

Signed-off-by: vivekpatani <9080894+vivekpatani@users.noreply.github.com>
Help: "Error count by type to count for lease grants.",
}, []string{"error"})

leaseRevokeError = prometheus.NewGaugeVec(prometheus.GaugeOpts{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be covered by RPC metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you perhaps show me an example of how to get specific errors for a metric like this? @serathius

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Help: "The number of leases that are detached from a lease item.",
})

initLeaseCount = prometheus.NewCounter(prometheus.CounterOpts{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need this metric?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replied here but we could also roll this up into another metric but I also don't want to affect the behavior of existing metrics, hence created a separate metric.

@@ -821,6 +830,7 @@ func (le *lessor) initAndRecover() {
}
}
le.leaseExpiredNotifier.Init()
initLeaseCount.Add(float64(len(lpbs)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the init can be called multiple times during server lifecycle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that in reverse calling stack, this is only ever initiated once per etcd member, but I might've missed something. NewServer -> NewLessor -> newLessor -> l.initAndRecover?

The intuition behind this metric is to see how many leases are available at the start. An issue recently occurred where we accumulated a lot of leases, and we wanted to get a number, so that it gives us an approximate idea on how long it will take for these leases to be revoked.

@serathius

@serathius
Copy link
Member

I'm really confused about the motivation behind those metrics. Instead of trying to add a metric for everything just in case, can you describe what do you want to achieve? We can discuss a design a metric per each use case.

@vivekpatani
Copy link
Contributor Author

Sorry for the confusion.

We observed a lot of lease churn in the recent past, to get to the bottom of this, metrics are helpful.

For intuition

  • error based metrics - Trying to see what kind of errors we see when the lease churn is high, and what exactly are we hitting would be helpful.
  • leaseAttachAndDetach - Creating the leases is one thing, and having it attached to a resource, are separate operations, fine grained metrics are helpful to see if the lease got attached to the resource that it was intended for.
  • Duration based metrics - as you said, these are available from the existing gRPC metrics. So not needed.

I'm open to the idea on how to implement this any better or reuse existing metrics to derive these. Thanks for taking a look @serathius.

@vivekpatani
Copy link
Contributor Author

Bump @serathius or @ahrtr, thanks.^

@serathius
Copy link
Member

Trying to see what kind of errors we see when the lease churn is high, and what exactly are we hitting would be helpful.

Error metrics LeaseGrant, LeaseRevoke, LeaseRefresh should also be available in the QPS metrics.

fine grained metrics are helpful to see if the lease got attached to the resource that it was intended for.

How would you know that by metric?

@@ -280,10 +280,12 @@ func (le *lessor) SetCheckpointer(cp Checkpointer) {

func (le *lessor) Grant(id LeaseID, ttl int64) (*Lease, error) {
if id == NoLease {
leaseGrantError.WithLabelValues(ErrLeaseNotFound.Error()).Inc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ErrLeaseNotFound mapped to a proper grpc error code? If so, then you can just check the qps metrics for LeaseGrant method and this specific error code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI.

ErrLeaseNotFound = Error(ErrGRPCLeaseNotFound)

grpc_server_handled_total{grpc_code="Aborted",grpc_method="LeaseGrant",grpc_service="etcdserverpb.Lease",grpc_type="unary"} 0
grpc_server_handled_total{grpc_code="Aborted",grpc_method="LeaseRevoke",grpc_service="etcdserverpb.Lease",grpc_type="unary"} 0
grpc_server_handled_total{grpc_code="Aborted",grpc_method="LeaseKeepAlive",grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"} 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants