-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server,tests: add additional lease metrics and test #18711
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: vivekpatani The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @vivekpatani. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files
... and 19 files with indirect coverage changes @@ Coverage Diff @@
## main #18711 +/- ##
==========================================
+ Coverage 68.79% 68.80% +0.01%
==========================================
Files 420 420
Lines 35523 35538 +15
==========================================
+ Hits 24437 24453 +16
+ Misses 9658 9655 -3
- Partials 1428 1430 +2 Continue to review full report in Codecov by Sentry.
|
leaseAttached = prometheus.NewCounter(prometheus.CounterOpts{ | ||
Namespace: "etcd_debugging", | ||
Subsystem: "lease", | ||
Name: "attach_total", | ||
Help: "The number of leases that are attached to a lease item.", | ||
}) | ||
|
||
leaseDetached = prometheus.NewCounter(prometheus.CounterOpts{ | ||
Namespace: "etcd_debugging", | ||
Subsystem: "lease", | ||
Name: "detach_total", | ||
Help: "The number of leases that are detached from a lease item.", | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we can consolidate the two metrics (leaseAttached and leaseDetached) into one something like below,
leaseTotalAttachedKeys = prometheus.NewGaugeVec(prometheus.GaugeOpts{
Namespace: "etcd_debugging",
Subsystem: "lease",
Name: "attached_keys_total",
Help: "Total number of attached key for each lease",
},
[]string{"lease_id"},
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have enough background about cardinality but my understanding is that, adding a key multiplies the cardinalities. But if not, then the idea sounds good. @ahrtr thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding a key multiplies the cardinalities
Yes, it's true. Indeed, It isn't a good idea to add label "lease_id". The prometheus official document clearly clarifies it https://prometheus.io/docs/practices/naming/#labels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So
- either keep it as it's (two metrics, leaseAttached and leaseDetached), but shouldn't them counter instead of gauge?
- or consolidate them into one as mentioned above, but without the label "lease_id". In this case, it should be Gauge for sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have background in metrics cardinality. label_id
is a big no-no. Please note that not everything can be expressed in metrics, they are aggregations that allow you to observe state of program. Some things, especially for debugging purposes are better expressed as events. So write logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or consolidate them into one as mentioned above, but without the label "lease_id". In this case, it should be Gauge for sure.
This seems like a good idea. Thoughts? @serathius
Some things, especially for debugging purposes are better expressed as events. So write logs
Logs are great but sometimes it's difficult to miss a trend, when there's log. Makes it easier to setup a trigger based on logs, than the trend of metrics, if that makes sense.
@@ -49,11 +49,88 @@ var ( | |||
// 1 second -> 3 months | |||
Buckets: prometheus.ExponentialBuckets(1, 2, 24), | |||
}) | |||
|
|||
leaseAttached = prometheus.NewCounter(prometheus.CounterOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of metrics about lease attachements? What you are trying to measure? number of keys per lease?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
number of keys per lease?
Precisely, yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@serathius ping^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
number of keys per lease?
It can only get the total number of keys attached by all leases because obviously there isn't a lease_id
label.
- metrics to capture leases attached and detached - metric to capture initial lease count at startup Signed-off-by: vivekpatani <9080894+vivekpatani@users.noreply.github.com>
Help: "Error count by type to count for lease grants.", | ||
}, []string{"error"}) | ||
|
||
leaseRevokeError = prometheus.NewGaugeVec(prometheus.GaugeOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be covered by RPC metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you perhaps show me an example of how to get specific errors for a metric like this? @serathius
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Help: "The number of leases that are detached from a lease item.", | ||
}) | ||
|
||
initLeaseCount = prometheus.NewCounter(prometheus.CounterOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need this metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replied here but we could also roll this up into another metric but I also don't want to affect the behavior of existing metrics, hence created a separate metric.
@@ -821,6 +830,7 @@ func (le *lessor) initAndRecover() { | |||
} | |||
} | |||
le.leaseExpiredNotifier.Init() | |||
initLeaseCount.Add(float64(len(lpbs))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the init can be called multiple times during server lifecycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding was that in reverse calling stack, this is only ever initiated once per etcd member, but I might've missed something. NewServer
-> NewLessor
-> newLessor
-> l.initAndRecover
?
The intuition behind this metric is to see how many leases are available at the start. An issue recently occurred where we accumulated a lot of leases, and we wanted to get a number, so that it gives us an approximate idea on how long it will take for these leases to be revoked.
I'm really confused about the motivation behind those metrics. Instead of trying to add a metric for everything just in case, can you describe what do you want to achieve? We can discuss a design a metric per each use case. |
Sorry for the confusion. We observed a lot of lease churn in the recent past, to get to the bottom of this, metrics are helpful. For intuition
I'm open to the idea on how to implement this any better or reuse existing metrics to derive these. Thanks for taking a look @serathius. |
Bump @serathius or @ahrtr, thanks.^ |
Error metrics LeaseGrant, LeaseRevoke, LeaseRefresh should also be available in the QPS metrics.
How would you know that by metric? |
@@ -280,10 +280,12 @@ func (le *lessor) SetCheckpointer(cp Checkpointer) { | |||
|
|||
func (le *lessor) Grant(id LeaseID, ttl int64) (*Lease, error) { | |||
if id == NoLease { | |||
leaseGrantError.WithLabelValues(ErrLeaseNotFound.Error()).Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is ErrLeaseNotFound
mapped to a proper grpc error code? If so, then you can just check the qps metrics for LeaseGrant method and this specific error code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI.
etcd/api/v3rpc/rpctypes/error.go
Line 181 in c1171da
ErrLeaseNotFound = Error(ErrGRPCLeaseNotFound) |
grpc_server_handled_total{grpc_code="Aborted",grpc_method="LeaseGrant",grpc_service="etcdserverpb.Lease",grpc_type="unary"} 0
grpc_server_handled_total{grpc_code="Aborted",grpc_method="LeaseRevoke",grpc_service="etcdserverpb.Lease",grpc_type="unary"} 0
grpc_server_handled_total{grpc_code="Aborted",grpc_method="LeaseKeepAlive",grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"} 0
Help
LeaseLeases
response.