-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd_cluster_version
metric is missing
#18805
Comments
and here is the all available metrics
|
I couldn't reproduce this issue in 3.5.16. Please try the latest version (3.5.16) and provide the exact steps how to reproduce it. |
Discussed during sig-etcd triage meeting, @ugur99 we'll need more info on how to reproduce this. We could discuss in sig-etcd slack or would you be able to provide a precise step by step here on how to recreate your scenario? At the moment there is too much ambiguity to be able to recreate your setup. |
Yep you are totally right @jmhbnz. I think I can reproduce the issue on my local with the following steps:
# download etcd binaries
mkdir -p /tmp/etcd-download
curl -L https://storage.googleapis.com/etcd/v3.5.12/etcd-v3.5.12-linux-arm64.tar.gz -o /tmp/etcd-v3.5.12-linux-arm64.tar.gz
tar xzvf /tmp/etcd-v3.5.12-linux-arm64.tar.gz -C /tmp/etcd-download --strip-components=1
cp /tmp/etcd-download/etcd /usr/bin/etcd
cp /tmp/etcd-download/etcdctl /usr/bin/etcdctl
cp /tmp/etcd-download/etcdutl /usr/bin/etcdutl # cat /usr/lib/systemd/system/etcd.service
[Unit]
Description=etcd
[Service]
Type=notify
User=root
EnvironmentFile=/etc/etcd.env
ExecStart=/usr/bin/etcd
NotifyAccess=all
Restart=always
RestartSec=10s
LimitNOFILE=40000
[Install]
WantedBy=multi-user.target # cat /etc/etcd.env
# Environment file for etcd v3.5.12
ETCD_DATA_DIR=/var/lib/etcd
ETCD_ADVERTISE_CLIENT_URLS=https://<node2_host_ip>:2379
ETCD_INITIAL_ADVERTISE_PEER_URLS=https://<node2_host_ip>:2380
ETCD_INITIAL_CLUSTER_STATE=existing
ETCD_METRICS=basic
ETCD_LISTEN_METRICS_URLS=http://<node2_host_ip>:2381,http://127.0.0.1:2381
ETCD_LISTEN_CLIENT_URLS=https://<node2_host_ip>:2379,https://127.0.0.1:2379
ETCD_ELECTION_TIMEOUT=5000
ETCD_HEARTBEAT_INTERVAL=250
ETCD_INITIAL_CLUSTER_TOKEN=k8s_etcd
ETCD_LISTEN_PEER_URLS=https://<node2_host_ip>:2380
ETCD_NAME=etcd2
ETCD_PROXY=off
ETCD_INITIAL_CLUSTER=etcd1=https://<node1_host_ip>:2380,etcd2=https://<node2_host_ip>:2380
ETCD_AUTO_COMPACTION_RETENTION=8
ETCD_SNAPSHOT_COUNT=10000
ETCD_QUOTA_BACKEND_BYTES=6147483648
# Flannel need etcd v2 API
ETCD_ENABLE_V2=true
# TLS settings
ETCD_TRUSTED_CA_FILE=/etc/ssl/etcd/ssl/ca.pem
ETCD_CERT_FILE=/etc/ssl/etcd/ssl/member-node2.pem
ETCD_KEY_FILE=/etc/ssl/etcd/ssl/member-node2-key.pem
ETCD_CLIENT_CERT_AUTH=true
ETCD_PEER_TRUSTED_CA_FILE=/etc/ssl/etcd/ssl/ca.pem
ETCD_PEER_CERT_FILE=/etc/ssl/etcd/ssl/member-node2.pem
ETCD_PEER_KEY_FILE=/etc/ssl/etcd/ssl/member-node2-key.pem
ETCD_PEER_CLIENT_CERT_AUTH=True
# CLI settings
ETCDCTL_ENDPOINTS=https://127.0.0.1:2379
ETCDCTL_CACERT=/etc/ssl/etcd/ssl/ca.pem
ETCDCTL_KEY=/etc/ssl/etcd/ssl/admin-node2-key.pem
ETCDCTL_CERT=/etc/ssl/etcd/ssl/admin-node2.pem
# ETCD 3.5.x issue
# https://groups.google.com/a/kubernetes.io/g/dev/c/B7gJs88XtQc/m/rSgNOzV2BwAJ?utm_medium=email&utm_source=footer
ETCD_EXPERIMENTAL_INITIAL_CORRUPT_CHECK=True
# ETCD tracing configuration
ETCD_EXPERIMENTAL_ENABLE_DISTRIBUTED_TRACING=true
ETCD_EXPERIMENTAL_DISTRIBUTED_TRACING_SAMPLING_RATE=100
ETCD_EXPERIMENTAL_DISTRIBUTED_TRACING_ADDRESS=0.0.0.0:4317
ETCD_EXPERIMENTAL_DISTRIBUTED_TRACING_SERVICE_NAME=etcd
ETCD_EXPERIMENTAL_DISTRIBUTED_TRACING_INSTANCE_ID=etcd2 # create first helper script for certificate generation
# cat make-ssl-etcd.sh
set -o errexit
set -o pipefail
MASTERS=(node1 node2)
HOSTS=(node1 node2)
CONFIG="/root/openssl.conf"
SSLDIR="/etc/ssl/etcd/ssl"
tmpdir=$(mktemp -d letcd_cacert.XXXXXX)
trap 'rm -rf "${tmpdir}"' EXIT
cd "${tmpdir}"
# Root CA
if [ -e "$SSLDIR/ca-key.pem" ]; then
# Reuse existing CA
cp $SSLDIR/{ca.pem,ca-key.pem} .
else
openssl genrsa -out ca-key.pem 2048 > /dev/null 2>&1
openssl req -x509 -new -nodes -key ca-key.pem -days 36500 -out ca.pem -subj "/CN=etcd-ca" > /dev/null 2>&1
fi
# ETCD member
if [ -n "$MASTERS" ]; then
echo "Creating certs for masters"
for host in "${MASTERS[@]}"; do
echo "Creating certs for $host"
cn="${host%%.*}"
# Member key
openssl genrsa -out member-${host}-key.pem 2048 > /dev/null 2>&1
openssl req -new -key member-${host}-key.pem -out member-${host}.csr -subj "/CN=etcd-member-${cn}" -config ${CONFIG} > /dev/null 2>&1
openssl x509 -req -in member-${host}.csr -CA ca.pem -CAkey ca-key.pem -CAcreateserial -out member-${host}.pem -days 36500 -extensions ssl_client -extfile ${CONFIG} > /dev/null 2>&1
# Admin key
openssl genrsa -out admin-${host}-key.pem 2048 > /dev/null 2>&1
openssl req -new -key admin-${host}-key.pem -out admin-${host}.csr -subj "/CN=etcd-admin-${cn}" > /dev/null 2>&1
openssl x509 -req -in admin-${host}.csr -CA ca.pem -CAkey ca-key.pem -CAcreateserial -out admin-${host}.pem -days 36500 -extensions ssl_client -extfile ${CONFIG} > /dev/null 2>&1
done
fi
# Node keys
if [ -n "$HOSTS" ]; then
echo "Creating certs for hosts"
for host in "${HOSTS[@]}"; do
echo "Creating certs for $host"
cn="${host%%.*}"
openssl genrsa -out node-${host}-key.pem 2048 > /dev/null 2>&1
openssl req -new -key node-${host}-key.pem -out node-${host}.csr -subj "/CN=etcd-node-${cn}" > /dev/null 2>&1
openssl x509 -req -in node-${host}.csr -CA ca.pem -CAkey ca-key.pem -CAcreateserial -out node-${host}.pem -days 36500 -extensions ssl_client -extfile ${CONFIG} > /dev/null 2>&1
done
fi
# Install certs
if [ -e "$SSLDIR/ca-key.pem" ]; then
# No pass existing CA
rm -f ca.pem ca-key.pem
fi
if [ -n "$(ls -A *.pem)" ]; then
mv *.pem ${SSLDIR}/
fi chmod +x make-ssl-etcd.sh # create openssl configuration file
# cat /root/openssl.conf
req_extensions = v3_req
distinguished_name = req_distinguished_name
[req_distinguished_name]
[ v3_req ]
basicConstraints = CA:FALSE
keyUsage = nonRepudiation, digitalSignature, keyEncipherment
subjectAltName = @alt_names
[ ssl_client ]
extendedKeyUsage = clientAuth, serverAuth
basicConstraints = CA:FALSE
subjectKeyIdentifier=hash
authorityKeyIdentifier=keyid,issuer
subjectAltName = @alt_names
[ v3_ca ]
basicConstraints = CA:TRUE
keyUsage = nonRepudiation, digitalSignature, keyEncipherment
subjectAltName = @alt_names
authorityKeyIdentifier=keyid:always,issuer
[alt_names]
DNS.1 = localhost
DNS.2 = etcd.kube-system.svc.cluster.local
DNS.3 = etcd.kube-system.svc
DNS.4 = etcd.kube-system
DNS.5 = etcd
DNS.6 = node1
DNS.7 = node2
DNS.8 = lb-apiserver.kubernetes.local
IP.1 = 127.0.0.1
IP.2 = <node1_host_ip>
IP.3 = <node2_host_ip> # create certificates directory on the second node
mkdir -p /etc/ssl/etcd/ssl
# copy etcd root CA and key to the second node (for simplicity I will skip controlplane joining steps; but in our environments we first join the new controlplane to the cluster and it fetches etcd CA from the secret in the cluster and we have another script that fetches etcd CA.key from the vault)
scp /etc/ssl/etcd/ssl/ca.pem /etc/ssl/etcd/ssl/ca-key.pem root@<node2_host_ip>:/etc/ssl/etcd/ssl/
# run the certificate generation script on the second node
./make-ssl-etcd.sh # create second helper script for etcd member joining
# cat join-etcd-member.sh
#!/bin/bash
export ETCDCTL_API=3
ETCD_CLUSTER=(<node1_host_ip> <node2_host_ip>)
PEER_URL="https://<node2_host_ip>:2380"
KEY="/etc/ssl/etcd/ssl/admin-node2-key.pem"
CERT="/etc/ssl/etcd/ssl/admin-node2.pem"
CACERT="/etc/ssl/etcd/ssl/ca.pem"
LOG_FILE="/var/log/etcd_join_member.log"
# Function to log messages with timestamps
log() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a $LOG_FILE
}
# Function to check etcd instance status
check_etcd_status() {
local endpoint=$1
etcdctl --endpoints=https://$endpoint:2379 --key=$KEY --cert=$CERT --cacert=$CACERT endpoint status > /dev/null 2>&1
return $?
}
# Function to add new etcd member
add_etcd_member() {
local endpoint=$1
etcdctl --endpoints=https://$endpoint:2379 --key=$KEY --cert=$CERT --cacert=$CACERT member add etcd2 --peer-urls=$PEER_URL
return $?
}
# Retry mechanism
RETRIES=2
SLEEP_INTERVAL=2
for endpoint in "${ETCD_CLUSTER[@]}"; do
for ((i=1; i<=RETRIES; i++)); do
check_etcd_status $endpoint
if [ $? -eq 0 ]; then
log "etcd instance $endpoint is reachable"
add_etcd_member $endpoint
if [ $? -eq 0 ]; then
log "Successfully added new etcd member using instance $endpoint"
exit 0
else
log "Failed to add new etcd member using instance $endpoint"
fi
break
else
log "etcd instance $endpoint is not reachable (Attempt $i/$RETRIES)"
sleep $SLEEP_INTERVAL
fi
done
done
log "Failed to add new etcd member after checking all instances"
exit 1 # run the etcd member joining script on the second node
./join-etcd-member.sh
# start the etcd service
systemctl start etcd
In the flatcar ignition template, we're following almost the same steps. After the new etcd/controlplane node is provisioned, these helper scripts act to set up the etcd service; download the binary; retrieve etcd ca.key from the vault (etcd root CA comes with the kubeadm join controlplane step); generate the certificate and finally join the new etcd member to the etcd cluster. So with this setup we are able to easily perform scaling operations for controlplanes running in a bare-metal environment. |
Hi @ugur99 can you check the status of the servers with
? |
Sure @siyuanfoundation, here it is: +-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 192.168.64.2:2379 | 1c5bce91c45fca0c | 3.5.12 | 4.9 MB | true | false | 4 | 2365 | 2365 | |
| 192.168.64.3:2379 | 82fc98698e1ab9a9 | 3.5.12 | 4.7 MB | false | false | 4 | 2365 | 2365 | |
+-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ |
Bug report criteria
What happened?
new etcd members do not generate
etcd_cluster_version
metrics; this is the only metrics being missing as of we've noticed. According to the etcd logs instance knows the cluster version; but this metrics is still missing.What did you expect to happen?
etcd_cluster_version
should be availableHow can we reproduce it (as minimally and precisely as possible)?
Deploy a cluster using kubespray: deploy etcds as a systemd service on each controlplane node. Then remove the controlplanes (and etcds) one by one from the cluster by reprovisioning the new controlplanes from the ignition template (we have baked all the controlplane and etcd setup/join processes into the ignition template). We have exactly the same configuration, same etcd member names; and followed similar steps for the etcd member join/remove process.
Anything else we need to know?
No response
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
Environment file for etcd v3.5.12
ETCD_DATA_DIR=/var/lib/etcd
ETCD_ADVERTISE_CLIENT_URLS=https://12.60.87.25:2379
ETCD_INITIAL_ADVERTISE_PEER_URLS=https://12.60.87.25:2380
ETCD_INITIAL_CLUSTER_STATE=existing
ETCD_METRICS=basic
ETCD_LISTEN_METRICS_URLS=http://12.60.87.25:2381,http://127.0.0.1:2381
ETCD_LISTEN_CLIENT_URLS=https://12.60.87.25:2379,https://127.0.0.1:2379
ETCD_ELECTION_TIMEOUT=5000
ETCD_HEARTBEAT_INTERVAL=250
ETCD_INITIAL_CLUSTER_TOKEN=k8s_etcd
ETCD_LISTEN_PEER_URLS=https://12.60.87.25:2380
ETCD_NAME=etcd1
ETCD_PROXY=off
ETCD_INITIAL_CLUSTER=etcd1=https://12.60.87.25:2380,etcd2=https://12.60.87.90:2380,etcd3=https://12.60.87.152:2380
ETCD_AUTO_COMPACTION_RETENTION=8
ETCD_SNAPSHOT_COUNT=10000
ETCD_QUOTA_BACKEND_BYTES=6147483648
Flannel need etcd v2 API
ETCD_ENABLE_V2=true
TLS settings
ETCD_TRUSTED_CA_FILE=/etc/ssl/etcd/ssl/ca.pem
ETCD_CERT_FILE=/etc/ssl/etcd/ssl/member-dev-k8s-control1.pem
ETCD_KEY_FILE=/etc/ssl/etcd/ssl/member-dev-k8s-control1-key.pem
ETCD_CLIENT_CERT_AUTH=true
ETCD_PEER_TRUSTED_CA_FILE=/etc/ssl/etcd/ssl/ca.pem
ETCD_PEER_CERT_FILE=/etc/ssl/etcd/ssl/member-dev-k8s-control1.pem
ETCD_PEER_KEY_FILE=/etc/ssl/etcd/ssl/member-dev-k8s-control1-key.pem
ETCD_PEER_CLIENT_CERT_AUTH=True
CLI settings
ETCDCTL_ENDPOINTS=https://127.0.0.1:2379
ETCDCTL_CACERT=/etc/ssl/etcd/ssl/ca.pem
ETCDCTL_KEY=/etc/ssl/etcd/ssl/admin-dev-k8s-control1-key.pem
ETCDCTL_CERT=/etc/ssl/etcd/ssl/admin-dev-k8s-control1.pem
ETCD 3.5.x issue
https://groups.google.com/a/kubernetes.io/g/dev/c/B7gJs88XtQc/m/rSgNOzV2BwAJ?utm_medium=email&utm_source=footer
ETCD_EXPERIMENTAL_INITIAL_CORRUPT_CHECK=True
ETCD tracing configuration
ETCD_EXPERIMENTAL_ENABLE_DISTRIBUTED_TRACING=true
ETCD_EXPERIMENTAL_DISTRIBUTED_TRACING_SAMPLING_RATE=100
ETCD_EXPERIMENTAL_DISTRIBUTED_TRACING_ADDRESS=0.0.0.0:4317
ETCD_EXPERIMENTAL_DISTRIBUTED_TRACING_SERVICE_NAME=etcd
ETCD_EXPERIMENTAL_DISTRIBUTED_TRACING_INSTANCE_ID=etcd1
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
The text was updated successfully, but these errors were encountered: