Observing Anthos clusters on bare metal

In this lab, you explore how to collect and analyze telemetry from Anthos clusters on bare metal deployments and services.

You leverage Google Cloud’s operations suite to do so. Anthos clusters on bare metal write their logs into Cloud Logging and their metrics into Cloud Monitoring. Once the data is saved in the cloud, you can use Google Cloud’s operations suite’s dashboards to understand and troubleshoot service performance, and have the evolution history over time. In addition, you install the Anthos Service Mesh to collect tracing information automatically.

In this lab, you run Anthos clusters on bare metal atop of GCE VMs. This does require a little extra work as the load balancer VMs need Layer 2 connectivity, so the VMs have been created and configured to use VXLAN, which encapsulates Layer 2 connections on a Layer 3 network. In a pure bare metal deployment, you would just skip this step and everything else would remain the same.

Learn how to configure the level of logging in your Anthos on bare metal clusters.
Evaluate service performance using Cloud Monitoring and Cloud Logging features within Google Cloud.
Install Anthos Service Mesh and investigate the telemetry in Cloud Tracing.
Enable Node Problem Detector to detect issues in your cluster nodes.

Explore the pre-created environment

This environment has been pre-configured with a hybrid Anthos cluster on bare metal running on Google Cloud’s Compute Engine VMs. The hybrid deployment model is a specialized multi-cluster deployment that enables to run user workloads on your admin cluster. In this type of deployment, you could add more user clusters to support multiple teams or workload types.

The pre-created VMs have the following functions:

setup-vm: a vm used to create the rest of the infrastructure. Check out the startup_script to check the steps taken to create the infrastructure and install the Anthos software.
abm-ws: an admin workstation in the same network as the cluster to perfom the configuration, creation, and management of the Anthos cluster on bare metal.
abm-user-cp1: a control plane nodes.
abm-user-w1 & abm-user-w2: two worker nodes.

Go to Navigation menu > Kubernetes Engine > Clusters and verify that you have a single Anthos cluster registered.

Log in to your Anthos cluster

When you create Anthos clusters in Google Cloud, AWS, or VMware, you typically use an environment-specific installation process that takes advantage of native APIs.

When you create a bare metal cluster, the installation process doesn’t automatically create machines for you (typically, they are physical machines so they can’t be created out of thin air). That doesn’t mean, however, that you can’t create “bare metal” clusters running on VMs in any of those environments.

In this lab, the “bare metal” cluster has been created on GCE VMs. It will behave almost identically to a bare metal cluster running on physical devices in your data center. Where the administration in the lab deviates from a pure bare metal scenario, the lab instructions will make it clear.

The cluster in the lab (see diagram below) is made of a single master node and two worker nodes. In a production environment, you might consider using three nodes for high availability of both the data and the control plane.

Hybrid Cluster diagram

student_04_31eb84f42862@cloudshell:~ (qwiklabs-gcp-00-6b98c2173360)$ ZONE=us-central1-b
student_04_31eb84f42862@cloudshell:~ (qwiklabs-gcp-00-6b98c2173360)$ gcloud compute ssh --ssh-flag="-A" root@abm-ws \
    --zone $ZONE \
    --tunnel-through-iap
WARNING: The private SSH key file for gcloud does not exist.
WARNING: The public SSH key file for gcloud does not exist.
WARNING: You do not have an SSH key for gcloud.
WARNING: SSH keygen will be executed to generate a key.
This tool needs to create the directory [/home/student_04_31eb84f42862/.ssh] before being able to generate SSH keys.

Do you want to continue (Y/n)?  y

Generating public/private rsa key pair.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/student_04_31eb84f42862/.ssh/google_compute_engine
Your public key has been saved in /home/student_04_31eb84f42862/.ssh/google_compute_engine.pub
The key fingerprint is:
SHA256:oRWP/U924pMlvcxhuLsmsUUlgS5by1kYML4MLAMdJUo student_04_31eb84f42862@cs-421171858362-default
The key's randomart image is:
+---[RSA 3072]----+
|   E.oo..o. ...  |
|  . o.o .=.o . . |
|   . o o+.+ o o  |
|      oooo.= o.. |
|      . So= *.=o+|
|         . = *=*o|
|            +.=+ |
|           o ... |
|            oo.  |
+----[SHA256]-----+
Updating project ssh metadata...                                                                                                            
Updating project ssh metadata...working.                                                                                                    
Updating project ssh metadata...working..                                                                                                   
Updating project ssh metadata...working...                                                                                                  
Updating project ssh metadata...working...Updated [https://www.googleapis.com/compute/v1/projects/qwiklabs-gcp-00-6b98c2173360].            
Updating project ssh metadata...done.                                                                                                       
Waiting for SSH key to propagate.
WARNING: 

To increase the performance of the tunnel, consider installing NumPy. For instructions,
please see https://cloud.google.com/iap/docs/using-tcp-forwarding#increasing_the_tcp_upload_bandwidth

Warning: Permanently added 'compute.3410199921532884116' (ED25519) to the list of known hosts.
WARNING: 

To increase the performance of the tunnel, consider installing NumPy. For instructions,
please see https://cloud.google.com/iap/docs/using-tcp-forwarding#increasing_the_tcp_upload_bandwidth

Welcome to Ubuntu 20.04.6 LTS (GNU/Linux 5.15.0-1062-gcp x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/pro

 System information as of Sun Jul 28 15:36:06 UTC 2024

  System load:  0.0                Processes:             128
  Usage of /:   1.4% of 247.92GB   Users logged in:       0
  Memory usage: 2%                 IPv4 address for ens4: 10.1.0.3
  Swap usage:   0%


Expanded Security Maintenance for Applications is not enabled.

1 update can be applied immediately.
To see these additional updates run: apt list --upgradable

Enable ESM Apps to receive additional future security updates.
See https://ubuntu.com/esm or run: sudo pro status

New release '22.04.3 LTS' available.
Run 'do-release-upgrade' to upgrade to it.


root@abm-ws:~# 

root@abm-ws:~# cd baremetal
root@abm-ws:~/baremetal# export CLUSTER_ID=abm-hybrid-cluster
root@abm-ws:~/baremetal# export KUBECONFIG=$HOME/baremetal/bmctl-workspace/$CLUSTER_ID/$CLUSTER_ID-kubeconfig
root@abm-ws:~/baremetal# kubectl get nodes -o wide
NAME           STATUS   ROLES           AGE   VERSION            INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION    CONTAINER-RUNTIME
abm-user-cp1   Ready    control-plane   8h    v1.26.2-gke.1001   10.200.0.3    <none>        Ubuntu 20.04.6 LTS   5.15.0-1062-gcp   containerd://1.6.18-gke.0
abm-user-w1    Ready    worker          8h    v1.26.2-gke.1001   10.200.0.4    <none>        Ubuntu 20.04.6 LTS   5.15.0-1062-gcp   containerd://1.6.18-gke.0
abm-user-w2    Ready    worker          8h    v1.26.2-gke.1001   10.200.0.5    <none>        Ubuntu 20.04.6 LTS   5.15.0-1062-gcp   containerd://1.6.18-gke.0
root@abm-ws:~/baremetal# 

Look at the configuration file used to create the Anthos cluster on bare metal and check that the desired options matched the current configuration:

root@abm-ws:~/baremetal# cat bmctl-workspace/abm-hybrid-cluster/abm-hybrid-cluster.yaml
# bmctl configuration variables. Because this section is valid YAML but not a valid Kubernetes
# resource, this section can only be included when using bmctl to
# create the initial admin/hybrid cluster. Afterwards, when creating user clusters by directly
# applying the cluster and node pool resources to the existing cluster, you must remove this
# section.
gcrKeyPath: bmctl-workspace/.sa-keys/qwiklabs-gcp-00-6b98c2173360-anthos-baremetal-gcr.json
sshPrivateKeyPath: /root/.ssh/id_rsa
gkeConnectAgentServiceAccountKeyPath: bmctl-workspace/.sa-keys/qwiklabs-gcp-00-6b98c2173360-anthos-baremetal-connect.json
gkeConnectRegisterServiceAccountKeyPath: bmctl-workspace/.sa-keys/qwiklabs-gcp-00-6b98c2173360-anthos-baremetal-register.json
cloudOperationsServiceAccountKeyPath: bmctl-workspace/.sa-keys/qwiklabs-gcp-00-6b98c2173360-anthos-baremetal-cloud-ops.json
---
apiVersion: v1
kind: Namespace
metadata:
  name: cluster-abm-hybrid-cluster
---
# Cluster configuration. Note that some of these fields are immutable once the cluster is created.
# For more info, see https://cloud.google.com/anthos/clusters/docs/bare-metal/1.15/reference/cluster-config-ref#cluster_configuration_fields
apiVersion: baremetal.cluster.gke.io/v1
kind: Cluster
metadata:
  name: abm-hybrid-cluster
  namespace: cluster-abm-hybrid-cluster
spec:
  # Cluster type. This can be:
  #   1) admin:  to create an admin cluster. This can later be used to create user clusters.
  #   2) user:   to create a user cluster. Requires an existing admin cluster.
  #   3) hybrid: to create a hybrid cluster that runs admin cluster components and user workloads.
  #   4) standalone: to create a cluster that manages itself, runs user workloads, but does not manage other clusters.
  type: hybrid
  # Cluster profile. This can be either 'default' or 'edge'.
  # The edge profile is tailored for deployments on the edge locations
  # and should be used together with the 'standalone' cluster type.
  profile: default
  # Anthos cluster version.
  anthosBareMetalVersion: 1.15.0
  # GKE connect configuration
  gkeConnect:
    projectID: qwiklabs-gcp-00-6b98c2173360
  # Control plane configuration
  controlPlane:
    nodePoolSpec:
      nodes:
      # Control plane node pools. Typically, this is either a single machine
      # or 3 machines if using a high availability deployment.
      - address: 10.200.0.3
  # Cluster networking configuration
  clusterNetwork:
    # Pods specify the IP ranges from which pod networks are allocated.
    pods:
      cidrBlocks:
      - 192.168.0.0/16
    # Services specify the network ranges from which service virtual IPs are allocated.
    # This can be any RFC1918 range that does not conflict with any other IP range
    # in the cluster and node pool resources.
    services:
      cidrBlocks:
      - 10.96.0.0/20
  # Load balancer configuration
  loadBalancer:
    # Load balancer mode can be either 'bundled' or 'manual'.
    # In 'bundled' mode a load balancer will be installed on load balancer nodes during cluster creation.
    # In 'manual' mode the cluster relies on a manually-configured external load balancer.
    mode: bundled
    # Load balancer port configuration
    ports:
      # Specifies the port the load balancer serves the Kubernetes control plane on.
      # In 'manual' mode the external load balancer must be listening on this port.
      controlPlaneLBPort: 443
    # There are two load balancer virtual IP (VIP) addresses: one for the control plane
    # and one for the L7 Ingress service.
    # If you use Layer2 load balancing, the VIPs must be in the same subnet as the load balancer nodes.
    # If you use bundled BGP-based load balancing (mode: 'bundled' and type: 'bgp'), the VIPs
    # must not come from the same subnet as any of the nodes in the cluster.
    # These IP addresses do not correspond to physical network interfaces.
    vips:
      # ControlPlaneVIP specifies the VIP to connect to the Kubernetes API server.
      # This address must not be in the address pools below.
      controlPlaneVIP: 10.200.0.99
      # IngressVIP specifies the VIP shared by all services for ingress traffic.
      # Allowed only in non-admin clusters.
      # This address must be in the address pools below.
      ingressVIP: 10.200.0.100
    # AddressPools is a list of non-overlapping IP ranges for the data plane load balancer.
    # All addresses must be in the same subnet as the load balancer nodes.
    # Address pool configuration is only valid for 'bundled' LB mode in non-admin clusters.
    addressPools:
    - name: pool1
      addresses:
    #   # Each address must be either in the CIDR form (1.2.3.0/24)
    #   # or range form (1.2.3.1-1.2.3.5).
      - 10.200.0.100-10.200.0.200
    # A load balancer node pool can be configured to specify nodes used for load balancing.
    # These nodes are part of the Kubernetes cluster and run regular workloads as well as load balancers.
    # If the node pool config is absent then the control plane nodes are used.
    # Node pool configuration is only valid for 'bundled' LB mode.
    # nodePoolSpec:
    #  nodes:
    #  - address: 10.200.0.3
  # Proxy configuration
  # proxy:
  #   url: http://[username:password@]domain
  #   # A list of IPs, hostnames or domains that should not be proxied.
  #   noProxy:
  #   - 127.0.0.1
  #   - localhost
  # Logging and Monitoring
  clusterOperations:
    # Cloud project for logs and metrics.
    projectID: qwiklabs-gcp-00-6b98c2173360
    # Cloud location for logs and metrics.
    location: us-central1
    # Whether collection of application logs/metrics should be enabled (in addition to
    # collection of system logs/metrics which correspond to system components such as
    # Kubernetes control plane or cluster management agents).
    enableApplication: true
  # Storage configuration
  storage:
    # lvpNodeMounts specifies the config for local PersistentVolumes backed by mounted disks.
    # These disks need to be formatted and mounted by the user, which can be done before or after
    # cluster creation.
    lvpNodeMounts:
      # path specifies the host machine path where mounted disks will be discovered and a local PV
      # will be created for each mount.
      path: /mnt/localpv-disk
      # storageClassName specifies the StorageClass that PVs will be created with. The StorageClass
      # is created during cluster creation.
      storageClassName: local-disks
    # lvpShare specifies the config for local PersistentVolumes backed by subdirectories in a shared filesystem.
    # These subdirectories are automatically created during cluster creation.
    lvpShare:
      # path specifies the host machine path where subdirectories will be created on each host. A local PV
      # will be created for each subdirectory.
      path: /mnt/localpv-share
      # storageClassName specifies the StorageClass that PVs will be created with. The StorageClass
      # is created during cluster creation.
      storageClassName: local-shared
      # numPVUnderSharedPath specifies the number of subdirectories to create under path.
      numPVUnderSharedPath: 5
  # NodeConfig specifies the configuration that applies to all nodes in the cluster.
  nodeConfig:
    # podDensity specifies the pod density configuration.
    podDensity:
      # maxPodsPerNode specifies at most how many pods can be run on a single node.
      maxPodsPerNode: 250
  # Authentication; uncomment this section if you wish to enable authentication to the cluster with OpenID Connect.
  # authentication:
  #   oidc:
  #     # issuerURL specifies the URL of your OpenID provider, such as "https://accounts.google.com". The Kubernetes API
  #     # server uses this URL to discover public keys for verifying tokens. Must use HTTPS.
  #     issuerURL: <URL for OIDC Provider; required>
  #     # clientID specifies the ID for the client application that makes authentication requests to the OpenID
  #     # provider.
  #     clientID: <ID for OIDC client application; required>
  #     # clientSecret specifies the secret for the client application.
  #     clientSecret: <Secret for OIDC client application; optional>
  #     # kubectlRedirectURL specifies the redirect URL (required) for the gcloud CLI, such as
  #     # "http://localhost:[PORT]/callback".
  #     kubectlRedirectURL: <Redirect URL for the gcloud CLI; optional, default is "http://kubectl.redirect.invalid">
  #     # username specifies the JWT claim to use as the username. The default is "sub", which is expected to be a
  #     # unique identifier of the end user.
  #     username: <JWT claim to use as the username; optional, default is "sub">
  #     # usernamePrefix specifies the prefix prepended to username claims to prevent clashes with existing names.
  #     usernamePrefix: <Prefix prepended to username claims; optional>
  #     # group specifies the JWT claim that the provider will use to return your security groups.
  #     group: <JWT claim to use as the group name; optional>
  #     # groupPrefix specifies the prefix prepended to group claims to prevent clashes with existing names.
  #     groupPrefix: <Prefix prepended to group claims; optional>
  #     # scopes specifies additional scopes to send to the OpenID provider as a comma-delimited list.
  #     scopes: <Additional scopes to send to OIDC provider as a comma-separated list; optional>
  #     # extraParams specifies additional key-value parameters to send to the OpenID provider as a comma-delimited
  #     # list.
  #     extraParams: <Additional key-value parameters to send to OIDC provider as a comma-separated list; optional>
  #     # proxy specifies the proxy server to use for the cluster to connect to your OIDC provider, if applicable.
  #     # Example: https://user:password@10.10.10.10:8888. If left blank, this defaults to no proxy.
  #     proxy: <Proxy server to use for the cluster to connect to your OIDC provider; optional, default is no proxy>
  #     # deployCloudConsoleProxy specifies whether to deploy a reverse proxy in the cluster to allow Google Cloud
  #     # Console access to the on-premises OIDC provider for authenticating users. If your identity provider is not
  #     # reachable over the public internet, and you wish to authenticate using Google Cloud Console, then this field
  #     # must be set to true. If left blank, this field defaults to false.
  #     deployCloudConsoleProxy: <Whether to deploy a reverse proxy for Google Cloud Console authentication; optional>
  #     # certificateAuthorityData specifies a Base64 PEM-encoded certificate authority certificate of your identity
  #     # provider. It's not needed if your identity provider's certificate was issued by a well-known public CA.
  #     # However, if deployCloudConsoleProxy is true, then this value must be provided, even for a well-known public
  #     # CA.
  #     certificateAuthorityData: <Base64 PEM-encoded certificate authority certificate of your OIDC provider; optional>
  # Node access configuration; uncomment this section if you wish to use a non-root user
  # with passwordless sudo capability for machine login.
  # nodeAccess:
  #   loginUser: <login user name>
---
# Node pools for worker nodes
apiVersion: baremetal.cluster.gke.io/v1
kind: NodePool
metadata:
  name: hybrid-cluster-pool-1
  namespace: cluster-abm-hybrid-cluster
spec:
  clusterName: abm-hybrid-cluster
  nodes:
  - address: 10.200.0.4
  - address: 10.200.0.5
root@abm-ws:~/baremetal# 

Log in to your Anthos cluster on bare metal from the Console

In Cloud Shell, create a Kubernetes Service Account on your cluster and grant it the cluster-admin role:
Create a token that you can use to log in to the cluster from the Console
Select the token in the SSH session (this will copy the token - don’t try to copy with CTRL+C).

Find the abm-admin-cluster entry in the cluster list showing in the Console and click the Actions icon (three dots) at the far right of the row

Select Log in, select Token, then paste the token from your Clipboard into the provided field. Click Login. When you’re done, it should look like this:

Congratulations! You have successfully signed in to your Anthos on bare metal hybrid cluster!

Understand installed observability software stack

Check all the observability tools installed in the cluster:

root@abm-ws:~/baremetal# kubectl -n kube-system get pods -l "managed-by=stackdriver"
NAME                                                       READY   STATUS    RESTARTS   AGE
gke-metrics-agent-8kxjj                                    1/1     Running   0          8h
gke-metrics-agent-q8cfx                                    1/1     Running   0          8h
gke-metrics-agent-qsqwk                                    1/1     Running   0          8h
kube-state-metrics-cc8987686-g495r                         1/1     Running   0          8h
node-exporter-9v9l6                                        1/1     Running   0          8h
node-exporter-mf2d5                                        1/1     Running   0          8h
node-exporter-mhcjx                                        1/1     Running   0          8h
stackdriver-log-forwarder-czsjj                            1/1     Running   0          8h
stackdriver-log-forwarder-f87fs                            1/1     Running   0          8h
stackdriver-log-forwarder-w789h                            1/1     Running   0          8h
stackdriver-metadata-agent-cluster-level-fcd8f55f9-h9ppw   1/1     Running   0          8h
root@abm-ws:~/baremetal# 

Cloud Logging and Cloud Monitoring are installed and activated in each cluster when you create a new admin, user, or hybrid cluster. The Stackdriver agents include several components on each cluster:

Stackdriver Operator (stackdriver-operator-*). Manages the lifecycle for all other Stackdriver agents deployed onto the cluster.
Stackdriver Custom Resource. A resource that is automatically created as part of the Anthos clusters on bare metal installation process.
Stackdriver Log Forwarder (stackdriver-log-forwarder-*). A Fluent Bit daemonset that forwards logs from each machine to the Cloud Logging. The log Forwarder buffers the log entries on the node locally and re-sends them for up to 4 hours. If the buffer gets full or if the Log Forwarder can’t reach the Cloud Logging API for more than 4 hours, logs are dropped.
Stackdriver Metadata Collector (stackdriver-metadata-agent-). A deployment that sends metadata for Kubernetes resources such as pods, deployments, or nodes to the Stackdriver Resource Metadata API; this data is used to enrich metric queries by enabling you to query by deployment name, node name, or even Kubernetes service name.

You can always disable Cloud Logging and Cloud Monitoring by deleting the resources using kubectl delete. If you want to use a third-party service for monitoring your cluster, check out the documentation to find guides to work with third-party solutions such as Elastic Stack, Splunk Connect, or Datadog.

Explore audit logs

Verify that the configuration file used to create the Anthos cluster on bare metal does not have the setting to disable the collection of audit logs: Notice it’s an empty output.

root@abm-ws:~/baremetal# cat bmctl-workspace/abm-hybrid-cluster/abm-hybrid-cluster.yaml | grep disableCloudAuditLogging:
root@abm-ws:~/baremetal# 

Access the audit logs using the gcloud cli tool:

root@abm-ws:~/baremetal# export PROJECT_ID=$(gcloud config get-value project)
root@abm-ws:~/baremetal# echo "gcloud logging read 'logName="projects/${PROJECT_ID}/logs/externalaudit.googleapis.com%2Factivity"
> AND resource.type="k8s_cluster"
> AND protoPayload.serviceName="anthosgke.googleapis.com"'     --limit 2     --freshness 300d" > get_audit_logs.sh
root@abm-ws:~/baremetal# sh get_audit_logs.sh

Access the audit logs from the Console by navigating to Navigation menu > Logging > Logs Explorer and entering the following query. Then, replace the PROJECT_ID in the query with and click Run query:

logName="projects/PROJECT_ID/logs/externalaudit.googleapis.com%2Factivity"
resource.type="k8s_cluster"
protoPayload.serviceName="anthosgke.googleapis.com"

Explore cluster logs

On the Logs Explorer screen, remove all filters from the query box and click Run query.

Notice on the left side of the screen that you can choose to explore Kubernetes Container, Cluster, and Node logs. If you don’t see them, click on the buble named “Log fields” to view them.

In the left-side logs field panel, select Kubernetes Node.
Now select the abm-user-cp1 node name to further filter the logs.

Explore application logs

Come back to the Cloud Shell window you were using to SSH into the admin workstation. Verify that the configuration file used to create the Anthos cluster on bare metal has the setting enabled to collect metrics and logs from our applications:

root@abm-ws:~/baremetal# cat bmctl-workspace/abm-hybrid-cluster/abm-hybrid-cluster.yaml | grep enableApplication:
    enableApplication: true
root@abm-ws:~/baremetal# 

Create a hello-app application and a Kubernetes Service of type LoadBalancer to access the app:

root@abm-ws:~/baremetal# kubectl create deployment hello-app --image=gcr.io/google-samples/hello-app:2.0
deployment.apps/hello-app created
root@abm-ws:~/baremetal# kubectl expose deployment hello-app --name hello-app-service --type LoadBalancer --port 80 --target-port=8080
service/hello-app-service exposed
root@abm-ws:~/baremetal# 

Get the IP for this workload:

root@abm-ws:~/baremetal# kubectl get svc
NAME                TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)        AGE
hello-app-service   LoadBalancer   10.96.15.125   10.200.0.101   80:31517/TCP   24s
kubernetes          ClusterIP      10.96.0.1      <none>         443/TCP        8h
root@abm-ws:~/baremetal# 

Access the public IP provided by the hello-app-service by running the below command multiple times.

root@abm-ws:~/baremetal# curl 10.200.0.101
Hello, world!
Version: 2.0.0
Hostname: hello-app-75d4498bb8-92766
root@abm-ws:~/baremetal# 

In the Console, go to Navigation menu > Logging > Logs Explorer and enter the following query and click Run query:

Notice that a log was created every time that you ran the curl command.

{
  "insertId": "l48pxjfif6f3f",
  "jsonPayload": {
    "logtag": "F",
    "log": "2024/07/28 15:50:21 Serving request: /"
  },
  "resource": {
    "type": "k8s_container",
    "labels": {
      "pod_name": "hello-app-75d4498bb8-92766",
      "namespace_name": "default",
      "cluster_name": "abm-hybrid-cluster",
      "container_name": "hello-app",
      "project_id": "qwiklabs-gcp-00-6b98c2173360",
      "location": "us-central1"
    }
  },
  "timestamp": "2024-07-28T15:50:21.616951031Z",
  "labels": {
    "k8s-pod/pod-template-hash": "75d4498bb8",
    "k8s-pod/app": "hello-app"
  },
  "logName": "projects/qwiklabs-gcp-00-6b98c2173360/logs/stderr",
  "receiveTimestamp": "2024-07-28T15:50:22.384543464Z"
}

Explore cluster metrics

In the Console, go to Navigation menu > Monitoring > Dashboards. Notice that the Anthos on bare metal installer has created some dashboards for you already.

Click the Anthos cluster control plane uptime to see the availability of your cluster. It should look something like this.

Explore the other dashboards and the metrics available.

Explore application metrics

In the Console, go to Navigation menu > Monitoring > Dashboards.
In the dashboards list, locate the Anthos cluster pod status and click Open dashboard settings icon > Copy Dashboard.
Name it Hello App Anthos cluster pod status and click Copy.
Click the newly created dashboard. You see that it’s gathering metrics for all pods in your cluster for a variety of metrics. Let’s change it so that it only monitors the hello-app pod that we created earlier. For that, click the Edit Dashboard button the top right of the screen.
Change all the metrics to only monitor your hello-app pod. For that, select one of the graphs and see how an Options tab opens on the left side of the screen. Remove all filters, and create a single filter with the following details. Then, click Done and do the same in the next graph.

Congratulations! You have successfully created a dashboard in Google Cloud to monitor your hello-app pod running on your on-premises cluster.

Install Anthos Service Mesh

Come back to the Cloud Shell window you were using to SSH into the admin workstation. Create an overlay file that will be used in the Anthos Service Mesh installation to enable Cloud Trace:

root@abm-ws:~/baremetal# cat <<'EOF' > cloud-trace.yaml
> apiVersion: install.istio.io/v1alpha1
> kind: IstioOperator
> spec:
>     meshConfig:
>         enableTracing: true
>     values:
>             global:
>                 proxy:
>                         tracer: stackdriver
> EOF
root@abm-ws:~/baremetal# 

Download the asmcli tool to install Anthos Service Mesh on your cluster:

root@abm-ws:~/baremetal# curl https://storage.googleapis.com/csm-artifacts/asm/asmcli_1.15 > asmcli
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  182k  100  182k    0     0  1394k      0 --:--:-- --:--:-- --:--:-- 1394k
root@abm-ws:~/baremetal# chmod +x asmcli
root@abm-ws:~/baremetal# 

Clean up the environment variables to make sure there are no conflicts with Kubeconfig when installing Anthos Service Mesh:

root@abm-ws:~/baremetal# export PROJECT_ID=
root@abm-ws:~/baremetal# export CLUSTER_NAME=
root@abm-ws:~/baremetal# export CLUSTER_LOCATION=
root@abm-ws:~/baremetal# 

Install Anthos Service Mesh in your cluster using the Cloud Trace overlay:

root@abm-ws:~/baremetal# ./asmcli install \
>     --kubeconfig ./bmctl-workspace/abm-hybrid-cluster/abm-hybrid-cluster-kubeconfig \
>     --fleet_id $FLEET_PROJECT_ID \
>     --output_dir . \
>     --platform multicloud \
>     --enable_all \
>     --ca mesh_ca \
>     --custom_overlay cloud-trace.yaml
asmcli: Reading cluster information for abm-hybrid-cluster-admin@abm-hybrid-cluster
asmcli: Setting up necessary files...
asmcli: Using /root/baremetal/bmctl-workspace/abm-hybrid-cluster/abm-hybrid-cluster-kubeconfig as the kubeconfig...
asmcli: Checking installation tool dependencies...
asmcli: kubeconfig set to /root/baremetal/bmctl-workspace/abm-hybrid-cluster/abm-hybrid-cluster-kubeconfig
asmcli: using context abm-hybrid-cluster-admin@abm-hybrid-cluster
asmcli: Getting account information...
asmcli: Downloading kpt..
asmcli: Downloading ASM..
asmcli: Downloading ASM kpt package...
fetching package "/asm" from "https://github.com/GoogleCloudPlatform/anthos-service-mesh-packages" to "asm"
fetching package "/samples" from "https://github.com/GoogleCloudPlatform/anthos-service-mesh-packages" to "samples"
asmcli: Verifying cluster registration.
asmcli: Verified cluster is registered to qwiklabs-gcp-00-6b98c2173360
asmcli: Enabling required APIs...
asmcli: Verifying cluster registration.
asmcli: Verified cluster is registered to qwiklabs-gcp-00-6b98c2173360
asmcli: Verifying cluster registration.
asmcli: Verified cluster is registered to qwiklabs-gcp-00-6b98c2173360
asmcli: Checking for project qwiklabs-gcp-00-6b98c2173360...
asmcli: Adding labels to abm-hybrid-cluster
authority:
  identityProvider: https://gkehub.googleapis.com/projects/qwiklabs-gcp-00-6b98c2173360/locations/global/memberships/abm-hybrid-cluster
  issuer: https://kubernetes.default.svc.cluster.local
  oidcJwks: eyJrZXlzIjpbeyJ1c2UiOiJzaWciLCJrdHkiOiJSU0EiLCJraWQiOiJzdEpia1JGQ0I2WDd0ODhjUU9FSWRWX1VlRmJwYmVhakRXME1xdnpGNG1FIiwiYWxnIjoiUlMyNTYiLCJuIjoicl9KTzI0WGpoa0hJRTZSMy1RWldTOWZXS1hEay1UYkQ2Tnp1THVJdkdZQmdoXzM1U2s4ZnExYmtnemE1SjQyUldHUXo4elJIVXo0bmtyRy1CV3VLY0dCckZTZEJ5ZWJKTzRfNnJ2WFNNOHFGNV8zS2VnQXNib0U5T3ZVbzl2VThfdmVRQm01OFVkanhNaVdMMTFSQkxtRThIUi1OamJtTUtqSW52Q0ttck9GUTNMTmI4azNnTkdTTDdFZTdyUEE0VnNIdXJBSHNjaTlJTFdidnBLYTF1YXAzdy0xSFZ2WGY3RC0tVGtsRVBqVC16MXNlX3dzT0FsSm41azlaX0xhUktqV3NRS2w1TFVYNEVoZUVZUnJTRlh3WUx0V0w3NDV2a1pxTklUblV1M0ZvUTMzVS05S0RIdVBxSXpSUWxURElxRGh2Q2xZQ0RDVlVWaEsxTjUzTlR3IiwiZSI6IkFRQUIifV19
  workloadIdentityPool: qwiklabs-gcp-00-6b98c2173360.svc.id.goog
createTime: '2024-07-28T07:15:15.483900886Z'
description: abm-hybrid-cluster
endpoint:
  kubernetesMetadata:
    kubernetesApiServerVersion: v1.26.2-gke.1001
    memoryMb: 33526
    nodeCount: 2
    nodeProviderId: baremetal
    updateTime: '2024-07-28T16:06:40.489873988Z'
    vcpuCount: 8
  kubernetesResource:
    resourceOptions:
      connectVersion: 20230224-00-00
      k8sVersion: '1.26'
externalId: d32d1228-bae4-46d8-9a8c-9bc53b7bc29b
labels:
  mesh_id: proj-1061033565780
lastConnectionTime: '2024-07-28T16:06:47.377415161Z'
name: projects/qwiklabs-gcp-00-6b98c2173360/locations/global/memberships/abm-hybrid-cluster
state:
  code: READY
uniqueId: 2268ee2b-0753-4b5e-a5eb-9f7f3648cc7a
updateTime: '2024-07-28T16:07:02.476725940Z'
asmcli: Querying for core/account...
asmcli: Binding abm-builder-sa@qwiklabs-gcp-00-6b98c2173360.iam.gserviceaccount.com to cluster admin role...
clusterrolebinding.rbac.authorization.k8s.io/abm-builder-sa-cluster-admin-binding created
asmcli: Creating istio-system namespace...
namespace/istio-system created
asmcli: Initializing meshconfig API...
asmcli: Cluster has Membership ID abm-hybrid-cluster in the Hub of project qwiklabs-gcp-00-6b98c2173360
asmcli: Binding serviceAccount:abm-builder-sa@qwiklabs-gcp-00-6b98c2173360.iam.gserviceaccount.com to required IAM roles...
asmcli: Configuring kpt package...
asm/
set 19 field(s) of setter "gcloud.core.project" to value "qwiklabs-gcp-00-6b98c2173360"
asm/
set 2 field(s) of setter "gcloud.project.projectNumber" to value "1061033565780"
asm/
set 16 field(s) of setter "gcloud.container.cluster" to value "abm-hybrid-cluster"
asm/
set 16 field(s) of setter "gcloud.compute.location" to value "global"
asm/
set 1 field(s) of setter "gcloud.compute.network" to value "default"
asm/
set 3 field(s) of setter "gcloud.project.environProjectNumber" to value "1061033565780"
asm/
set 2 field(s) of setter "anthos.servicemesh.rev" to value "asm-1157-23"
asm/
set 3 field(s) of setter "anthos.servicemesh.tag" to value "1.15.7-asm.23"
asm/
set 3 field(s) of setter "anthos.servicemesh.trustDomain" to value "qwiklabs-gcp-00-6b98c2173360.svc.id.goog"
asm/
set 1 field(s) of setter "anthos.servicemesh.tokenAudiences" to value "istio-ca,qwiklabs-gcp-00-6b98c2173360.svc.id.goog"
asm/
set 3 field(s) of setter "anthos.servicemesh.created-by" to value "asmcli-1.15.7-asm.23.config1"
asm/
set 2 field(s) of setter "anthos.servicemesh.idp-url" to value "https://gkehub.googleapis.com/projects/qwiklabs-gcp-00-6b98c2173360/locations/global/memberships/abm-hybrid-cluster"
asm/
set 2 field(s) of setter "anthos.servicemesh.trustDomainAliases" to value "qwiklabs-gcp-00-6b98c2173360.svc.id.goog"
namespace/istio-system labeled
asmcli: Installing validation webhook fix...
service/istiod created
asmcli: Installing ASM control plane...

Thank you for installing Istio 1.15.  Please take a few minutes to tell us about your install/upgrade experience!  https://forms.gle/SWHFBmwJspusK1hv6
asmcli: ...done!
asmcli: Installing ASM CanonicalService controller in asm-system namespace...
namespace/asm-system created
customresourcedefinition.apiextensions.k8s.io/canonicalservices.anthos.cloud.google.com created
role.rbac.authorization.k8s.io/canonical-service-leader-election-role created
clusterrole.rbac.authorization.k8s.io/canonical-service-manager-role created
clusterrole.rbac.authorization.k8s.io/canonical-service-metrics-reader created
serviceaccount/canonical-service-account created
rolebinding.rbac.authorization.k8s.io/canonical-service-leader-election-rolebinding created
clusterrolebinding.rbac.authorization.k8s.io/canonical-service-manager-rolebinding created
clusterrolebinding.rbac.authorization.k8s.io/canonical-service-proxy-rolebinding created
service/canonical-service-controller-manager-metrics-service created
deployment.apps/canonical-service-controller-manager created
asmcli: Waiting for deployment...
deployment.apps/canonical-service-controller-manager condition met
asmcli: ...done!
asmcli: 
asmcli: *****************************
client version: 1.15.7-asm.23
istiod version: 1.15.7
istiod version: 1.15.7-asm.23
data plane version: none
asmcli: *****************************
asmcli: The ASM control plane installation is now complete.
asmcli: To enable automatic sidecar injection on a namespace, you can use the following command:
asmcli: kubectl label namespace <NAMESPACE> istio-injection- istio.io/rev=asm-1157-23 --overwrite
asmcli: If you use 'istioctl install' afterwards to modify this installation, you will need
asmcli: to specify the option '--set revision=asm-1157-23' to target this control plane
asmcli: instead of installing a new one.
asmcli: To finish the installation, enable Istio sidecar injection and restart your workloads.
asmcli: For more information, see:
asmcli: https://cloud.google.com/service-mesh/docs/proxy-injection
asmcli: The ASM package used for installation can be found at:
asmcli: /root/baremetal/asm
asmcli: The version of istioctl that matches the installation can be found at:
asmcli: /root/baremetal/istio-1.15.7-asm.23/bin/istioctl
asmcli: A symlink to the istioctl binary can be found at:
asmcli: /root/baremetal/istioctl
asmcli: The combined configuration generated for installation can be found at:
asmcli: /root/baremetal/asm-1157-23-manifest-raw.yaml
asmcli: The full, expanded set of kubernetes resources can be found at:
asmcli: /root/baremetal/asm-1157-23-manifest-expanded.yaml
asmcli: *****************************
asmcli: Successfully installed ASM.
root@abm-ws:~/baremetal# 

Explore application tracing

Create and label a demo namespace to enable automatic sidecar injection:

root@abm-ws:~/baremetal# export ASM_REV=$(kubectl -n istio-system get pods -l app=istiod -o json | jq -r '.items[0].metadata.labels["istio.io/rev"]')
root@abm-ws:~/baremetal# kubectl create namespace demo
namespace/demo created
root@abm-ws:~/baremetal# kubectl label namespace demo istio.io/rev=$ASM_REV --overwrite
namespace/demo labeled
root@abm-ws:~/baremetal# 

Install the Hipster Shop application in the demo namespace:

root@abm-ws:~/baremetal# kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/microservices-demo/master/release/kubernetes-manifests.yaml -n demo
deployment.apps/currencyservice created
service/currencyservice created
serviceaccount/currencyservice created
deployment.apps/loadgenerator created
serviceaccount/loadgenerator created
deployment.apps/productcatalogservice created
service/productcatalogservice created
serviceaccount/productcatalogservice created
deployment.apps/checkoutservice created
service/checkoutservice created
serviceaccount/checkoutservice created
deployment.apps/shippingservice created
service/shippingservice created
serviceaccount/shippingservice created
deployment.apps/cartservice created
service/cartservice created
serviceaccount/cartservice created
deployment.apps/redis-cart created
service/redis-cart created
deployment.apps/emailservice created
service/emailservice created
serviceaccount/emailservice created
deployment.apps/paymentservice created
service/paymentservice created
serviceaccount/paymentservice created
deployment.apps/frontend created
service/frontend created
service/frontend-external created
serviceaccount/frontend created
deployment.apps/recommendationservice created
service/recommendationservice created
serviceaccount/recommendationservice created
deployment.apps/adservice created
service/adservice created
serviceaccount/adservice created
root@abm-ws:~/baremetal# kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/microservices-demo/master/release/istio-manifests.yaml -n demo
serviceentry.networking.istio.io/allow-egress-googleapis created
serviceentry.networking.istio.io/allow-egress-google-metadata created
virtualservice.networking.istio.io/frontend created
resource mapping not found for name: "istio-gateway" namespace: "" from "https://raw.githubusercontent.com/GoogleCloudPlatform/microservices-demo/master/release/istio-manifests.yaml": no matches for kind "Gateway" in version "gateway.networking.k8s.io/v1beta1"
ensure CRDs are installed first
resource mapping not found for name: "frontend-route" namespace: "" from "https://raw.githubusercontent.com/GoogleCloudPlatform/microservices-demo/master/release/istio-manifests.yaml": no matches for kind "HTTPRoute" in version "gateway.networking.k8s.io/v1beta1"
ensure CRDs are installed first
root@abm-ws:~/baremetal# kubectl patch deployments/productcatalogservice -p '{"spec":{"template":{"metadata":{"labels":{"version":"v1"}}}}}' -n demo
deployment.apps/productcatalogservice patched
root@abm-ws:~/baremetal# 

To be able to access the application from outside the cluster, install the ingress Gateway:

root@abm-ws:~/baremetal# git clone https://github.com/GoogleCloudPlatform/anthos-service-mesh-packages
Cloning into 'anthos-service-mesh-packages'...
remote: Enumerating objects: 12080, done.
remote: Counting objects: 100% (1485/1485), done.
remote: Compressing objects: 100% (506/506), done.
remote: Total 12080 (delta 1109), reused 1328 (delta 976), pack-reused 10595
Receiving objects: 100% (12080/12080), 3.04 MiB | 20.89 MiB/s, done.
Resolving deltas:   2% (169/8295)
Resolving deltas: 100% (8295/8295), done.
root@abm-ws:~/baremetal# kubectl apply -f anthos-service-mesh-packages/samples/gateways/istio-ingressgateway -n demo
horizontalpodautoscaler.autoscaling/istio-ingressgateway created
deployment.apps/istio-ingressgateway created
poddisruptionbudget.policy/istio-ingressgateway created
role.rbac.authorization.k8s.io/istio-ingressgateway created
rolebinding.rbac.authorization.k8s.io/istio-ingressgateway created
service/istio-ingressgateway created
serviceaccount/istio-ingressgateway created
root@abm-ws:~/baremetal# 

Configure the Gateway:

root@abm-ws:~/baremetal# kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/microservices-demo/master/release/istio-manifests.yaml -n demo
serviceentry.networking.istio.io/allow-egress-googleapis unchanged
serviceentry.networking.istio.io/allow-egress-google-metadata unchanged
virtualservice.networking.istio.io/frontend unchanged
resource mapping not found for name: "istio-gateway" namespace: "" from "https://raw.githubusercontent.com/GoogleCloudPlatform/microservices-demo/master/release/istio-manifests.yaml": no matches for kind "Gateway" in version "gateway.networking.k8s.io/v1beta1"
ensure CRDs are installed first
resource mapping not found for name: "frontend-route" namespace: "" from "https://raw.githubusercontent.com/GoogleCloudPlatform/microservices-demo/master/release/istio-manifests.yaml": no matches for kind "HTTPRoute" in version "gateway.networking.k8s.io/v1beta1"
ensure CRDs are installed first
root@abm-ws:~/baremetal# 

View the Hipster Shop pods that have been created in the demo namespace. Notice that they have a 2/2 in a ready state. That means that the two containers are ready, including the application container and the mesh sidecar container.

root@abm-ws:~/baremetal# kubectl get pods -n demo
NAME                                     READY   STATUS    RESTARTS   AGE
adservice-7d94897dd6-zvnwk               2/2     Running   0          115s
cartservice-55dbff556f-845j4             2/2     Running   0          116s
checkoutservice-86cd7d575b-q2frb         2/2     Running   0          117s
currencyservice-65cb46f6db-xzq8q         2/2     Running   0          118s
emailservice-59f785bbb-ktb6g             2/2     Running   0          116s
frontend-6d4b46bd79-724cv                2/2     Running   0          116s
istio-ingressgateway-78d5d78c6-n8vmd     1/1     Running   0          70s
istio-ingressgateway-78d5d78c6-rbfg9     1/1     Running   0          70s
istio-ingressgateway-78d5d78c6-w87n2     1/1     Running   0          70s
loadgenerator-d9c69fc8f-7p9p2            2/2     Running   0          118s
paymentservice-dcd974b78-h7x4p           2/2     Running   0          116s
productcatalogservice-798c7c4b65-6wjcp   2/2     Running   0          109s
recommendationservice-6969c69d8-pf6l5    2/2     Running   0          115s
redis-cart-79b899577-dt2mq               2/2     Running   0          116s
shippingservice-6fc554769d-74zf8         2/2     Running   0          117s
root@abm-ws:~/baremetal# 

Get the external IP from the istio-ingressgateway to access the Hipster Shop that you just deployed:

root@abm-ws:~/baremetal# kubectl get svc istio-ingressgateway -n demo
NAME                   TYPE           CLUSTER-IP   EXTERNAL-IP    PORT(S)                                      AGE
istio-ingressgateway   LoadBalancer   10.96.7.84   10.200.0.103   15021:31695/TCP,80:31280/TCP,443:32275/TCP   98s
root@abm-ws:~/baremetal# 

Access the Hipster Shop using the IP you copied in the previous task:

root@abm-ws:~/baremetal# curl 10.200.0.103
curl: (7) Failed to connect to 10.200.0.103 port 80: Connection refused
root@abm-ws:~/baremetal# curl 10.200.0.103
curl: (7) Failed to connect to 10.200.0.103 port 80: Connection refused
root@abm-ws:~/baremetal# curl 10.200.0.103
curl: (7) Failed to connect to 10.200.0.103 port 80: Connection refused
root@abm-ws:~/baremetal# 

In the Console, go to Navigation menu > Trace > Trace List and select a trace from the graph that you would like to review. You see the breakdown by service so that you can investigate bottlenecks and networking issues.

Detect and repair node problems

Many node problems can affect the pods running on the node, such as issues in the kernel, hardware, or container runtime. These problems are invisible to the upstream layers in the cluster management stack.

Node Problem Detector (NDP) detects common node problems, and reports node events and conditions. NPD runs as a systemd service on each node and can be enabled and disabled.

Come back to the Cloud Shell window you were using to SSH into the admin workstation. Check that the following conditions are being checked on the worker nodes by default:

root@abm-ws:~/baremetal# kubectl describe node abm-user-w2
Name:               abm-user-w2
Roles:              worker
Labels:             baremetal.cluster.gke.io/cgroup=v1
                    baremetal.cluster.gke.io/k8s-ip=10.200.0.5
                    baremetal.cluster.gke.io/namespace=cluster-abm-hybrid-cluster
                    baremetal.cluster.gke.io/node-pool=hybrid-cluster-pool-1
                    baremetal.cluster.gke.io/version=1.15.0
                    beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    cloud.google.com/gke-nodepool=hybrid-cluster-pool-1
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=abm-user-w2
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
Annotations:        baremetal.cluster.gke.io/provider: baremetal
                    kubeadm.alpha.kubernetes.io/cri-socket: unix:///run/containerd/containerd.sock
                    networking.gke.io/ipv4-subnet: 10.200.0.5/24
                    networking.gke.io/ipv6-subnet: 
                    networking.gke.io/network-status: [{"name":"pod-network"}]
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Sun, 28 Jul 2024 07:10:15 +0000
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  abm-user-w2
  AcquireTime:     <unset>
  RenewTime:       Sun, 28 Jul 2024 16:14:29 +0000
Conditions:
  Type                          Status  LastHeartbeatTime                 LastTransitionTime                Reason                          Message
  ----                          ------  -----------------                 ------------------                ------                          -------
  ContainerRuntimeUnhealthy     False   Sun, 28 Jul 2024 16:11:53 +0000   Sun, 28 Jul 2024 07:15:55 +0000   ContainerRuntimeIsHealthy       Container runtime on the node is functioning properly
  KernelDeadlock                False   Sun, 28 Jul 2024 16:11:53 +0000   Sun, 28 Jul 2024 07:15:55 +0000   KernelHasNoDeadlock             kernel has no deadlock
  ReadonlyFilesystem            False   Sun, 28 Jul 2024 16:11:53 +0000   Sun, 28 Jul 2024 07:15:55 +0000   FilesystemIsNotReadOnly         Filesystem is not read-only
  FrequentUnregisterNetDevice   False   Sun, 28 Jul 2024 16:11:53 +0000   Sun, 28 Jul 2024 07:15:55 +0000   NoFrequentUnregisterNetDevice   node is functioning properly
  KubeletUnhealthy              False   Sun, 28 Jul 2024 16:11:53 +0000   Sun, 28 Jul 2024 07:15:55 +0000   KubeletIsHealthy                kubelet on the node is functioning properly
  FrequentKubeletRestart        False   Sun, 28 Jul 2024 16:11:53 +0000   Sun, 28 Jul 2024 07:15:55 +0000   NoFrequentKubeletRestart        kubelet is functioning properly
  FrequentDockerRestart         False   Sun, 28 Jul 2024 16:11:53 +0000   Sun, 28 Jul 2024 07:15:55 +0000   NoFrequentDockerRestart         docker is functioning properly
  FrequentContainerdRestart     False   Sun, 28 Jul 2024 16:11:53 +0000   Sun, 28 Jul 2024 07:15:55 +0000   NoFrequentContainerdRestart     containerd is functioning properly
  MemoryPressure                False   Sun, 28 Jul 2024 16:11:12 +0000   Sun, 28 Jul 2024 07:10:15 +0000   KubeletHasSufficientMemory      kubelet has sufficient memory available
  DiskPressure                  False   Sun, 28 Jul 2024 16:11:12 +0000   Sun, 28 Jul 2024 07:10:15 +0000   KubeletHasNoDiskPressure        kubelet has no disk pressure
  PIDPressure                   False   Sun, 28 Jul 2024 16:11:12 +0000   Sun, 28 Jul 2024 07:10:15 +0000   KubeletHasSufficientPID         kubelet has sufficient PID available
  Ready                         True    Sun, 28 Jul 2024 16:11:12 +0000   Sun, 28 Jul 2024 07:10:49 +0000   KubeletReady                    kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.200.0.5
  Hostname:    abm-user-w2
Capacity:
  cpu:                4
  ephemeral-storage:  259966896Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16369876Ki
  pods:               250
Allocatable:
  cpu:                3440m
  ephemeral-storage:  239585490957
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             13295828Ki
  pods:               250
System Info:
  Machine ID:                 0eb23e41c594f0fed31719f54d8c6735
  System UUID:                0eb23e41-c594-f0fe-d317-19f54d8c6735
  Boot ID:                    ee19bf1b-568f-4e07-be1e-a1159d3919dd
  Kernel Version:             5.15.0-1062-gcp
  OS Image:                   Ubuntu 20.04.6 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.18-gke.0
  Kubelet Version:            v1.26.2-gke.1001
  Kube-Proxy Version:         v1.26.2-gke.1001
PodCIDR:                      192.168.2.0/23
PodCIDRs:                     192.168.2.0/23
ProviderID:                   baremetal://10.200.0.5
Non-terminated Pods:          (16 in total)
  Namespace                   Name                                                     CPU Requests  CPU Limits   Memory Requests  Memory Limits  Age
  ---------                   ----                                                     ------------  ----------   ---------------  -------------  ---
  asm-system                  canonical-service-controller-manager-8576bf9678-cfzkh    200m (5%)     200m (5%)    40Mi (0%)        600Mi (4%)     6m38s
  demo                        emailservice-59f785bbb-ktb6g                             200m (5%)     2200m (63%)  192Mi (1%)       1152Mi (8%)    4m56s
  demo                        istio-ingressgateway-78d5d78c6-w87n2                     100m (2%)     2 (58%)      128Mi (0%)       1Gi (7%)       4m10s
  demo                        loadgenerator-d9c69fc8f-7p9p2                            400m (11%)    2500m (72%)  384Mi (2%)       1536Mi (11%)   4m58s
  demo                        paymentservice-dcd974b78-h7x4p                           200m (5%)     2200m (63%)  192Mi (1%)       1152Mi (8%)    4m56s
  demo                        productcatalogservice-798c7c4b65-6wjcp                   200m (5%)     2200m (63%)  192Mi (1%)       1152Mi (8%)    4m49s
  demo                        recommendationservice-6969c69d8-pf6l5                    200m (5%)     2200m (63%)  348Mi (2%)       1474Mi (11%)   4m55s
  demo                        shippingservice-6fc554769d-74zf8                         200m (5%)     2200m (63%)  192Mi (1%)       1152Mi (8%)    4m57s
  gke-connect                 gke-connect-agent-20230224-00-00-654bf8d884-lqdmc        0 (0%)        0 (0%)       256Mi (1%)       256Mi (1%)     8h
  istio-system                istiod-asm-1157-23-96886d6f4-5qdmk                       500m (14%)    0 (0%)       2Gi (15%)        0 (0%)         6m53s
  kube-system                 anetd-c2rm2                                              110m (3%)     500m (14%)   164Mi (1%)       128Mi (0%)     9h
  kube-system                 gke-metrics-agent-8kxjj                                  50m (1%)      200m (5%)    200Mi (1%)       4608Mi (35%)   9h
  kube-system                 kube-proxy-ln2d6                                         100m (2%)     0 (0%)       15Mi (0%)        0 (0%)         9h
  kube-system                 localpv-ndrqv                                            0 (0%)        0 (0%)       0 (0%)           0 (0%)         9h
  kube-system                 node-exporter-mhcjx                                      100m (2%)     200m (5%)    50Mi (0%)        500Mi (3%)     9h
  kube-system                 stackdriver-log-forwarder-f87fs                          200m (5%)     600m (17%)   100Mi (0%)       600Mi (4%)     9h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                2760m (80%)   17200m (500%)
  memory             4501Mi (34%)  15334Mi (118%)
  ephemeral-storage  0 (0%)        0 (0%)
  hugepages-1Gi      0 (0%)        0 (0%)
  hugepages-2Mi      0 (0%)        0 (0%)
Events:              <none>
root@abm-ws:~/baremetal# 

In the Conditions section of the output, verify that that the following conditions are present:

Checks provided by the Node Problem Detector:

FrequentContainerdRestart
NoFrequentContainerdRestart
KernelHasNoDeadlock
ReadonlyFilesystem
FilesystemIsNotReadOnly
FrequentUnregisterNetDevice
NoFrequentUnregisterNetDevice
ContainerRuntimeIsHealthy
KubeletIsHealthy
FrequentKubeletRestart
FrequentDockerRestart

Checks provided by Kubernetes:

MemoryPressure
DiskPressure
PIDPressure
Ready

Also, verify that there are not any Events indicating there might be a problem.

Node Problem Detector is enabled by default in your cluster starting in Anthos clusters on bare metal v1.10.

Edit the configmap to enable or disable it:

root@abm-ws:~/baremetal# kubectl edit configmap node-problem-detector-config -n cluster-abm-hybrid-cluster
configmap/node-problem-detector-config edited
root@abm-ws:~/baremetal# 

Now, when you get the state of the node, nothing will have changed, but you could disable it by changing the value to false.

Run kubectl describe again to view that no changes were made:

root@abm-ws:~/baremetal# kubectl describe node abm-user-w2
Name:               abm-user-w2
Roles:              worker
Labels:             baremetal.cluster.gke.io/cgroup=v1
                    baremetal.cluster.gke.io/k8s-ip=10.200.0.5
                    baremetal.cluster.gke.io/namespace=cluster-abm-hybrid-cluster
                    baremetal.cluster.gke.io/node-pool=hybrid-cluster-pool-1
                    baremetal.cluster.gke.io/version=1.15.0
                    beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    cloud.google.com/gke-nodepool=hybrid-cluster-pool-1
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=abm-user-w2
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
Annotations:        baremetal.cluster.gke.io/provider: baremetal
                    kubeadm.alpha.kubernetes.io/cri-socket: unix:///run/containerd/containerd.sock
                    networking.gke.io/ipv4-subnet: 10.200.0.5/24
                    networking.gke.io/ipv6-subnet: 
                    networking.gke.io/network-status: [{"name":"pod-network"}]
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Sun, 28 Jul 2024 07:10:15 +0000
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  abm-user-w2
  AcquireTime:     <unset>
  RenewTime:       Sun, 28 Jul 2024 16:17:43 +0000
Conditions:
  Type                          Status  LastHeartbeatTime                 LastTransitionTime                Reason                          Message
  ----                          ------  -----------------                 ------------------                ------                          -------
  FrequentKubeletRestart        False   Sun, 28 Jul 2024 16:16:54 +0000   Sun, 28 Jul 2024 07:15:55 +0000   NoFrequentKubeletRestart        kubelet is functioning properly
  FrequentDockerRestart         False   Sun, 28 Jul 2024 16:16:54 +0000   Sun, 28 Jul 2024 07:15:55 +0000   NoFrequentDockerRestart         docker is functioning properly
  FrequentContainerdRestart     False   Sun, 28 Jul 2024 16:16:54 +0000   Sun, 28 Jul 2024 07:15:55 +0000   NoFrequentContainerdRestart     containerd is functioning properly
  ContainerRuntimeUnhealthy     False   Sun, 28 Jul 2024 16:16:54 +0000   Sun, 28 Jul 2024 07:15:55 +0000   ContainerRuntimeIsHealthy       Container runtime on the node is functioning properly
  KernelDeadlock                False   Sun, 28 Jul 2024 16:16:54 +0000   Sun, 28 Jul 2024 07:15:55 +0000   KernelHasNoDeadlock             kernel has no deadlock
  ReadonlyFilesystem            False   Sun, 28 Jul 2024 16:16:54 +0000   Sun, 28 Jul 2024 07:15:55 +0000   FilesystemIsNotReadOnly         Filesystem is not read-only
  FrequentUnregisterNetDevice   False   Sun, 28 Jul 2024 16:16:54 +0000   Sun, 28 Jul 2024 07:15:55 +0000   NoFrequentUnregisterNetDevice   node is functioning properly
  KubeletUnhealthy              False   Sun, 28 Jul 2024 16:16:54 +0000   Sun, 28 Jul 2024 07:15:55 +0000   KubeletIsHealthy                kubelet on the node is functioning properly
  MemoryPressure                False   Sun, 28 Jul 2024 16:16:18 +0000   Sun, 28 Jul 2024 07:10:15 +0000   KubeletHasSufficientMemory      kubelet has sufficient memory available
  DiskPressure                  False   Sun, 28 Jul 2024 16:16:18 +0000   Sun, 28 Jul 2024 07:10:15 +0000   KubeletHasNoDiskPressure        kubelet has no disk pressure
  PIDPressure                   False   Sun, 28 Jul 2024 16:16:18 +0000   Sun, 28 Jul 2024 07:10:15 +0000   KubeletHasSufficientPID         kubelet has sufficient PID available
  Ready                         True    Sun, 28 Jul 2024 16:16:18 +0000   Sun, 28 Jul 2024 07:10:49 +0000   KubeletReady                    kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.200.0.5
  Hostname:    abm-user-w2
Capacity:
  cpu:                4
  ephemeral-storage:  259966896Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16369876Ki
  pods:               250
Allocatable:
  cpu:                3440m
  ephemeral-storage:  239585490957
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             13295828Ki
  pods:               250
System Info:
  Machine ID:                 0eb23e41c594f0fed31719f54d8c6735
  System UUID:                0eb23e41-c594-f0fe-d317-19f54d8c6735
  Boot ID:                    ee19bf1b-568f-4e07-be1e-a1159d3919dd
  Kernel Version:             5.15.0-1062-gcp
  OS Image:                   Ubuntu 20.04.6 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.18-gke.0
  Kubelet Version:            v1.26.2-gke.1001
  Kube-Proxy Version:         v1.26.2-gke.1001
PodCIDR:                      192.168.2.0/23
PodCIDRs:                     192.168.2.0/23
ProviderID:                   baremetal://10.200.0.5
Non-terminated Pods:          (16 in total)
  Namespace                   Name                                                     CPU Requests  CPU Limits   Memory Requests  Memory Limits  Age
  ---------                   ----                                                     ------------  ----------   ---------------  -------------  ---
  asm-system                  canonical-service-controller-manager-8576bf9678-cfzkh    200m (5%)     200m (5%)    40Mi (0%)        600Mi (4%)     9m51s
  demo                        emailservice-59f785bbb-ktb6g                             200m (5%)     2200m (63%)  192Mi (1%)       1152Mi (8%)    8m9s
  demo                        istio-ingressgateway-78d5d78c6-w87n2                     100m (2%)     2 (58%)      128Mi (0%)       1Gi (7%)       7m23s
  demo                        loadgenerator-d9c69fc8f-7p9p2                            400m (11%)    2500m (72%)  384Mi (2%)       1536Mi (11%)   8m11s
  demo                        paymentservice-dcd974b78-h7x4p                           200m (5%)     2200m (63%)  192Mi (1%)       1152Mi (8%)    8m9s
  demo                        productcatalogservice-798c7c4b65-6wjcp                   200m (5%)     2200m (63%)  192Mi (1%)       1152Mi (8%)    8m2s
  demo                        recommendationservice-6969c69d8-pf6l5                    200m (5%)     2200m (63%)  348Mi (2%)       1474Mi (11%)   8m8s
  demo                        shippingservice-6fc554769d-74zf8                         200m (5%)     2200m (63%)  192Mi (1%)       1152Mi (8%)    8m10s
  gke-connect                 gke-connect-agent-20230224-00-00-654bf8d884-lqdmc        0 (0%)        0 (0%)       256Mi (1%)       256Mi (1%)     9h
  istio-system                istiod-asm-1157-23-96886d6f4-5qdmk                       500m (14%)    0 (0%)       2Gi (15%)        0 (0%)         10m
  kube-system                 anetd-c2rm2                                              110m (3%)     500m (14%)   164Mi (1%)       128Mi (0%)     9h
  kube-system                 gke-metrics-agent-8kxjj                                  50m (1%)      200m (5%)    200Mi (1%)       4608Mi (35%)   9h
  kube-system                 kube-proxy-ln2d6                                         100m (2%)     0 (0%)       15Mi (0%)        0 (0%)         9h
  kube-system                 localpv-ndrqv                                            0 (0%)        0 (0%)       0 (0%)           0 (0%)         9h
  kube-system                 node-exporter-mhcjx                                      100m (2%)     200m (5%)    50Mi (0%)        500Mi (3%)     9h
  kube-system                 stackdriver-log-forwarder-f87fs                          200m (5%)     600m (17%)   100Mi (0%)       600Mi (4%)     9h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                2760m (80%)   17200m (500%)
  memory             4501Mi (34%)  15334Mi (118%)
  ephemeral-storage  0 (0%)        0 (0%)
  hugepages-1Gi      0 (0%)        0 (0%)
  hugepages-2Mi      0 (0%)        0 (0%)
Events:              <none>
root@abm-ws:~/baremetal# 

Verify that the same conditions are being displayed:

FrequentContainerdRestart
NoFrequentContainerdRestart
KernelHasNoDeadlock
ReadonlyFilesystem
FilesystemIsNotReadOnly
FrequentUnregisterNetDevice
NoFrequentUnregisterNetDevice
ContainerRuntimeIsHealthy
KubeletIsHealthy
FrequentKubeletRestart
FrequentDockerRestart
MemoryPressure
DiskPressure
PIDPressure
Ready

Also, verify that there are not any Events indicating there might be a problem.

Introduction of a node problem

Let’s introduce a problem in one of the nodes to see if NPD reports it correctly.

Open a new tab in the Google Cloud Console.
SSH into the abm-user-w2 worker node:
Stop the container runtime to simulate a problem in the node:
Go back to the first Cloud Shell window and check the conditions of the node where we stopped containerd:

root@abm-ws:~/baremetal# kubectl describe node abm-user-w2
Name:               abm-user-w2
Roles:              worker
Labels:             baremetal.cluster.gke.io/cgroup=v1
                    baremetal.cluster.gke.io/k8s-ip=10.200.0.5
                    baremetal.cluster.gke.io/namespace=cluster-abm-hybrid-cluster
                    baremetal.cluster.gke.io/node-pool=hybrid-cluster-pool-1
                    baremetal.cluster.gke.io/version=1.15.0
                    beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    cloud.google.com/gke-nodepool=hybrid-cluster-pool-1
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=abm-user-w2
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
Annotations:        baremetal.cluster.gke.io/provider: baremetal
                    kubeadm.alpha.kubernetes.io/cri-socket: unix:///run/containerd/containerd.sock
                    networking.gke.io/ipv4-subnet: 10.200.0.5/24
                    networking.gke.io/ipv6-subnet: 
                    networking.gke.io/network-status: [{"name":"pod-network"}]
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Sun, 28 Jul 2024 07:10:15 +0000
Taints:             node.kubernetes.io/not-ready:NoExecute
                    node.kubernetes.io/not-ready:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  abm-user-w2
  AcquireTime:     <unset>
  RenewTime:       Sun, 28 Jul 2024 16:19:56 +0000
Conditions:
  Type                          Status  LastHeartbeatTime                 LastTransitionTime                Reason                          Message
  ----                          ------  -----------------                 ------------------                ------                          -------
  ContainerRuntimeUnhealthy     True    Sun, 28 Jul 2024 16:19:26 +0000   Sun, 28 Jul 2024 16:19:25 +0000   ContainerdUnhealthy             cri:containerd was found unhealthy; repair flag : true
  KernelDeadlock                False   Sun, 28 Jul 2024 16:19:26 +0000   Sun, 28 Jul 2024 07:15:55 +0000   KernelHasNoDeadlock             kernel has no deadlock
  ReadonlyFilesystem            False   Sun, 28 Jul 2024 16:19:26 +0000   Sun, 28 Jul 2024 07:15:55 +0000   FilesystemIsNotReadOnly         Filesystem is not read-only
  FrequentUnregisterNetDevice   False   Sun, 28 Jul 2024 16:19:26 +0000   Sun, 28 Jul 2024 07:15:55 +0000   NoFrequentUnregisterNetDevice   node is functioning properly
  KubeletUnhealthy              False   Sun, 28 Jul 2024 16:19:26 +0000   Sun, 28 Jul 2024 07:15:55 +0000   KubeletIsHealthy                kubelet on the node is functioning properly
  FrequentKubeletRestart        False   Sun, 28 Jul 2024 16:19:26 +0000   Sun, 28 Jul 2024 07:15:55 +0000   NoFrequentKubeletRestart        kubelet is functioning properly
  FrequentDockerRestart         False   Sun, 28 Jul 2024 16:19:26 +0000   Sun, 28 Jul 2024 07:15:55 +0000   NoFrequentDockerRestart         docker is functioning properly
  FrequentContainerdRestart     False   Sun, 28 Jul 2024 16:19:26 +0000   Sun, 28 Jul 2024 07:15:55 +0000   NoFrequentContainerdRestart     containerd is functioning properly
  MemoryPressure                False   Sun, 28 Jul 2024 16:19:43 +0000   Sun, 28 Jul 2024 07:10:15 +0000   KubeletHasSufficientMemory      kubelet has sufficient memory available
  DiskPressure                  False   Sun, 28 Jul 2024 16:19:43 +0000   Sun, 28 Jul 2024 07:10:15 +0000   KubeletHasNoDiskPressure        kubelet has no disk pressure
  PIDPressure                   False   Sun, 28 Jul 2024 16:19:43 +0000   Sun, 28 Jul 2024 07:10:15 +0000   KubeletHasSufficientPID         kubelet has sufficient PID available
  Ready                         False   Sun, 28 Jul 2024 16:19:43 +0000   Sun, 28 Jul 2024 16:19:43 +0000   KubeletNotReady                 container runtime is down
Addresses:
  InternalIP:  10.200.0.5
  Hostname:    abm-user-w2
Capacity:
  cpu:                4
  ephemeral-storage:  259966896Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16369876Ki
  pods:               250
Allocatable:
  cpu:                3440m
  ephemeral-storage:  239585490957
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             13295828Ki
  pods:               250
System Info:
  Machine ID:                 0eb23e41c594f0fed31719f54d8c6735
  System UUID:                0eb23e41-c594-f0fe-d317-19f54d8c6735
  Boot ID:                    ee19bf1b-568f-4e07-be1e-a1159d3919dd
  Kernel Version:             5.15.0-1062-gcp
  OS Image:                   Ubuntu 20.04.6 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://Unknown
  Kubelet Version:            v1.26.2-gke.1001
  Kube-Proxy Version:         v1.26.2-gke.1001
PodCIDR:                      192.168.2.0/23
PodCIDRs:                     192.168.2.0/23
ProviderID:                   baremetal://10.200.0.5
Non-terminated Pods:          (16 in total)
  Namespace                   Name                                                     CPU Requests  CPU Limits   Memory Requests  Memory Limits  Age
  ---------                   ----                                                     ------------  ----------   ---------------  -------------  ---
  asm-system                  canonical-service-controller-manager-8576bf9678-cfzkh    200m (5%)     200m (5%)    40Mi (0%)        600Mi (4%)     12m
  demo                        emailservice-59f785bbb-ktb6g                             200m (5%)     2200m (63%)  192Mi (1%)       1152Mi (8%)    10m
  demo                        istio-ingressgateway-78d5d78c6-w87n2                     100m (2%)     2 (58%)      128Mi (0%)       1Gi (7%)       9m36s
  demo                        loadgenerator-d9c69fc8f-7p9p2                            400m (11%)    2500m (72%)  384Mi (2%)       1536Mi (11%)   10m
  demo                        paymentservice-dcd974b78-h7x4p                           200m (5%)     2200m (63%)  192Mi (1%)       1152Mi (8%)    10m
  demo                        productcatalogservice-798c7c4b65-6wjcp                   200m (5%)     2200m (63%)  192Mi (1%)       1152Mi (8%)    10m
  demo                        recommendationservice-6969c69d8-pf6l5                    200m (5%)     2200m (63%)  348Mi (2%)       1474Mi (11%)   10m
  demo                        shippingservice-6fc554769d-74zf8                         200m (5%)     2200m (63%)  192Mi (1%)       1152Mi (8%)    10m
  gke-connect                 gke-connect-agent-20230224-00-00-654bf8d884-lqdmc        0 (0%)        0 (0%)       256Mi (1%)       256Mi (1%)     9h
  istio-system                istiod-asm-1157-23-96886d6f4-5qdmk                       500m (14%)    0 (0%)       2Gi (15%)        0 (0%)         12m
  kube-system                 anetd-c2rm2                                              110m (3%)     500m (14%)   164Mi (1%)       128Mi (0%)     9h
  kube-system                 gke-metrics-agent-8kxjj                                  50m (1%)      200m (5%)    200Mi (1%)       4608Mi (35%)   9h
  kube-system                 kube-proxy-ln2d6                                         100m (2%)     0 (0%)       15Mi (0%)        0 (0%)         9h
  kube-system                 localpv-ndrqv                                            0 (0%)        0 (0%)       0 (0%)           0 (0%)         9h
  kube-system                 node-exporter-mhcjx                                      100m (2%)     200m (5%)    50Mi (0%)        500Mi (3%)     9h
  kube-system                 stackdriver-log-forwarder-f87fs                          200m (5%)     600m (17%)   100Mi (0%)       600Mi (4%)     9h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                2760m (80%)   17200m (500%)
  memory             4501Mi (34%)  15334Mi (118%)
  ephemeral-storage  0 (0%)        0 (0%)
  hugepages-1Gi      0 (0%)        0 (0%)
  hugepages-2Mi      0 (0%)        0 (0%)
Events:
  Type     Reason               Age   From            Message
  ----     ------               ----  ----            -------
  Warning  ContainerdUnhealthy  40s   health-checker  Node condition ContainerRuntimeUnhealthy is now: True, reason: ContainerdUnhealthy, message: "cri:containerd was found unhealthy; repair flag : true"
  Normal   NodeNotReady         22s   kubelet         Node abm-user-w2 status is now: NodeNotReady
  Warning  ContainerGCFailed    1s    kubelet         rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: no such file or directory"
root@abm-ws:~/baremetal# 

Notice that the status of the Condition “ContainerRuntimeUnhealthy” is “true”. Also, notice that the following events appar now:

ContainerdUnhealthy
NodeNotReady
ContainerGCFailed

All errors might take a couple seconds to appear.

In the second Cloud Shell window, you can get more information by running journalctl. Input letter q to exit:

root@abm-user-w2:~# journalctl -u node-problem-detector
-- Logs begin at Sun 2024-07-28 06:51:55 UTC, end at Sun 2024-07-28 16:21:00 UTC. --
Jul 28 07:15:55 abm-user-w2 systemd[1]: Started Kubernetes node problem detector.
Jul 28 07:15:55 abm-user-w2 node-problem-detector[13978]: I0728 07:15:55.590102   13978 custom_plugin_monitor.go:80] Finish parsing custom plugin monitor config file /opt/node-pro>
Jul 28 07:15:55 abm-user-w2 node-problem-detector[13978]: I0728 07:15:55.591259   13978 custom_plugin_monitor.go:80] Finish parsing custom plugin monitor config file /opt/node-pro>
Jul 28 07:15:55 abm-user-w2 node-problem-detector[13978]: I0728 07:15:55.591615   13978 custom_plugin_monitor.go:80] Finish parsing custom plugin monitor config file /opt/node-pro>
Jul 28 07:15:55 abm-user-w2 node-problem-detector[13978]: I0728 07:15:55.591937   13978 custom_plugin_monitor.go:80] Finish parsing custom plugin monitor config file /opt/node-pro>
Jul 28 07:15:55 abm-user-w2 node-problem-detector[13978]: I0728 07:15:55.592647   13978 log_monitor.go:79] Finish parsing log monitor config file /opt/node-problem-detector/config>
Jul 28 07:15:55 abm-user-w2 node-problem-detector[13978]: I0728 07:15:55.592834   13978 log_watchers.go:40] Use log watcher of plugin "kmsg"
Jul 28 07:15:55 abm-user-w2 node-problem-detector[13978]: I0728 07:15:55.593183   13978 log_monitor.go:79] Finish parsing log monitor config file /opt/node-problem-detector/config>
Jul 28 07:15:55 abm-user-w2 node-problem-detector[13978]: I0728 07:15:55.593356   13978 log_watchers.go:40] Use log watcher of plugin "journald"
Jul 28 07:15:55 abm-user-w2 node-problem-detector[13978]: I0728 07:15:55.596971   13978 k8s_exporter.go:54] Waiting for kube-apiserver to be ready (timeout 5m0s)...
Jul 28 07:15:55 abm-user-w2 node-problem-detector[13978]: I0728 07:15:55.686318   13978 node_problem_detector.go:63] K8s exporter started.
Jul 28 07:15:55 abm-user-w2 node-problem-detector[13978]: I0728 07:15:55.686361   13978 custom_plugin_monitor.go:111] Start custom plugin monitor /opt/node-problem-detector/config>
Jul 28 07:15:55 abm-user-w2 node-problem-detector[13978]: I0728 07:15:55.686379   13978 custom_plugin_monitor.go:111] Start custom plugin monitor /opt/node-problem-detector/config>
Jul 28 07:15:55 abm-user-w2 node-problem-detector[13978]: I0728 07:15:55.686388   13978 custom_plugin_monitor.go:111] Start custom plugin monitor /opt/node-problem-detector/config>
Jul 28 07:15:55 abm-user-w2 node-problem-detector[13978]: I0728 07:15:55.686406   13978 custom_plugin_monitor.go:111] Start custom plugin monitor /opt/node-problem-detector/config>
Jul 28 07:15:55 abm-user-w2 node-problem-detector[13978]: I0728 07:15:55.686416   13978 log_monitor.go:111] Start log monitor /opt/node-problem-detector/config/kernel-monitor.json
Jul 28 07:15:55 abm-user-w2 node-problem-detector[13978]: I0728 07:15:55.686490   13978 log_monitor.go:111] Start log monitor /opt/node-problem-detector/config/systemd-monitor.json
Jul 28 07:15:55 abm-user-w2 node-problem-detector[13978]: I0728 07:15:55.686691   13978 custom_plugin_monitor.go:300] Initialize condition generated: [{Type:ContainerRuntimeUnheal>
lines 1-19

Alternatively, you can also check the logs in Cloud Logging. To do so, in the Console, go to Navigation menu > Logging > Logs Explorer and enter the following query. Then, replace the PROJECT_ID in the query with and click Run query:

resource.type="k8s_node"
resource.labels.node_name="abm-user-w2"
log_name="projects/PROJECT_ID/logs/node-problem-detector"

Repairing the node problem

To repair the node problem, go to the second Cloud Shell window, and start the container runtime again:

root@abm-user-w2:~# sudo systemctl start containerd
root@abm-user-w2:~# 

Check the status of the node until the Condition “ContainerRuntimeUnhealthy” is false:

root@abm-ws:~/baremetal# kubectl describe node abm-user-w2
Name:               abm-user-w2
Roles:              worker
Labels:             baremetal.cluster.gke.io/cgroup=v1
                    baremetal.cluster.gke.io/k8s-ip=10.200.0.5
                    baremetal.cluster.gke.io/namespace=cluster-abm-hybrid-cluster
                    baremetal.cluster.gke.io/node-pool=hybrid-cluster-pool-1
                    baremetal.cluster.gke.io/version=1.15.0
                    beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    cloud.google.com/gke-nodepool=hybrid-cluster-pool-1
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=abm-user-w2
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
Annotations:        baremetal.cluster.gke.io/provider: baremetal
                    kubeadm.alpha.kubernetes.io/cri-socket: unix:///run/containerd/containerd.sock
                    networking.gke.io/ipv4-subnet: 10.200.0.5/24
                    networking.gke.io/ipv6-subnet: 
                    networking.gke.io/network-status: [{"name":"pod-network"}]
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Sun, 28 Jul 2024 07:10:15 +0000
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  abm-user-w2
  AcquireTime:     <unset>
  RenewTime:       Sun, 28 Jul 2024 16:22:18 +0000
Conditions:
  Type                          Status  LastHeartbeatTime                 LastTransitionTime                Reason                          Message
  ----                          ------  -----------------                 ------------------                ------                          -------
  FrequentKubeletRestart        False   Sun, 28 Jul 2024 16:22:06 +0000   Sun, 28 Jul 2024 07:15:55 +0000   NoFrequentKubeletRestart        kubelet is functioning properly
  FrequentDockerRestart         False   Sun, 28 Jul 2024 16:22:06 +0000   Sun, 28 Jul 2024 07:15:55 +0000   NoFrequentDockerRestart         docker is functioning properly
  FrequentContainerdRestart     False   Sun, 28 Jul 2024 16:22:06 +0000   Sun, 28 Jul 2024 07:15:55 +0000   NoFrequentContainerdRestart     containerd is functioning properly
  ContainerRuntimeUnhealthy     False   Sun, 28 Jul 2024 16:22:06 +0000   Sun, 28 Jul 2024 16:22:05 +0000   ContainerRuntimeIsHealthy       Container runtime on the node is functioning properly
  KernelDeadlock                False   Sun, 28 Jul 2024 16:22:06 +0000   Sun, 28 Jul 2024 07:15:55 +0000   KernelHasNoDeadlock             kernel has no deadlock
  ReadonlyFilesystem            False   Sun, 28 Jul 2024 16:22:06 +0000   Sun, 28 Jul 2024 07:15:55 +0000   FilesystemIsNotReadOnly         Filesystem is not read-only
  FrequentUnregisterNetDevice   False   Sun, 28 Jul 2024 16:22:06 +0000   Sun, 28 Jul 2024 07:15:55 +0000   NoFrequentUnregisterNetDevice   node is functioning properly
  KubeletUnhealthy              False   Sun, 28 Jul 2024 16:22:06 +0000   Sun, 28 Jul 2024 07:15:55 +0000   KubeletIsHealthy                kubelet on the node is functioning properly
  MemoryPressure                False   Sun, 28 Jul 2024 16:22:16 +0000   Sun, 28 Jul 2024 07:10:15 +0000   KubeletHasSufficientMemory      kubelet has sufficient memory available
  DiskPressure                  False   Sun, 28 Jul 2024 16:22:16 +0000   Sun, 28 Jul 2024 07:10:15 +0000   KubeletHasNoDiskPressure        kubelet has no disk pressure
  PIDPressure                   False   Sun, 28 Jul 2024 16:22:16 +0000   Sun, 28 Jul 2024 07:10:15 +0000   KubeletHasSufficientPID         kubelet has sufficient PID available
  Ready                         True    Sun, 28 Jul 2024 16:22:16 +0000   Sun, 28 Jul 2024 16:22:16 +0000   KubeletReady                    kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.200.0.5
  Hostname:    abm-user-w2
Capacity:
  cpu:                4
  ephemeral-storage:  259966896Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16369876Ki
  pods:               250
Allocatable:
  cpu:                3440m
  ephemeral-storage:  239585490957
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             13295828Ki
  pods:               250
System Info:
  Machine ID:                 0eb23e41c594f0fed31719f54d8c6735
  System UUID:                0eb23e41-c594-f0fe-d317-19f54d8c6735
  Boot ID:                    ee19bf1b-568f-4e07-be1e-a1159d3919dd
  Kernel Version:             5.15.0-1062-gcp
  OS Image:                   Ubuntu 20.04.6 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.18-gke.0
  Kubelet Version:            v1.26.2-gke.1001
  Kube-Proxy Version:         v1.26.2-gke.1001
PodCIDR:                      192.168.2.0/23
PodCIDRs:                     192.168.2.0/23
ProviderID:                   baremetal://10.200.0.5
Non-terminated Pods:          (16 in total)
  Namespace                   Name                                                     CPU Requests  CPU Limits   Memory Requests  Memory Limits  Age
  ---------                   ----                                                     ------------  ----------   ---------------  -------------  ---
  asm-system                  canonical-service-controller-manager-8576bf9678-cfzkh    200m (5%)     200m (5%)    40Mi (0%)        600Mi (4%)     14m
  demo                        emailservice-59f785bbb-ktb6g                             200m (5%)     2200m (63%)  192Mi (1%)       1152Mi (8%)    12m
  demo                        istio-ingressgateway-78d5d78c6-w87n2                     100m (2%)     2 (58%)      128Mi (0%)       1Gi (7%)       11m
  demo                        loadgenerator-d9c69fc8f-7p9p2                            400m (11%)    2500m (72%)  384Mi (2%)       1536Mi (11%)   12m
  demo                        paymentservice-dcd974b78-h7x4p                           200m (5%)     2200m (63%)  192Mi (1%)       1152Mi (8%)    12m
  demo                        productcatalogservice-798c7c4b65-6wjcp                   200m (5%)     2200m (63%)  192Mi (1%)       1152Mi (8%)    12m
  demo                        recommendationservice-6969c69d8-pf6l5                    200m (5%)     2200m (63%)  348Mi (2%)       1474Mi (11%)   12m
  demo                        shippingservice-6fc554769d-74zf8                         200m (5%)     2200m (63%)  192Mi (1%)       1152Mi (8%)    12m
  gke-connect                 gke-connect-agent-20230224-00-00-654bf8d884-lqdmc        0 (0%)        0 (0%)       256Mi (1%)       256Mi (1%)     9h
  istio-system                istiod-asm-1157-23-96886d6f4-5qdmk                       500m (14%)    0 (0%)       2Gi (15%)        0 (0%)         14m
  kube-system                 anetd-c2rm2                                              110m (3%)     500m (14%)   164Mi (1%)       128Mi (0%)     9h
  kube-system                 gke-metrics-agent-8kxjj                                  50m (1%)      200m (5%)    200Mi (1%)       4608Mi (35%)   9h
  kube-system                 kube-proxy-ln2d6                                         100m (2%)     0 (0%)       15Mi (0%)        0 (0%)         9h
  kube-system                 localpv-ndrqv                                            0 (0%)        0 (0%)       0 (0%)           0 (0%)         9h
  kube-system                 node-exporter-mhcjx                                      100m (2%)     200m (5%)    50Mi (0%)        500Mi (3%)     9h
  kube-system                 stackdriver-log-forwarder-f87fs                          200m (5%)     600m (17%)   100Mi (0%)       600Mi (4%)     9h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                2760m (80%)   17200m (500%)
  memory             4501Mi (34%)  15334Mi (118%)
  ephemeral-storage  0 (0%)        0 (0%)
  hugepages-1Gi      0 (0%)        0 (0%)
  hugepages-2Mi      0 (0%)        0 (0%)
Events:
  Type     Reason                     Age                  From             Message
  ----     ------                     ----                 ----             -------
  Warning  ContainerdUnhealthy        3m3s                 health-checker   Node condition ContainerRuntimeUnhealthy is now: True, reason: ContainerdUnhealthy, message: "cri:containerd was found unhealthy; repair flag : true"
  Normal   NodeNotReady               2m45s                kubelet          Node abm-user-w2 status is now: NodeNotReady
  Warning  ContainerdStart            31s                  systemd-monitor  Starting containerd container runtime...
  Warning  ContainerGCFailed          24s (x3 over 2m24s)  kubelet          rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: no such file or directory"
  Normal   ContainerRuntimeIsHealthy  23s                  health-checker   Node condition ContainerRuntimeUnhealthy is now: False, reason: ContainerRuntimeIsHealthy, message: "Container runtime on the node is functioning properly"
  Normal   NodeReady                  12s (x2 over 9h)     kubelet          Node abm-user-w2 status is now: NodeReady
root@abm-ws:~/baremetal# 

Troubleshooting

Set the Zone environment variable:
```
ZONE=us-central1-b
```

gcloud compute ssh --ssh-flag="-A" root@abm-ws \
    --zone $ZONE \
    --tunnel-through-iap

# From the hybrid workstation (root@abm-ws)
export KUBECONFIG=$KUBECONFIG:~/baremetal/bmctl-workspace/abm-hybrid-cluster/abm-hybrid-cluster-kubeconfig
kubectl get nodes

In this lab, you explored logs and metrics from Anthos clusters on bare metal. You learned how to access cluster, application, and audit logs, as well as cluster and application metrics.

Share on

X Facebook LinkedIn Bluesky

Observing Anthos clusters on bare metal

Observing Anthos clusters on bare metal

Professional Cloud Architect

Topics

Observing Anthos clusters on bare metal

Explore the pre-created environment

Log in to your Anthos cluster

Log in to your Anthos cluster on bare metal from the Console

Understand installed observability software stack

Explore audit logs

Explore cluster logs

Explore application logs

Explore cluster metrics

Explore application metrics

Install Anthos Service Mesh

Explore application tracing

Detect and repair node problems

Introduction of a node problem

Repairing the node problem

Troubleshooting

Share on

You May Also Enjoy

Python: Regular Expressions

Oracle Cloud Infrastructure Foundation

Kubernetes with Calico

Python: Data Serialization using XML, YAML, JSON