Edge CPU utilization monitoring

Recently I was working with a customer who are using Bare Metal Transport Nodes (TNs) in a core part of their infrastructure. One of the pieces of work I was looking at was how this customer could improve the monitoring of their TNs.

This article describes how CPU utilization is calculated on NSX-T Edge Transport Nodes and how this can be monitored via API.

https://manager-fqdn/api/v1/transport-nodes/<tn-uuid>status

output clipped to show key sections:

{

"load_average": [
  6.16, // <---- 1 minute system load average in % (All cores, non-DPDK/DPDK)
  6.36, // <---- 5 minute system load average in % (All cores, non-DPDK/DPDK)
  6.42 // <---- 15 minute system load average in % (All cores, non-DPDK/DPDK)
    
],
"cpu_usage": {
  "highest_cpu_core_usage_dpdk": 0.45, // <---- Highest CPU usage among non-DPDK assigned cores in %
  "avg_cpu_core_usage_dpdk": 0.26, //<---- Average usage of all DPDK assigned cores in % (Over 5 minute interval)
  "highest_cpu_core_usage_non_dpdk": 100.0, // <---- Highest CPU usage among DPDK assigned cores in %
  "avg_cpu_core_usage_non_dpdk": 58.25 // <---- Average usage of all non-DPDK assigned cores in % (Over 5 minute interval)
  }
}

“Load_average” is collected using the “/proc/loadavg” file. The “cpu_usage” output is collected using mpstat utility. This provides the per core CPU usage for a given period. The output is then divided into two groups (DPDK and non-DPDK cores) and averaged.
The python script that is run to collect this data is located on the edge here: root@bm-edge01:/opt/vmware/nsx-edge/bin# ./cpu_usage.py The output is then stored in the following file: /var/run/vmware/edge/cpu_usage.json’ Average for non-DPDK and DPDK cores is over a 300 second interval.