We use Prometheus to provide real-time monitoring of our hardware. The master is dementors which uses the Node Exporter to collect data from other servers.
We monitor servers, desktops, and staff VMs, but not the hozer boxes. Additionally, we don't receive email alerts for staff VMs. Monitoring for the networking switch, blackhole, is currently under development.
Alerts can be viewed at prometheus.ocf.berkeley.edu/alerts. They are configured at this folder in the Puppet configs.
Alerts can additionally be configured using the alert manager. Alertmanager handles notifications for alerts via communication through email and Slack. Alerts can be inhibited or silenced. Alertmanager documentation can be found here.
Alerts are currently under development and may not be fully comprehensive.
Prometheus uses metrics to collect and visualize different types of data.
The main way Prometheus collects metrics in the OCF is Node Exporter. Another important exporter we use is the SNMP Exporter which monitors information from printers, and possibly in the future, network switches.
A full list of exporters is available in the Prometheus documentation. In order to take advantage of these exporters, we define them in the Puppet config for the Prometheus server.
There are three main ways to generate custom metrics:
/srv/prometheus
. These automatically get bundled into Node Exporter. We do this for CUPS monitoring - here is an example of this in practice.Prometheus supports querying a wide variety of metrics. (For a full list, go to Prometheus and use the "insert metric at cursor" dropdown.) A basic query comes in the form:
metric{label="value", label2="value2", ...}
Some labels used frequently are:
papercut
, avalanche
, or supernova
.desktop
, server
, and staffvm
.node
, printer
, and slurm
.For example, if you would like to view the total RAM installed on each of the servers you can query node_memory_Active_bytes{host_type="server"}
.
To view the per-second rate of a metric, use
rate(metric{label="value",...})
For example, the data sent in bytes/second over the past 5 minutes by fallingrocks
can be retrieved using rate(node_network_transmit_bytes_total{instance="fallingrocks"}
.
For more info about querying, see the official documentation.
Queries are best used in conjunction with Grafana, as to produce more readable results and save them for future reference. The next section will give more details on how to do this.
The frontend for Prometheus is Grafana, which displays statistics collected by Prometheus in a user-friendly manner. Some of the more useful dashboards available are:
There are more dashboards available, which can be accessed by clicking the dropdown arrow on the top left of the Grafana page.
Configuring Grafana dashboards does not require editing Puppet configs. Simply go to Grafana, login using your OCF account, and click the plus icon on the left toolbar to begin visually creating a custom dashboard. Grafana uses Prometheus queries to fetch data to be displayed.