Linux¶
To monitor the health of your systems running Linux operating system and identify irregular behavior, you’ll need to see the right data at each source. In OctoPerf, you can visualize your overall Linux performance as well as system level metrics like CPU, disk I/O, or memory usage.
A valid user account also must be entered to allow OctoPerf to connect to the ssh server and collect monitoring metrics.
Supported versions¶
The Linux monitor has been tested on latest Debian, Fedora and Ubuntu operating systems.
Prerequisites¶
The SSH server (Daemon) must be installed (apt-get install ssh
on Ubuntu).
The sysstat package must be installed on the linux machine to monitor. The Linux monitor needs to have an account which has the rights to execute the following commands:
- awk,
- grep,
- vmstat,
- iostat,
- top,
- netstat,
- nproc,
- pidstat,
- sysstat,
- read-only access to /proc/meminfo.
We recommend to setup a dedicated linux account with access to those commands only.
SSH Protocol¶
The Linux monitor uses an SSH connection to retrieve counter metrics. The connection is established when starting the monitoring and stopped when monitoring finishes.
SSH Configuration¶
Edit /etc/ssh/sshd_config file and check that any of the following parameters has the value yes:
- PasswordAuthentication: allow username and password authentication,
- PubkeyAuthentication: allows private key authentication,
- or PAMAuthenticationViaKbdInt: specifies whether PAM challenge response authentication is allowed.
The SSH service needs to be restarted after any change.
Configuration¶
The linux monitor can login on the operating system in 2 different ways.
Username and password:
- Username: linux username,
- Password: linux password.
Private key:
- Username: the linux operating system username,
- Private Key: private key certificate in base64 format,
- Password: private key passphrase.
Encryption algorithms¶
OctoPerf supports the following Ciphers:
- aes128-cbc,
- aes128-ctr,
- aes192-cbc,
- aes256-cbc,
- 3des-cbc,
- 3des-ctr,
- blowfish-cbc,
- arcfour,
- arcfour128,
- twofish-cbc,
- twofish128-cbc,
- twofish192-cbc,
- twofish256-cbc,
- and cast128-cbc.
OctoPerf supports the following message authentication codes (MAC):
- hmac-sha1,
- hmac-sha1-96,
- hmac-md5,
- hmac-md5-96.
If you don't know what are these settings, leave the fields configured as default. Default settings are valid for most systems.
Applications¶
The Linux monitor supports monitor specific devices like disks and even specific processes. To fine tune monitored counters selection, the linux monitor proposes to select the following applications:
- Disks: hard disk drives and solid state drives,
- Network Interfaces: network cards,
- Processes: running linux processes.
Most relevant counters are selected in the next step depending on the applications being selected.
Monitored Counters¶
The Linux monitoring module collects the following metrics:
-
System:
- Processes Sleeping: The number of processes in uninterruptible sleep,
- Memory Swapped: The amount of swap memory used,
- Memory Swap In: Amount of memory swapped from Disk per second,
- Memory Swap Out: Amount of memory swapped to Disk per second,
- Blocks Received: Blocks received from a block device per second,
- Blocks Sent: Blocks sent to a block device per second,
- Interrupts: Number of interrupts per second, including the clock,
- Processes Runnable: The number of runnable processes (running or waiting for run time),
- Processes Runnable per CPU: The number of runnable processes per CPU (running or waiting for run time),
- Context switches: The number of context switches per second,
- Context switches per CPU: The number of context switches per CPU per second.
-
CPU:
- % CPU User: Percent of time running un-niced user processes,
- % CPU System: Percent of time running kernel processes,
- % CPU Niced: Percent of time running niced user processes,
- % CPU Idle: Percent of time running idle,
- % CPU Wait: Percent of time running waiting for Input / Output completion,
- % CPU Hardware Interrupts: Percent of time servicing Hardware interrupts,
- % CPU Software Interrupts: Percent of time servicing Software interrupts,
- % CPU Stolen: Percent of time stolen from this vm by the hypervisor,
- Load Avg (1min): System load average for the last minute,
- Load Avg (5min): System load average for the past 5 minutes,
- Load Avg (15min): System load average for the past 15 minutes,
- Load Avg per CPU (1min): System load average for the last minute per CPU,
- Load Avg per CPU (5min): System load average for the past 5 minutes per CPU,
- Load Avg per CPU (15min): System load average for the past 15 minutes per CPU.
-
Memory:
- Total Memory: Total installed RAM (Random Access Memory) in Megabytes,
- Free Memory: Unused memory in Megabytes (sum of MemFree and SwapFree),
- Buffered Memory: Portion of RAM is dedicated to cache disk block in MegaBytes,
- Available Memory: Estimated available memory size in MegaBytes,
- Cached Memory: Amount of physical memory that is being used by cache buffers for filesystems in MegaBytes,
- Swap Total: Total swap memory in MegaBytes,
- Swap Used: Used swap memory in MegaBytes (Total - Free),
- % Swap Used: Percent of used swap memory,
- Used Memory: Effectively Used memory (minus Buffered and Cached memory),
- % Used Memory: % Effectively Used memory (used - (buffered + cached) / total * 100).
-
Disks:
-
{disk}:
- Read Request Merged: The number of read requests merged per second that were queued to the device per second,
- Write Request Merged: The number of write requests merged per second that were queued to the device per second,
- Write Request Merged: The number of write requests merged per second that were queued to the device per second,
- Read requests/sec: The number (after merges) of read requests completed per second for the device,
- Write requests/sec: The number (after merges) of write requests completed per second for the device,
- Read KB/sec: The number of kilobytes read from the device per second,
- Write KB/sec: The number of kilobytes written to the device per second,
- I/O Request Size: The average size (in sectors) of the requests that were issued to the device,
- Disk Queue Length: The average queue length of the requests that were issued to the device,
- IO Wait: The average time (in milliseconds) for I/O requests issued to the device to be served,
- Disk Read Await: The average time (in milliseconds) for read requests issued to the device to be served,
- Disk Write Await: The average time (in milliseconds) for write requests issued to the device to be served,
- Disk Avg Service Time: The average service time (in milliseconds) for I/O requests that were issued to the device,
- % Disk Utilization: Percentage of elapsed time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100% for devices serving requests serially.
-
-
Network:
-
{network_interface}:
- Net. Received OK: Number of network packets received successfully since startup,
- Net. Received Error: Number of network packets failed to received since startup,
- Net. Received Dropped: Number of network packets dropped since startup,
- Net. Received Overun: Number of network packets this interface was unable to receive,
- Received Bytes: Total received bytes since system startup,
- Net. Transmitted OK: Number of network packets successfully transmitted since startup,
- Net. Transmitted Error: Number of network packets failed to transmit since startup,
- Net. Transmitted Dropped: Number of network packets dropped since startup,
- Net. Transmitted Overun: Number of network packets this interface was unable to transmit since startup,
- Sent Bytes: Total number of bytes sent since startup,
- Received MB/sec: Number of MegaBytes per second received,
- Sent MB/sec: Number of MegaBytes per second sent,
- % Net. Received Error: % Invalid packets received on this interface,
- % Net. Sent Error: % Invalid packets sent on this interface.
-
-
Processes:
-
{process}:
- % Memory Usage: Percent of process memory usage,
- % CPU Usage: Percent of process cpu usage.
-
-
TCP:
- Established Connections: Number of TCP established connections,
- Incoming Segments/sec: Number of TCP segments received per second,
- Outgoing Segments/sec: Number of TCP segments transmitted per second,
- Segments Retransmitted/sec: Number of TCP segments retransmitted per second,
- % Segments Retransmitted: Percent of TCP segments retransmitted.
Going Further¶
Runnable processes¶
A high amount of runnable processes is not a problem in itself, but if this number increases it indicates an issue on the system.
SWAP Memory¶
A great performance worry is a high rate of swap-in/out's. This means your host doesn't have physical memory to store the needed pages and is using the disk often, which is significantly slower than physical memory.
Note that a moderate amount of swap space isn't necessarily an issue if the pages swapped out belong to idle processes. However, when the swap space increases, the chances of swap activity impacting your server performances are greater.
Context switches¶
A context switch is the switching of the CPU from one process or thread to another. Context switching is generally computationally intensive but the rate of switching entirely depends on the way your application is built. Several factors can increase this rate like the number of system calls and multi-tasking. Context switches are usually symptomatic of another problem, not a problem by themselves.
Load Average¶
A load average of 1.0 on a single core, single process machine means that the CPU at 100% capacity. To understand if the value for load average is correct, first you must count the number of cores on the machine (for that purpose we consider that 2 quad cores == 4 dual cores == 8 cores in total)
Then as long as the load average is less or equal to this number your system is doing fine. When the value is increasing over the number of cores, you can consider that the CPU is overloaded and that at least some processes had to wait.
CPU Usage¶
Linux keeps statistics on how much time the CPU spends performing different tasks. Most of its time should be spent running user space programs or being idle. However there are several other execution states including running the kernel and servicing interrupts. Monitoring these different states will help you assess any issue during the test.
Memory Usage¶
In general Linux uses all free RAM for the buffer cache to make reading data as efficient as possible. For this reason, the use of the "Used memory" or "% Used memory" counter allow for a quick overview of memory usage.
TCP Segments¶
The most useful overall metric regarding TCP segments is probably "% Segments Retransmitted".TCP retransmit happens when the server does not receive the acknowledgment for a segment before a timeout. This means that for every TCP segment retransmitted, one of the users is getting a delay in response time. Even a percent of all the segments being retrasmitted can have a dramatic effect on the response times.