Friday, May 24, 2024

10 Key Questions When Working on Ampere Altra-Primarily based Situations — SitePoint

Must read

This text was initially revealed by Ampere Computing.

You’re operating your software on a brand new cloud occasion or a server (or SUT, a system below check) and also you discover there’s a efficiency problem. Otherwise you want to guarantee you’re getting the very best efficiency, given the system sources at your disposal. This doc discusses some fundamental questions you need to ask and methods to reply these questions.

Conditions: Know Your VM or Server

Earlier than you begin troubleshooting or embarking on a efficiency evaluation train, you want to pay attention to the system sources at your disposal. System-level efficiency sometimes boils all the way down to 4 parts and the way they work together with one another — CPU, Reminiscence, Community, Disk. Additionally confer with Brendan Gregg’s glorious article Linux Efficiency Evaluation in 60,000 milliseconds for an incredible begin to rapidly consider efficiency points.

This text explains easy methods to dig deeper to know efficiency points.

Decide CPU Kind

Run the $lscpu command, and it’ll show the CPU kind, CPU Frequency, Variety of cores and different CPU related info:

ampere@colo1:~$ lscpu 

Structure:                    aarch64 

CPU op-mode(s):                  32-bit, 64-bit 

Byte Order:                      Little Endian 

CPU(s):                          160 

On-line CPU(s) listing:             0-159 

Thread(s) per core:              1 

Core(s) per socket:              80 

Socket(s):                       2 

NUMA node(s):                    2 

Vendor ID:                       ARM 

Mannequin:                           1 

Mannequin identify:                      Neoverse-N1 

Stepping:                        r3p1 

CPU max MHz:                     3000.0000 

CPU min MHz:                     1000.0000 

BogoMIPS:                        50.00 

L1d cache:                       10 MiB 

L1i cache:                       10 MiB 

L2 cache:                        160 MiB 

NUMA node0 CPU(s):               0-79 

NUMA node1 CPU(s):               80-159 

Vulnerability Itlb multibit:     Not affected 

Vulnerability L1tf:              Not affected 

Vulnerability Mds:               Not affected 

Vulnerability Meltdown:          Not affected 

Vulnerability Mmio stale knowledge:   Not affected 

Vulnerability Spec retailer bypass: Mitigation; Speculative Retailer Bypass disabled by way of prctl 

Vulnerability Spectre v1:        Mitigation; __user pointer sanitization 

Vulnerability Spectre v2:        Mitigation; CSV2, BHB 

Vulnerability Srbds:             Not affected 

Vulnerability Tsx async abort:   Not affected 

Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid 

                                  asimdrdm lrcpc dcpop asimddp ssbs 

Decide Reminiscence Configuration

Run the $free command, and it’ll present you details about the whole quantity of bodily and swap reminiscence (together with the breakdown of reminiscence utilization). Run the Multichase benchmark to find out the latency, reminiscence bandwidth and load-latency of the occasion/SUT:

ampere@colo1:~$ free 

              whole        used        free      shared  buff/cache   obtainable 

Mem:      130256992     3422844   120742736        4208     6091412   125852984 

Swap:       8388604           0     8388604 

Assess Community Functionality

Run the $ethtool command, and it’ll present you details about the {hardware} settings of the NIC card. It is also used to regulate community gadget driver and {hardware} settings. In case you’re operating the workload within the client-server mannequin, it’s a good suggestion to know the Bandwidth and Latency between the consumer and the server. For figuring out the Bandwidth, a easy iperf3 check can be adequate, and for latency a easy ping check would have the ability to provide you with that worth. Within the client-server setup it’s additionally advisable to maintain the variety of community hops to a minimal. A traceroute is a community diagnostic command for displaying the route and measuring transit delays of packets throughout the community:

ampere@colo1:~$ ethtool -i enp1s0np0  

driver: mlx5_core 

model: 5.7-1.0.2 

firmware-version: 16.32.1010 (RCP0000000001) 


bus-info: 0000:01:00.0 

supports-statistics: sure 

supports-test: sure 

supports-eeprom-access: no 

supports-register-dump: no 

supports-priv-flags: sure> 

Perceive Storage Infrastructure

It’s important to know the disk capabilities earlier than you begin operating the workloads. Figuring out the throughput and latency of your disk and the filesystems will provide help to plan and architect the workload successfully. Versatile I/O (or “fio”) is the software of alternative to find out these values.

Now On to the Prime 10 Questions

1. Are my CPUs getting used nicely?

One of many major parts of the Whole Price of Possession is the CPU. It’s subsequently price discovering out how effectively CPUs are getting used. Idle CPUs sometimes imply there are exterior dependencies, like ready on disk or community accesses. It’s all the time a good suggestion to observe CPU utilization and to test if core utilization is uniform.

A pattern output from command $prime -1 is pictured beneath.

2. Are my CPUs operating on the highest frequencies potential?

Trendy CPUs use p-states to scale the frequency and voltage at which they run to scale back the ability consumption of the CPU when greater frequencies usually are not wanted. That is known as Dynamic Voltage and Frequency Scaling (DVFS) and is managed by the OS. In Linux, p-states are managed by the CPUFreq subsystem, which use completely different algorithms (known as governors) to find out which frequency the CPU is to be run at. Normally, for performance-sensitive purposes, it’s a good suggestion to make sure that the efficiency governor is used, and the next command makes use of the cpupower utility to realize that. Needless to say the frequency utilization at which a CPU ought to run is workload dependent:

cpupower frequency-set –governor efficiency 

To test the frequency of the CPU whereas operating your software, run the next command:

ampere@colo1:~$ cpupower frequency-info 

analyzing CPU 0: 

  driver: cppc_cpufreq 

  CPUs which run on the identical {hardware} frequency: 0 

  CPUs which have to have their frequency coordinated by software program: 0 

  most transition latency: Can not decide or isn't supported. 

  {hardware} limits: 1000 MHz - 3.00 GHz 

  obtainable cpufreq governors: conservative ondemand userspace powersave efficiency schedutil 

  present coverage: frequency ought to be inside 1000 MHz and 3.00 GHz. 

                  The governor "ondemand" might determine which pace to make use of 

                  inside this vary. 

  present CPU frequency: Unable to name {hardware} 

  present CPU frequency: 1000 MHz (asserted by name to kernel) 


3. How a lot time am I spending in my software versus kernel time?

It’s typically vital to search out out what share of the CPU’s time is consumed in person house versus privileged time (i.e., kernel house). Excessive kernel time may be justified for a sure class of workloads (network-bound workloads, for instance) however may also be a sign of an issue.

The Linux software prime can be utilized to search out out the person vs. kernel time consumption as proven beneath.

  • Mpstat — study statistics per CPU and test for particular person sizzling/busy CPUs. It is a multiprocessor statics software, and may report statistics per CPU (-P choice)
  • CPU: Logical CPU ID, or all for abstract
  • %usr: Consumer Time, excluding %good
  • %good: Consumer Time for processes with a niced precedence
  • %sys: System Time
  • %iowait: IO wait
  • %irq : {Hardware} interrupt CPU utilization
  • %delicate: Software program interrupt CPU utilization
  • %steal: Time spent servicing different tenants
  • %visitor: CPU time spent in visitor Digital Machines
  • %gnice: CPU time to run a niced visitor
  • %idle: Idle

To determine CPU utilization per CPU and present the user-time/kernel time ratio %usr, %sys, and %idle are the important thing values. These key values can even assist determine “sizzling” CPUs which could be brought on by single threaded purposes or interrupt mapping.

4. Do I’ve sufficient reminiscence for my software?

If you end up managing a server, you might need to put in a brand new software, otherwise you may discover that the applying has began to decelerate. For managing your system sources and understanding your put in system reminiscence and reminiscence utilization by the system the $free command is a helpful software. $vmstat can also be a helpful software to observe reminiscence utilization and if you’re actively swapping your reminiscence together with your digital reminiscence.

  • Free. The Linux free command exhibits reminiscence and swap statistics.

    The output exhibits the whole, used and free reminiscence of the system. An necessary column is the obtainable worth, which exhibits obtainable reminiscence to an software with the necessity of swap. It additionally accounts for the reminiscence which can’t be reclaimed instantly

  • Vmstat. This command offers a high-level view of system reminiscence, well being, together with presently free reminiscence and paging statistics.

    The $vmstat command exhibits lively Reminiscence being swapped out (paging).

The instructions print the abstract of the present standing. The columns are in kilobytes by default and are:

  • Swpd: Quantity of swapped out reminiscence
  • Free: Free obtainable reminiscence
  • Buff: Reminiscence within the buffer cache
  • Cache: Reminiscence within the web page cache
  • Si: Reminiscence swapped in (paging)
  • So: Reminiscence swapped out (paging)

If the si and the so are non-zero, the system is below reminiscence strain and is swapping reminiscence to the swap gadget.

5. Am I getting the suitable quantity of reminiscence bandwidth?

To know the suitable quantity of reminiscence bandwidth, first get the “Max Reminiscence Bandwidth” worth of your system. The “Max Reminiscence Bandwidth” worth could be discovered by:

  • Base DRAM clock Frequency
  • Variety of Knowledge Transfers per clock: two, in case of “double knowledge price” (DDR*) reminiscence
  • Reminiscence bus (interface) width: for Instance, DDR 3 is 64 bits large (additionally known as line)
  • Variety of interfaces: trendy private computer systems sometimes use two reminiscence interfaces (dual-channel mode) for an efficient 128-bit bus width
  • Max Reminiscence Bandwidth = Base DRAM clock Frequency * Variety of Knowledge Transfers per clock * Reminiscence base width * Variety of interfaces

This worth represents the theoretical most bandwidth of the system, also called the “burst price”. Now you can run benchmarks like Multichase, or Bandwidth towards the system and confirm the values.

Word: it has been seen that the burst charges is probably not sustainable, and the values achieved may be a bit lower than calculated.

6. Is my workload utilizing all my CPUs in a balanced method?

When operating workloads in your server, as a part of efficiency tuning or troubleshooting, chances are you’ll wish to know on which CPU core a specific course of is presently scheduled and acquire efficiency statistics of the method operating on that CPU core. Step one can be to search out the method operating on the CPU core. This may be executed utilizing the htop. The CPU worth doesn’t replicate on the default show of htop. To get the CPU core worth, launch $htop from the command line, press the F2 key, go to the “Columns”, and add “Processor” below the “Accessible Columns”. The presently used “CPU ID” of every course of will seem below the “CPU” column.

  • The best way to configure $htop to indicate CPU/core:

  • $htop command displaying core 4-6 maxed out (htop core rely begin from “1” as an alternative of “0”):

  • $mpstat command for chosen cores to look at statistics:

Upon getting recognized the CPU core, you may run the $mpstat command to look at statistics per CPU and test for particular person sizzling/busy CPUs. It is a multiprocessor statics software and may report statistics per CPU (or core). For extra info on $mpstat see the “How a lot time am I spending in my software versus kernel time?” part above.

7. Is my community a bottleneck for my software?

Community bottlenecking can occur even earlier than you saturate different sources on the server. This problem is discovered when a workload is being run in a client-server mannequin. The very first thing it’s essential do is decide how your community seems. The latency and bandwidth between the consumer and the server is particularly necessary. Instruments like iperf3, ping and traceroute are easy instruments which might help you identify the bounds of your community. Upon getting decided the bounds in case your community, instruments like $dstat and $nicstat provide help to monitor the community utilization and decide any bottlenecking occurring together with your system on account of networking.

  • Dstat. This command is used to observe the system sources, together with CPU stats, Disk stats, Community stats, paging stats, and system stats. For monitoring the community utilization use the -n choice.

    The command will give the throughput for packets acquired and despatched by the system.

  • Nicstat. This command prints community interface statistics, together with throughput and utilization.

The columns embody:

  • Int: interface identify
  • %util: the utmost utilization
  • Sat: worth reflecting interface saturation statistics
  • Values prefix “r” = learn /obtain
  • Values prefix “w” = write/transmit
  • 1- KB/s: KiloByes per second
  • 2- Pk/s: packets per second
  • 3- Avs/s: Common packet measurement in bytes

8. Is my disk a bottleneck?

Like Community, disk may also be the explanation for a low performing software. With regards to measuring disk efficiency, we take a look at the next indicators:

  • Utilization
  • Saturation
  • IOPS (Enter/Output Per Second)
  • Throughput
  • Response time

An excellent rule is that if you find yourself choosing a server/occasion for an software, you have to first carry out a benchmark check on the I/O efficiency of the disk so to get the height worth or “ceiling” of the disk efficiency and in addition have the ability to decide of the disk efficiency meets the wants of the applying. Versatile I/O is the software of alternative to find out these values.

As soon as the applying is operating, you need to use $iostat and $dstat to observe the disk useful resource utilization in actual time.

The iostat command exhibits the per-disk I/O statistics, proving metrics for workload characterization, utilization, and saturation.

The primary output line exhibits the abstract of the system, together with the kernel model, host identify, knowledge structure and CPU rely. The second line exhibits the abstract of the system since boot time for the CPUs.

For every disk gadget proven within the subsequent rows, it exhibits the essential particulars within the columns:

  • Tps: Transactions per second
  • kB_read/s: Kilobytes learn per second
  • kB_wrtn/s: Kilobytes written per second
  • kB_read: Whole Kilobytes learn
  • KB_write: Whole Kilobytes written

The dstat command is used to observe the system sources, together with CPU stats, Disk stats, Community stats, paging stats, and system stats. For monitoring the disk utilization use the -d choice. The choice will present the whole variety of learn (learn) and write (writ) operations on disks.

The picture beneath demonstrates a write intensive workload.

9. Am I paying a NUMA penalty?

Non-uniform reminiscence entry (NUMA) is a pc reminiscence design utilized in multiprocessing, the place the reminiscence entry time depends upon the reminiscence location relative to the processor. Underneath NUMA, a processor can entry its personal native reminiscence quicker than non-local reminiscence (reminiscence native to a different processor or reminiscence shared between processors). The advantages of NUMA are restricted to workloads, notably on servers the place the information is usually related strongly with sure duties or customers.

On a NUMA system, the better the space between the processor and its reminiscence financial institution, the slower the processor entry to that reminiscence financial institution is. For Efficiency-sensitive software the system OS ought to allocate reminiscence from the closet potential reminiscence financial institution. To observe in actual time the reminiscence allocation of the system or a course of, $numastat is a good software to make use of.

The numastat command offers statistics for non-uniform reminiscence entry (NUMA) methods. These methods are sometimes methods with a number of CPU sockets.

Linux OS tries to allocation reminiscence on the closest NUMA node, and $numastat exhibits the present statistics of the reminiscence allocation.

  • Numa_hit: Reminiscence allocation on the supposed NUMA node
  • Numa_miss: Reveals native allocation that ought to have been elsewhere
  • Numa_foreign: exhibits distant allocation that ought to been native
  • Other_node: Reminiscence allocation on this node whereas the method is operating elsewhere

Each numa_miss and Numa_foreign present reminiscence allocations not on the popular NUMA node. In a great scenario the values of numa_miss and numa_foreign ought to be stored to the minimal, as greater values end result and poor reminiscence I/O efficiency.

The $numastat -p <course of -id> command may also be used to see the NUMA distribution of a course of.

10. What’s my CPU doing when I’m operating my software?

When operating an software in your system/occasion you’d be considering figuring out what the applying is doing and sources utilized by the applying in your CPU. $pidstat is a command-line software which might monitor each particular person course of operating on the system.

pidstat will break down the highest CPU shoppers into user-time and system-time.

This Linux software prints CPU utilization by course of or thread, together with person and system time. This command can even report IO statics of a course of (-d choice).

  • UID: The actual person identification variety of the duty being monitored
  • PID: The identification variety of the duty being monitored
  • %usr: Share of CPU utilized by the duty whereas executing on the person stage (software), with out good precedence.
  • %system: P.c of CPU utilized by the duty whereas executing on the system stage (kernel)
  • %wait: P.c of CPU spent by the duty whereas ready to run
  • %CPU: Whole share of CPU time utilized by the duty.
  • CPU: Processor/core quantity to which the duty is hooked up

$pidstat -p could be additionally run to collect knowledge on a specific course of.

Speak to our professional gross sales staff about partnerships or study entry to Ampere Techniques by our Developer Entry Applications.

Supply hyperlink

More articles


Please enter your comment!
Please enter your name here

Latest article