DAOS Performance Tuning¶
This section documents how to validate the performance of the baseline building blocks in a DAOS system (i.e. network and storage) and then the full stack.
Network Performance¶
The DAOS CaRT (Collective and RPC Transport) layer can validate and benchmark network communications in the same context as an application and using the same networks/tuning options as regular DAOS.
The CaRT self_test
can run against the DAOS servers in a production environment
in a non-destructive manner. The only requirement is to have a formatted DAOS
system and the DAOS agent running on the client node where self_test is run.
Parameters¶
self_test
supports different message sizes, bulk transfers, multiple targets,
and the following test scenarios:
-
Selftest client to servers - where
self_test
issues RPCs directly to a list of servers. -
Cross-servers - where
self_test
sends instructions to the different servers that will issue cross-server RPCs. This model supports a many to many communication model.
The mode is selected via the --master-endpoint
option. If this option is
notified on the command line, then we are in the first mode and the self_test
binary itself issues the RPCs. If one or several master endpoint are specified,
then we are in the cross-server mode.
An endpoint consists of pair of two values separated by a colon. The first value
is the rank that matches the engine rank displayed in dmg system query
. The
second value is called tag and identified the service thread in the engine.
The DAOS engine uses the following mapping:
- tag 0 is used by the metadata service handling pool and container operations.
- tag 1 is used for cross-server monitoring (SWIM).
- tags 2 to [#targets + 1] is used by DAOS targets (one tag per target].
- tags [#targets + 2] to [#targets + #helpers + 1] is used by helper service threads.
As an example, an engine with targets: 16
and nr_xs_helpers: 4
would use the
following tag distributions:
- tag 0: metadata service
- tag 1: monitoring service
- tag 2-17: targets 0 to 15 (16 targets total)
- tag 18-21: helper service
For a total of 21 endpoints exported by this engine.
The RPC flow sent over the network can be configured via the --message-sizes
options that take a list of size tuples. Performance will be reported individually
for each tuple. Each size integer can be prepended with a single character to specify
the underlying transport mechanism. Available types are:
- 'e' - Empty (no payload)
- 'i' - I/O vector (IOV)
- 'b' - Bulk transfer
For example, (b1000) would transfer 1000 bytes via bulk in both directions. Similarly, (i100 b1000) would use IOV to send and bulk to reply.
--repetitions-per-size
can be used to define the Number of samples per message
size per endpoint with a default value of 10000. --max-inflight-rpcs
determines the number of concurrent RPCs issued simultaneously.
self_test
has many more options. For a full description of self_test
usage,
please run:
$ self_test --help
Example: Client-to-Servers¶
To run self_test in client-to-servers mode:
self_test -u --group-name daos_server --endpoint 0:2 --message-size '(0 b1048578)' --max-inflight-rpcs 16 --repetitions 100000
This will send 100k RPCs with a empty request, a bulk put of 1MB followed by an empty reply from the node where the self_test application is running to the first target of engine rank 0. This workload effectively simulate a 1MB fetch/read RPC over DAOS.
A 1MB update/write RPC would be simulated with the following command:
self_test -u --group-name daos_server --endpoint 0:2 --message-size "(b1048578 0)" --max-inflight-rpcs 16 --repetitions 100000
The RPC rate with empty request and reply is also often useful to evaluate what is the maximum capabilities of the network. This can be achieved as with the following command line:
self_test -u --group-name daos_server --endpoint 0:2 --message-size "(0 0)" --max-inflight-rpcs 16 --repetitions 100000
0 could be replaced with i2048 for instance to send a payload of 2Kb.
All those 3 tests could be combined in a single and unique run:
self_test -u --group-name daos_server --endpoint 0:2 --message-size "(0 0) (b1048578 0) (0 b1048578)" --max-inflight-rpcs 16 --repetitions 100000
RPCs could also be send to a range of engine ranks and tags as follows:
self_test -u --group-name daos_server --endpoint 0-<MAX_RANK>:0-<MAX_TAG> --message-size "(0 0) (b1048578 0) (0 b1048578)" --max-inflight-rpcs 16 --repetitions 100000
Note
By default, self_test will use the network interface selected by the agent. This can be forced by setting the OFI_INTERFACE and OFI_DOMAIN environment variables manually. e.g. export OFI_INTERFACE=eth0; export OFI_DOMAIN=eth0 or export OFI_INTERFACE=ib0; export OFI_DOMAIN=mlx5_0
Note
Depending on the HW configuration, the agent might assign a different network interface to the self_test application depending on the NUMA node where the process is scheduled. It is thus recommended to use taskset to bind the self_test process to a specific core. e.g. taskset -c 1 self_test ...
Example: Cross-Servers¶
To run self_test in cross-servers mode:
$ self_test -u --group-name daos_server --endpoint 0-<MAX_SERVER-1>:0 \
--master-endpoint 0-<MAX_RANK>:0-<MAX_TAG> \
--message-sizes "b1048576,b1048576 0,0 b1048576,i2048,i2048 0,0 i2048" \
--max-inflight-rpcs 16 --repetitions 100
The commands above will run self_test
benchmark using the following message sizes:
b1048576 1Mb bulk transfer Get and Put
b1048576 0 1Mb bulk transfer Get only
0 b1048576 1Mb bulk transfer Put only
I2048 2Kb iovec Input and Output
i2048 0 2Kb iovec Input only
0 i2048 2Kb iovec Output only
Note
Number of repetitions, max inflight rpcs, message sizes can be adjusted based on the particular test/experiment.
Storage Performance¶
SCM¶
DAOS provides a tool called vos_perf
to benchmark the versioning object store
over storage-class memory. For a full description of vos_perf
usage, please
run:
$ vos_perf --help
The -R
option is used to define the operation to be performanced:
U
forupdate
(i.e. write) operationF
forfetch
(i.e. read) operationP
forpunch
(i.e. truncate) operationp
to display the performance result for the previous operation.
For instance, -R "U;p F;p" means update the keys, print the update rate/bandwidth, fetch the keys and then print the fetch rate/bandwidth. The number of object/dkey/akey/value can be passed via respectively the -o, -d, -a and -n options. The value size is specified via the -s parameter (e.g. -s 4K for 4K value).
For instance, to measure rate for 10M update & fetch operation in VOS mode, mount the pmem device and then run:
$ cd /mnt/daos0
$ df .
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/pmem0 4185374720 49152 4118216704 1% /mnt/daos0
$ taskset -c 1 vos_perf -f ./vos -P 100G -d 10000000 -a 1 -n 1 -s 4K -z -R "U;p F;p"
Test :
VOS (storage only)
Pool :
a3b7ff28-56ff-4974-9283-62990dd770ad
Parameters :
pool size : SCM: 102400 MB, NVMe: 0 MB
credits : -1 (sync I/O for -ve)
obj_per_cont : 1 x 1 (procs)
dkey_per_obj : 10000000 (buf)
akey_per_dkey : 1
recx_per_akey : 1
value type : single
stride size : 4096
zero copy : yes
VOS file : ./vos0
Running test=UPDATE
Running UPDATE test (iteration=1)
UPDATE successfully completed:
duration : 124.478568 sec
bandwidth : 313.809 MB/sec
rate : 80335.11 IO/sec
latency : 12.448 us (nonsense if credits > 1)
Duration across processes:
MAX duration : 124.478568 sec
MIN duration : 124.478568 sec
Average duration : 124.478568 sec
Completed test=UPDATE
Running test=FETCH
Running FETCH test (iteration=1)
FETCH successfully completed:
duration : 23.884087 sec
bandwidth : 1635.503 MB/sec
rate : 418688.81 IO/sec
latency : 2.388 us (nonsense if credits > 1)
Duration across processes:
MAX duration : 23.884087 sec
MIN duration : 23.884087 sec
Average duration : 23.884087 sec
Taskset is used to change the CPU affinity of the daos_perf process.
Warning
Performance of persistent memory may be impacted by NUMA affinity. It is thus recommended to set the affinity of daos_perf to a CPU core locally attached to the persistent memory device.
The same test can be performed on the 2nd pmem device to compare the performance.
$ cd /mnt/daos1/
$ df .
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/pmem1 4185374720 262144 4118003712 1% /mnt/daos1
$ taskset -c 36 vos_perf -f ./vos -P 100G -d 10000000 -a 1 -n 1 -s 4K -z -R "U;p F;p"
Test :
VOS (storage only)
Pool :
9d6c3fbd-a4f1-47d2-92a9-6112feb52e74
Parameters :
pool size : SCM: 102400 MB, NVMe: 0 MB
credits : -1 (sync I/O for -ve)
obj_per_cont : 1 x 1 (procs)
dkey_per_obj : 10000000 (buf)
akey_per_dkey : 1
recx_per_akey : 1
value type : single
stride size : 4096
zero copy : yes
VOS file : ./vos0
Running test=UPDATE
Running UPDATE test (iteration=1)
UPDATE successfully completed:
duration : 123.389467 sec
bandwidth : 316.579 MB/sec
rate : 81044.19 IO/sec
latency : 12.339 us (nonsense if credits > 1)
Duration across processes:
MAX duration : 123.389467 sec
MIN duration : 123.389467 sec
Average duration : 123.389467 sec
Completed test=UPDATE
Running test=FETCH
Running FETCH test (iteration=1)
FETCH successfully completed:
duration : 24.114830 sec
bandwidth : 1619.854 MB/sec
rate : 414682.58 IO/sec
latency : 2.411 us (nonsense if credits > 1)
Duration across processes:
MAX duration : 24.114830 sec
MIN duration : 24.114830 sec
Average duration : 24.114830 sec
Completed test=FETCH
Bandwidth can be tested by using a larger record size (i.e. -s option). For instance:
$ taskset -c 36 vos_perf -f ./vos -P 100G -d 40000 -a 1 -n 1 -s 1M -z -R "U;p F;p"
Test :
VOS (storage only)
Pool :
dc44f0dd-930e-43b1-b599-5cc141c868d9
Parameters :
pool size : SCM: 102400 MB, NVMe: 0 MB
credits : -1 (sync I/O for -ve)
obj_per_cont : 1 x 1 (procs)
dkey_per_obj : 40000 (buf)
akey_per_dkey : 1
recx_per_akey : 1
value type : single
stride size : 1048576
zero copy : yes
VOS file : ./vos0
Running test=UPDATE
Running UPDATE test (iteration=1)
UPDATE successfully completed:
duration : 21.247287 sec
bandwidth : 1882.593 MB/sec
rate : 1882.59 IO/sec
latency : 531.182 us (nonsense if credits > 1)
Duration across processes:
MAX duration : 21.247287 sec
MIN duration : 21.247287 sec
Average duration : 21.247287 sec
Completed test=UPDATE
Running test=FETCH
Running FETCH test (iteration=1)
FETCH successfully completed:
duration : 10.133850 sec
bandwidth : 3947.167 MB/sec
rate : 3947.17 IO/sec
latency : 253.346 us (nonsense if credits > 1)
Duration across processes:
MAX duration : 10.133850 sec
MIN duration : 10.133850 sec
Average duration : 10.133850 sec
Completed test=FETCH
Note
With 3rd Gen Intel® Xeon® Scalable processors (ICX), the PMEM_NO_FLUSH environment variable can be set to 1 to take advantage of the extended asynchronous DRAM refresh (eADR) feature
A tool called daos_perf with the same syntax as vos_perf is also available to run tests from a compute node with the full DAOS stack. Please refer to the next section for more information.
SSDs¶
Performance of SSDs can be measured directly with SPDK via the spdk_nvme_perf tool. It can be run to test bandwidth in a non-destructive way as follows:
spdk_nvme_perf -q 16 -o 1048576 -w read -c 0xff -t 60
IOPS can be measured with the following command:
spdk_nvme_perf -q 16 -o 4096 -w read -c 0xff -t 60
-q
is used to control the queue depth, -o
for the I/O size, -w
is the
operation and can be either (rand)read, (rand)write or (rand)rw. The test
duration (in minutes) is defined by the -t
parameter.
Warning
write and rw options are destructive.
This command uses all the available SSDs. Specific SSDs can be specified via the
--allowed-pci-addr
options followed by the PCIe addresses of the SSDs of
interest.
The -c
option is used to specify the CPU cores used to submit I/O under the
form of a core mash. -c 0xff
uses the first 8 cores.
Note
On storage node using Intel VMD, the --enable-vmd
option must be specified.
Many more options are available. Please run spdk_nvme_perf
to see the list of
parameters that can be tweaked.
End-to-end Performance¶
DAOS can be benchmarked using several widely used IO benchmarks like IOR, mdtest, and FIO. There are several backends that can be used with those benchmarks.
ior¶
IOR (https://github.com/hpc/ior) with the following backends:
-
The IOR APIs POSIX, MPIIO and HDF5 can be used with DAOS POSIX containers that are accessed over dfuse. This works without or with the I/O interception library (
libioil
). Performance is significantly better when usinglibioil
. For detailed information on dfuse usage with the IO interception library, please refer to the [POSIX DFUSE section][7]. -
A custom DFS (DAOS File System) plugin for DAOS can be used by building IOR with DAOS support, and selecting API=DFS. This integrates IOR directly with the DAOS File System (
libdfs
), without requiring FUSE or an interception library. Please refer to the [DAOS README][10] in the hpc/ior repository for some basic instructions on how to use the DFS driver. -
When using the IOR API=MPIIO, the ROMIO ADIO driver for DAOS can be used by providing the
daos://
prefix to the filename. This ADIO driver bypassesdfuse
and directly invkes thelibdfs
calls to perform I/O to a DAOS POSIX container. The DAOS-enabled MPIIO driver is available in the upstream MPICH repository and included with Intel MPI. Please refer to the [MPI-IO documentation][8]. -
An HDF5 VOL connector for DAOS is under development. This maps the HDF5 data model directly to the DAOS data model, and works in conjunction with DAOS containers of
--type=HDF5
(in contrast to DAOS container of--type=POSIX
that are used for the other IOR APIs). Please refer the the [HDF5 with DAOS documentation][9].
IOR has several parameters to characterize performance. The main parameters to work with include:
- transfer size (-t)
- block size (-b)
- segment size (-s)
For more use cases, the IO-500 workloads are a good starting point to measure performance on a system: https://github.com/IO500/io500
mdtest¶
mdtest is released in the same repository as IOR. The corresponding backends that are listed above support mdtest, except for the MPI-IO and HDF5 backends that were only designed to support IOR. The [DAOS README][10] in the hpc/ior repository includes some examples to run mdtest with DAOS.
The IO-500 workloads for mdtest provide some good criteria for performance measurements.
FIO¶
A DAOS engine is integrated into FIO and available upstream. To build it, just run:
$ git clone http://git.kernel.dk/fio.git
$ cd fio
$ ./configure
$ make install
If DAOS is installed via packages, it should be automatically detected. If not, please specific the path to the DAOS library and headers to configure as follows:
$ CFLAGS="-I/path/to/daos/install/include" LDFLAGS="-L/path/to/daos/install/lib64" ./configure
Once successfully build, once can run the default example:
$ export POOL= # your pool UUID
$ export CONT= # your container UUID
$ fio ./examples/dfs.fio
Please note that DAOS does not transfer data (i.e. zeros) over the network when reading a hole in a sparse POSIX file. Very high read bandwidth can thus be reported if fio reads unallocated extents in a file. It is thus a good practice to start fio with a first write phase.
FIO can also be used to benchmark DAOS performance using dfuse and the interception library with all the POSIX based engines like sync and libaio.
daos_perf¶
Finally, DAOS provides a tool called daos_perf
which allows benchmarking to the
DAOS object API directly. It has a similar syntax as vos_perf
and, like IOR,
can be run as an MPI application. For a full description of daos_perf
usage,
please run:
$ daos_perf --help
Like vos_perf
, the -R
option is used to define the operation to be performanced:
U
forupdate
(i.e. write) operationF
forfetch
(i.e. read) operationP
forpunch
(i.e. truncate) operationp
to display the performance result for the previous operation.
For instance, -R "U;p F;p" means update the keys, print the update rate/bandwidth, fetch the keys and then print the fetch rate/bandwidth. The number of object/dkey/akey/value can be passed via respectively the -o, -d, -a and -n options. The value size is specified via the -s parameter (e.g. -s 4K for 4K value).
Client Tuning¶
For best performance, a DAOS client should specifically bind itself to a NUMA node instead of leaving core allocation and memory binding to chance. This allows the DAOS Agent to detect the client's NUMA affinity from its PID and automatically assign a network interface with a matching NUMA node. The network interface provided in the GetAttachInfo response is used to initialize CaRT.
To override the automatically assigned interface, the client should set the
environment variable OFI_INTERFACE
to match the desired network
interface.
The DAOS Agent scans the client machine on the first GetAttachInfo request to
determine the set of network interfaces available that support the DAOS Server's
OFI provider. This request occurs as part of the initialization sequence in the
libdaos daos_init()
performed by each client.
Upon receipt, the Agent populates a cache of responses indexed by NUMA affinity. Provided a client application has bound itself to a specific NUMA node and that NUMA node has a network device associated with it, the DAOS Agent will provide a GetAttachInfo response with a network interface corresponding to the client's NUMA node.
When more than one appropriate network interface exists per NUMA node, the agent uses a round-robin resource allocation scheme to load balance the responses for that NUMA node.
If a client is bound to a NUMA node that has no matching network interface, then a default NUMA node is used for the purpose of selecting a response. Provided that the DAOS Agent can detect any valid network device on any NUMA node, the default response will contain a valid network interface for the client. When a default response is provided, a message in the Agent's log is emitted:
No network devices bound to client NUMA node X. Using response from NUMA Y
To improve performance, it is worth figuring out if the client bound itself to the wrong NUMA node, or if expected network devices for that NUMA node are missing from the Agent's fabric scan.
In some situations, the Agent may detect no network devices and the response cache will be empty. In such a situation, the GetAttachInfo response will contain no interface assignment and the following info message will be found in the Agent's log:
No network devices detected in fabric scan; default AttachInfo response may be incorrect
In either situation, the admin may execute the command
daos_agent net-scan
with appropriate debug flags to gain more insight
into the configuration problem.
Disabling the GetAttachInfo cache:
The default configuration enables the Agent GetAttachInfo cache. If it is
desired, the cache may be disabled prior to DAOS Agent startup by setting the
Agent's environment variable DAOS_AGENT_DISABLE_CACHE=true
. The cache is
loaded only at Agent startup. The following debug message will be found in the
Agent's log:
GetAttachInfo agent caching has been disabled
If the network configuration changes while the Agent is running, it must be restarted to gain visibility to these changes. For additional information, please refer to the System Deployment: Agent Startup documentation section.