Block Device¶

A POSIX container can exported a block device via the NVMe-oF protocol. This requires to set up a separate SPDK service on third-party nodes (e.g. dedicated nodes or running on specific cores on client nodes or the DAOS storage nodes) exporting DAOS containers as a NVMe targets. This section describes how to configure the SPDK DAOS bdev and access it via the NVMe-oF protocol. It assumes that a DAOS system is already configured.

Configuring NVMe-oF Target¶

It is advisable to configure host according with the SPDK performance reports.

Clone the spdk repo and switch to daos branch:

$ git clone https://github.com/spdk/spdk.git
$ git submodule update --init
$ ./configure --with-daos
$ make -j 16

Tip

If DAOS was built from source, use --with-daos=--with-daos=/path/to/daos/install/dir

The output binaries are located under build/bin

Note

Prior to DAOS v2.2 the single thread performance is capped at ~250k 4k IOPS, it should be rectified with the patch on the daos branch. Meanwhile, usage of process per disk is preferable.

Hugepages should be then be configured on the host:

$ sudo HUGE_EVEN_ALLOC=yes scripts/setup.sh

Note

HUGE_EVEN_ALLOC=yes is needed to enable hugepages on all NUMA nodes in the system.

In the first terminal run the nvmf target app:

$ sudo ./build/bin/nvmf_tgt -m [21,22,23,24]
[2023-04-21 09:09:40.791150] Starting SPDK v23.05-pre git sha1 26b9be752 / DPDK 22.11.1 initialization...
[2023-04-21 09:09:40.791194] [ DPDK EAL parameters: nvmf --no-shconf -l 21,22,23,24 --huge-unlink --log-level=lib.eal:6 --log-level=lib.cryptodev:5 --log-level=user1:6 --base-virtaddr=0x200000000000 --match-allocations --file-prefix=spdk_pid747434 ]
TELEMETRY: No legacy callbacks, legacy socket not created
[2023-04-21 09:09:40.830768] app.c: 738:spdk_app_start: *NOTICE*: Total cores available: 4
[2023-04-21 09:09:40.859580] reactor.c: 937:reactor_run: *NOTICE*: Reactor started on core 22
[2023-04-21 09:09:40.859716] reactor.c: 937:reactor_run: *NOTICE*: Reactor started on core 23
[2023-04-21 09:09:40.859843] reactor.c: 937:reactor_run: *NOTICE*: Reactor started on core 24
[2023-04-21 09:09:40.859844] reactor.c: 937:reactor_run: *NOTICE*: Reactor started on core 21
[2023-04-21 09:09:40.878692] accel_sw.c: 601:sw_accel_module_init: *NOTICE*: Accel framework software module initialized.

Open another terminal for the configuration process. The configuration is done via scripts/rpc.py script, after that it can be dumped into json file that later may be passed to nvmf_tgt. The shortest way to create couple of disk backed up by DAOS DFS is to use the following script (called export_disk.sh):

POOL_UUID="${POOL:-pool_label}"
CONT_UUID="${CONT:-const_label}"
DISK_UUID="${UUID:-`uuidgen`}"
NR_DISKS="${1:-1}"
BIND_IP="${TARGET_IP:-172.31.91.61}"

sudo ./scripts/rpc.py nvmf_create_transport -t TCP -u 2097152 -i 2097152

for i in $(seq 1 "$NR_DISKS"); do
    sudo ./scripts/rpc.py bdev_daos_create disk$i ${POOL_UUID} ${CONT_UUID} 1048576 4096 --uuid ${DISK_UUID}
    subsystem=nqn.2016-06.io.spdk$i:cnode$i
    sudo scripts/rpc.py nvmf_create_subsystem $subsystem -a -s SPDK0000000000000$i -d SPDK_Virtual_Controller_$i
    sudo scripts/rpc.py nvmf_subsystem_add_ns $subsystem  disk$i
    sudo scripts/rpc.py nvmf_subsystem_add_listener $subsystem -t tcp -a ${BIND_IP} -s 4420
done

Subsystem name (NQN) and UUID of the disk are pretty important for multi-pathing and have to match across different controllers (nodes) The default values are for the dev setup on the daos2 node.

The script optionally takes the number of disk to export:

denis@daos2:~/spdk> POOL=denisb CONT=nvmetest sh export_disk.sh 2
disk1
disk2

Accessing the Device¶

On the node where you want to access the block device, make sure that nvme-cli is installed and nvme-tcp module is loaded via: sudo modprobe nvme-tcp. To connect to the target disk run:

$ sudo nvme connect-all -t tcp -a 172.31.91.61 -s 4420

After the successful execution the new nvme drives should appear in the system:

$ sudo nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     S4YPNE0N800124       SAMSUNG MZWLJ3T8HBLS-00007               1           3.84  TB /   3.84  TB    512   B +  0 B   EPK98B5Q
/dev/nvme1n1     SPDK00000000000001   SPDK_Virtual_Controller_1                1           1.10  TB /   1.10  TB      4 KiB +  0 B   23.05
/dev/nvme2n1     SPDK00000000000002   SPDK_Virtual_Controller_2                1           1.10  TB /   1.10  TB      4 KiB +  0 B   23.05

The block devices can now be accessed to run fio or to mount a filesystem:

$ sudo mkfs.ext4 /dev/nvme1n1
$ sudo mkdir /testfs
$ sudo mount /dev/nvme1n1 /testfs
$ df -h /testfs
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme1n1   1007G  1.1G  955G   1% /testfs

The resulted filesystem should not be concurrently modified from different client nodes. It is recommended to use the ext4 feature called multiple mount protection to avoid corrupting the ext4 filesystem from different client nodes. This feature can be enabled as follows:

$ sudo umount /testfs
$ sudo mkfs.ext4 -F -O mmp /dev/nvme1n1
$ sudo e2mmpstatus /dev/nvme1n1
e2mmpstatus: it is safe to mount '/dev/nvme1n1', MMP is clean
$ sudo mount /dev/nvme1n1 /testfs

It is possible to mount one one node, inject data, unmount it and then mount this filesystem on multiple nodes in read-only mode (-o ro mount option).

Once all is done, please clean up after yourself by running on the initiator side:

$ sudo nvme disconnect-all

And shut down the nvmf_tgt. Otherwise linux kernel might get very upset about missing drives.