Skip to content

File System

A container can be mounted as shared POSIX namespace on multiple compute nodes. This capability is provided by the libdfs library that implements the file and directory abstractions over the native libdaos library. The POSIX emulation can be exposed directly to applications or I/O frameworks (e.g., for frameworks like Spark or TensorFlow, or benchmarks like IOR or mdtest that support different storage backend plugins). It can also be exposed transparently via a FUSE daemon, combined optionally with an interception library to address some of the FUSE performance bottlenecks by delivering full OS bypass for POSIX read/write operations.

../graph/posix.png

The performance is going to be best generally when using the DFS API directly. Using the IO interception library with dfuse should yield the same performance for IO operations (read/write) as the DFS API with minimal overhead. Performance of metadata operations (file creation, deletion, rename, etc.) over dfuse will be much slower than the DFS API since there is no interception to bypass the fuse/kernel layer.

libdfs

The DAOS File System (DFS) is implemented in the libdfs library, and allows a DAOS container to be accessed as a hierarchical POSIX namespace. libdfs supports files, directories, and symbolic links, but not hard links. Access permissions are inherited from the parent pool and are not implemented on a per-file or per-directory basis.

Supported Operations

The DFS API closely represents the POSIX API. The API includes operations to:

  • Mount: create/open superblock and root object
  • Un-mount: release open handles
  • Lookup: traverse a path and return an open file/dir handle
  • IO: read & write with an iovec
  • Stat: retrieve attributes of an entry
  • Mkdir: create a dir
  • Readdir: enumerate all entries under a directory
  • Open: create/Open a file/dir
  • Remove: unlink a file/dir
  • Move: rename
  • Release: close an open handle of a file/dir
  • Extended Attributes: set, get, list, remove

POSIX Compliance

POSIX support in DAOS comes with the following limitations:

  • Hard links are currently not supported.
  • Flock operations are not supported (maybe at dfuse local node level only).
  • mmap support with MAP_SHARED will be consistent from single client only and only when data caching is enabled. Note that this is supported through DFUSE only (i.e. not through the DFS API). The dfuse-data-cache=otoc container attribute allows this without enabling other caching.
  • Char devices, block devices, sockets and pipes are not supported.
  • User/group quotas are not supported.
  • Access time (atime) is always the greater between the change and modify time.
  • Block size in stat buf is not accurate (no account for holes, extended attributes).
  • Various parameters reported via statfs like number of blocks, files, free/available space are not accurate.
  • O_APPEND mode for files is not supported. When O_APPEND is used on file open with dfuse, the file open will not return an unsupported error and will be consistent only on the local node (dfuse instance) where this operation was executed. This means if O_APPEND is used over dfuse with multiple dfuse instances, the appends to the file are not consistent and may corrupt the file.
  • The sticky bit, POSIX ACLs, and supplementary groups are not supported.
  • While set_uid/gid bits are stored by libdfs on setattr and returned on getattr, it is up to the caller (e.g. fuse in the case of dfuse) to implement support for setuid/gid binaries since libdfs does not provide any interface to execute binaries.
  • POSIX permissions are only stored and enforced at the DFS level and provided for convenience purposes. Security of access to the DFS container should be properly set at the DAOS pool and/or container level using DAOS ACLs. This means that a user should not rely on those POSIX permissions for securing access to their data since it can be bypassed by the DAOS lower level API if the user has ACL access to the container.
  • Open-unlink semantics: This occurs when a client obtains an open handle on an object (file or directory), and accesses that object (reads/writes data or create other files), while another client removes that object that the other client has opened from under it. In DAOS, we don't track object open handles as that would be very expensive, and so in such conflicting cases, the worst case scenario is the lost/leaked space that is written to those orphan objects that have been unlinked from the namespace. DAOS implements a file system checker that can be used to either relink those orphaned objects back in a lost+found directory or remove them from the container to reclaim the space.

Note

DFS directories do not include the . (current directory) and .. (parent directory) directory entries that are known from other POSIX filesystems. Commands like ls -al will not include these entries in their output. Those directory entries are not required by POSIX, so this is not a limitation to POSIX compliance. But scripts that parse directory listings under the assumption that those dot directories are present may need to be adapted to correctly handle this situation. Note that operations like cd . or cd .. will still succeed in dfuse-mounted POSIX containers.

It is possible to use libdfs in a parallel application from multiple nodes. DFS provides two modes that offer different levels of consistency. The modes can be set on container creation time:

1) Relaxed mode for well-behaved applications that generate conflict-free operations for which a very high level of concurrency will be supported.

2) Balanced mode for applications that require stricter consistency at the cost of performance. This mode is currently not fully supported and DFS by default will use the relaxed mode.

On container access, if the container is created with balanced mode, it can be accessed in balanced mode only. If the container was created with relaxed mode, it can be accessed in relaxed or balanced mode. In either mode, there is a consistency semantic issue that is not properly handled:

Other consistency issues are handled differently between the two consistency modes:

  • Same Operation Executed Concurrently (Supported in both Relaxed and Balanced Mode): For example, clients try to create or remove the same file concurrently, one should succeed and others will fail.
  • Create/Unlink/Rename Conflicts (Supported in Balanced Mode only): For example, a client renames a file, but another unlinks the old file at the same time.
  • Operation Atomicity (Supported only in Balanced mode): If a client crashes in the middle of the rename, the state of the container should be consistent as if the operation never happened.
  • Visibility (Supported in Balanced and Relaxed mode): A write from one client should be visible to another client with a simple coordination between the clients.

Unified NameSpace (UNS)

Many clients support links to other containers as a layer on top of DFS, where a directory in a POSIX container is interpreted as a instruction to access the root of a separate container, in much the same way as symbolic links work on Unix. DFS does not handle this directly, however the same mechanism for accessing paths in this way is common across several higher layers.

DFuse (DAOS FUSE)

DFuse provides DAOS File System access through the standard libc/kernel/VFS POSIX infrastructure. This allows existing applications to use DAOS without modification, and provides a path to upgrade those applications to native DAOS support. Additionally, DFuse provides an Interception Library libioil to transparently allow POSIX clients to talk directly to DAOS servers, providing OS-Bypass for I/O without modifying or recompiling of the application.

DFuse is layered on top of DFS. Data written via DFuse can be accessed by DFS and vice versa, even simultaneously from different client applications.

DFuse Daemon

The dfuse daemon runs a single instance per node to provide a user POSIX access to DAOS. It should be run with the credentials of the user, and typically will be started and stopped on each compute node as part of the prolog and epilog scripts of any resource manager or scheduler in use.

Core binding and threads

DFuse will launch one thread per available core by default, limited to 16 if not constrained by a taskset. This can be changed by the --thread-count option. To change the cores that DFuse runs on use kernel level tasksets which will bind DFuse to a subset of cores. This can be done via the taskset or numactl programs or similar. If doing this then DFuse will again launch one thread per available core by default. Many metadata operations will block a thread until completed so if restricting DFuse to a small number of cores then overcommiting via the --thread-count option may be desirable.

DFuse will use two types of threads: fuse threads to accept and process requests and event queue progress threads. The --thread-count option will dictate the total number of threads and each eq-thread will reduce this. Each event queue thread will create a daos event queue so consumes additional network resources. The --eq-count option will control the event queues and associated threads.

In addition DFuse will always use a single main thread and a invalidation thread to manage dentry timeouts.

Restrictions

DFuse by default is limited to a single user. Access to the filesystem from other users, including root, will not be honored. As a consequence of this, the chown and chgrp calls are not supported. Hard links and special device files, except symbolic links, are not supported, nor are any ACLs beyond standard POSIX permissions.

DFuse can run in the foreground, keeping the terminal window open, or it can daemonize to run like a system daemon. The default is to run in the background and when doing this it will remain attached to the terminal until after initialization to be able to report back status or failure to start to the user.

Inodes are managed on the local node by DFuse. So while inode numbers will be consistent on a node for the duration of the session, they are not guaranteed to be consistent across restarts of DFuse or across nodes.

It is not possible to see pool/container listings through DFuse. So if readdir, ls or others are used, DFuse will return ENOTSUP.

Multi-user mode

The --multi-user option will put DFuse into multi user mode where it will tell the kernel to make the filesystem available to all users on a node rather than only the user running the DFuse process. This makes DFuse appear like a generic multi-user filesystem and the standard chown and chgrp calls are enabled, all filesystem entries will be owned by the user that created them as is normal in a POSIX filesystem.

Links to other containers can be created in this mode even if the new containers are not owned by the user running DFuse. In this case the user running DFuse should be given 'r' access to the pool if required and the container create command will apply permissions required to the container at create time.

It is anticipated that in this mode DFuse will be configured to start at boot time and run as a general purpose filesystem providing access to multiple users.

Multi-user mode requires the fuse package to be reconfigured as it's disabled by default. The setting user_allow_other needs to be set in /etc/fuse.conf or /etc/fuse3.conf, which will need to be done as root and takes effect for all users on that node.

Launching

Via dfuse command

DFuse should be run with the credentials (user/group) of the user who will be accessing it, and who owns any pools that will be used.

There is one mandatory command-line option, this is a mount point to start dfuse and can be supplied either via the --mountpoint option or the first positional argument. The mount point specified should be an empty directory on the local node that is owned by the user.

Additionally, there are several optional command-line options:

Command-line Option Description
--pool=<label|uuid> pool label or uuid to connect to
--container=<label|uuid> container label or uuid to open
--sys-name=<name> DAOS system name
--foreground run in foreground
--singlethreaded run single threaded
--thread-count= Number of threads to use
--multi-user Run in multi user mode
--read-only Mount in read-only mode

The --pool and --container options can also be passed as the second and third positional arguments.

When DFuse starts, it will register a single mount with the kernel, at the location specified. This mount will be visible in /proc/mounts, and possibly in the output of df. The contents of multiple pools/containers may be accessible via this single kernel mount.

Below is an example of creating and mounting a POSIX container under the /scratch_fs/dfuse mountpoint.

$ mkdir /scratch_fs/dfuse

$ dfuse -m /scratch_fs/dfuse tank mycont

$ touch /scratch_fs/dfuse/foo

$ ls -l /scratch_fs/dfuse/
total 0
-rw-rw-r-- 1 samirrav samirrav 0 Sep 23 16:31 foo

$ df -h /scratch_fs/dfuse/
Filesystem      Size  Used Avail Use% Mounted on
dfuse           537G  5.1G  532G   1% /scratch_fs/dfuse
$

DFuse can be launched via fstab and the standard mount command, it will parse -o options and extract pool=,container= if provided and ignore any other filesystem options given.

There are few use cases described below to explain how systemd or /etc/fstab can be used to mount the daos container using dfuse.

Via mount.fuse3 command

$  dmg pool create --scm-size=8G --nvme-size=64G -u samirrav@ samirrav_pool

Creating DAOS pool with manual per-engine storage allocation: 8.0 GB SCM, 64 GB NVMe (12.50% ratio)
Pool created with 11.11%,88.89% storage tier ratio
--------------------------------------------------
  UUID                 : b43b06fe-4013-4177-911c-6d230b88fe6e
  Service Ranks        : [1-5]
  Storage Ranks        : [0-7]
  Total Size           : 576 GB
  Storage tier 0 (SCM) : 64 GB (8.0 GB / rank)
  Storage tier 1 (NVMe): 512 GB (64 GB / rank)

$  daos cont create samirrav_pool samirrav_cont --type=POSIX
  Container UUID : 6efdc02c-5eaa-4a29-a34b-a062f1fe3371
  Container Label: samirrav_cont
  Container Type : POSIX

Successfully created container 6efdc02c-5eaa-4a29-a34b-a062f1fe3371
$  daos cont get-prop samirrav_pool samirrav_cont
Properties for container samirrav_cont
Name                                    Value
----                                    -----
Highest Allocated OID                   0
Checksum                                off
Checksum Chunk Size                     32 KiB
Compression                             off
Deduplication                           off
Dedupe Threshold                        4.0 KiB
EC Cell Size                            64 KiB
Performance domain affinity level of EC 1
Encryption                              off
Global Version                          2
Group                                   samirrav@
Label                                   samirrav_cont
Layout Type                             POSIX (1)
Layout Version                          1
Max Snapshot                            0
Owner                                   samirrav@
Redundancy Factor                       rd_fac0
Redundancy Level                        node (2)
Performance domain affinity level of RP 3
Server Checksumming                     off
Health                                  HEALTHY
Access Control List                     A::OWNER@:rwdtTaAo, A:G:GROUP@:rwtT
$  mkdir /scratch_fs/daos_dfuse_samir
$  mount.fuse3 dfuse /scratch_fs/daos_dfuse_samir -o pool=samirrav_pool,container=samirrav_cont
$  touch /scratch_fs/daos_dfuse_samir/foo
$  ls -l /scratch_fs/daos_dfuse_samir/
total 0
-rw-rw-r-- 1 samirrav samirrav 0 Sep 23 15:49 foo
$  df -h | grep fuse
dfuse                         537G  5.1G  532G   1% /scratch_fs/daos_dfuse_samir
$

Via fstab

Only root can run 'mount -a' command so this example should be run as root user.

$  dmg pool create --scm-size=8G --nvme-size=64G admin_pool
Creating DAOS pool with manual per-engine storage allocation: 8.0 GB SCM, 64 GB NVMe (12.50% ratio)
Pool created with 11.11%,88.89% storage tier ratio
--------------------------------------------------
  UUID                 : 97196853-a487-41b2-a5d2-286e62f14e9e
  Service Ranks        : [1-5]
  Storage Ranks        : [0-7]
  Total Size           : 576 GB
  Storage tier 0 (SCM) : 64 GB (8.0 GB / rank)
  Storage tier 1 (NVMe): 512 GB (64 GB / rank)

$  daos cont create admin_pool admin_cont --type=POSIX
  Container UUID : ac4fb4db-a15e-45bf-8225-b71d34e3e578
  Container Label: admin_cont
  Container Type : POSIX

Successfully created container ac4fb4db-a15e-45bf-8225-b71d34e3e578
$  daos cont get-prop admin_pool admin_cont
Properties for container admin_cont
Name                                    Value
----                                    -----
Highest Allocated OID                   0
Checksum                                off
Checksum Chunk Size                     32 KiB
Compression                             off
Deduplication                           off
Dedupe Threshold                        4.0 KiB
EC Cell Size                            64 KiB
Performance domain affinity level of EC 1
Encryption                              off
Global Version                          2
Group                                   root@
Label                                   admin_cont
Layout Type                             POSIX (1)
Layout Version                          1
Max Snapshot                            0
Owner                                   root@
Redundancy Factor                       rd_fac0
Redundancy Level                        node (2)
Performance domain affinity level of RP 3
Server Checksumming                     off
Health                                  HEALTHY
Access Control List                     A::OWNER@:rwdtTaAo, A:G:GROUP@:rwtT
$  echo 'dfuse /scratch_fs/root_dfuse fuse3 pool=admin_pool,container=admin_cont,auto,x-systemd.requires=daos_agent.service    0 0' >> /etc/fstab
$  mkdir /scratch_fs/root_dfuse
$  df -h | grep fuse
$  mount -a
$  df -h | grep fuse
dfuse                         537G  5.1G  532G   1% /scratch_fs/root_dfuse
$

Via systemd for user

User can mount/unmount the dfuse using systemd.

$ dmg pool create --scm-size=8G --nvme-size=64G samirrav_pool -u samirrav@ -g samirrav@
Creating DAOS pool with manual per-engine storage allocation: 8.0 GB SCM, 64 GB NVMe (12.50% ratio)
Pool created with 11.11%,88.89% storage tier ratio
--------------------------------------------------
  UUID                 : a635cc99-22b3-4af4-8cee-d756463b5ca0
  Service Ranks        : [0-1]
  Storage Ranks        : [0-1]
  Total Size           : 144 GB
  Storage tier 0 (SCM) : 16 GB (8.0 GB / rank)
  Storage tier 1 (NVMe): 128 GB (64 GB / rank)

$ daos cont create samirrav_pool --type='POSIX' samirrav_cont
  Container UUID : 8dc1a401-1b55-486e-ba70-c4a713eb3c0d
  Container Label: samirrav_cont
  Container Type : POSIX

Successfully created container 8dc1a401-1b55-486e-ba70-c4a713eb3c0d
$

$ cat ~/.config/systemd/user/samirrav_dfuse.service
[Service]
ExecStart=dfuse  --foreground -m /scratch_fs/samirrav_dfuse/  --pool samirrav_pool --cont samirrav_cont
ExecStop=fusermount3 -u /scratch_fs/samirrav_dfuse/

[Install]
WantedBy=default.target
$

$ systemctl --user daemon-reload

$ systemctl --user list-unit-files | grep samirrav
samirrav_dfuse.service         disabled

$ systemctl --user status samirrav_dfuse.service
● samirrav_dfuse.service
   Loaded: loaded (/home/samirrav/.config/systemd/user/samirrav_dfuse.service; disabled; vendor preset: enabled)
   Active: inactive (dead)

$ systemctl --user start samirrav_dfuse.service

$ systemctl --user status samirrav_dfuse.service
● samirrav_dfuse.service
   Loaded: loaded (/home/samirrav/.config/systemd/user/samirrav_dfuse.service; disabled; vendor preset: enabled)
   Active: active (running) since Thu 2022-10-20 15:41:46 UTC; 1s ago
 Main PID: 2845753 (dfuse)
   CGroup: /user.slice/user-11832957.slice/user@11832957.service/samirrav_dfuse.service
           └─2845753 /usr/bin/dfuse --foreground -m /scratch_fs/samirrav_dfuse/ --pool samirrav_pool --cont samirrav_cont

$ df -h | grep fuse
dfuse                         135G  1.3G  133G   1% /scratch_fs/samirrav_dfuse

$ touch /scratch_fs/samirrav_dfuse/test1

$ systemctl --user stop samirrav_dfuse.service
$ df -h | grep fuse
$ ls -l /scratch_fs/samirrav_dfuse/test1
ls: cannot access '/scratch_fs/samirrav_dfuse/test1': No such file or directory

$ systemctl --user start samirrav_dfuse.service
$ ls -l /scratch_fs/samirrav_dfuse/test1
-rw-rw-r-- 1 samirrav samirrav 0 Oct 20 15:42 /scratch_fs/samirrav_dfuse/test1

Via systemd for root

Root user can create the systemd file from /etc/fstab using the 'systemd-fstab-generator' command. Consider the previous example /etc/fstab entry which has the admin_pool and admin_cont. Steps mention below will explain, how to generate the systemd file and start/stop the dfuse service.

$  cat /etc/fstab | grep fuse
dfuse /scratch_fs/root_dfuse fuse3 pool=admin_pool,container=admin_cont,auto,x-systemd.requires=daos_agent.service    0 0
$  /usr/lib/systemd/system-generators/systemd-fstab-generator
Failed to create unit file /tmp/-.mount, as it already exists. Duplicate entry in /etc/fstab?
Failed to create unit file /tmp/var-tmp.mount, as it already exists. Duplicate entry in /etc/fstab?
Failed to create unit file /tmp/dev-sda2.swap, as it already exists. Duplicate entry in /etc/fstab?
$  cat /tmp/scratch_fs-root_dfuse.mount
# Automatically generated by systemd-fstab-generator

[Unit]
SourcePath=/etc/fstab
Documentation=man:fstab(5) man:systemd-fstab-generator(8)
Before=local-fs.target
After=daos_agent.service
Requires=daos_agent.service

[Mount]
Where=/scratch_fs/root_dfuse
What=dfuse
Type=fuse3
Options=pool=admin_pool,container=admin_cont,auto,x-systemd.requires=daos_agent.service
$  cp -rf /tmp/scratch_fs-root_dfuse.mount  /usr/lib/systemd/system/
$  systemctl daemon-reload
$  systemctl status scratch_fs-root_dfuse.mount
● scratch_fs-root_dfuse.mount - /scratch_fs/root_dfuse
   Loaded: loaded (/etc/fstab; generated)
   Active: inactive (dead) since Fri 2022-09-23 15:55:33 UTC; 1min 50s ago
    Where: /scratch_fs/root_dfuse
     What: dfuse
     Docs: man:fstab(5)
           man:systemd-fstab-generator(8)

Sep 23 15:55:33 wolf-170.wolf.hpdd.intel.com systemd[1]: scratch_fs-root_dfuse.mount: Succeeded.
$  systemctl start scratch_fs-root_dfuse.mount
$  df -h | grep fuse
dfuse                         537G  5.1G  532G   1% /scratch_fs/root_dfuse
$  ls -l /scratch_fs/root_dfuse/
total 0
$  systemctl status scratch_fs-root_dfuse.mount
● scratch_fs-root_dfuse.mount - /scratch_fs/root_dfuse
   Loaded: loaded (/etc/fstab; generated)
   Active: active (mounted) since Fri 2022-09-23 15:57:53 UTC; 31s ago
    Where: /scratch_fs/root_dfuse
     What: dfuse
     Docs: man:fstab(5)
           man:systemd-fstab-generator(8)
    Tasks: 63 (limit: 1648282)
   Memory: 51.5M
   CGroup: /system.slice/scratch_fs-root_dfuse.mount
           └─4173 dfuse /scratch_fs/root_dfuse -o rw pool=admin_pool container=admin_cont dev suid

Sep 23 15:57:52 wolf-170.wolf.hpdd.intel.com systemd[1]: Mounting /scratch_fs/root_dfuse...
Sep 23 15:57:53 wolf-170.wolf.hpdd.intel.com systemd[1]: Mounted /scratch_fs/root_dfuse.
$  systemctl stop scratch_fs-root_dfuse.mount
$  systemctl status scratch_fs-root_dfuse.mount
● scratch_fs-root_dfuse.mount - /scratch_fs/root_dfuse
   Loaded: loaded (/etc/fstab; generated)
   Active: inactive (dead) since Fri 2022-09-23 15:58:32 UTC; 2s ago
    Where: /scratch_fs/root_dfuse
     What: dfuse
     Docs: man:fstab(5)
           man:systemd-fstab-generator(8)
    Tasks: 0 (limit: 1648282)
   Memory: 540.0K
   CGroup: /system.slice/scratch_fs-root_dfuse.mount

Sep 23 15:57:52 wolf-170.wolf.hpdd.intel.com systemd[1]: Mounting /scratch_fs/root_dfuse...
Sep 23 15:57:53 wolf-170.wolf.hpdd.intel.com systemd[1]: Mounted /scratch_fs/root_dfuse.
Sep 23 15:58:32 wolf-170.wolf.hpdd.intel.com systemd[1]: Unmounting /scratch_fs/root_dfuse...
Sep 23 15:58:32 wolf-170.wolf.hpdd.intel.com systemd[1]: scratch_fs-root_dfuse.mount: Succeeded.
Sep 23 15:58:32 wolf-170.wolf.hpdd.intel.com systemd[1]: Unmounted /scratch_fs/root_dfuse.
$

Via systemd during system power ON

Same systemd file mention in previous example is used to mount the fuse during system power ON.

$  echo -e '\n[Install]\nWantedBy = multi-user.target' >> /usr/lib/systemd/system/scratch_fs-root_dfuse.mount
$  cat /usr/lib/systemd/system/scratch_fs-root_dfuse.mount
# Automatically generated by systemd-fstab-generator

[Unit]
SourcePath=/etc/fstab
Documentation=man:fstab(5) man:systemd-fstab-generator(8)
Before=local-fs.target
After=daos_agent.service
Requires=daos_agent.service

[Mount]
Where=/scratch_fs/root_dfuse
What=dfuse
Type=fuse3
Options=pool=admin_pool,container=admin_cont,auto,x-systemd.requires=daos_agent.service

[Install]
WantedBy = multi-user.target

$  systemctl is-enabled  scratch_fs-root_dfuse.mount
generated
$  systemctl daemon-reload
$  rm -rf /run/systemd/generator/scratch_fs-root_dfuse.mount
$  systemctl enable  scratch_fs-root_dfuse.mount
Created symlink /etc/systemd/system/multi-user.target.wants/scratch_fs-root_dfuse.mount → /usr/lib/systemd/system/scratch_fs-root_dfuse.mount.
$  systemctl is-enabled  scratch_fs-root_dfuse.mount
enabled
$  reboot
$  dmesg | grep fuse
[   18.060203] systemd[1]: sysinit.target: Found dependency on scratch_fs-root_dfuse.mount/start
[   28.736227] fuse: init (API version 7.33)
$  df -h | grep fuse
dfuse                         537G  5.1G  532G   1% /scratch_fs/root_dfuse
$  systemctl status scratch_fs-root_dfuse.mount
● scratch_fs-root_dfuse.mount - /scratch_fs/root_dfuse
   Loaded: loaded (/etc/fstab; enabled; vendor preset: disabled)
   Active: active (mounted) since Fri 2022-09-23 16:13:35 UTC; 4min 8s ago
    Where: /scratch_fs/root_dfuse
     What: dfuse
     Docs: man:fstab(5)
           man:systemd-fstab-generator(8)
    Tasks: 63 (limit: 1648282)
   Memory: 56.2M
   CGroup: /system.slice/scratch_fs-root_dfuse.mount
           └─2346 dfuse /scratch_fs/root_dfuse -o rw pool=admin_pool container=admin_cont dev suid

Sep 23 16:13:34 wolf-170.wolf.hpdd.intel.com systemd[1]: Mounting /scratch_fs/root_dfuse...
Sep 23 16:13:35 wolf-170.wolf.hpdd.intel.com systemd[1]: Mounted /scratch_fs/root_dfuse.
$

It is possible to link to other containers in DFuse, where subdirectories within a container resolve not to regular directories, but rather to the root of entirely different POSIX containers.

To create a new container and link it into the namespace of an existing one, use the following command.

$ daos container create <pool_label> <cont_label> --type POSIX --path <path_to_entry_point>

The pool should already exist, and the path should specify a location somewhere within a DFuse mount point that resolves to a POSIX container. Once a link is created, it can be accessed through the new path. Following the link is virtually transparent. No container uuid is required. If one is not supplied, it will be created.

To destroy a container again, the following command should be used.

$ daos container destroy --path <path to entry point>

This will both remove the link between the containers and remove the container that was linked to.

Links to pre-existing containers can also be created via the daos container link command.

Information about a container, for example, the presence of an entry point between containers, or the pool and container uuids of the container linked to can be read with the following command.

$ daos container info --path <path to entry point>

Please find below an example.

$ dfuse -m /scratch_fs/dfuse --pool tank --cont mycont3
$ cd /scratch_fs/dfuse/
$ ls -l
total 0
-rw-rw-r-- 1 samirrav samirrav 0 Sep 23 16:31 foo
$ daos cont create tank mycont3 --type POSIX --path ./link_to_external_container
  Container UUID : 03f9dc7d-ca6a-4f1e-8246-fd89072cfeca
  Container Label: mycont3
  Container Type : POSIX

Successfully created container 03f9dc7d-ca6a-4f1e-8246-fd89072cfeca type POSIX
$ ls -lrt
total 0
-rw-rw-r-- 1 samirrav samirrav   0 Sep 23 16:31 foo
drwxr-xr-x 1 samirrav samirrav 120 Sep 23 16:32 link_to_external_container
$ daos cont destroy --path ./link_to_external_container/
Successfully destroyed container ./link_to_external_container/
$ pwd
/scratch_fs/dfuse
$ ls -l
total 0
-rw-rw-r-- 1 samirrav samirrav 0 Sep 23 16:31 foo
$

Caching

For performance reasons caching will be enabled by default in DFuse, including both data and metadata caching. It is possible to tune these settings both at a high level on the DFuse command line and fine grained control via container attributes.

The following types of data will be cached by default.

  • Kernel caching of dentries
  • Kernel caching of negative dentries
  • Kernel caching of inodes (file sizes, permissions etc)
  • Kernel caching of file contents
  • Kernel caching of directory contents (when supported by libfuse)
  • MMAP write optimization

Warning

Caching is enabled by default in dfuse. This might cause some parallel applications to fail. Please disable caching (--disable-caching option) if you experience this or want up to date data sharing between nodes.

To selectively control caching within a container the following container attributes should be used, if any attribute is set then the rest are assumed to be set to 0 or off, except dentry-dir-time which defaults to dentry-time

Attribute name Description
dfuse-attr-time How long file attributes are cached
dfuse-dentry-time How long directory entries are cached
dfuse-dentry-dir-time How long dentries are cached, if the entry is itself a directory
dfuse-ndentry-time How long negative dentries are cached
dfuse-data-cache Data caching enabled, duration or ("on"/"true"/"off"/"false"/"otoc")
dfuse-direct-io-disable Force use of page cache for this container ("on"/"true"/"off"/"false")

For metadata caching attributes specify the duration that the cache should be valid for, specified in seconds or with a 's', 'm', 'h' or 'd' suffix for seconds, minutes, hours or days.

dfuse-data-cache can be set to a time value or "on", "true", "off", "false" or "otoc". If set, other values will log an error and result in the cache being off. The O_DIRECT flag for open files will be honored with this option enabled. Files which do not set O_DIRECT will be cached. Data caching is controlled by dfuse passing a flag to the kernel on open. If data-cache is enabled then it will be allowed for files, and timeout value will be the duration between a previous close call which reduced the open count to zero and the next subsequent call to open. A value of "otoc" will allow the use of the page cache for caching the file whilst open but the cache will only be used from open to close and not be saved across opens, this allows the use of MAP_SHARED on files.

Processes running with a working directory within the dfuse mount do not hold a reference on the directory so cache expiry can in this case cause getcwd() to fail. Should this happen then a larger value for "dfuse-dentry-dir-time" should avoid the issue.

dfuse-direct-io-disable will enable data caching, similar to dfuse-data-cache, however if this is enabled then the O_DIRECT flag will be ignored, and all files will use the page cache. This default value for this is disabled.

With no options specified attr and dentry timeouts will be 1 second, dentry-dir and ndentry timeouts will be 5 seconds, and data caching will be set to 10 minutes.

Readdir caching is available when supported by libfuse; however, on many distributions the system libfuse is not able to support this feature. Libfuse version 3.5.0 or newer is required at both compile and run-time. Use dfuse --version or the runtime logs to see the fuse version used and if the feature is compiled into dfuse. Readdir caching is controlled by the dfuse-dentry-time setting.

These are two command line options to control the DFuse process itself.

Command line option Description
--disable-caching Disables all caching
--disable-wb-cache Disables write-back cache

These will affect all containers accessed via DFuse, regardless of any container attributes.

Managing memory usage and disconnecting from containers

DFuse can be instructed to evict paths from local memory which drops any open handles on containers or pools as well as reducing the working set size and memory consumption. This is an asynchronous operation and there is no automatic way to tell if it's completed. In addition, any lookup of the path specified in the eviction call will cause a new lookup and prevent the eviction from completing.

Paths can be requested for eviction from dfuse using the daos filesystem evict command. This does not change any data that is stored in DAOS in any way but rather releases local resources. This command will return the inode number of the path as well as key dfuse metrics.

DFuse metrics can be queried with the daos filesystem query command which takes an optional --inode parameter. This will return information on the number of inodes held in memory, the number of open files as well as the number of pools and containers that DFuse is connected to. If the --inode option is given then this command will also report if the inode is in memory or not.

Together these two commands can be used to request eviction of a path and to poll for its release, although lookups from other processes might block the eviction process.

If daos filesystem evict is passed the root of the DFuse mount then the path itself cannot be evicted - in this case all top-level entries in the directory are evicted instead and no inode number is returned.

Permissions

DFuse can serve data from any user's container, but needs appropriate permissions in order to do this.

File ownership within containers is set by the container being served, with the owner of the container owning all files within that container, so if looking at the container of another user then all entries within that container will be owned by that user, and file-based permissions checks by the kernel will be made on that basis.

Should write permission be granted to another user then any newly created files will also be owned by the container owner, regardless of the user used to create them. Permissions are only checked on connect, so if permissions are revoked users need to restart DFuse for these to be picked up.

Pool permissions.

DFuse needs 'r' permission for pools only.

Container permissions.

DFuse needs 'r' and 't' permissions to run: read for accessing the data, 't' to read container properties to know the container type. For older layout versions (containers created by DAOS v2.0.x and before), 'a' permission is also required to read the ACLs to know the container owner.

Write permission 'w' for the container is optional; however, without it the container will be read-only.

Stopping DFuse

When done, the file system can be unmounted via fusermount:

$ fusermount3 -u /scratch_fs/daos

When this is done, the local DFuse daemon should shut down the mount point, disconnect from the DAOS servers, and exit. You can also verify that the mount point is no longer listed in /proc/mounts.

Interception Library libioil

An interception library called libioil is available to work with DFuse. This library works in conjunction with DFuse and allows the interception of POSIX I/O calls and issue the I/O operations directly from the application context through libdaos without any application changes. This provides kernel-bypass for I/O data, leading to improved performance.

Using libioil

To use the interception library, set LD_PRELOAD to point to the shared library in the DAOS install directory:

LD_PRELOAD=/path/to/daos/install/lib/libioil.so
LD_PRELOAD=/usr/lib64/libioil.so # when installed from RPMs

For instance:

$ dd if=/dev/zero of=./foo bs=1G count=20
20+0 records in
20+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 14.1946 s, 1.5 GB/s

$ LD_PRELOAD=/usr/lib64/libioil.so dd if=/dev/zero of=./bar bs=1G count=20
20+0 records in
20+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 5.0483 s, 4.3 GB/s

Alternatively, it's possible to simply link the interception library into the application at compile time with the -lioil flag.

Monitoring Activity

The interception library is intended to be transparent to the user, and no other setup should be needed beyond the above. However this can mean it's not easy to tell if it is linked correctly and working or not, to detect this you can turn on reporting of activity by the interception library via environment variable, in which will case it will print reports to stderr.

If the D_IL_REPORT environment variable is set then the interception library will print a short summary in the shared library destructor, typically as a program exits, if you set this to a number then it will also log the first read and write calls as well. For example, if you set this to a value of 2 then the interception library will print to stderr on the first two intercepted read calls, the first two write calls and the first two stat calls. To have all calls printed set the value to -1. A value of 0 means to print the summary at program exit only.

D_IL_REPORT=2

For instance:

$ D_IL_REPORT=1 LD_PRELOAD=/usr/lib64/libioil.so dd if=/dev/zero of=./bar bs=1G count=20
[libioil] Intercepting write of size 1073741824
20+0 records in
20+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 5.17297 s, 4.2 GB/s

$ D_IL_REPORT=3 LD_PRELOAD=/usr/lib64/libioil.so dd if=/dev/zero of=./bar bs=1G count=5
[libioil] Intercepting write of size 1073741824
[libioil] Intercepting write of size 1073741824
[libioil] Intercepting write of size 1073741824
5+0 records in
5+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 1.27362 s, 4.2 GB/s

$ D_IL_REPORT=-1 LD_PRELOAD=/usr/lib64/libioil.so dd if=/dev/zero of=./bar bs=1G count=5
[libioil] Intercepting write of size 1073741824
[libioil] Intercepting write of size 1073741824
[libioil] Intercepting write of size 1073741824
[libioil] Intercepting write of size 1073741824
[libioil] Intercepting write of size 1073741824
5+0 records in
5+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 1.29935 s, 4.1 GB/s

Note

Some programs, most GNU utilities from the 'coreutils' package have a destructor function to close stderr on exit, so for many basic commands such as cp and cat whilst the interception library will work it is not possible to see the summary generated by the interception library.

Advanced Usage

DFuse will only create one kernel level mount point regardless of how it is launched. How POSIX containers are represented within that mount point varies depending on the DFuse command-line options. In addition to mounting a single POSIX container, DFuse can also operate in two other modes detailed below.

Pool Mode

If a pool uuid is specified but not a container uuid, then the containers can be accessed by the path <mount point>/<container uuid>. The container uuid will have to be provided from an external source.

$ daos cont create tank  mycont --type POSIX
  Container UUID : 7dee7162-8ab2-4704-ad22-8b43f2eb5279
  Container Label: mycont
  Container Type : POSIX

Successfully created container 7dee7162-8ab2-4704-ad22-8b43f2eb5279
$ daos cont create tank  mycont2 --type POSIX
  Container UUID : 8437e099-b19a-4b33-85da-81f5b3b7a833
  Container Label: mycont2
  Container Type : POSIX

Successfully created container 8437e099-b19a-4b33-85da-81f5b3b7a833
$ dfuse -m /scratch_fs/dfuse --pool tank
$ ls -l /scratch_fs/dfuse/
ls: cannot open directory '/scratch_fs/dfuse/': Operation not supported
$ ls -l /scratch_fs/dfuse/8437e099-b19a-4b33-85da-81f5b3b7a833
total 0
$ touch /scratch_fs/dfuse/7dee7162-8ab2-4704-ad22-8b43f2eb5279/foo
$ ls -l /scratch_fs/dfuse/7dee7162-8ab2-4704-ad22-8b43f2eb5279
total 0
-rw-rw-r-- 1 samirrav samirrav 0 Sep 23 17:00 foo
$ fusermount3 -u /scratch_fs/dfuse
$

System Mode

If neither a pool or container is specified, then pools and container can be accessed by the path <mount point>/<pool uuid>/<container uuid>. However it should be noted that readdir() and therefore ls do not work on either mount points or directories representing pools here. So the pool and container uuids will have to be provided from an external source.

$ dfuse -m /scratch_fs/dfuse
$ df -h /scratch_fs/dfuse/
Filesystem      Size  Used Avail Use% Mounted on
dfuse              -     -     -    - /scratch_fs/dfuse
$ daos pool query tank | grep -- -.*-
Pool 3559e4b2-7f55-41ad-8d37-f279a8f3f586, ntarget=128, disabled=0, leader=5, version=1
$
$ ls -l /scratch_fs/dfuse/3559e4b2-7f55-41ad-8d37-f279a8f3f586/8437e099-b19a-4b33-85da-81f5b3b7a833
total 0
$ ls -l /scratch_fs/dfuse/3559e4b2-7f55-41ad-8d37-f279a8f3f586/7dee7162-8ab2-4704-ad22-8b43f2eb5279
total 0
-rw-rw-r-- 1 samirrav samirrav 0 Sep 23 17:00 foo
$

While this mode is not expected to be used directly by users, it is useful for the unified namespace integration.

Interception Library libpil4dfs

libpil4dfs is similar to libioil, but it intercepts not only read/write, but also metadata related functions. This provides similar performance as using native DFS with POSIX interface. libpil4dfs can be used in conjunction with dfuse or without a dfuse mountpoint.

Using libpil4dfs with dfuse

Start dfuse daemon,

dfuse -m /scratch_fs/dfuse tank mycont

To use the interception library, set LD_PRELOAD to point to the shared library in the DAOS install directory:

LD_PRELOAD=/path/to/daos/install/lib/libpil4dfs.so
or
LD_PRELOAD=/usr/lib64/libpil4dfs.so # when installed from RPMs

Example:

$ LD_PRELOAD=/usr/lib64/libpil4dfs.so mdtest -a POSIX -z 0 -F -C -i 1 -n 1667 -e 4096 -d /scratch_fs/dfuse/ -w 4096

Using libpil4dfs without dfuse

When no dfuse mountpoint is specified, several environment variables must be set to tell libpil4dfs what POSIX container to mount where in the namespace:

  • D_IL_POOL must be set to the pool label where the container to be mounted resides
  • D_IL_CONTAINER must be set to the label of the POSIX container to be mounted
  • D_IL_MOUNT_POINT shall be set to the path in the local namespace where the container should be mounted

Please find below an example with an (empty) POSIX container mounted on the fly by pil4dfs under /tmp

$ ls /tmp
daos_agent.log  runtime-root  systemd-private-6bcc82c125b84f88b78f4f52b848d0d2-chronyd.service-5LAQe4  tmpjsonlogdir.a3L8Gv
$ LD_PRELOAD=/usr/lib64/libpil4dfs.so D_IL_POOL=tank D_IL_CONTAINER=mycont D_IL_MOUNT_POINT=/tmp ls /tmp
$

Warning

The operation mode without dfuse has a lot of limitations and is not recommended for production use.

If the D_IL_REPORT environment variable is set then the interception library will print a short summary of intercepted functions accessing DAOS filesystem through POSIX as a program exits. Both "D_IL_REPORT=1" and "D_IL_REPORT=true" enable printing the summary.

$ D_IL_REPORT=1 LD_PRELOAD=/usr/lib64/libpil4dfs.so mdtest -a POSIX -z 0 -F -C -i 1 -n 1667 -e 4096 -d /scratch_fs/dfuse -w 4096
...
libpil4dfs intercepting summary for ops on DFS:
[read   ]  0
[write  ]  1667

[open   ]  1667
[stat   ]  0
[opendir]  0
[readdir]  0
[unlink ]  0
[seek   ]  1667
[mkdir  ]  2
[rmdir  ]  0
[rename ]  0
[mmap   ]  0

[op_sum ]  5003

Turn on compatible mode in libpil4dfs

Fake file descriptor (FD) is used in regular mode in libpil4dfs.so for efficiency. open() returns fake fd to applications. In cases of some APIs are not intercepted, applications could crash with the error "Bad File Descriptor". Compatible mode is provided to work around such situations. Setting env "D_IL_COMPATIBLE=1" turns on compatible mode. Kernel fd allocated by dfuse instead of fake fd will be returned to applications. This mode provides better compatibility with degraded performance in open, openat, and opendir, etc. Please start dfuse with "--disable-caching" to disable caching before using compatible mode.

Child Process Inheritance

Normally child processes inherit environmental variables from parent processes. In rare cases, e.g. scons, envs are striped off when calling execve(). It might be useful to force pil4dfs related env set in child processes by setting env "D_IL_ENFORCE_EXEC_ENV=1". This flag is 0 if not set.

Directory caching

To improve performance, directories are cached in a hash table. The size of this hash table could be changed, thanks to the following environment variable: * D_IL_DCACHE_SIZE_BITS: power 2 number of buckets of the hash table (default value of 16).

A garbage collector is periodically triggered to remove the stalled entries from the hash table. The behavior of this garbage collector can be configured thanks to the following environment variables: * D_IL_DCACHE_REC_TIMEOUT: define the lifetime in seconds of an entry of the hash table (default value of 60). * D_IL_DCACHE_GC_RECLAIM_MAX: define the maximal number of entries which can be reclaimed per garbgage collection iteration (default value of 1000). * D_IL_DCACHE_GC_PERIOD: define the triggering time period in seconds of the garbage collector (default value of 120).

Note

  • The directory cache can be deactivated with setting a value of 0 to the D_IL_DCACHE_REC_TIMEOUT environment variable.
  • The garbage collector can be deactivated with setting a value of 0 to the D_IL_DCACHE_GC_PERIOD environment variable.

Limitations of libpil4dfs

Libpil4dfs is a available as a preview. Some features are not implemented yet. Many APIs are involved in libpil4dfs. There may be bugs, uncovered/not intercepted functions, etc.

Libpil4dfs suffers from the following limitations:

  • Current code was developed and tested on x86_64. We do have ongoing work to port the library to Arm64, but we have not tested on Arm64 yet.
  • Large overhead for small tasks due to slow daos_init() (order of hundreds of milliseconds)
  • Not working for statically linked executable
  • dfuse is still required to handle some operations that libpil4dfs does not supported yet.
  • Support for multiple pool and containers within a singled dfuse mountpoint is not there yet (each container accessed should be mounted separately), i.e. no UNS support (concerns about the overhead of getfattr())
  • No support of creating a process with the executable and shared object files stored on DAOS yet
  • No support for applications using fork yet
  • DFS (dfs_open / dfs_lookup) does not support O_APPEND currently. We allow O_APPEND flag in open in libpil4dfs to support bash scripts like configure. Currently, we only query file size one time when opening the file, then set file pointer to the end of the file. We DO NOT move file pointer to the end of the file in all following write to avoid expensive stat. Further work is required for rigorous O_APPEND support.

Those unsupported features are still available through dfuse.

Back to top