DAOS System Administration¶

RAS Events¶

Reliability, Availability, and Serviceability (RAS) related events are communicated and logged within DAOS and syslog.

Event Structure¶

The following table describes the structure of a DAOS RAS event, including descriptions of mandatory and optional fields.

Field	Optional/Mandatory	Description
ID	Mandatory	Unique event identifier referenced in the manual.
Timestamp (ts)	Mandatory	Resolution at the microseconds and include the timezone offset to avoid locality issues.
Hostname (host)	Optional	Hostname of the node involved in the event.
Type	Mandatory	Event type of STATE_CHANGE causes an update to the Management Service (MS) database in addition to event being written to SYSLOG. INFO_ONLY type events are only written to SYSLOG.
Severity (sev)	Mandatory	Indicates event severity, Error/Warning/Notice.
Msg	Mandatory	Human readable message.
PID	Optional	Identifier of the process involved in the RAS event
TID	Optional	Identifier of the thread involved in the RAS event.
Rank	Optional	DAOS rank involved in the event.
Incarnation (inc)	Optional	Incarnation version of DAOS rank involved in the event. An incarnation of an engine (engine is identified by a rank) is an internal sequence number used to order aliveness events related to an engine.
HWID	Optional	Identify hardware components involved in the event. E.g., PCI address for SSD, network interface
JOBID	Optional	Identifier of the job involved in the RAS event.
PUUID (pool)	Optional	Pool UUID involved in the event, if any.
CUUID (cont)	Optional	Container UUID involved in the event, if relevant.
OID (objid)	Optional	Object identifier involved in the event, if relevant.
Control Op (ctlop)	Optional	Recommended automatic action, if any.
Data	Optional	Specific instance data treated as a blob.

Below is an example of a RAS event signaling an exclusion of an unresponsive engine:

&&& RAS EVENT id: [swim_rank_dead] ts: [2021-11-21T13:32:31.747408+0000] host: [wolf-112.wolf.hpdd.intel.com] type: [STATE_CHANGE] sev: [NOTICE] msg: [SWIM marked rank as dead.] pid: [253454] tid: [1] rank: [6] inc: [63a058833280000]

Event List¶

The following table lists supported DAOS RAS events, including IDs, type, severity, message, description, and cause.

Event	Event type	Severity	Message	Description	Cause
device_set_faulty	INFO_ONLY	NOTICE or ERROR	Device: <uuid> set faulty / Device: <uuid> set faulty failed: <rc> / Device: <uuid> auto faulty detect / Device: <uuid> auto faulty detect failed: <rc>	Indicates that a device has either been explicitly automatically set as faulty. Device UUID specified in event data.	Either DMG set nvme-faulty command was used to explicitly set device as faulty or an error threshold was reached on a device which has triggered an auto faulty reaction.
device_media_error	INFO_ONLY	ERROR	Device: <uuid> <error-type> error logged from tgt_id:<idx>	Indicates that a device media error has been detected for a specific target. The error type could be unmap, write, read or checksum (csum). Device UUID and target ID specified in event data.	Media error occurred on backing device.
device_unplugged	INFO_ONLY	NOTICE	Device: <uuid> unplugged	Indicates device was physically removed from host.	NVMe SSD physically removed from host.
device_plugged	INFO_ONLY	NOTICE	Detected hot plugged device: <bdev-name>	Indicates device was physically inserted into host.	NVMe SSD physically added to host.
device_replace	INFO_ONLY	NOTICE or ERROR	Replaced device: <uuid> with device: <uuid> [failed: <rc>]	Indicates that a faulty device was replaced with a new device and if the operation failed. The old and new device IDs as well as any non-zero return code are specified in the event data.	Device was replaced using DMG nvme replace command.
device_link_speed_changed	NOTICE or WARNING	NVMe PCIe device at <pci-address> port-<idx>: link speed changed to <transfer-rate> (max <transfer-rate>)	Indicates that an NVMe device link speed has changed. The negotiated and maximum device link speeds are indicated in the event message field and the severity is set to warning if the negotiated speed is not at maximum capability (and notice level severity if at maximum). No other specific information is included in the event data.	Either device link speed was previously downgraded and has returned to maximum or link speed has downgraded to a value that is less than its maximum capability.
device_link_width_changed	NOTICE or WARNING	NVMe PCIe device at <pci-address> port-<idx>: link width changed to <pcie-link-lanes> (max <pcie-link-lanes>)	Indicates that an NVMe device link width has changed. The negotiated and maximum device link widths are indicated in the event message field and the severity is set to warning if the negotiated width is not at maximum capability (and notice level severity if at maximum). No other specific information is included in the event data.	Either device link width was previously downgraded and has returned to maximum or link width has downgraded to a value that is less than its maximum capability.
engine_format_required	INFO_ONLY	NOTICE	DAOS engine <idx> requires a <type> format	Indicates engine is waiting for allocated storage to be formatted on formatted on instance <idx> with dmg tool. <type> can be either SCM or Metadata.	DAOS server attempts to bring-up an engine that has unformatted storage.
engine_died	STATE_CHANGE	ERROR	DAOS engine <idx> exited exited unexpectedly: <error>	Indicates engine instance <idx> unexpectedly. describes the exit state returned from exited daos_engine process.	N/A
engine_asserted	STATE_CHANGE	ERROR	TBD	Indicates engine instance <idx> threw a runtime assertion, causing a crash.	An unexpected internal state resulted in assert failure.
engine_clock_drift	INFO_ONLY	ERROR	clock drift detected	Indicates CART comms layer has detected clock skew between engines.	NTP may not be syncing clocks across DAOS system.
engine_join_failed	INFO_ONLY	ERROR	DAOS engine <idx> (rank <rank>) was not allowed to join the system	Join operation failed for the given engine instance ID and rank (if assigned).	Reason should be provided in the extended info field of the event data.
pool_corruption_detected	INFO_ONLY	ERROR	Data corruption detected	Indicates a corruption in pool data has been detected. The event fields will contain pool and container UUIDs.	A corruption was found by the checksum scrubber.
pool_destroy_deferred	INFO_ONLY	WARNING	pool:<uuid> destroy is deferred	Indicates a destroy operation has been deferre.	Pool destroy in progress but not complete.
pool_rebuild_started	INFO_ONLY	NOTICE	Pool rebuild started.	Indicates a pool rebuild has started. The event data field contains pool map version and pool operation identifier.	When a pool rank becomes unavailable a rebuild will be triggered.
pool_rebuild_finished	INFO_ONLY	NOTICE	Pool rebuild finished.	Indicates a pool rebuild has finished successfully. The event data field includes the pool map version and pool operation identifier.	N/A
pool_rebuild_failed	INFO_ONLY	ERROR	Pool rebuild failed: <rc>.	Indicates a pool rebuild has failed. The event data field includes the pool map version and pool operation identifier. <rc> provides a string representation of DER code.	N/A
pool_replicas_updated	STATE_CHANGE	NOTICE	List of pool service replica ranks has been updated.	Indicates a pool service replica list has changed. The event contains the new service replica list in a custom payload.	When a pool service replica rank becomes unavailable a new rank is selected to replace it (if available).
pool_durable_format_incompat	INFO_ONLY	ERROR	incompatible layout version: <current> not in [<min>, <max>]	Indicates the given pool's layout version does not match any of the versions supported by the currently running DAOS software.	DAOS engine is started with pool data in local storage that has an incompatible layout version.
container_durable_format_incompat	INFO_ONLY	ERROR	incompatible layout version[: <current> not in [<min>, <max>]	Indicates the given container's layout version does not match any of the versions supported by the currently running DAOS software.	DAOS engine is started with container data in local storage that has an incompatible layout version.
rdb_durable_format_incompatible	INFO_ONLY	ERROR	incompatible layout version[: <current> not in [<min>, <max>]] OR incompatible DB UUID: <uuid>	Indicates the given RDB's layout version does not match any of the versions supported by the currently running DAOS software, or the given RDB's UUID does not match the expected UUID (usually because the RDB belongs to a pool created by a pre-2.0 DAOS version).	DAOS engine is started with rdb data in local storage that has an incompatible layout version.
swim_rank_alive	STATE_CHANGE	NOTICE	TBD	The SWIM protocol has detected the specified rank is responsive.	A remote DAOS engine has become responsive.
swim_rank_dead	STATE_CHANGE	NOTICE	SWIM rank marked as dead.	The SWIM protocol has detected the specified rank is unresponsive.	A remote DAOS engine has become unresponsive.
system_start_failed	INFO_ONLY	ERROR	System startup failed, <errors>	Indicates that a user initiated controlled startup failed. <errors> shows which ranks failed.	Ranks failed to start.
system_stop_failed	INFO_ONLY	ERROR	System shutdown failed during <action> action, <errors>	Indicates that a user initiated controlled shutdown failed. <action> identifies the failing shutdown action and <errors> shows which ranks failed.	Ranks failed to stop.
system_fabric_provider_changed	NOTICE	System fabric provider has changed: -> <new-provider>	Indicates that the system-wide fabric provider has been updated. No other specific information is included in event data.	A system-wide fabric provider change has been intentionally applied to all joined ranks.

System Logging¶

Engine logging is configured on daos_server start-up by setting the log_file and log_mask parameters in the server config file.

The DD_MASK and DD_SUBSYS environment variables can be defined within the env_vars list parameter of the engine section of the server config file to tune log output.

Engine log levels can be changed dynamically (at runtime) by setting log masks for a set of facilities to a given level. Settings will be applied to all running DAOS I/O Engines present in the configured dmg hostlist using the dmg server set-logmasks command. The command accepts named arguments for masks [-m|--masks] (equivalent to D_LOG_MASK), streams [-d|--streams] (equivalent to DD_MASK) and subsystems [-s|--subsystems] (equivalent to DD_SUBSYS):

Usage help:

dmg server set-logmasks --help
Usage:
  dmg [OPTIONS] server set-logmasks [set-logmasks-OPTIONS]

Application Options:
      --allow-proxy     Allow proxy configuration via environment
  -i, --insecure        Have dmg attempt to connect without certificates
  -d, --debug           Enable debug output
      --log-file=       Log command output to the specified file
  -j, --json            Enable JSON output
  -J, --json-logging    Enable JSON-formatted log output
  -o, --config-path=    Client config file path

Help Options:
  -h, --help            Show this help message

[set-logmasks command options]
      -l, --host-list=  A comma separated list of addresses <ipv4addr/hostname>
                        to connect to
      -m, --masks=      Set log masks for a set of facilities to a given level.
                        The input string should look like
                        PREFIX1=LEVEL1,PREFIX2=LEVEL2,... where the syntax is
                        identical to what is expected by 'D_LOG_MASK'
                        environment variable. If the 'PREFIX=' part is omitted,
                        then the level applies to all defined facilities (e.g.
                        a value of 'WARN' sets everything to WARN). If unset
                        then reset engine log masks to use the 'log_mask' value
                        set in the server config file (for each engine) at the
                        time of DAOS system format. Supported levels are FATAL,
                        CRIT, ERR, WARN, NOTE, INFO, DEBUG
      -d, --streams=    Employ finer grained control over debug streams. Mask
                        bits are set as the first argument passed in
                        D_DEBUG(mask, ...) and this input string (DD_MASK) can
                        be set to enable different debug streams. The expected
                        syntax is a comma separated list of stream identifiers
                        and accepted DAOS Debug Streams are
                        md,pl,mgmt,epc,df,rebuild,daos_default and Common Debug
                        Streams (GURT) are any,trace,mem,net,io. If not set,
                        streams will be read from server config file and if set
                        to an empty string then all debug streams will be
                        enabled
      -s, --subsystems= This input string is equivalent to the use of the
                        DD_SUBSYS environment variable and can be set to enable
                        logging for specific subsystems or facilities. The
                        expected syntax is a comma separated list of facility
                        identifiers. Accepted DAOS facilities are
                        common,tree,vos,client,server,rdb,pool,container,object-
                        ,placement,rebuild,tier,mgmt,bio,tests, Common
                        facilities (GURT) are MISC,MEM and CaRT facilities
                        RPC,BULK,CORPC,GRP,LM,HG,ST,IV If not set, subsystems
                        to enable will be read from server config file and if
                        set to an empty string then logging all subsystems will
                        be enabled

If an arg is not passed, then that logging parameter for each engine process is reset to the values set in the server config file that was used when starting daos_server. - --masks will be reset to the value of the engine config log_mask parameter. - --streams will be reset to the env_vars DD_MASK environment variable value or to an empty string if not set. - --subsystems will be reset to the env_vars DD_SUBSYS environment variable value or to an empty string if not set.

Example usage:

dmg server set-logmasks -m DEBUG,MEM=ERR -d mgmt,md -s server,mgmt,bio,common

This example would be a runtime equivalent to setting the following in the server config file:

...
engines:
- log_mask: DEBUG,MEM=ERR
  env_vars:
  - DD_SUBSYS=server,mgmt,bio,common
  - DD_MASK=mgmt,md
...

If the above server config file was used to start an engine process, running dmg server set-logmasks without parameters would reset logging to config values and would be equivalent to the example given above.

For more information on the usage of masks (D_LOG_MASK), streams (DD_MASK) and subsystems (DD_SUBSYS) parameters refer to the Debugging System section.

System Monitoring¶

The DAOS servers maintain a set of metrics on I/O and internal state of the DAOS processes. The metrics collection is very lightweight and is always enabled. It cannot be manually enabled or disabled.

The DAOS metrics can be accessed locally on each DAOS server, or remotely by configuring an HTTP endpoint on each server.

Local metrics collection with daos_metrics¶

The daos-server package includes the daos_metrics command-line tool. This tool fetches metrics from the local host only. No configuration is required to use the daos_metric command.

By default, daos_metrics displays the metrics in a human-readable tree format. To produce CSV formatted output, use daos_metrics --csv.

Each DAOS engine maintains its own metrics. The --srv_idx parameter can be used to specify which engine to query, if there are multiple engines configured per server. The default is to query the first engine on the server (index 0).

See daos_metrics -h for details on how to filter metrics.

Configuring the servers for remote metrics collection¶

Each DAOS server can be configured to provide an HTTP endpoint for metrics collection. This endpoint presents the data in a format compatible with Prometheus.

To enable remote telemetry collection, update the control plane section of your DAOS server configuration file:

telemetry_port: 9191

By default, the HTTP endpoint is disabled. The default port number is 9191, and it is recommended to use this port as it is also the default for the clients that will collect the metrics. Each control plane server will present its local metrics via the endpoint: http://<host>:<port>/metrics

Remote metrics collection with dmg telemetry¶

The dmg telemetry administrative command can be used to query an individual DAOS server for metrics. Only one DAOS host may be queried at a time. The command will return information for all engines on that server, identified by the "rank" attribute.

The metrics have the same names as seen on the telemetry web endpoint.

By default, the dmg telemetry command produces human readable output. The output can be formatted in JSON by running dmg -j telemetry.

To list all metrics for the server with their name, type and description:

dmg telemetry [-l <host>] [-p <telemetry-port>] metrics list

If no host is provided, the default is localhost. The default port is 9191.

To query the values of one or more metrics on the server:

dmg telemetry [-l <host>] [-p <telemetry-port>] metrics query [-m <metric_name>]

If no host is provided, the default is localhost. The default port is 9191.

Metric names may be provided in a comma-separated list. If no metric names are provided, all metrics are queried.

Remote metrics collection with Prometheus¶

Prometheus is the preferred way to collect metrics from multiple DAOS servers at the same time.

To integrate with Prometheus, add a new job to your Prometheus server's configuration file, with the targets set to the hosts and telemetry ports of your DAOS servers:

scrape_configs:
- job_name: daos
  scrape_interval: 5s
  static_configs:
  - targets: ['<host>:<telemetry-port>']

If there is not already a Prometheus server set up, DMG offers quick setup options for DAOS.

To install and configure Prometheus on the local machine:

dmg telemetry config [-i <install-dir>]

DMG will install Prometheus in the directory given with option -i install-dir. Prometheus install path needs to be add to the default system $PATH environment if required.

The Prometheus configuration file will be populated based on the DAOS server list in your dmg configuration file. The Prometheus configuration will be written to $HOME/.prometheus.yml.

To start the Prometheus server with the configuration file generated by dmg:

prometheus --config-file=$HOME/.prometheus.yml

Storage Operations¶

Storage subcommands can be used to operate on host storage.

$ dmg storage --help
Usage:
  dmg [OPTIONS] storage <command>

...

Available commands:
  format    Format SCM and NVMe storage attached to remote servers.
  identify  Blink the status LED on a given VMD device for visual SSD identification.
  query     Query storage commands, including raw NVMe SSD device health stats and internal blobstore health info.
  replace   Replace a storage device that has been hot-removed with a new device.
  scan      Scan SCM and NVMe storage attached to remote servers.
  set       Manually set the device state.

Storage query subcommands can be used to get detailed information about how DAOS is using host storage.

$ dmg storage query --help
Usage:
  dmg [OPTIONS] storage query <command>

...

Available commands:
  list-devices   List storage devices on the server
  list-pools     List pools on the server
  usage          Show SCM & NVMe storage space utilization per storage server

Space Utilization¶

To query SCM and NVMe storage space usage and show how much space is available to create new DAOS pools with, run the following command:

Query Per-Server Space Utilization:

$ dmg storage query usage --help
Usage:
  dmg [OPTIONS] storage query usage

...

The command output shows online DAOS storage utilization, only including storage statistics for devices that have been formatted by DAOS control-plane and assigned to a currently running rank of the DAOS system. This represents the storage that can host DAOS pools.

$ dmg storage query usage
Hosts   SCM-Total SCM-Free SCM-Used NVMe-Total NVMe-Free NVMe-Used
-----   --------- -------- -------- ---------- --------- ---------
wolf-71 6.4 TB    2.0 TB   68 %     1.5 TB     1.1 TB    27 %
wolf-72 6.4 TB    2.0 TB   68 %     1.5 TB     1.1 TB    27 %

Note that the table values are per-host (storage server) and SCM/NVMe capacity pool component values specified in dmg pool create are per rank. If multiple ranks (I/O processes) have been configured per host in the server configuration file daos_server.yml then the values supplied to dmg pool create should be a maximum of the SCM/NVMe free space divided by the number of ranks per host.

For example, if 2.0 TB SCM and 10.0 TB NVMe free space is reported by dmg storage query usage and the server configuration file used to start the system specifies 2 I/O processes (2 "server" sections), the maximum pool size that can be specified is approximately dmg pool create -s 1T -n 5T (may need to specify slightly below the maximum to take account of negligible metadata overhead).

SSD Management¶

Health Monitoring¶

Useful admin dmg commands to query NVMe SSD health:

Query Per-Server Metadata:

$ dmg storage query list-devices --help
Usage:
  dmg [OPTIONS] storage query list-devices [list-devices-OPTIONS]

...

[list-devices command options]
      -l, --host-list=    A comma separated list of addresses <ipv4addr/hostname> to
                          connect to
      -r, --rank=         Constrain operation to the specified server rank
      -b, --health        Include device health in results
      -u, --uuid=         Device UUID (all devices if blank)
      -e, --show-evicted  Show only evicted faulty devices

$ dmg storage query list-pools --help
Usage:
  dmg [OPTIONS] storage query list-pools [list-pools-OPTIONS]

...

[list-pools command options]
      -r, --rank=     Constrain operation to the specified server rank
      -u, --uuid=     Pool UUID (all pools if blank)
      -v, --verbose   Show more detail about pools

The NVMe storage query list-devices and list-pools commands query the persistently stored SMD device and pool tables, respectively. The device table maps the internal device UUID to attached VOS target IDs. The rank number of the server where the device is located is also listed, along with the current device state. The current device states are the following: - NORMAL: a fully functional device in-use by DAOS - EVICTED: the device is no longer in-use by DAOS - UNPLUGGED: the device is currently unplugged from the system (may or not be evicted) - NEW: the device is plugged and available and not currently in-use by DAOS

To list only devices in the EVICTED state, use the (--show-evicted|-e) option to the list-devices command.

The transport address is also listed for the device. This is either the PCIe address for normal NVMe SSDs, or the BDF format address of the backing NVMe SSDs behind a VMD (Volume Management Device) address. In the example below, the last two listed devices are both VMD devices with transport addresses in the BDF format behind the VMD address 0000:5d:05.5.

The pool table maps the DAOS pool UUID to attached VOS target IDs and will list all of the server ranks that the pool is distributed on. With the additional verbose flag, the mapping of SPDK blob IDs to VOS target IDs will also be displayed.

$ dmg -l boro-11,boro-13 storage query list-devices
-------
boro-11
-------
  Devices
    UUID:5bd91603-d3c7-4fb7-9a71-76bc25690c19 [TrAddr:0000:8a:00.0]
      Targets:[0 2] Rank:0 State:NORMAL LED:OFF
    UUID:80c9f1be-84b9-4318-a1be-c416c96ca48b [TrAddr:0000:8b:00.0]
      Targets:[1 3] Rank:0 State:NORMAL LED:OFF
    UUID:051b77e4-1524-4662-9f32-f8e4d2542c2d [TrAddr:0000:8c:00.0]
      Targets:[] Rank:0 State:NEW LED:OFF
    UUID:81905b24-be44-4106-8ff9-03002e9dd86a [TrAddr:5d0505:01:00.0]
      Targets:[0 2] Rank:1 State:EVICTED LED:ON
    UUID:2ccb8afb-5d32-454e-86e3-762ec5dca7be [TrAddr:5d0505:03:00.0]
      Targets:[1 3] Rank:1 State:NORMAL LED:OFF

$ dmg -l boro-11,boro-13 storage query list-pools
-------
boro-11
-------
  Pools
    UUID:08d6839b-c71a-4af6-901c-28e141b2b429
      Rank:0 Targets:[0 1 2 3]
      Rank:1 Targets:[0 1 2 3]

$ dmg -l boro-11,boro-13 storage query list-pools --verbose
-------
boro-11
-------
  Pools
    UUID:08d6839b-c71a-4af6-901c-28e141b2b429
      Rank:0 Targets:[0 1 2 3] Blobs:[4294967404 4294967405 4294967407 4294967406]
      Rank:1 Targets:[0 1 2 3] Blobs:[4294967410 4294967411 4294967413 4294967412]

Query Storage Device Health Data:

$ dmg storage query list-devices --health --help
Usage:
  dmg [OPTIONS] storage query list-devices [list-devices-OPTIONS]

...

[list-devices command options]
      -l, --host-list=    A comma separated list of addresses <ipv4addr/hostname> to
                          connect to
      -r, --rank=         Constrain operation to the specified server rank
      -b, --health        Include device health in results
      -u, --uuid=         Device UUID (all devices if blank)
      -e, --show-evicted  Show only evicted faulty devices

$ dmg storage scan --nvme-health --help
Usage:
  dmg [OPTIONS] storage scan [scan-OPTIONS]

...

[scan command options]
      -l, --host-list=   A comma separated list of addresses <ipv4addr/hostname>
                         to connect to
      -v, --verbose      List SCM & NVMe device details
      -n, --nvme-health  Display NVMe device health statistics

The 'dmg storage scan --nvme-health' command queries the device health data, including NVMe SSD health stats and in-memory I/O error and checksum error counters and prefixes the stat list with NVMe controller details. The 'dmg storage query list-devices --health' command displays the same health data and SMD UUID, bdev roles, server rank and device state.

Vendor-specific SMART stats are displayed, currently for Intel devices only. Note: A reasonable timed workload > 60 min must be ran for the SMART stats to register (Raw values are 65535). Media wear percentage can be calculated by dividing by 1024 to find the percentage of the maximum rated cycles.

$ dmg -l boro-11 storage query list-devices --health --uuid=d5ec1227-6f39-40db-a1f6-70245aa079f1
-------
boro-11
-------
  Devices
    UUID:d5ec1227-6f39-40db-a1f6-70245aa079f1 [TrAddr:d70505:03:00.0 NSID:1]
      Roles:NA Targets:[3 7] Rank:0 State:NORMAL LED:OFF
      Health Stats:
        Timestamp:2021-09-13T11:12:34.000+00:00
        Temperature:289K(15C)
        Controller Busy Time:0s
        Power Cycles:0
        Power On Duration:0s
        Unsafe Shutdowns:0
        Media Errors:0
        Read Errors:0
        Write Errors:0
        Unmap Errors:0
        Checksum Errors:0
        Error Log Entries:0
      Critical Warnings:
        Temperature: OK
        Available Spare: OK
        Device Reliability: OK
        Read Only: OK
        Volatile Memory Backup: OK
      Intel Vendor SMART Attributes:
        Program Fail Count:
           Normalized:100%
           Raw:0
        Erase Fail Count:
           Normalized:100%
           Raw:0
        Wear Leveling Count:
           Normalized:100%
           Min:24
           Max:25
           Avg:24
        End-to-End Error Detection Count:0
        CRC Error Count:0
        Timed Workload, Media Wear:65535
        Timed Workload, Host Read/Write Ratio:65535
        Timed Workload, Timer:65535
        Thermal Throttle Status:0%
        Thermal Throttle Event Count:0
        Retry Buffer Overflow Counter:0
        PLL Lock Loss Count:0
        NAND Bytes Written:244081
        Host Bytes Written:52114

Exclusion and Hotplug¶

Automatic exclusion of an NVMe SSD:

Automatic exclusion based on faulty criteria is the default behavior in DAOS release 2.6. The default criteria parameters are max_io_errs: 10 and max_csum_errs: <uint32_max> (essentially eviction due to checksum errors is disabled by default).

Setting auto-faulty criteria parameters can be done through the server config file by adding the following YAML to the engine section of the server config file.

engines:
-  bdev_auto_faulty:
     enable: true
     max_io_errs: 1
     max_csum_errs: 2

On formatting the storage for the engine, these settings result in the following daos_server log entries to indicate the parameters are written to the engine's NVMe config:

DEBUG 13:59:29.229795 provider.go:592: BdevWriteConfigRequest: &{ForwardableRequest:{Forwarded:false} ConfigOutputPath:/mnt/daos0/daos_nvme.conf OwnerUID:10695475 OwnerGID:10695475 TierProps:[{Class:nvme DeviceList:0000:5e:00.0 DeviceFileSize:0 Tier:1 DeviceRoles:{OptionBits:0}}] HotplugEnabled:false HotplugBusidBegin:0 HotplugBusidEnd:0 Hostname:wolf-310.wolf.hpdd.intel.com AccelProps:{Engine: Options:0} SpdkRpcSrvProps:{Enable:false SockAddr:} AutoFaultyProps:{Enable:true MaxIoErrs:1 MaxCsumErrs:2} VMDEnabled:false ScannedBdevs:}
Writing NVMe config file for engine instance 0 to "/mnt/daos0/daos_nvme.conf"

The engine's NVMe config (produced during format) then contains the following JSON to apply the criteria:

cat /mnt/daos0/daos_nvme.conf
{
  "daos_data": {
    "config": [
      {
        "params": {
          "enable": true,
          "max_io_errs": 1,
          "max_csum_errs": 2
        },
        "method": "auto_faulty"
 ...

These engine logfile entries indicate that the settings have been read and applied:

01/12-13:59:41.36 wolf-310 DAOS[1299350/-1/0] bio  INFO src/bio/bio_config.c:1016 bio_read_auto_faulty_criteria() NVMe auto faulty is enabled. Criteria: max_io_errs:1, max_csum_errs:2

Manually exclude an NVMe SSD:

$ dmg storage set nvme-faulty --help
Usage:
  dmg [OPTIONS] storage set nvme-faulty [nvme-faulty-OPTIONS]

...

[nvme-faulty command options]
      -u, --uuid=     Device UUID to set
      -f, --force     Do not require confirmation
      -l, --host=     Single host address <ipv4addr/hostname> to connect to

To manually evict an NVMe SSD (auto eviction is covered later in this section), the device state needs to be set faulty by running the following command:

$ dmg storage set nvme-faulty --host=boro-11 --uuid=5bd91603-d3c7-4fb7-9a71-76bc25690c19
NOTICE: This command will permanently mark the device as unusable!
Are you sure you want to continue? (yes/no)
yes
set-faulty operation performed successfully on the following host: boro-11:10001

The device state will transition from "NORMAL" to "EVICTED" (shown above), during which time the faulty device reaction will have been triggered (all targets on the SSD will be rebuilt). The SSD will remain evicted until device replacement occurs.

If an NVMe SSD is faulty, the status LED on the VMD device will be set to an ON state, represented by a solidly ON amber light. This LED activity visually indicates a fault and that the device needs to be replaced and is no longer in use by DAOS. The LED of the VMD device will remain in this state until replaced by a new device.

Note

Full NVMe hot plug capability will be available and supported in DAOS 2.6 release. Use is currently intended for testing only and is not supported for production.

To use a newly added (hot-inserted) SSD it needs to be unbound from the kernel driver and bound instead to a user-space driver so that the device can be used with DAOS.

To rebind a SSD on a single host, run the following command (replace SSD PCI address and hostname with appropriate values):

$ dmg storage nvme-rebind -a 0000:84:00.0 -l wolf-167
Command completed successfully

The device will now be bound to a user-space driver (e.g. VFIO) and can be accessed by DAOS I/O engine processes (and used in the following dmg storage replace nvme command as a new device).

Once an engine is using a newly added (hot-inserted) SSD it can be added to the persistent NVMe config (stored on SCM) so that on engine restart the new device will be used.

To update the engine's persistent NVMe config with the new SSD transport address, run the following command (replace SSD PCI address, engine index and hostname with appropriate values):

$ dmg storage nvme-add-device -a 0000:84:00.0 -e 0 -l wolf-167
Command completed successfully

The optional [--tier-index|-t] command parameter can be used to specify which storage tier to insert the SSD into, if specified then the server will attempt to insert the device into the tier specified by the index, if not specified then the server will attempt to insert the device into the bdev tier with the lowest index value (the first bdev tier).

The device will now be registered in the engine's persistent NVMe config so that when restarted, the newly added SSD will be used.

Replace an excluded SSD with a New Device:

$ dmg storage replace nvme --help
Usage:
  dmg [OPTIONS] storage replace nvme [nvme-OPTIONS]

...

[nvme command options]
          --old-uuid= Device UUID of hot-removed SSD
          --new-uuid= Device UUID of new device
          -l, --host= Single host address <ipv4addr/hostname> to connect to

To replace an NVMe SSD with an evicted device and reintegrate it into use with DAOS, run the following command:

$ dmg storage replace nvme --host=boro-11 --old-uuid=5bd91603-d3c7-4fb7-9a71-76bc25690c19 --new-uuid=80c9f1be-84b9-4318-a1be-c416c96ca48b
dev-replace operation performed successfully on the following host: boro-11:10001

The old, now replaced device will remain in an "EVICTED" state until it is unplugged. The new device will transition from a "NEW" state to a "NORMAL" state (shown above).

Reuse a FAULTY Device:

In order to reuse a device that was previously set as FAULTY and evicted from the DAOS system, an admin can run the following command (setting the old device UUID to be the new device UUID):

$ dmg storage replace nvme --host=boro-11 ---old-uuid=5bd91603-d3c7-4fb7-9a71-76bc25690c19 --new-uuid=5bd91603-d3c7-4fb7-9a71-76bc25690c19
NOTICE: Attempting to reuse a previously set FAULTY device!
dev-replace operation performed successfully on the following host: boro-11:10001

The FAULTY device will transition from an "EVICTED" state back to a "NORMAL" state, and will again be available for use with DAOS. The use case of this command will mainly be for testing or for accidental device eviction.

Identification¶

The SSD identification feature is simply a way to quickly and visually locate a device. It requires the use of Intel VMD (Volume Management Device), which needs to be physically available on the hardware as well as enabled in the system BIOS. The feature supports two LED device events: locating a healthy device and locating an evicted device.

Locate a Healthy SSD:

$ dmg storage led identify --help
Usage:
  dmg [OPTIONS] storage led identify [identify-OPTIONS] [ids]

...

[identify command options]
          --reset     Reset blinking LED on specified VMD device back to previous state

[identify command arguments]
  ids:                Comma-separated list of identifiers which could be either VMD backing device
                      (NVMe SSD) PCI addresses or device. All SSDs selected if arg not provided.

To identify a single SSD, any of the Device-UUIDs can be used which can be found from output of the dmg storage query list-devices command:

$ dmg -l boro-11 storage led identify 6fccb374-413b-441a-bfbe-860099ac5e8d
---------
boro-11
---------
  Devices
    TrAddr:850505:0b:00.0 LED:QUICK_BLINK

The SSD PCI address can also be used in the command to identify a SSD. The PCI address should refer to a VMD backing device and can be found from either dmg storage scan -v or dmg storage query list-devices commands:

$ dmg -l boro-11 storage led identify 850505:0b:00.0
---------
boro-11
---------
  Devices
    TrAddr:850505:0b:00.0 LED:QUICK_BLINK

To identify multiple SSDs, supply a comma separated list of Device-UUIDs and/or PCI addresses, adding custom timeout of 5 minutes for LED identification (time to flash LED for):

$ dmg -l boro-11 storage led identify --timeout 5 850505:0a:00.0,6fccb374-413b-441a-bfbe-860099ac5e8d,850505:11:00.0
---------
boro-11
---------
  Devices
    TrAddr:850505:0a:00.0 LED:QUICK_BLINK
    TrAddr:850505:0b:00.0 LED:QUICK_BLINK
    TrAddr:850505:11:00.0 LED:QUICK_BLINK

If a Device-UUID is specified then the command output will display the PCI address of the SSD to which the Device-UUID belongs and the LED state of that SSD.

Mappings of Device-UUIDs to PCI address can be found in the output of the dmg storage query list-devices command.

An error will be returned if the Device-UUID or PCI address of a non-VMD enabled SSD is specified in the command.

Upon issuing a device identify command with specified device IDs and optional custom timeout value, an admin now can quickly identify a device in question.

After issuing the identify command, the status LED on the VMD device is now set to a "QUICK_BLINK" state, representing a quick, 4Hz blinking amber light.

The device will quickly blink for the specified timeout (in minutes) or the default (2 minutes) if no value is specified on the command line, after which the LED state will return to the previous state (faulty "ON" or default "OFF").

The led identify command will set (or --reset) the state of all devices on the specified host(s) if no positional arguments are supplied.

Check LED state of SSDs:

To verify the LED state of SSDs the following command can be used in a similar way to the identify command:

$ dmg -l boro-11 storage led check 850505:0a:00.0,6fccb374-413b-441a-bfbe-860099ac5e8d,850505:11:00.0
---------
boro-11
---------
  Devices
    TrAddr:850505:0a:00.0 LED:QUICK_BLINK
    TrAddr:850505:0b:00.0 LED:QUICK_BLINK
    TrAddr:850505:11:00.0 LED:QUICK_BLINK

The led check command will return the state of all devices on the specified host(s) if no positional arguments are supplied.

Locate an Evicted SSD:

If an NVMe SSD is evicted, the status LED on the VMD device is set to a "FAULT" state, represented by a solidly "ON" amber light. No additional command apart from the SSD eviction command would be needed, and this would visually indicate that the device needs to be replaced and is no longer in use by DAOS. The LED of the VMD device would remain in this state until replaced by a new device.

System Operations¶

The DAOS server acting as the access point records details of engines that join the DAOS system. Once an engine has joined the DAOS system, it is identified by a unique system "rank". Multiple ranks can reside on the same host machine, accessible via the same network address.

A DAOS system can be shutdown and restarted to perform maintenance and/or reboot hosts. Pool data and state will be maintained providing no changes are made to the rank's metadata stored on persistent memory.

Storage reformat can also be performed after system shutdown. Pools will be removed and storage wiped.

System commands will be handled by a DAOS Server acting as access point and listening on the address specified in the DMG config file "hostlist" parameter. See daos_control.yml for details.

At least one of the addresses in the hostlist parameters should match one of the "access point" addresses specified in the server config file daos_server.yml that is supplied when starting daos_server instances.

Commands used to manage a DAOS System:

$ dmg system --help
Usage:
  dmg [OPTIONS] system <command>

...

Available commands:
  cleanup       Clean up all resources associated with the specified machine
  erase         Erase system metadata prior to reformat
  leader-query  Query for current Management Service leader
  list-pools    List all pools in the DAOS system
  query         Query DAOS system status
  start         Perform start of stopped DAOS system
  stop          Perform controlled shutdown of DAOS system

Membership¶

The system membership refers to the DAOS engine processes that have registered, or joined, a specific DAOS system.

Query System Membership:

$ dmg system query --help
Usage:
  dmg [OPTIONS] system query [query-OPTIONS]

...

[query command options]
      -r, --ranks=      Comma separated ranges or individual system ranks to operate on
          --rank-hosts= Hostlist representing hosts whose managed ranks are to be operated on
      -v, --verbose     Display more member details

The --ranks takes a pattern describing rank ranges e.g., 0,5-10,20-100. The --rank-hosts takes a pattern describing host ranges e.g. storagehost[0,5-10],10.8.1.[20-100].

The output table will provide system rank mappings to host address and instance UUID, in addition to the rank state.

DAOS engines run a gossip-based protocol called SWIM that provides efficient and scalable fault detection. When an engine is reported as unresponsive, a RAS event is raised and the associated engine is marked as excluded in the output of dmg system query. The engine can be stopped (see next section) and then restarted to rejoin the system. An failed engine might also be excluded from the pools it hosted, please check the pool operation section on how to reintegrate an excluded engine.

After one or more DAOS engines being excluded, the DAOS agent cache needs to be refreshed. For detailed information, please refer to the 1[System Deployment documentation]. Before refreshing the DAOS Agent cache, it should be checked that the exclusion information has been spread to the Management Service leader. This could be done using the dump-attachinfo sub-command of the daos_agent executable:

daos_agent -o /tmp/daos_agent-tmp.yml dump-attachinfo

This usage of the daos_agent command needs a minimal DAOS agent configuration file /tmp/daos_agent-tmp.yml such as:

name: daos_server
access_points:
- sertver-1
port: 10001
transport_config:
  allow_insecure: true
log_file: /tmp/daos_agent-tmp.log

Shutdown¶

When up and running, the entire system can be shutdown.

Stop a System:

$ dmg system stop --help
Usage:
  dmg [OPTIONS] system stop [stop-OPTIONS]

...

[stop command options]
      -r, --ranks=      Comma separated ranges or individual system ranks to operate on
          --rank-hosts= Hostlist representing hosts whose managed ranks are to be operated on
          --force       Force stop DAOS system members

The --ranks takes a pattern describing rank ranges e.g., 0,5-10,20-100. The --rank-hosts takes a pattern describing host ranges e.g. storagehost[0,5-10],10.8.1.[20-100].

The output table will indicate action and result.

While the engines are stopped, the DAOS servers will continue to operate and listen on the management network.

Warning

All engines monitor each other and pro-actively exclude unresponsive members. It is critical to properly stop a DAOS system as with dmg in the case of a planned maintenance on all or a majority of the DAOS storage nodes. An abrupt reboot of the storage nodes might result in massive exclusion that will take time to recover.

The force option can be passed to for cases when a clean shutown is not working. Monitoring is not disabled in this case and spurious exclusion might happen, but the engines are guaranteed to be killed.

dmg also allows to stop a subsection of engines identified by ranks or hostnames. This is useful to stop (and restart) misbehaving engines.

Start¶

The system can be started backup after a controlled shutdown.

Start a System:

$ dmg system start --help
Usage:
  dmg [OPTIONS] system start [start-OPTIONS]

...

[start command options]
      -r, --ranks=      Comma separated ranges or individual system ranks to operate on
          --rank-hosts= Hostlist representing hosts whose managed ranks are to be operated on

The --ranks takes a pattern describing rank ranges e.g., 0,5-10,20-100. The --rank-hosts takes a pattern describing host ranges e.g. storagehost[0,5-10],10.8.1.[20-100].

The output table will indicate action and result.

DAOS I/O Engines will be started.

As for shutdown, a subsection of engines identified by ranks or hostname can be specified on the command line:

If the ranks were excluded from pools (e.g., unclean shutdown), they will need to be reintegrated. Please see the pool operation section for more information.

Storage Reformat¶

To reformat the system after a controlled shutdown, run the command:

$ dmg storage format --force

--force flag indicates that a (re)format operation should be performed disregarding existing filesystems
if no record of previously running ranks can be found, reformat is performed on the hosts that are specified in the daos_control.yml config file's hostlist parameter.
if system membership has records of previously running ranks, storage allocated to those ranks will be formatted

The output table will indicate action and result.

DAOS I/O Engines will be started, and all DAOS pools will have been removed.

Note

While it should not be required during normal operations, one may still want to restart the DAOS installation from scratch without using the DAOS control plane.

First, ensure all daos_server processes on all hosts have been stopped, then for each SCM mount specified in the config file (scm_mount in the servers section) umount and wipe FS signatures.

bash $ umount /mnt/daos0 $ umount /mnt/daos1 $ wipefs -a /dev/pmem0 $ wipefs -a /dev/pmem0 Then restart DAOS Servers and format.

System Erase¶

To erase the DAOS sorage configuration, the dmg system erase command can be used. Before doing this, the affected engines need to be stopped by running dmg system stop (if necessary with the --force flag). The erase operation will destroy any pools that may still exist, and will unconfigure the storage. It will not stop the daos_server process, so the dmg command can still be used. For example, the system can be formatted again by running dmg storage format.

Note

Note that dmg system erase does not currently reset the SCM. The /dev/pmemX devices will remain mounted, and the PMem configuration will not be reset to Memory Mode. To completely unconfigure the SCM, it is advisable to run daos_server scm reset which will completely reset the PMem. A reboot will be required to finalize the change of the PMem allocation goals.

System Extension¶

To add a new server to an existing DAOS system, one should install:

the relevant certificates
the server yaml file pointing to the access points of the running DAOS system

The daos_control.yml file should also be updated to include the new DAOS server.

Then starts the daos_server via systemd and format the new server via dmg as follows:

$ dmg storage format -l ${new_storage_node}

new_storage_node should be replaced with the hostname or the IP address of the new storage node (comma separated list or range of hosts for multiple nodes) to be added.

Upon completion of the format operation, the new storage nodes will join the system (this can be checked with dmg system query -v).

Note

New pools created after the extension will automatically use the newly added nodes (if membership is not restricted on the dmg command line). That being said, existing pools won't be automatically extended to use the new servers. Please see the pool operation section for how to extend the pool membership.

After extending the system, the cache of the daos_agent service of the client nodes needs to be refreshed. For detailed information, please refer to the 1[System Deployment documentation].

Software Upgrade¶

The DAOS v2.0 wire protocol and persistent layout is not compatible with previous DAOS versions and would require a reformat and all client and server nodes to be upgraded to a 2.x version.

Warning

Attempts to start DAOS v2.0 over a system formatted with a previous DAOS version will trigger a RAS event and cause all the engines to abort. Similarly, a 2.0 DAOS client or engine will refuse to communicate with a peer that runs an incompatible version.

DAOS v2.0 will maintain interoperability for both the wire protocol and persistent layout with any future v2.x versions. That being said, it is required that all engines in the same system run the same DAOS version.

Warning

Rolling upgrade is not supporting at this time.

DAOS v2.2 client connections to pools which were created by DAOS v2.4 will be rejected. DAOS v2.4 client should work with DAOS v2.4 and DAOS v2.2 server. To upgrade all pools to latest format after software upgrade, run dmg pool upgrade <pool>

Interoperability Matrix¶

The following table is intended to visually depict the interoperability policies for all major components in a DAOS system.

	Server (daos_server)	Engine (daos_engine)	Agent (daos_agent)	Client (libdaos)	Admin (dmg)
Server	x.y.z	x.y.z	x.(y±1)	n/a	x.y
Engine	x.y.z	x.y.z	n/a	x.(y±1)	n/a
Agent	x.(y±1)	n/a	n/a	x.y.z	n/a
Client	n/a	x.(y±1)	x.y.z	n/a	n/a
Admin	x.y	n/a	n/a	n/a	n/a

Key: * x.y.z: Major.Minor.Patch must be equal * x.y: Major.Minor must be equal * x.(y±1): Major must be equal, Minor must be equal or -1/+1 release version * n/a: Components do not communicate

Examples: * daos_server 2.4.0 is only compatible with daos_engine 2.4.0 * daos_agent 2.6.0 is compatible with daos_server 2.4.0 (2.5 is a development version) * dmg 2.4.1 is compatible with daos_server 2.4.0