DAOS System Administration¶
RAS Events¶
Reliability, Availability, and Serviceability (RAS) related events are communicated and logged within DAOS and syslog.
Event Structure¶
The following table describes the structure of a DAOS RAS event, including descriptions of mandatory and optional fields.
Field | Optional/Mandatory | Description |
---|---|---|
ID | Mandatory | Unique event identifier referenced in the manual. |
Type | Mandatory | Event type of STATE_CHANGE causes an update to the Management Service (MS) database in addition to event being written to SYSLOG. INFO_ONLY type events are only written to SYSLOG. |
Timestamp | Mandatory | Resolution at the microseconds and include the timezone offset to avoid locality issues. |
Severity | Mandatory | Indicates event severity, Error/Warning/Notice. |
Msg | Mandatory | Human readable message. |
HID | Optional | Identify hardware components involved in the event. E.g., PCI address for SSD, network interface |
Rank | Optional | DAOS rank involved in the event. |
PID | Optional | Identifier of the process involved in the RAS event |
TID | Optional | Identifier of the thread involved in the RAS event. |
JOBID | Optional | Identifier of the job involved in the RAS event. |
Hostname | Optional | Hostname of the node involved in the event. |
PUUID | Optional | Pool UUID involved in the event, if any. |
CUUID | Optional | Container UUID involved in the event, if relevant. |
OID | Optional | Object identifier involved in the event, if relevant. |
Control Operation | Optional | Recommended automatic action, if any. |
Data | Optional | Specific instance data treated as a blob. |
Below is an example of a RAS event signaling an exclusion of an unresponsive engine:
&&& RAS EVENT id: [swim_rank_dead] ts: [2021-11-21T13:32:31.747408+0000] host: [wolf-112.wolf.hpdd.intel.com] type: [STATE_CHANGE] sev: [NOTICE] msg: [SWIM marked rank as dead.] pid: [253454] tid: [1] rank: [6] inc: [63a058833280000]
Event List¶
The following table lists supported DAOS RAS events, including IDs, type, severity, message, description, and cause.
Event | Event type | Severity | Message | Description | Cause |
---|---|---|---|---|---|
engine_format_required | INFO_ONLY | NOTICE | DAOS engine <idx> requires a <type> format | Indicates engine is waiting for allocated storage to be formatted on formatted on instance <idx> with dmg tool. <type> can be either SCM or Metadata. | DAOS server attempts to bring-up an engine that has unformatted storage. |
engine_died | STATE_CHANGE | ERROR | DAOS engine <idx> exited exited unexpectedly: <error> | Indicates engine instance <idx> unexpectedly. |
N/A |
engine_asserted | STATE_CHANGE | ERROR | TBD | Indicates engine instance |
An unexpected internal state resulted in assert failure. |
engine_clock_drift | INFO_ONLY | ERROR | clock drift detected | Indicates CART comms layer has detected clock skew between engines. | NTP may not be syncing clocks across DAOS system. |
pool_rebuild_started | INFO_ONLY | NOTICE | Pool rebuild started. | Indicates a pool rebuild has started. The event data field contains pool map version and pool operation identifier. | When a pool rank becomes unavailable a rebuild will be triggered. |
pool_rebuild_finished | INFO_ONLY | NOTICE | Pool rebuild finished. | Indicates a pool rebuild has finished successfully. The event data field includes the pool map version and pool operation identifier. | N/A |
pool_rebuild_failed | INFO_ONLY | ERROR | Pool rebuild failed: <rc>. | Indicates a pool rebuild has failed. The event data field includes the pool map version and pool operation identifier. <rc> provides a string representation of DER code. | N/A |
pool_replicas_updated | STATE_CHANGE | NOTICE | List of pool service replica ranks has been updated. | Indicates a pool service replica list has changed. The event contains the new service replica list in a custom payload. | When a pool service replica rank becomes unavailable a new rank is selected to replace it (if available). |
pool_durable_format_incompat | INFO_ONLY | ERROR | incompatible layout version: <current> not in [<min>, <max>] | Indicates the given pool's layout version does not match any of the versions supported by the currently running DAOS software. | DAOS engine is started with pool data in local storage that has an incompatible layout version. |
container_durable_format_incompat | INFO_ONLY | ERROR | incompatible layout version[: <current> not in [<min>, <max>] | Indicates the given container's layout version does not match any of the versions supported by the currently running DAOS software. | DAOS engine is started with container data in local storage that has an incompatible layout version. |
rdb_durable_format_incompatible | INFO_ONLY | ERROR | incompatible layout version[: <current> not in [<min>, <max>]] OR incompatible DB UUID: <uuid> | Indicates the given RDB's layout version does not match any of the versions supported by the currently running DAOS software, or the given RDB's UUID does not match the expected UUID (usually because the RDB belongs to a pool created by a pre-2.0 DAOS version). | DAOS engine is started with rdb data in local storage that has an incompatible layout version. |
swim_rank_alive | STATE_CHANGE | NOTICE | TBD | The SWIM protocol has detected the specified rank is responsive. | A remote DAOS engine has become responsive. |
swim_rank_dead | STATE_CHANGE | NOTICE | SWIM rank marked as dead. | The SWIM protocol has detected the specified rank is unresponsive. | A remote DAOS engine has become unresponsive. |
system_start_failed | INFO_ONLY | ERROR | System startup failed, <errors> | Indicates that a user initiated controlled startup failed. <errors> shows which ranks failed. | Ranks failed to start. |
system_stop_failed | INFO_ONLY | ERROR | System shutdown failed during <action> action, <errors> | Indicates that a user initiated controlled shutdown failed. <action> identifies the failing shutdown action and <errors> shows which ranks failed. | Ranks failed to stop. |
System Logging¶
Engine logging is initially configured by setting the log_file
and log_mask
parameters in the server config file. Logging is described in detail in the
Debugging System
section.
Engine log levels can be changed dynamically (at runtime) by setting log masks
for a set of facilities to a given level.
Settings will be applied to all running DAOS I/O Engines present in the configured
dmg hostlist using the command dmg server set-logmasks [<masks>]
.
The command accepts 0-1 positional arguments.
If no args are passed, then the log masks for each running engine will be reset
to the value of engine "log_mask" parameter in the server config file (as set
at the time of daos_server startup).
If a single arg is passed, then this will be used as the log masks setting.
Example usage:
dmg server set-logmasks ERR,mgmt=DEBUG
The input string should look like PREFIX1=LEVEL1,PREFIX2=LEVEL2,... where the syntax is identical to what is expected by the 'D_LOG_MASK' environment variable. If the 'PREFIX=' part is omitted, then the level applies to all defined facilities (e.g., a value of 'WARN' sets everything to WARN).
Supported priority levels for engine logging are FATAL, CRIT, ERR, WARN, NOTE, INFO, DEBUG.
System Monitoring¶
The DAOS servers maintain a set of metrics on I/O and internal state of the DAOS processes. The metrics collection is very lightweight and is always enabled. It cannot be manually enabled or disabled.
The DAOS metrics can be accessed locally on each DAOS server, or remotely by configuring an HTTP endpoint on each server.
Local metrics collection with daos_metrics¶
The daos-server
package includes the daos_metrics
command-line tool.
This tool fetches metrics from the local host only.
No configuration is required to use the daos_metric
command.
By default, daos_metrics
displays the metrics in a human-readable tree format.
To produce CSV formatted output, use daos_metrics --csv
.
Each DAOS engine maintains its own metrics.
The --srv_idx
parameter can be used to specify which engine to query, if there
are multiple engines configured per server.
The default is to query the first engine on the server (index 0).
See daos_metrics -h
for details on how to filter metrics.
Configuring the servers for remote metrics collection¶
Each DAOS server can be configured to provide an HTTP endpoint for metrics collection. This endpoint presents the data in a format compatible with Prometheus.
To enable remote telemetry collection, update the control plane section of your DAOS server configuration file:
telemetry_port: 9191
By default, the HTTP endpoint is disabled. The default port number is 9191,
and it is recommended to use this port as it is also the default for the
clients that will collect the metrics. Each control plane server will present
its local metrics via the endpoint: http://<host>:<port>/metrics
Remote metrics collection with dmg telemetry¶
The dmg telemetry
administrative command can be used to query an individual DAOS
server for metrics. Only one DAOS host may be queried at a time.
The command will return information for all engines on that server,
identified by the "rank" attribute.
The metrics have the same names as seen on the telemetry web endpoint.
By default, the dmg telemetry
command produces human readable output.
The output can be formatted in JSON by running dmg -j telemetry
.
To list all metrics for the server with their name, type and description:
dmg telemetry [-l <host>] [-p <telemetry-port>] metrics list
If no host is provided, the default is localhost. The default port is 9191.
To query the values of one or more metrics on the server:
dmg telemetry [-l <host>] [-p <telemetry-port>] metrics query [-m <metric_name>]
If no host is provided, the default is localhost. The default port is 9191.
Metric names may be provided in a comma-separated list. If no metric names are provided, all metrics are queried.
Remote metrics collection with Prometheus¶
Prometheus is the preferred way to collect metrics from multiple DAOS servers at the same time.
To integrate with Prometheus, add a new job to your Prometheus server's
configuration file, with the targets
set to the hosts and telemetry ports of
your DAOS servers:
scrape_configs:
- job_name: daos
scrape_interval: 5s
static_configs:
- targets: ['<host>:<telemetry-port>']
If there is not already a Prometheus server set up, DMG offers quick setup options for DAOS.
To install and configure Prometheus on the local machine:
dmg telemetry config [-i <install-dir>]
If no install-dir
is provided, DMG will attempt to install Prometheus in the
first writable directory found in the user's PATH
.
The Prometheus configuration file will be populated based on the DAOS server
list in your dmg
configuration file. The Prometheus configuration will be
written to $HOME/.prometheus.yml
.
To start the Prometheus server with the configuration file generated by dmg
:
prometheus --config-file=$HOME/.prometheus.yml
Storage Operations¶
Space Utilization¶
To query SCM and NVMe storage space usage and show how much space is available to create new DAOS pools with, run the following command:
$ dmg storage query usage
Hosts SCM-Total SCM-Free SCM-Used NVMe-Total NVMe-Free NVMe-Used
----- --------- -------- -------- ---------- --------- ---------
wolf-71 6.4 TB 2.0 TB 68 % 1.5 TB 1.1 TB 27 %
wolf-72 6.4 TB 2.0 TB 68 % 1.5 TB 1.1 TB 27 %
The command output shows online DAOS storage utilization, only including storage statistics for devices that have been formatted by DAOS control-plane and assigned to a currently running rank of the DAOS system. This represents the storage that can host DAOS pools.
Note that the table values are per-host (storage server) and SCM/NVMe capacity
pool component values specified in
dmg pool create
are per rank.
If multiple ranks (I/O processes) have been configured per host in the server
configuration file
daos_server.yml
then the values supplied to dmg pool create
should be
a maximum of the SCM/NVMe free space divided by the number of ranks per host.
For example, if 2.0 TB SCM and 10.0 TB NVMe free space is reported by
dmg storage query usage
and the server configuration file used to start the
system specifies 2 I/O processes (2 "server" sections), the maximum pool size
that can be specified is approximately dmg pool create -s 1T -n 5T
(may need to
specify slightly below the maximum to take account of negligible metadata
overhead).
SSD Management¶
Health Monitoring¶
Useful admin dmg commands to query NVMe SSD health:
- Query Per-Server Metadata:
dmg storage query (list-devices|list-pools)
dmg storage scan --nvme-meta
shows mapping of metadata to NVMe controllers
The NVMe storage query list-devices and list-pools commands query the persistently stored SMD device and pool tables, respectively. The device table maps the internal device UUID to attached VOS target IDs. The rank number of the server where the device is located is also listed, along with the current device state. The current device states are the following: - NORMAL: a fully functional device in-use by DAOS - EVICTED: the device is no longer in-use by DAOS - UNPLUGGED: the device is currently unplugged from the system (may or not be evicted) - NEW: the device is plugged and available and not currently in-use by DAOS
The transport address is also listed for the device. This is either the PCIe address for normal NVMe SSDs, or the BDF format address of the backing NVMe SSDs behind a VMD (Volume Management Device) address. In the example below, the last two listed devices are both VMD devices with transport addresses in the BDF format behind the VMD address 0000:5d:05.5.
The pool table maps the DAOS pool UUID to attached VOS target IDs and will list all of the server ranks that the pool is distributed on. With the additional verbose flag, the mapping of SPDK blob IDs to VOS target IDs will also be displayed.
$ dmg -l boro-11,boro-13 storage query list-devices
-------
boro-11
-------
Devices
UUID:5bd91603-d3c7-4fb7-9a71-76bc25690c19 [TrAddr:0000:8a:00.0]
Targets:[0 2] Rank:0 State:NORMAL
UUID:80c9f1be-84b9-4318-a1be-c416c96ca48b [TrAddr:0000:8b:00.0]
Targets:[1 3] Rank:0 State:NORMAL
UUID:051b77e4-1524-4662-9f32-f8e4d2542c2d [TrAddr:0000:8c:00.0]
Targets:[] Rank:0 State:NEW
UUID:81905b24-be44-4106-8ff9-03002e9dd86a [TrAddr:5d0505:01:00.0]
Targets:[0 2] Rank:1 State:EVICTED
UUID:2ccb8afb-5d32-454e-86e3-762ec5dca7be [TrAddr:5d0505:03:00.0]
Targets:[1 3] Rank:1 State:NORMAL
$ dmg -l boro-11,boro-13 storage query list-pools
-------
boro-11
-------
Pools
UUID:08d6839b-c71a-4af6-901c-28e141b2b429
Rank:0 Targets:[0 1 2 3]
Rank:1 Targets:[0 1 2 3]
$ dmg -l boro-11,boro-13 storage query list-pools --verbose
-------
boro-11
-------
Pools
UUID:08d6839b-c71a-4af6-901c-28e141b2b429
Rank:0 Targets:[0 1 2 3] Blobs:[4294967404 4294967405 4294967407 4294967406]
Rank:1 Targets:[0 1 2 3] Blobs:[4294967410 4294967411 4294967413 4294967412]
- Query Storage Device Health Data:
dmg storage query (device-health|target-health)
dmg storage scan --nvme-health
shows NVMe controller health stats
The NVMe storage query device-health and target-health commands query the device health data, including NVMe SSD health stats and in-memory I/O error and checksum error counters. The server rank and device state are also listed. The device health data can either be queried by device UUID (device-health command) or by VOS target ID along with the server rank (target-health command). The same device health information is displayed with both command options. Additionally, vendor-specific SMART stats are displayed, currently for Intel devices only. Note: A reasonable timed workload > 60 min must be ran for the SMART stats to register (Raw values are 65535). Media wear percentage can be calculated by dividing by 1024 to find the percentage of the maximum rated cycles.
$ dmg -l boro-11 storage query device-health --uuid=5bd91603-d3c7-4fb7-9a71-76bc25690c19
or
$ dmg -l boro-11 storage query target-health --rank=0 --tgtid=0
-------
boro-11
-------
Devices
UUID:5bd91603-d3c7-4fb7-9a71-76bc25690c19 [TrAddr:0000:8a:00.0]
Targets:[0 1 2 3] Rank:0 State:NORMAL
Health Stats:
Timestamp:2021-09-13T11:12:34.000+00:00
Temperature:289K(15C)
Controller Busy Time:0s
Power Cycles:0
Power On Duration:0s
Unsafe Shutdowns:0
Media Errors:0
Read Errors:0
Write Errors:0
Unmap Errors:0
Checksum Errors:0
Error Log Entries:0
Critical Warnings:
Temperature: OK
Available Spare: OK
Device Reliability: OK
Read Only: OK
Volatile Memory Backup: OK
Intel Vendor SMART Attributes:
Program Fail Count:
Normalized:100%
Raw:0
Erase Fail Count:
Normalized:100%
Raw:0
Wear Leveling Count:
Normalized:100%
Min:24
Max:25
Avg:24
End-to-End Error Detection Count:0
CRC Error Count:0
Timed Workload, Media Wear:65535
Timed Workload, Host Read/Write Ratio:65535
Timed Workload, Timer:65535
Thermal Throttle Status:0%
Thermal Throttle Event Count:0
Retry Buffer Overflow Counter:0
PLL Lock Loss Count:0
NAND Bytes Written:244081
Host Bytes Written:52114
Exclusion and Hotplug¶
- Manually exclude an NVMe SSD:
dmg storage set nvme-faulty
To manually evict an NVMe SSD (auto eviction will be supported in a future release), the device state needs to be set to "FAULTY" by running the following command:
$ dmg -l boro-11 storage set nvme-faulty --uuid=5bd91603-d3c7-4fb7-9a71-76bc25690c19
-------
boro-11
-------
Devices
UUID:5bd91603-d3c7-4fb7-9a71-76bc25690c19 Targets:[] Rank:1 State:FAULTY
The device state will transition from "NORMAL" to "FAULTY" (shown above), which will trigger the faulty device reaction (all targets on the SSD will be rebuilt, and the SSD will remain evicted until device replacement occurs).
Note
Full NVMe hot plug capability will be available and supported in DAOS 2.2 release. Use is currently intended for testing only and is not supported for production.
- Replace an excluded SSD with a New Device:
dmg storage replace nvme
To replace an NVMe SSD with an evicted device and reintegrate it into use with DAOS, run the following command:
$ dmg -l boro-11 storage replace nvme --old-uuid=5bd91603-d3c7-4fb7-9a71-76bc25690c19 --new-uuid=80c9f1be-84b9-4318-a1be-c416c96ca48b
-------
boro-11
-------
Devices
UUID:80c9f1be-84b9-4318-a1be-c416c96ca48b Targets:[] Rank:1 State:NORMAL
The old, now replaced device will remain in an "EVICTED" state until it is unplugged. The new device will transition from a "NEW" state to a "NORMAL" state (shown above).
- Reuse a FAULTY Device:
dmg storage replace nvme
In order to reuse a device that was previously set as FAULTY and evicted from the DAOS system, an admin can run the following command (setting the old device UUID to be the new device UUID):
$ dmg -l boro-11 storage replace nvme --old-uuid=5bd91603-d3c7-4fb7-9a71-76bc25690c19 --new-uuid=5bd91603-d3c7-4fb7-9a71-76bc25690c19
-------
boro-11
-------
Devices
UUID:5bd91603-d3c7-4fb7-9a71-76bc25690c19 Targets:[] Rank:1 State:NORMAL
The FAULTY device will transition from an "EVICTED" state back to a "NORMAL" state, and will again be available for use with DAOS. The use case of this command will mainly be for testing or for accidental device eviction.
Identification¶
The SSD identification feature is simply a way to quickly and visually locate a device. It requires the use of Intel VMD (Volume Management Device), which needs to be physically available on the hardware as well as enabled in the system BIOS. The feature supports two LED device events: locating a healthy device and locating an evicted device.
- Locate a Healthy SSD:
dmg storage identify vmd
To quickly identify an SSD in question, an administrator can run the following command:
$ dmg -l boro-11 storage identify vmd --uuid=6fccb374-413b-441a-bfbe-860099ac5e8d
If a non-VMD device UUID is used with the command, the following error will occur:
localhost DAOS error (-1010): DER_NOSYS
The status LED on the VMD device is now set to an "IDENTIFY" state, represented by a quick, 4Hz blinking amber light. The device will quickly blink by default for about 60 seconds and then return to the default "OFF" state. The LED event duration can be customized by setting the VMD_LED_PERIOD environment variable if a duration other than the default value is desired.
- Locate an Evicted SSD:
If an NVMe SSD is evicted, the status LED on the VMD device is set to a "FAULT" state, represented by a solidly ON amber light. No additional command apart from the SSD eviction command would be needed, and this would visually indicate that the device needs to be replaced and is no longer in use by DAOS. The LED of the VMD device would remain in this state until replaced by a new device.
System Operations¶
The DAOS server acting as the access point records details of engines that join the DAOS system. Once an engine has joined the DAOS system, it is identified by a unique system "rank". Multiple ranks can reside on the same host machine, accessible via the same network address.
A DAOS system can be shutdown and restarted to perform maintenance and/or reboot hosts. Pool data and state will be maintained providing no changes are made to the rank's metadata stored on persistent memory.
Storage reformat can also be performed after system shutdown. Pools will be removed and storage wiped.
System commands will be handled by the DAOS Server listening at the access point
address specified as the first entry in the DMG config file "hostlist" parameter.
See
daos_control.yml
for details.
The "access point" address should be the same as that specified in the server
config file
daos_server.yml
specified when starting daos_server
instances.
Membership¶
The system membership can be queried using the command:
$ dmg system query [--verbose] [--ranks <rankset>|--host-ranks <hostset>]
<rankset>
is a pattern describing rank ranges e.g., 0,5-10,20-100<hostset>
is a pattern describing host ranges e.g., storagehost[0,5-10],10.8.1.[20-100]--verbose
flag gives more information on each rank
The output table will provide system rank mappings to host address and instance UUID, in addition to the rank state.
DAOS engines run a gossip-based protocol called SWIM that provides efficient
and scalable fault detection. When an engine is reported as unresponsive, a
RAS event is raised and the associated engine is marked as excluded in the
output of dmg system query
. The engine can be stopped (see next section)
and then restarted to rejoin the system. An failed engine might also be excluded
from the pools it hosted, please check the pool operation section on how to
reintegrate an excluded engine.
Shutdown¶
When up and running, the entire system can be shutdown with the command:
$ dmg system stop [--force]
The output table will indicate action and result.
While the engines are stopped, the DAOS servers will continue to operate and listen on the management network.
Warning
All engines monitor each other and pro-actively exclude unresponsive members. It is critical to properly stop a DAOS system as with dmg in the case of a planned maintenance on all or a majority of the DAOS storage nodes. An abrupt reboot of the storage nodes might result in massive exclusion that will take time to recover.
The force option can be passed to dmg system stop for cases when a clean shutown is not working. Monitoring is not disabled in this case and spurious exclusion might happen, but the engines are guaranteed to be killed.
dmg also allows to stop a list of engines identified by ranks or hostnames. This is useful to stop (and restart) misbehaving engines.
$ dmg system stop [--force] [--ranks <rankset>|--host-ranks <hostset>]
<rankset>
is a pattern describing rank ranges e.g., 0,5-10,20-100<hostset>
is a pattern describing host ranges e.g., storagehost[0,5-10],10.8.1.[20-100]
Start¶
To start the system after a controlled shutdown, run the command:
$ dmg system start
<rankset>
is a pattern describing rank ranges e.g., 0,5-10,20-100<hostset>
is a pattern describing host ranges e.g., storagehost[0,5-10],10.8.1.[20-100]
The output table will indicate action and result.
DAOS I/O Engines will be started.
As for shutdown, a list of engines to restart can be specified on the command line:
$ dmg system start [--ranks <rankset>|--host-ranks <hostset>]
<rankset>
is a pattern describing rank ranges e.g., 0,5-10,20-100<hostset>
is a pattern describing host ranges e.g., storagehost[0,5-10],10.8.1.[20-100]
If the ranks were excluded from pools (e.g., unclean shutdown), they will need to be reintegrated. Please see the pool operation section for more information.
Storage Reformat¶
To reformat the system after a controlled shutdown, run the command:
$ dmg storage format --force
--force
flag indicates that a (re)format operation should be performed disregarding existing filesystems- if no record of previously running ranks can be found, reformat is
performed on the hosts that are specified in the
daos_control.yml
config file'shostlist
parameter. - if system membership has records of previously running ranks, storage allocated to those ranks will be formatted
The output table will indicate action and result.
DAOS I/O Engines will be started, and all DAOS pools will have been removed.
Note
While it should not be required during normal operations, one may still want to restart the DAOS installation from scratch without using the DAOS control plane.
First, ensure all daos_server
processes on all hosts have been
stopped, then for each SCM mount specified in the config file
(scm_mount
in the servers
section) umount and wipe FS signatures.
bash
$ umount /mnt/daos0
$ umount /mnt/daos1
$ wipefs -a /dev/pmem0
$ wipefs -a /dev/pmem0
Then restart DAOS Servers and format.
System Erase¶
To erase the DAOS sorage configuration, the dmg system erase
command can be used. Before doing this, the affected engines need to be
stopped by running dmg system stop
(if necessary with the --force
flag).
The erase operation will destroy any pools that may still exist, and will
unconfigure the storage. It will not stop the daos_server process, so
the dmg
command can still be used. For example, the system can be
formatted again by running dmg storage format
.
Note
Note that dmg system erase
does not currently reset the SCM.
The /dev/pmemX
devices will remain mounted,
and the PMem configuration will not be reset to Memory Mode.
To completely unconfigure the SCM, it is advisable to run
daos_server storage prepare --scm-only --reset
which will
completely reset the PMem. A reboot will be required to finalize
the change of the PMem allocation goals.
System Extension¶
To add a new server to an existing DAOS system, one should install:
- the relevant certificates
- the server yaml file pointing to the access points of the running DAOS system
The daos_control.yml file should also be updated to include the new DAOS server.
Then starts the daos_server via systemd and format the new server via dmg as follows:
$ dmg storage format -l ${new_storage_node}
new_storage_node should be replaced with the hostname or the IP address of the new storage node (comma separated list or range of hosts for multiple nodes) to be added.
Upon completion of the format operation, the new storage nodes will join
the system (this can be checked with dmg system query -v
).
Note
New pools created after the extension will automatically use the newly added nodes (if membership is not restricted on the dmg command line). That being said, existing pools won't be automatically extended to use the new servers. Please see the pool operation section for how to extend the pool membership.
Software Update¶
The DAOS v2.0 wire protocol and persistent layout is not compatible with previous DAOS versions and would require a reformat and all client and server nodes to be updated to a 2.x version.
Warning
Attempts to start DAOS v2.0 over a system formatted with a previous DAOS version will trigger a RAS event and cause all the engines to abort. Similarly, a 2.0 DAOS client or engine will refuse to communicate with a peer that runs an incompatible version.
DAOS v2.0 will maintain interoperability for both the wire protocol and persistent layout with any future v2.x versions. That being said, it is required that all engines in the same system run the same DAOS version.
Warning
Rolling update is not supported at this time.