DAOS Hadoop Filesystem¶
Here, we describe the quick steps required to use the DAOS Hadoop filesystem to access DAOS from Hadoop and Spark.
- Linux OS
- Java 8
- Hadoop 2.7 or later
- Spark 3.1 or later
- DAOS Readiness
We assume that the DAOS servers and agents have already been deployed in the environment. Otherwise, they can be deployed by following the DAOS Installation Guide.
There are two artifacts to download, daos-java and hadoop-daos, from maven. Here are maven dependencies.
You can download them with below commands if you have maven installed.
mvn dependency:get -DgroupId=io.daos -DartifactId=daos-java -Dversion=<version> -Dclassifier=protobuf3-netty4-shaded -Ddest=./ mvn dependency:get -DgroupId=io.daos -DartifactId=hadoop-daos -Dversion=<version> -Dclassifier=protobuf3-netty4-shaded -Ddest=./
Or search these artifacts from maven central(https://search.maven.org) and download them manually. Just make sure classifier, "protobuf3-netty4-shaded", is selected.
You can also build artifacts by yourself. see Build DAOS Hadoop Filesystem for details.
hadoop-daos-<version>-protobuf3-netty4-shaded.jar need to be deployed on
every compute node that runs Spark or Hadoop. Place them in a directory,
$SPARK_HOME/jars for Spark and
for Hadoop, which is accessible to all the nodes or copy them to every node.
core-site-daos-ref.xml (version >= 2.2.1, or
hadoop-daos-<version>-protobuf3-netty4-shaded.jar. Then merge
with your Hadoop
$HADOOP_HOME/etc/hadoop. If Hadoop
installation is not present, you can rename the file to
core-site.xml and put
$SPARK_HOME/conf or directory to
HADOOP_CONF_DIR env variable is defined.
Export all DAOS-specific env variables in your application, e.g.,
spark-env.sh for Spark and
hadoop-env.sh for Hadoop. Or you can simply put
env variables in your
Besides, if your DAOS is not installed from linux package, like RPM, you should
LD_LIBRARY_PATH include DAOS library path so that Java can link to DAOS
libs, like below.
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<DAOS_INSTALL>/lib64:<DAOS_INSTALL>/lib
core-site-daos-ref.xml, we default the DAOS URI as simplest form,
"daos://Pool1/Cont1". For other form of URIs, please check
DAOS More URIs.
If the DAOS pool and container have not been created, we can use the following command to create them and get the pool UUID and container UUID.
$ export DAOS_POOL="mypool" $ export DAOS_CONT="mycont" $ dmg pool create --scm-size=<scm size> --nvme-size=<nvme size> --label $DAOS_POOL $ daos cont create --label $DAOS_CONT --type POSIX $DAOS_POOL
After that, replace pool label and container label in DAOS URI in
core-site.xml with your above labels.
Validating Hadoop Access¶
If everything goes well, you should see
/user directory being listed after
issuing below command.
$ hadoop fs -ls /
You can also play around with other Hadoop commands, like -copyFromLocal and -copyToLocal. You can also start Yarn and run some mapreduce jobs on Yarn. See Run Map-Reduce in Hadoop
To access DAOS Hadoop filesystem in Spark, add the jar files to the classpath
of the Spark executor and driver. This can be configured in Spark's
spark.executor.extraClassPath /path/to/daos-java-<version>.jar:/path/to/hadoop-daos-<version>.jar spark.driver.extraClassPath /path/to/daos-java-<version>.jar:/path/to/hadoop-daos-<version>.jar
Validating Spark Access¶
All Spark APIs that work with the Hadoop filesystem will work with DAOS. We can
daos://Pool1/Cont1/ URI to access files stored in DAOS. For example,
to read the
people.json file from the root directory of DAOS filesystem, we
can use the following pySpark code:
df = spark.read.json("daos:///people.json")
Building DAOS Hadoop Filesystem¶
Below are the steps to build the Java jar files for the DAOS Java and DAOS Hadoop filesystem. Spark and Hadoop require these jars in their classpath. You can ignore this section if you already have the pre-built jars.
$ git clone https://github.com/daos-stack/daos.git $ cd daos $ git checkout <desired branch or commit> ## assume DAOS is built and installed to <daos_install> directory $ cd src/client/java ## with-proto3-netty4-deps profile builds jars with protobuf 3 and netty-buffer 4 shaded ## It spares you potential third-party jar compatibility issue. $ mvn -Pdistribute,with-proto3-netty4-deps clean package -DskipTests -Dgpg.skip -Ddaos.install.path=<daos_install>
After build, the package
daos-java-<version>-assemble.tgz will be available
DAOS More URIs¶
DAOS FileSystem binds to schema "daos". DAOS URIs are in the format of "daos://[authority]//[path]". Both authority and path are optional. There are three types of DAOS URIs, DAOS UNS path, DAOS Non-UNS path and Special UUID path depending on where you want the DAOS Filesystem to get initialized and configured.
DAOS UNS Path¶
The simple form of URI is "daos:///\<your uns path>[/sub path]".
"\<your path>" is your OS file path created with the
daos command or Java
DAOS UNS method,
DaosUns.create(). The "[sub path]" is optional. You can
create the UNS path with below command.
$ daos cont create --label $DAOS_CONT --path <your_path> --type POSIX $DAOS_POOL
$ java -Dpath="your_path" -Dpool_id=$DAOS_POOL -cp ./daos-java-<version>-shaded.jar io.daos.dfs.DaosUns create
After creation, you can use below command to see what DAOS properties set to the path.
$ getfattr -d -m - <your path>
DAOS Non-UNS Path¶
Special UUID Path¶
DAOS supports a specialized URI with pool/container UUIDs embedded. The format is "daos://pool UUID/container UUID". As you can see, we don't need to find the UUIDs from neither UNS path nor configuration like above two types of URIs.
You may want to connect to two DAOS servers or two DFS instances mounted to different containers in one DAOS server from same JVM. Then, you need to add authority to your URI to make it unique since Hadoop caches filesystem instance keyed by "schema + authority" in global (JVM). It applies to the both types of URIs described above.
Run Map-Reduce in Hadoop¶
$HADOOP_HOME/etc/hadoop/core-site.xml to change fs.defaultFS to
daos://Pool1/Cont1/. It is not recommended to set fs.defaultFS to a DAOS UNS
path. You may get an error complaining pool/container UUIDs cannot be found.
It's because Hadoop considers the default filesystem is DAOS since you
configured DAOS UNS URI. YARN has some working directories defaulting to local
path without schema, like "/tmp/yarn", which is then constructed as
"daos:///tmp/yarn". With this URI, Hadoop cannot connect to DAOS since no
pool/container UUIDs can be found if daos-site.xml is not provided too.
Then append below configuration to this file and
<property> <name>fs.AbstractFileSystem.daos.impl</name> <value>io.daos.fs.hadoop.DaosAbsFsImpl</value> </property>
DAOS has no data locality since it is remote storage. You need to add below
configuration to the scheduler configuration file, like
capacity-scheduler.xml in yarn.
<property> <name>yarn.scheduler.capacity.node-locality-delay</name> <value>-1</value> </property>
capacity-scheduler.xml to other nodes.
Tune More Configurations¶
If your DAOS URI is the non-UNS, you can follow descriptions of each
config item to set your own values in loadable
If your DAOS URI is the UNS path, your configurations, except those set by DAOS
UNS creation, in
daos-site.xml can still be effective. To make configuration
source consistent, an alternative to the configuration file
to set all configurations to the UNS path. You put the configs to the same UNS
path with below command.
# install attr package if get "command not found" error $ setfattr -n user.daos.hadoop -v "fs.daos.server.group=daos_server" <your path>
$ java -Dpath="your path" -Dattr=user.daos.hadoop -Dvalue="fs.daos.server.group=daos_server" -cp ./daos-java-<version>-shaded.jar io.daos.dfs.DaosUns setappinfo
For the "value" property, you need to follow pattern, key1=value1:key2=value2.. .. And key* should be from daos-config.txt. If value* contains characters of '=' or ':', you need to escape the value with below command.
$ java -Dop=escape-app-value -Dinput="daos_server:1=2" -cp ./daos-java-<version>-shaded.jar io.daos.dfs.DaosUns util
You'll get escaped value, "daos_server\u003a1\u003d2", for "daos_server:1=2".
If you configure the same property in both
daos-site.xml and UNS path, the
daos-site.xml takes priority. If user sets Hadoop configuration
before initializing Hadoop DAOS FileSystem, the user's configuration takes
Libfabric Signal Handling Issue¶
For some libfabric providers, like the (unsupported) PSM2 provider, signal chaining should be enabled to better interoperate with DAOS and its dependencies which may install its own signal handlers. It ensures that signal calls are intercepted so that they do not actually replace the JVM's signal handlers if the handlers conflict with those already installed by the JVM. Instead, these calls save the new signal handlers, or "chain" them behind the JVM-installed handlers. Later, when any of these signals are raised and found not to be targeted at the JVM, the DAOS's handlers are invoked.
$ export LD_PRELOAD=<YOUR JDK HOME>/jre/lib/amd64/libjsig.so