diff --git a/io500/sc25/tta/mantastorage/README.md b/io500/sc25/tta/mantastorage/README.md new file mode 100644 index 0000000..df94ed1 --- /dev/null +++ b/io500/sc25/tta/mantastorage/README.md @@ -0,0 +1,146 @@ +# IO500-SC25 - TTA - MantaStorage + +Reproducibility information for benchmarks submitted to the IO500-SC25 +[Production](https://io500.org/list/sc25/production) list and +[Research](https://io500.org/list/sc25/io500) list. + +General information on building and running the IO500 benchmarks with DAOS can +be found in the [DAOS Community Wiki](https://daosio.atlassian.net/wiki/spaces/DC/pages/11167301633/IO-500+SC22) +page on IO500. + +## IO500.org Reproducibility Questionnaire + +Answers to the _IO500.org Reproducibility Questionnaire_ are provided in +the +[io500-reproducibility.tta-mantastorage.md](io500-reproducibility.tta-mantastorage.md) +document. Note that this document covers two different _user-selectable_ data protection +schemes to address the different requirements of the _Production_ and _Research_ list. + +## Institution + +[TTA](https://tta.or.kr) + +The TTA operates this supercomputer as part of the Republic of Korea's +national R&D project, "Development of High-Efficiency Parallel Storage +Software Technology Optimized for AI Computational Accelerators." + +## Storage System + +The DAOS storage cluster in TTA's MantaStorage system consists of 4x +Supermicro SYS-222C-TN servers, currently running DAOS version 2.7(under dev) +on Rocky Linux 9.6. The system was deployed and is commercially supported by +[Gluesys](https://www.gluesys.com): + +- Server 1(MantaStorage-01) + - CPU: 2x Intel(R) Xeon(R) 6530P + - RAM: 8x 64GiB DDR5 Samsung 4800MT/s + - NVMe + - 10x Samsung MZWLR3T8HCLS-00A07 3.84TB + - 2x Dapustor DPRD3108T0T507T6000 5.63TB + - NIC + - 2x Mellanox ConnectX-6 1-port Infiniband HDR adapter +- Server 2(MantaStorage-02) + - CPU: 2x Intel(R) Xeon(R) 6530P + - RAM: 8x 64GiB DDR5 Samsung 4800MT/s + - NVMe + - 8x Seagate XP7680SE70006 7.68TB + - 2x Dapustor DPRD3108T0T507T6000 5.63TB + - 2x Samsung MZWLR3T8HCLS-00A07 3.84TB + - NIC + - 2x Mellanox ConnectX-6 1-port Infiniband HDR adapter +- Server 3(MantaStorage-03) + - CPU: 2x Intel(R) Xeon(R) 6520P + - RAM: 8x 64GiB DDR5 Samsung 4800MT/s + - NVMe + - 8x Samsung MZQL23T8HCLS-00A07 3.84TB + - 2x Seagate XP7680SE70006 7.68TB + - 2x Dapustor DPRD3108T0T507T6000 5.63TB + - NIC + - 2x Mellanox ConnectX-6 1-port Infiniband HDR adapter +- Server 4(MantaStorage-04) + - CPU: 2x Intel(R) Xeon(R) 6520P + - RAM: 8x 64GiB DDR5 Samsung 4800MT/s + - NVMe + - 8x Samsung MZQL23T8HCLS-00A07 3.84TB + - 2x Seagate XP7680SE70006 7.68TB + - 2x Dapustor DPRD3108T0T507T6000 5.63TB + - NIC + - 2x Mellanox ConnectX-6 1-port Infiniband HDR adapter + +DAOS storage software references: + +* DAOS [github repository](https://github.com/daos-stack/daos) +* DAOS [packages repository](https://packages.daos.io) +* DAOS [documentation](https://docs.daos.io/) +* [SC-Asia 2020 paper](https://doi.org/10.1007/978-3-030-48842-0_3) + _DAOS: A Scale-Out High Performance Storage Stack for Storage Class Memory_ +* [SC-Asia 2023 paper](https://doi.org/10.1145/3581576.3581577) + _Understanding DAOS Storage Performance Scalability_ + +## Client Nodes + +The clients in TTA's MantaStorage system are ASUSTeK ESC8000-E11 servers, +currently running DAOS version 2.7(under dev) on Rocky Linux 9.6: + +- CPU: 2x Intel(R) Xeon(R) Platinum 8558 +- RAM: 16x 64GiB DDR5 Samsung 4800MT/s +- NIC: 2x Mellanox ConnectX-5 1-port Infiniband EDR adapter + +## High-Performance Fabric + +The HPC Fabric is a fully non-blocking HDR InfiniBand network, using +[Mellanox QM8790](https://docs.nvidia.com/networking/display/qm87xx/introduction) switch. + +Both servers and clients use two single-port Mellanox ConnectX-6 HDR adapters +(one per CPU socket). On the servers, each port is managed by a dedicated +`daos_engine` running on that CPU socket. On the clients, each MPI task is +communicating through the IB interface on the same NUMA node. + +## Execution Environment + +All servers and clients were installed with the following software stack: + +- Rocky Linux 9.6 (Kernel version 5.14.0-570.52.1.el9_6.x86_64) +- [MLNX_OFED_LINUX-24.10-3.2.5.0](https://docs.nvidia.com/networking/display/mlnxofedv24103250lts/release+notes) on the DAOS servers and clients. +- [DAOS 2.7 (under development)](https://github.com/daos-stack/daos/tree/808afd521bb41f3b0e08b43b2b5bae521ed00bd2) +- [MPICH 4.2.3](https://www.mpich.org/downloads/) + +The following DAOS server and client configuration files were used. + +### DAOS Server configuration + +- [/etc/daos/daos_server.yml](servers/MantaStorage-01/etc/daos/daos_server.yml) +- [/etc/daos/daos_control.yml](servers/MantaStorage-01/etc/daos/daos_control.yml) +- [/etc/daos/daos_agent.yml](servers/MantaStorage-01/etc/daos/daos_agent.yml) +- [/etc/sysctl.d/99-daos-net.conf](servers/MantaStorage-01/etc/sysctl.d/99-daos-net.conf) + +For the IO500 benchmarks, one storage pool was created that spans all 4 +servers, using the [create-pool.sh](servers/create-pool.sh) script. In that pool, a DAOS POSIX +container was created using the [create-cont.sh](servers/create-cont.sh) script. + +### DAOS Client environment + +Our IO500-SC25 benchmark runs were performed in the deployment stage, before +user operation started. For this reason the runs have been performed with +interactive `mpirun` invocations, using a hostlist to specify client nodes as +described above. + +The IO500 run scripts are included in the IO500 results tarballs. + +The rules for the IO500 Production lists require that the storage system has +no single point of failure. So the +[config-all-dfs-rf1-tmpl.ini](config-all-dfs-rf1-tmpl.ini) configuration file +has been used for Production runs. It protects against single faults by using +2-Way replication for metadata and IOR-Hard, and 2+1P Erasure Coding for IOR-Easy. + +Submissions to the IO500 Research lists are using an identical storage system +setup, but since the "no single point of failure" requirement does not apply +to the Research list the +[config-all-dfs-rf0-tmpl.ini](config-all-dfs-rf0-tmpl.ini) configuration file +has been used for Research runs. This configuration does not use replication +or Erasure Coding to maximize the achievable performance. + +## IO500 List Entries + +- MantaStorage: SC25 Research List #85, submission 770 +- MantaStorage-EC: SC25 Production List #113, submission 776 diff --git a/io500/sc25/tta/mantastorage/clients/Client-2/etc/daos/daos_agent.yml b/io500/sc25/tta/mantastorage/clients/Client-2/etc/daos/daos_agent.yml new file mode 100644 index 0000000..33e67b0 --- /dev/null +++ b/io500/sc25/tta/mantastorage/clients/Client-2/etc/daos/daos_agent.yml @@ -0,0 +1,183 @@ +# DAOS agent configuration file. +# +# Location of this configuration file is determined by first checking for the +# path specified through the -o option of the daos_agent command line. +# Otherwise, /etc/daos/daos_agent.yml is used. +# +# Section describing the daos_agent configuration +# +# Although not supported for now, one might want to connect to multiple +# DAOS installations from the same node in the future. +# +# Specify the associated DAOS systems. +# Name must match name specified in the daos_server.yml file on the server. +# +# NOTE: changing the name is not supported yet, it must be daos_server +# +# default: daos_server +name: daos_server + +# Management server access points +# Must have the same value for all agents and servers in a system. +# default: hostname of this node +access_points: [ + "MantaStorage-01", + "MantaStorage-02", + "MantaStorage-03", + "MantaStorage-04" +] + +# Force different port number to connect to access points. +# default: 10001 +port: 10001 + +## Enable HTTP endpoint for remote telemetry collection. +# Note that enabling the endpoint automatically enables +# client telemetry collection. +# +## default endpoint state: disabled +## default endpoint port: 9192 +#telemetry_port: 9192 + +## Enable client telemetry for all DAOS clients. +# If false, clients will need to optionally enable telemetry by setting +# the D_CLIENT_METRICS_ENABLE environment variable to true. +# +## default: false +#telemetry_enabled: true + +## Retain client telemetry for a period of time after the client +# process exits. +# +## default 0 (do not retain telemetry after client exit) +#telemetry_retain: 1m + +## Enable client telemetry for a subset of DAOS clients matching the +# supplied regular expression pattern. May not be used with +# telemetry_disabled_procs. +# +## default: not set +#telemetry_enabled_procs: ^dfuse$ + +## Disable client telemetry for a subset of DAOS clients matching the +# supplied regular expression pattern. May not be used with +# telemetry_enabled_procs. +# +## default: not set +#telemetry_disabled_procs: ^spambot-.* + +## Configuration for user credential management. +#credential_config: +# # If the agent should be able to resolve unknown client uids and gids +# # (e.g. when running in a container) into ACL principal names, then a +# # client user map may be defined. The optional "default" uid is a special +# # case and applies if no other matches are found. +# client_user_map: +# default: +# user: nobody +# group: nobody +# 1000: +# user: ralph +# group: stanley +# +# # Optionally cache generated credentials with the specified cache +# # lifetime. By default, a credential is generated for every client +# # process that connects to a pool. If the credential cache is +# # enabled, then local client processes connecting with stable +# # uid:gid associations may take advantage of the cached credential +# # and reduce some agent overhead. For heavily-loaded client nodes +# # with many frequent (e.g. hundreds per minute) client connections, +# # a lifetime of 1-5 minutes may be a reasonable tradeoff between +# # performance and responsiveness to user/group database updates. +# # If no expiration is set, credential caching is not enabled. +# cache_expiration: 1m +# +## Configuration for SSL certificates used to secure management traffic +# and authenticate/authorize management components. +transport_config: +# # In order to disable transport security, uncomment and set allow_insecure +# # to true. Not recommended for production configurations. + allow_insecure: false + +# # Custom CA Root certificate for generated certs + ca_cert: /etc/daos/certs/daosCA.crt +# # Agent certificate for use in TLS handshakes + cert: /etc/daos/certs/agent.crt +# # Key portion of Agent Certificate + key: /etc/daos/certs/agent.key + +# Use the given directory for creating unix domain sockets +# +# NOTE: Do not change this when running under systemd control. If it needs to +# be changed, then make sure that it matches the RuntimeDirectory setting +# in /usr/lib/systemd/system/daos_agent.service +# +# default: /var/run/daos_agent +runtime_dir: /var/run/daos_agent + +## Full path and name of the DAOS agent logfile. +## default: print to stderr +log_file: /var/log/daos/daos_agent.log + +## Force specific debug mask for daos_agent (control plane). +## Mask specifies minimum level of message significance to pass to logger. +## Currently supported values are DISABLED, TRACE, DEBUG, INFO, NOTICE and ERROR. +# +## default: INFO +control_log_mask: INFO + +## Disable automatic eviction of open pool handles on agent shutdown. By default, +## the agent will evict all open pool handles for local processes on shutdown. +## Note that this implies that stopping or restarting the agent will result +## in interruption of DAOS I/O for any local DAOS client processes that have +## an open pool handle. +## default: false +disable_auto_evict: true + +## If enabled, the agent will evict any open pool handles associated with this machine on agent +## startup. This allows the servers to reclaim resources that may not have been properly cleaned +## up in the event of an agent or machine crash. +## default: false +enable_evict_on_start: true + +## Disable the agent's internal caches. If set to true, the agent will query the +## server access point and local hardware data every time a client requests +## rank connection information. +# +## default: false +disable_caching: true + +## Automatically expire the agent's remote cache after a period of time defined in +## minutes. It will refresh the data the next time it is requested. +# +## default: 0 (never expires) +#cache_expiration: 30 + +## Ignore a subset of fabric interfaces when selecting an interface for client +## applications. (Mutually exclusive with include). +# +#exclude_fabric_ifaces: ["lo", "eno1", "eno2", "ens4f0", "ens4f1", "usb0"] + +## Conversely, only consider a specific set of fabric interfaces when selecting +## an interface for client applications. (Mutually exclusive with exclude). +# +include_fabric_ifaces: ["ibs255", "ibp200s0"] + +# Manually define the fabric interfaces and domains to be used by the agent, +# organized by NUMA node. +# If not defined, the agent will automatically detect all fabric interfaces and +# select appropriate ones based on the server preferences. +# +fabric_ifaces: +- + numa_node: 0 + devices: + - + iface: ibs255 + domain: mlx5_0 +- + numa_node: 3 + devices: + - + iface: ibp200s0 + domain: mlx5_1 diff --git a/io500/sc25/tta/mantastorage/clients/Client-2/etc/sysctl.d/99-daos-net.conf b/io500/sc25/tta/mantastorage/clients/Client-2/etc/sysctl.d/99-daos-net.conf new file mode 100644 index 0000000..9f79b3e --- /dev/null +++ b/io500/sc25/tta/mantastorage/clients/Client-2/etc/sysctl.d/99-daos-net.conf @@ -0,0 +1,39 @@ +############################################################################## +# Multi-rail +############################################################################## +net.ipv4.conf.ibp200s0.accept_local = 1 +net.ipv4.conf.ibp200s0.rp_filter = 2 +net.ipv4.conf.ibp200s0.arp_ignore = 2 +net.ipv4.conf.ibp200s0.arp_announce = 0 +net.ipv4.conf.ibp200s0.arp_filter = 0 + +net.ipv4.conf.ibs255.accept_local = 1 +net.ipv4.conf.ibs255.rp_filter = 2 +net.ipv4.conf.ibs255.arp_ignore = 2 +net.ipv4.conf.ibs255.arp_announce = 0 +net.ipv4.conf.ibs255.arp_filter = 0 + +net.core.netdev_max_backlog = 250000 +#net.core.rmem_max = 16777216 +net.core.rmem_max = 134217728 +#net.core.wmem_max = 16777216 +net.core.wmem_max = 134217728 +net.core.rmem_default = 16777216 +net.core.wmem_default = 16777216 +net.core.optmem_max = 16777216 + +net.ipv4.neigh.default.gc_thresh1 = 1024 +net.ipv4.neigh.default.gc_thresh2 = 4096 +net.ipv4.neigh.default.gc_thresh3 = 16384 + +net.ipv4.tcp_timestamps = 1 +net.ipv4.tcp_sack = 1 +net.ipv4.tcp_mem = 16777216 16777216 16777216 +net.ipv4.tcp_rmem = 4096 16777216 134217728 +net.ipv4.tcp_wmem = 4096 16777216 134217728 +net.ipv4.tcp_low_latency = 1 +net.ipv4.tcp_mtu_probing = 1 + +net.ipv4.tcp_congestion_control=bbr +net.core.default_qdisc=fq +net.ipv4.tcp_slow_start_after_idle=0 diff --git a/io500/sc25/tta/mantastorage/clients/Client-2/etc/sysctl.d/99-daos-vm.conf b/io500/sc25/tta/mantastorage/clients/Client-2/etc/sysctl.d/99-daos-vm.conf new file mode 100644 index 0000000..40356f3 --- /dev/null +++ b/io500/sc25/tta/mantastorage/clients/Client-2/etc/sysctl.d/99-daos-vm.conf @@ -0,0 +1 @@ +vm.swappiness = 1 diff --git a/io500/sc25/tta/mantastorage/clients/Client-3/etc/daos/daos_agent.yml b/io500/sc25/tta/mantastorage/clients/Client-3/etc/daos/daos_agent.yml new file mode 100644 index 0000000..b0c2f2f --- /dev/null +++ b/io500/sc25/tta/mantastorage/clients/Client-3/etc/daos/daos_agent.yml @@ -0,0 +1,183 @@ +# DAOS agent configuration file. +# +# Location of this configuration file is determined by first checking for the +# path specified through the -o option of the daos_agent command line. +# Otherwise, /etc/daos/daos_agent.yml is used. +# +# Section describing the daos_agent configuration +# +# Although not supported for now, one might want to connect to multiple +# DAOS installations from the same node in the future. + +# Specify the associated DAOS systems. +# Name must match name specified in the daos_server.yml file on the server. +# +# NOTE: changing the name is not supported yet, it must be daos_server +# +# default: daos_server +name: daos_server + +# Management server access points +# Must have the same value for all agents and servers in a system. +# default: hostname of this node +access_points: [ + "MantaStorage-01", + "MantaStorage-02", + "MantaStorage-03", + "MantaStorage-04" +] + +# Force different port number to connect to access points. +# default: 10001 +port: 10001 + +## Enable HTTP endpoint for remote telemetry collection. +# Note that enabling the endpoint automatically enables +# client telemetry collection. +# +## default endpoint state: disabled +## default endpoint port: 9192 +#telemetry_port: 9192 + +## Enable client telemetry for all DAOS clients. +# If false, clients will need to optionally enable telemetry by setting +# the D_CLIENT_METRICS_ENABLE environment variable to true. +# +## default: false +#telemetry_enabled: true + +## Retain client telemetry for a period of time after the client +# process exits. +# +## default 0 (do not retain telemetry after client exit) +#telemetry_retain: 1m + +## Enable client telemetry for a subset of DAOS clients matching the +# supplied regular expression pattern. May not be used with +# telemetry_disabled_procs. +# +## default: not set +#telemetry_enabled_procs: ^dfuse$ + +## Disable client telemetry for a subset of DAOS clients matching the +# supplied regular expression pattern. May not be used with +# telemetry_enabled_procs. +# +## default: not set +#telemetry_disabled_procs: ^spambot-.* + +## Configuration for user credential management. +#credential_config: +# # If the agent should be able to resolve unknown client uids and gids +# # (e.g. when running in a container) into ACL principal names, then a +# # client user map may be defined. The optional "default" uid is a special +# # case and applies if no other matches are found. +# client_user_map: +# default: +# user: nobody +# group: nobody +# 1000: +# user: ralph +# group: stanley +# +# # Optionally cache generated credentials with the specified cache +# # lifetime. By default, a credential is generated for every client +# # process that connects to a pool. If the credential cache is +# # enabled, then local client processes connecting with stable +# # uid:gid associations may take advantage of the cached credential +# # and reduce some agent overhead. For heavily-loaded client nodes +# # with many frequent (e.g. hundreds per minute) client connections, +# # a lifetime of 1-5 minutes may be a reasonable tradeoff between +# # performance and responsiveness to user/group database updates. +# # If no expiration is set, credential caching is not enabled. +# cache_expiration: 1m + +## Configuration for SSL certificates used to secure management traffic +# and authenticate/authorize management components. +transport_config: + # In order to disable transport security, uncomment and set allow_insecure + # to true. Not recommended for production configurations. + allow_insecure: false + + # Custom CA Root certificate for generated certs + ca_cert: /etc/daos/certs/daosCA.crt + # Agent certificate for use in TLS handshakes + cert: /etc/daos/certs/agent.crt + # Key portion of Agent Certificate + key: /etc/daos/certs/agent.key + +# Use the given directory for creating unix domain sockets +# +# NOTE: Do not change this when running under systemd control. If it needs to +# be changed, then make sure that it matches the RuntimeDirectory setting +# in /usr/lib/systemd/system/daos_agent.service +# +# default: /var/run/daos_agent +runtime_dir: /var/run/daos_agent + +# Full path and name of the DAOS agent logfile. +# default: print to stderr +log_file: /var/log/daos/daos_agent.log + +# Force specific debug mask for daos_agent (control plane). +# Mask specifies minimum level of message significance to pass to logger. +# Currently supported values are DISABLED, TRACE, DEBUG, INFO, NOTICE and ERROR. + +# default: INFO +control_log_mask: INFO + +# Disable automatic eviction of open pool handles on agent shutdown. By default, +# the agent will evict all open pool handles for local processes on shutdown. +# Note that this implies that stopping or restarting the agent will result +# in interruption of DAOS I/O for any local DAOS client processes that have +# an open pool handle. +# default: false +disable_auto_evict: true + +# If enabled, the agent will evict any open pool handles associated with this machine on agent +# startup. This allows the servers to reclaim resources that may not have been properly cleaned +# up in the event of an agent or machine crash. +# default: false +enable_evict_on_start: true + +## Disable the agent's internal caches. If set to true, the agent will query the +## server access point and local hardware data every time a client requests +## rank connection information. +# +## default: false +disable_caching: true + +## Automatically expire the agent's remote cache after a period of time defined in +## minutes. It will refresh the data the next time it is requested. +# +## default: 0 (never expires) +#cache_expiration: 30 + +## Ignore a subset of fabric interfaces when selecting an interface for client +## applications. (Mutually exclusive with include). +# +#exclude_fabric_ifaces: ["lo", "eth1"] + +# Conversely, only consider a specific set of fabric interfaces when selecting +# an interface for client applications. (Mutually exclusive with exclude). + +include_fabric_ifaces: ["ibs255", "ibp200s0"] + +# Manually define the fabric interfaces and domains to be used by the agent, +# organized by NUMA node. +# If not defined, the agent will automatically detect all fabric interfaces and +# select appropriate ones based on the server preferences. + +fabric_ifaces: +- + numa_node: 0 + devices: + - + iface: ibs255 + domain: mlx5_0 +- + numa_node: 3 + devices: + - + iface: ibp200s0 + domain: mlx5_1 diff --git a/io500/sc25/tta/mantastorage/clients/Client-3/etc/sysctl.d/99-daos-net.conf b/io500/sc25/tta/mantastorage/clients/Client-3/etc/sysctl.d/99-daos-net.conf new file mode 100644 index 0000000..029eff6 --- /dev/null +++ b/io500/sc25/tta/mantastorage/clients/Client-3/etc/sysctl.d/99-daos-net.conf @@ -0,0 +1,39 @@ +############################################################################## +# Multi-rail +############################################################################## +net.ipv4.conf.ibp200s0.accept_local = 1 +net.ipv4.conf.ibp200s0.rp_filter = 2 +net.ipv4.conf.ibp200s0.arp_ignore = 2 +net.ipv4.conf.ibp200s0.arp_ignore = 0 +net.ipv4.conf.ibp200s0.arp_filter = 0 + +net.ipv4.conf.ibs255.accept_local = 1 +net.ipv4.conf.ibs255.rp_filter = 2 +net.ipv4.conf.ibs255.arp_ignore = 2 +net.ipv4.conf.ibs255.arp_ignore = 0 +net.ipv4.conf.ibs255.arp_filter = 0 + +net.core.netdev_max_backlog = 250000 +#net.core.rmem_max = 16777216 +net.core.rmem_max = 134217728 +#net.core.wmem_max = 16777216 +net.core.wmem_max = 134217728 +net.core.rmem_default = 16777216 +net.core.wmem_default = 16777216 +net.core.optmem_max = 16777216 + +net.ipv4.neigh.default.gc_thresh1 = 1024 +net.ipv4.neigh.default.gc_thresh2 = 4096 +net.ipv4.neigh.default.gc_thresh3 = 16384 + +net.ipv4.tcp_timestamps = 1 +net.ipv4.tcp_sack = 1 +net.ipv4.tcp_mem = 16777216 16777216 16777216 +net.ipv4.tcp_rmem = 4096 16777216 134217728 +net.ipv4.tcp_wmem = 4096 16777216 134217728 +net.ipv4.tcp_low_latency = 1 +net.ipv4.tcp_mtu_probing = 1 + +net.ipv4.tcp_congestion_control=bbr +net.core.default_qdisc=fq +net.ipv4.tcp_slow_start_after_idle=0 diff --git a/io500/sc25/tta/mantastorage/clients/Client-3/etc/sysctl.d/99-daos-vm.conf b/io500/sc25/tta/mantastorage/clients/Client-3/etc/sysctl.d/99-daos-vm.conf new file mode 100644 index 0000000..40356f3 --- /dev/null +++ b/io500/sc25/tta/mantastorage/clients/Client-3/etc/sysctl.d/99-daos-vm.conf @@ -0,0 +1 @@ +vm.swappiness = 1 diff --git a/io500/sc25/tta/mantastorage/clients/Client-4/etc/daos/daos_agent.yml b/io500/sc25/tta/mantastorage/clients/Client-4/etc/daos/daos_agent.yml new file mode 100644 index 0000000..b0c2f2f --- /dev/null +++ b/io500/sc25/tta/mantastorage/clients/Client-4/etc/daos/daos_agent.yml @@ -0,0 +1,183 @@ +# DAOS agent configuration file. +# +# Location of this configuration file is determined by first checking for the +# path specified through the -o option of the daos_agent command line. +# Otherwise, /etc/daos/daos_agent.yml is used. +# +# Section describing the daos_agent configuration +# +# Although not supported for now, one might want to connect to multiple +# DAOS installations from the same node in the future. + +# Specify the associated DAOS systems. +# Name must match name specified in the daos_server.yml file on the server. +# +# NOTE: changing the name is not supported yet, it must be daos_server +# +# default: daos_server +name: daos_server + +# Management server access points +# Must have the same value for all agents and servers in a system. +# default: hostname of this node +access_points: [ + "MantaStorage-01", + "MantaStorage-02", + "MantaStorage-03", + "MantaStorage-04" +] + +# Force different port number to connect to access points. +# default: 10001 +port: 10001 + +## Enable HTTP endpoint for remote telemetry collection. +# Note that enabling the endpoint automatically enables +# client telemetry collection. +# +## default endpoint state: disabled +## default endpoint port: 9192 +#telemetry_port: 9192 + +## Enable client telemetry for all DAOS clients. +# If false, clients will need to optionally enable telemetry by setting +# the D_CLIENT_METRICS_ENABLE environment variable to true. +# +## default: false +#telemetry_enabled: true + +## Retain client telemetry for a period of time after the client +# process exits. +# +## default 0 (do not retain telemetry after client exit) +#telemetry_retain: 1m + +## Enable client telemetry for a subset of DAOS clients matching the +# supplied regular expression pattern. May not be used with +# telemetry_disabled_procs. +# +## default: not set +#telemetry_enabled_procs: ^dfuse$ + +## Disable client telemetry for a subset of DAOS clients matching the +# supplied regular expression pattern. May not be used with +# telemetry_enabled_procs. +# +## default: not set +#telemetry_disabled_procs: ^spambot-.* + +## Configuration for user credential management. +#credential_config: +# # If the agent should be able to resolve unknown client uids and gids +# # (e.g. when running in a container) into ACL principal names, then a +# # client user map may be defined. The optional "default" uid is a special +# # case and applies if no other matches are found. +# client_user_map: +# default: +# user: nobody +# group: nobody +# 1000: +# user: ralph +# group: stanley +# +# # Optionally cache generated credentials with the specified cache +# # lifetime. By default, a credential is generated for every client +# # process that connects to a pool. If the credential cache is +# # enabled, then local client processes connecting with stable +# # uid:gid associations may take advantage of the cached credential +# # and reduce some agent overhead. For heavily-loaded client nodes +# # with many frequent (e.g. hundreds per minute) client connections, +# # a lifetime of 1-5 minutes may be a reasonable tradeoff between +# # performance and responsiveness to user/group database updates. +# # If no expiration is set, credential caching is not enabled. +# cache_expiration: 1m + +## Configuration for SSL certificates used to secure management traffic +# and authenticate/authorize management components. +transport_config: + # In order to disable transport security, uncomment and set allow_insecure + # to true. Not recommended for production configurations. + allow_insecure: false + + # Custom CA Root certificate for generated certs + ca_cert: /etc/daos/certs/daosCA.crt + # Agent certificate for use in TLS handshakes + cert: /etc/daos/certs/agent.crt + # Key portion of Agent Certificate + key: /etc/daos/certs/agent.key + +# Use the given directory for creating unix domain sockets +# +# NOTE: Do not change this when running under systemd control. If it needs to +# be changed, then make sure that it matches the RuntimeDirectory setting +# in /usr/lib/systemd/system/daos_agent.service +# +# default: /var/run/daos_agent +runtime_dir: /var/run/daos_agent + +# Full path and name of the DAOS agent logfile. +# default: print to stderr +log_file: /var/log/daos/daos_agent.log + +# Force specific debug mask for daos_agent (control plane). +# Mask specifies minimum level of message significance to pass to logger. +# Currently supported values are DISABLED, TRACE, DEBUG, INFO, NOTICE and ERROR. + +# default: INFO +control_log_mask: INFO + +# Disable automatic eviction of open pool handles on agent shutdown. By default, +# the agent will evict all open pool handles for local processes on shutdown. +# Note that this implies that stopping or restarting the agent will result +# in interruption of DAOS I/O for any local DAOS client processes that have +# an open pool handle. +# default: false +disable_auto_evict: true + +# If enabled, the agent will evict any open pool handles associated with this machine on agent +# startup. This allows the servers to reclaim resources that may not have been properly cleaned +# up in the event of an agent or machine crash. +# default: false +enable_evict_on_start: true + +## Disable the agent's internal caches. If set to true, the agent will query the +## server access point and local hardware data every time a client requests +## rank connection information. +# +## default: false +disable_caching: true + +## Automatically expire the agent's remote cache after a period of time defined in +## minutes. It will refresh the data the next time it is requested. +# +## default: 0 (never expires) +#cache_expiration: 30 + +## Ignore a subset of fabric interfaces when selecting an interface for client +## applications. (Mutually exclusive with include). +# +#exclude_fabric_ifaces: ["lo", "eth1"] + +# Conversely, only consider a specific set of fabric interfaces when selecting +# an interface for client applications. (Mutually exclusive with exclude). + +include_fabric_ifaces: ["ibs255", "ibp200s0"] + +# Manually define the fabric interfaces and domains to be used by the agent, +# organized by NUMA node. +# If not defined, the agent will automatically detect all fabric interfaces and +# select appropriate ones based on the server preferences. + +fabric_ifaces: +- + numa_node: 0 + devices: + - + iface: ibs255 + domain: mlx5_0 +- + numa_node: 3 + devices: + - + iface: ibp200s0 + domain: mlx5_1 diff --git a/io500/sc25/tta/mantastorage/clients/Client-4/etc/sysctl.d/99-daos-net.conf b/io500/sc25/tta/mantastorage/clients/Client-4/etc/sysctl.d/99-daos-net.conf new file mode 100644 index 0000000..029eff6 --- /dev/null +++ b/io500/sc25/tta/mantastorage/clients/Client-4/etc/sysctl.d/99-daos-net.conf @@ -0,0 +1,39 @@ +############################################################################## +# Multi-rail +############################################################################## +net.ipv4.conf.ibp200s0.accept_local = 1 +net.ipv4.conf.ibp200s0.rp_filter = 2 +net.ipv4.conf.ibp200s0.arp_ignore = 2 +net.ipv4.conf.ibp200s0.arp_ignore = 0 +net.ipv4.conf.ibp200s0.arp_filter = 0 + +net.ipv4.conf.ibs255.accept_local = 1 +net.ipv4.conf.ibs255.rp_filter = 2 +net.ipv4.conf.ibs255.arp_ignore = 2 +net.ipv4.conf.ibs255.arp_ignore = 0 +net.ipv4.conf.ibs255.arp_filter = 0 + +net.core.netdev_max_backlog = 250000 +#net.core.rmem_max = 16777216 +net.core.rmem_max = 134217728 +#net.core.wmem_max = 16777216 +net.core.wmem_max = 134217728 +net.core.rmem_default = 16777216 +net.core.wmem_default = 16777216 +net.core.optmem_max = 16777216 + +net.ipv4.neigh.default.gc_thresh1 = 1024 +net.ipv4.neigh.default.gc_thresh2 = 4096 +net.ipv4.neigh.default.gc_thresh3 = 16384 + +net.ipv4.tcp_timestamps = 1 +net.ipv4.tcp_sack = 1 +net.ipv4.tcp_mem = 16777216 16777216 16777216 +net.ipv4.tcp_rmem = 4096 16777216 134217728 +net.ipv4.tcp_wmem = 4096 16777216 134217728 +net.ipv4.tcp_low_latency = 1 +net.ipv4.tcp_mtu_probing = 1 + +net.ipv4.tcp_congestion_control=bbr +net.core.default_qdisc=fq +net.ipv4.tcp_slow_start_after_idle=0 diff --git a/io500/sc25/tta/mantastorage/clients/Client-4/etc/sysctl.d/99-daos-vm.conf b/io500/sc25/tta/mantastorage/clients/Client-4/etc/sysctl.d/99-daos-vm.conf new file mode 100644 index 0000000..40356f3 --- /dev/null +++ b/io500/sc25/tta/mantastorage/clients/Client-4/etc/sysctl.d/99-daos-vm.conf @@ -0,0 +1 @@ +vm.swappiness = 1 diff --git a/io500/sc25/tta/mantastorage/clients/Client-5/etc/daos/daos_agent.yml b/io500/sc25/tta/mantastorage/clients/Client-5/etc/daos/daos_agent.yml new file mode 100644 index 0000000..065fc36 --- /dev/null +++ b/io500/sc25/tta/mantastorage/clients/Client-5/etc/daos/daos_agent.yml @@ -0,0 +1,183 @@ +# DAOS agent configuration file. +# +# Location of this configuration file is determined by first checking for the +# path specified through the -o option of the daos_agent command line. +# Otherwise, /etc/daos/daos_agent.yml is used. +# +# Section describing the daos_agent configuration +# +# Although not supported for now, one might want to connect to multiple +# DAOS installations from the same node in the future. + +# Specify the associated DAOS systems. +# Name must match name specified in the daos_server.yml file on the server. +# +# NOTE: changing the name is not supported yet, it must be daos_server +# +# default: daos_server +name: daos_server + +# Management server access points +# Must have the same value for all agents and servers in a system. +# default: hostname of this node +access_points: [ + "MantaStorage-01", + "MantaStorage-02", + "MantaStorage-03", + "MantaStorage-04" +] + +# Force different port number to connect to access points. +# default: 10001 +port: 10001 + +## Enable HTTP endpoint for remote telemetry collection. +# Note that enabling the endpoint automatically enables +# client telemetry collection. +# +## default endpoint state: disabled +## default endpoint port: 9192 +#telemetry_port: 9192 + +## Enable client telemetry for all DAOS clients. +# If false, clients will need to optionally enable telemetry by setting +# the D_CLIENT_METRICS_ENABLE environment variable to true. +# +## default: false +#telemetry_enabled: true + +## Retain client telemetry for a period of time after the client +# process exits. +# +## default 0 (do not retain telemetry after client exit) +#telemetry_retain: 1m + +## Enable client telemetry for a subset of DAOS clients matching the +# supplied regular expression pattern. May not be used with +# telemetry_disabled_procs. +# +## default: not set +#telemetry_enabled_procs: ^dfuse$ + +## Disable client telemetry for a subset of DAOS clients matching the +# supplied regular expression pattern. May not be used with +# telemetry_enabled_procs. +# +## default: not set +#telemetry_disabled_procs: ^spambot-.* + +## Configuration for user credential management. +#credential_config: +# # If the agent should be able to resolve unknown client uids and gids +# # (e.g. when running in a container) into ACL principal names, then a +# # client user map may be defined. The optional "default" uid is a special +# # case and applies if no other matches are found. +# client_user_map: +# default: +# user: nobody +# group: nobody +# 1000: +# user: ralph +# group: stanley +# +# # Optionally cache generated credentials with the specified cache +# # lifetime. By default, a credential is generated for every client +# # process that connects to a pool. If the credential cache is +# # enabled, then local client processes connecting with stable +# # uid:gid associations may take advantage of the cached credential +# # and reduce some agent overhead. For heavily-loaded client nodes +# # with many frequent (e.g. hundreds per minute) client connections, +# # a lifetime of 1-5 minutes may be a reasonable tradeoff between +# # performance and responsiveness to user/group database updates. +# # If no expiration is set, credential caching is not enabled. +# cache_expiration: 1m + +## Configuration for SSL certificates used to secure management traffic +# and authenticate/authorize management components. +transport_config: + # In order to disable transport security, uncomment and set allow_insecure + # to true. Not recommended for production configurations. + allow_insecure: false + + # Custom CA Root certificate for generated certs + ca_cert: /etc/daos/certs/daosCA.crt + # Agent certificate for use in TLS handshakes + cert: /etc/daos/certs/agent.crt + # Key portion of Agent Certificate + key: /etc/daos/certs/agent.key + +# Use the given directory for creating unix domain sockets +# +# NOTE: Do not change this when running under systemd control. If it needs to +# be changed, then make sure that it matches the RuntimeDirectory setting +# in /usr/lib/systemd/system/daos_agent.service +# +# default: /var/run/daos_agent +runtime_dir: /var/run/daos_agent + +# Full path and name of the DAOS agent logfile. +# default: print to stderr +log_file: /var/log/daos/daos_agent.log + +# Force specific debug mask for daos_agent (control plane). +# Mask specifies minimum level of message significance to pass to logger. +# Currently supported values are DISABLED, TRACE, DEBUG, INFO, NOTICE and ERROR. + +# default: INFO +control_log_mask: INFO + +# Disable automatic eviction of open pool handles on agent shutdown. By default, +# the agent will evict all open pool handles for local processes on shutdown. +# Note that this implies that stopping or restarting the agent will result +# in interruption of DAOS I/O for any local DAOS client processes that have +# an open pool handle. +# default: false +disable_auto_evict: true + +# If enabled, the agent will evict any open pool handles associated with this machine on agent +# startup. This allows the servers to reclaim resources that may not have been properly cleaned +# up in the event of an agent or machine crash. +# default: false +enable_evict_on_start: true + +## Disable the agent's internal caches. If set to true, the agent will query the +## server access point and local hardware data every time a client requests +## rank connection information. +# +## default: false +disable_caching: true + +## Automatically expire the agent's remote cache after a period of time defined in +## minutes. It will refresh the data the next time it is requested. +# +## default: 0 (never expires) +#cache_expiration: 30 + +## Ignore a subset of fabric interfaces when selecting an interface for client +## applications. (Mutually exclusive with include). +# +#exclude_fabric_ifaces: ["lo", "eth1"] + +# Conversely, only consider a specific set of fabric interfaces when selecting +# an interface for client applications. (Mutually exclusive with exclude). + +include_fabric_ifaces: ["ibs255f0", "ibp200s0f0"] + +# Manually define the fabric interfaces and domains to be used by the agent, +# organized by NUMA node. +# If not defined, the agent will automatically detect all fabric interfaces and +# select appropriate ones based on the server preferences. + +fabric_ifaces: +- + numa_node: 0 + devices: + - + iface: ibs255f0 + domain: mlx5_0 +- + numa_node: 3 + devices: + - + iface: ibp200s0f0 + domain: mlx5_2 diff --git a/io500/sc25/tta/mantastorage/clients/Client-5/etc/sysctl.d/99-daos-net.conf b/io500/sc25/tta/mantastorage/clients/Client-5/etc/sysctl.d/99-daos-net.conf new file mode 100644 index 0000000..cee17f3 --- /dev/null +++ b/io500/sc25/tta/mantastorage/clients/Client-5/etc/sysctl.d/99-daos-net.conf @@ -0,0 +1,51 @@ +############################################################################## +# Multi-rail +############################################################################## +net.ipv4.conf.ibp200s0f0.accept_local = 1 +net.ipv4.conf.ibp200s0f0.rp_filter = 2 +net.ipv4.conf.ibp200s0f0.arp_ignore = 2 +net.ipv4.conf.ibp200s0f0.arp_ignore = 0 +net.ipv4.conf.ibp200s0f0.arp_filter = 0 + +net.ipv4.conf.ibp200s0f1.accept_local = 1 +net.ipv4.conf.ibp200s0f1.rp_filter = 2 +net.ipv4.conf.ibp200s0f1.arp_ignore = 2 +net.ipv4.conf.ibp200s0f1.arp_ignore = 0 +net.ipv4.conf.ibp200s0f1.arp_filter = 0 + +net.ipv4.conf.ibs255f0.accept_local = 1 +net.ipv4.conf.ibs255f0.rp_filter = 2 +net.ipv4.conf.ibs255f0.arp_ignore = 2 +net.ipv4.conf.ibs255f0.arp_ignore = 0 +net.ipv4.conf.ibs255f0.arp_filter = 0 + +net.ipv4.conf.ibs255f1.accept_local = 1 +net.ipv4.conf.ibs255f1.rp_filter = 2 +net.ipv4.conf.ibs255f1.arp_ignore = 2 +net.ipv4.conf.ibs255f1.arp_ignore = 0 +net.ipv4.conf.ibs255f1.arp_filter = 0 + +net.core.netdev_max_backlog = 250000 +#net.core.rmem_max = 16777216 +net.core.rmem_max = 134217728 +#net.core.wmem_max = 16777216 +net.core.wmem_max = 134217728 +net.core.rmem_default = 16777216 +net.core.wmem_default = 16777216 +net.core.optmem_max = 16777216 + +net.ipv4.neigh.default.gc_thresh1 = 1024 +net.ipv4.neigh.default.gc_thresh2 = 4096 +net.ipv4.neigh.default.gc_thresh3 = 16384 + +net.ipv4.tcp_timestamps = 1 +net.ipv4.tcp_sack = 1 +net.ipv4.tcp_mem = 16777216 16777216 16777216 +net.ipv4.tcp_rmem = 4096 16777216 134217728 +net.ipv4.tcp_wmem = 4096 16777216 134217728 +net.ipv4.tcp_low_latency = 1 +net.ipv4.tcp_mtu_probing = 1 + +net.ipv4.tcp_congestion_control=bbr +net.core.default_qdisc=fq +net.ipv4.tcp_slow_start_after_idle=0 diff --git a/io500/sc25/tta/mantastorage/clients/Client-5/etc/sysctl.d/99-daos.conf b/io500/sc25/tta/mantastorage/clients/Client-5/etc/sysctl.d/99-daos.conf new file mode 100644 index 0000000..40356f3 --- /dev/null +++ b/io500/sc25/tta/mantastorage/clients/Client-5/etc/sysctl.d/99-daos.conf @@ -0,0 +1 @@ +vm.swappiness = 1 diff --git a/io500/sc25/tta/mantastorage/clients/Client-6/etc/daos/daos_agent.yml b/io500/sc25/tta/mantastorage/clients/Client-6/etc/daos/daos_agent.yml new file mode 100644 index 0000000..bbb5036 --- /dev/null +++ b/io500/sc25/tta/mantastorage/clients/Client-6/etc/daos/daos_agent.yml @@ -0,0 +1,183 @@ +# DAOS agent configuration file. +# +# Location of this configuration file is determined by first checking for the +# path specified through the -o option of the daos_agent command line. +# Otherwise, /etc/daos/daos_agent.yml is used. +# +# Section describing the daos_agent configuration +# +# Although not supported for now, one might want to connect to multiple +# DAOS installations from the same node in the future. + +# Specify the associated DAOS systems. +# Name must match name specified in the daos_server.yml file on the server. +# +# NOTE: changing the name is not supported yet, it must be daos_server +# +# default: daos_server +name: daos_server + +# Management server access points +# Must have the same value for all agents and servers in a system. +# default: hostname of this node +access_points: [ + "MantaStorage-01", + "MantaStorage-02", + "MantaStorage-03", + "MantaStorage-04" +] + +# Force different port number to connect to access points. +# default: 10001 +port: 10001 + +## Enable HTTP endpoint for remote telemetry collection. +# Note that enabling the endpoint automatically enables +# client telemetry collection. +# +## default endpoint state: disabled +## default endpoint port: 9192 +#telemetry_port: 9192 + +## Enable client telemetry for all DAOS clients. +# If false, clients will need to optionally enable telemetry by setting +# the D_CLIENT_METRICS_ENABLE environment variable to true. +# +## default: false +#telemetry_enabled: true + +## Retain client telemetry for a period of time after the client +# process exits. +# +## default 0 (do not retain telemetry after client exit) +#telemetry_retain: 1m + +## Enable client telemetry for a subset of DAOS clients matching the +# supplied regular expression pattern. May not be used with +# telemetry_disabled_procs. +# +## default: not set +#telemetry_enabled_procs: ^dfuse$ + +## Disable client telemetry for a subset of DAOS clients matching the +# supplied regular expression pattern. May not be used with +# telemetry_enabled_procs. +# +## default: not set +#telemetry_disabled_procs: ^spambot-.* + +## Configuration for user credential management. +#credential_config: +# # If the agent should be able to resolve unknown client uids and gids +# # (e.g. when running in a container) into ACL principal names, then a +# # client user map may be defined. The optional "default" uid is a special +# # case and applies if no other matches are found. +# client_user_map: +# default: +# user: nobody +# group: nobody +# 1000: +# user: ralph +# group: stanley +# +# # Optionally cache generated credentials with the specified cache +# # lifetime. By default, a credential is generated for every client +# # process that connects to a pool. If the credential cache is +# # enabled, then local client processes connecting with stable +# # uid:gid associations may take advantage of the cached credential +# # and reduce some agent overhead. For heavily-loaded client nodes +# # with many frequent (e.g. hundreds per minute) client connections, +# # a lifetime of 1-5 minutes may be a reasonable tradeoff between +# # performance and responsiveness to user/group database updates. +# # If no expiration is set, credential caching is not enabled. +# cache_expiration: 1m + +## Configuration for SSL certificates used to secure management traffic +# and authenticate/authorize management components. +transport_config: + # In order to disable transport security, uncomment and set allow_insecure + # to true. Not recommended for production configurations. + allow_insecure: false + + # Custom CA Root certificate for generated certs + ca_cert: /etc/daos/certs/daosCA.crt + # Agent certificate for use in TLS handshakes + cert: /etc/daos/certs/agent.crt + # Key portion of Agent Certificate + key: /etc/daos/certs/agent.key + +# Use the given directory for creating unix domain sockets +# +# NOTE: Do not change this when running under systemd control. If it needs to +# be changed, then make sure that it matches the RuntimeDirectory setting +# in /usr/lib/systemd/system/daos_agent.service +# +# default: /var/run/daos_agent +runtime_dir: /var/run/daos_agent + +# Full path and name of the DAOS agent logfile. +# default: print to stderr +log_file: /var/log/daos/daos_agent.log + +# Force specific debug mask for daos_agent (control plane). +# Mask specifies minimum level of message significance to pass to logger. +# Currently supported values are DISABLED, TRACE, DEBUG, INFO, NOTICE and ERROR. + +# default: INFO +control_log_mask: INFO + +# Disable automatic eviction of open pool handles on agent shutdown. By default, +# the agent will evict all open pool handles for local processes on shutdown. +# Note that this implies that stopping or restarting the agent will result +# in interruption of DAOS I/O for any local DAOS client processes that have +# an open pool handle. +# default: false +disable_auto_evict: true + +# If enabled, the agent will evict any open pool handles associated with this machine on agent +# startup. This allows the servers to reclaim resources that may not have been properly cleaned +# up in the event of an agent or machine crash. +# default: false +enable_evict_on_start: true + +## Disable the agent's internal caches. If set to true, the agent will query the +## server access point and local hardware data every time a client requests +## rank connection information. +# +## default: false +disable_caching: true + +## Automatically expire the agent's remote cache after a period of time defined in +## minutes. It will refresh the data the next time it is requested. +# +## default: 0 (never expires) +#cache_expiration: 30 + +## Ignore a subset of fabric interfaces when selecting an interface for client +## applications. (Mutually exclusive with include). +# +#exclude_fabric_ifaces: ["lo", "eth1"] + +# Conversely, only consider a specific set of fabric interfaces when selecting +# an interface for client applications. (Mutually exclusive with exclude). + +include_fabric_ifaces: ["ibs255f0", "ibp200s0"] + +# Manually define the fabric interfaces and domains to be used by the agent, +# organized by NUMA node. +# If not defined, the agent will automatically detect all fabric interfaces and +# select appropriate ones based on the server preferences. + +fabric_ifaces: +- + numa_node: 0 + devices: + - + iface: ibs255f0 + domain: mlx5_0 +- + numa_node: 3 + devices: + - + iface: ibp200s0 + domain: mlx5_2 diff --git a/io500/sc25/tta/mantastorage/clients/Client-6/etc/sysctl.d/99-daos-net.conf b/io500/sc25/tta/mantastorage/clients/Client-6/etc/sysctl.d/99-daos-net.conf new file mode 100644 index 0000000..d3baeaf --- /dev/null +++ b/io500/sc25/tta/mantastorage/clients/Client-6/etc/sysctl.d/99-daos-net.conf @@ -0,0 +1,48 @@ +vm.nr_hugepages = 16384 +vm.swappiness = 1 + +############################################################################## +# Multi-rail +############################################################################## +net.ipv4.conf.ibs255f0.accept_local = 1 +net.ipv4.conf.ibs255f0.rp_filter = 2 +net.ipv4.conf.ibs255f0.arp_ignore = 2 +net.ipv4.conf.ibs255f0.arp_ignore = 0 +net.ipv4.conf.ibs255f0.arp_filter = 0 + +net.ipv4.conf.ibs255f1.accept_local = 1 +net.ipv4.conf.ibs255f1.rp_filter = 2 +net.ipv4.conf.ibs255f1.arp_ignore = 2 +net.ipv4.conf.ibs255f1.arp_ignore = 0 +net.ipv4.conf.ibs255f1.arp_filter = 0 + +net.ipv4.conf.ibp200s0.accept_local = 1 +net.ipv4.conf.ibp200s0.rp_filter = 2 +net.ipv4.conf.ibp200s0.arp_ignore = 2 +net.ipv4.conf.ibp200s0.arp_ignore = 0 +net.ipv4.conf.ibp200s0.arp_filter = 0 + +net.core.netdev_max_backlog = 250000 +#net.core.rmem_max = 16777216 +net.core.rmem_max = 134217728 +#net.core.wmem_max = 16777216 +net.core.wmem_max = 134217728 +net.core.rmem_default = 16777216 +net.core.wmem_default = 16777216 +net.core.optmem_max = 16777216 + +net.ipv4.neigh.default.gc_thresh1 = 1024 +net.ipv4.neigh.default.gc_thresh2 = 4096 +net.ipv4.neigh.default.gc_thresh3 = 16384 + +net.ipv4.tcp_timestamps = 1 +net.ipv4.tcp_sack = 1 +net.ipv4.tcp_mem = 16777216 16777216 16777216 +net.ipv4.tcp_rmem = 4096 16777216 134217728 +net.ipv4.tcp_wmem = 4096 16777216 134217728 +net.ipv4.tcp_low_latency = 1 +net.ipv4.tcp_mtu_probing = 1 + +net.ipv4.tcp_congestion_control=bbr +net.core.default_qdisc=fq +net.ipv4.tcp_slow_start_after_idle=0 diff --git a/io500/sc25/tta/mantastorage/clients/Client-6/etc/sysctl.d/99-daos-vm.conf b/io500/sc25/tta/mantastorage/clients/Client-6/etc/sysctl.d/99-daos-vm.conf new file mode 100644 index 0000000..40356f3 --- /dev/null +++ b/io500/sc25/tta/mantastorage/clients/Client-6/etc/sysctl.d/99-daos-vm.conf @@ -0,0 +1 @@ +vm.swappiness = 1 diff --git a/io500/sc25/tta/mantastorage/clients/Client-6/etc/sysctl.d/99-daos.conf b/io500/sc25/tta/mantastorage/clients/Client-6/etc/sysctl.d/99-daos.conf new file mode 100644 index 0000000..feff3ec --- /dev/null +++ b/io500/sc25/tta/mantastorage/clients/Client-6/etc/sysctl.d/99-daos.conf @@ -0,0 +1,45 @@ +############################################################################## +# Multi-rail +############################################################################## +net.ipv4.conf.ibs255f0.accept_local = 1 +net.ipv4.conf.ibs255f0.rp_filter = 2 +net.ipv4.conf.ibs255f0.arp_ignore = 2 +net.ipv4.conf.ibs255f0.arp_ignore = 0 +net.ipv4.conf.ibs255f0.arp_filter = 0 + +net.ipv4.conf.ibs255f1.accept_local = 1 +net.ipv4.conf.ibs255f1.rp_filter = 2 +net.ipv4.conf.ibs255f1.arp_ignore = 2 +net.ipv4.conf.ibs255f1.arp_ignore = 0 +net.ipv4.conf.ibs255f1.arp_filter = 0 + +net.ipv4.conf.ibp200s0.accept_local = 1 +net.ipv4.conf.ibp200s0.rp_filter = 2 +net.ipv4.conf.ibp200s0.arp_ignore = 2 +net.ipv4.conf.ibp200s0.arp_ignore = 0 +net.ipv4.conf.ibp200s0.arp_filter = 0 + +net.core.netdev_max_backlog = 250000 +#net.core.rmem_max = 16777216 +net.core.rmem_max = 134217728 +#net.core.wmem_max = 16777216 +net.core.wmem_max = 134217728 +net.core.rmem_default = 16777216 +net.core.wmem_default = 16777216 +net.core.optmem_max = 16777216 + +net.ipv4.neigh.default.gc_thresh1 = 1024 +net.ipv4.neigh.default.gc_thresh2 = 4096 +net.ipv4.neigh.default.gc_thresh3 = 16384 + +net.ipv4.tcp_timestamps = 1 +net.ipv4.tcp_sack = 1 +net.ipv4.tcp_mem = 16777216 16777216 16777216 +net.ipv4.tcp_rmem = 4096 16777216 134217728 +net.ipv4.tcp_wmem = 4096 16777216 134217728 +net.ipv4.tcp_low_latency = 1 +net.ipv4.tcp_mtu_probing = 1 + +net.ipv4.tcp_congestion_control=bbr +net.core.default_qdisc=fq +net.ipv4.tcp_slow_start_after_idle=0 diff --git a/io500/sc25/tta/mantastorage/clients/Client-7/etc/daos/daos_agent.yml b/io500/sc25/tta/mantastorage/clients/Client-7/etc/daos/daos_agent.yml new file mode 100644 index 0000000..bbb5036 --- /dev/null +++ b/io500/sc25/tta/mantastorage/clients/Client-7/etc/daos/daos_agent.yml @@ -0,0 +1,183 @@ +# DAOS agent configuration file. +# +# Location of this configuration file is determined by first checking for the +# path specified through the -o option of the daos_agent command line. +# Otherwise, /etc/daos/daos_agent.yml is used. +# +# Section describing the daos_agent configuration +# +# Although not supported for now, one might want to connect to multiple +# DAOS installations from the same node in the future. + +# Specify the associated DAOS systems. +# Name must match name specified in the daos_server.yml file on the server. +# +# NOTE: changing the name is not supported yet, it must be daos_server +# +# default: daos_server +name: daos_server + +# Management server access points +# Must have the same value for all agents and servers in a system. +# default: hostname of this node +access_points: [ + "MantaStorage-01", + "MantaStorage-02", + "MantaStorage-03", + "MantaStorage-04" +] + +# Force different port number to connect to access points. +# default: 10001 +port: 10001 + +## Enable HTTP endpoint for remote telemetry collection. +# Note that enabling the endpoint automatically enables +# client telemetry collection. +# +## default endpoint state: disabled +## default endpoint port: 9192 +#telemetry_port: 9192 + +## Enable client telemetry for all DAOS clients. +# If false, clients will need to optionally enable telemetry by setting +# the D_CLIENT_METRICS_ENABLE environment variable to true. +# +## default: false +#telemetry_enabled: true + +## Retain client telemetry for a period of time after the client +# process exits. +# +## default 0 (do not retain telemetry after client exit) +#telemetry_retain: 1m + +## Enable client telemetry for a subset of DAOS clients matching the +# supplied regular expression pattern. May not be used with +# telemetry_disabled_procs. +# +## default: not set +#telemetry_enabled_procs: ^dfuse$ + +## Disable client telemetry for a subset of DAOS clients matching the +# supplied regular expression pattern. May not be used with +# telemetry_enabled_procs. +# +## default: not set +#telemetry_disabled_procs: ^spambot-.* + +## Configuration for user credential management. +#credential_config: +# # If the agent should be able to resolve unknown client uids and gids +# # (e.g. when running in a container) into ACL principal names, then a +# # client user map may be defined. The optional "default" uid is a special +# # case and applies if no other matches are found. +# client_user_map: +# default: +# user: nobody +# group: nobody +# 1000: +# user: ralph +# group: stanley +# +# # Optionally cache generated credentials with the specified cache +# # lifetime. By default, a credential is generated for every client +# # process that connects to a pool. If the credential cache is +# # enabled, then local client processes connecting with stable +# # uid:gid associations may take advantage of the cached credential +# # and reduce some agent overhead. For heavily-loaded client nodes +# # with many frequent (e.g. hundreds per minute) client connections, +# # a lifetime of 1-5 minutes may be a reasonable tradeoff between +# # performance and responsiveness to user/group database updates. +# # If no expiration is set, credential caching is not enabled. +# cache_expiration: 1m + +## Configuration for SSL certificates used to secure management traffic +# and authenticate/authorize management components. +transport_config: + # In order to disable transport security, uncomment and set allow_insecure + # to true. Not recommended for production configurations. + allow_insecure: false + + # Custom CA Root certificate for generated certs + ca_cert: /etc/daos/certs/daosCA.crt + # Agent certificate for use in TLS handshakes + cert: /etc/daos/certs/agent.crt + # Key portion of Agent Certificate + key: /etc/daos/certs/agent.key + +# Use the given directory for creating unix domain sockets +# +# NOTE: Do not change this when running under systemd control. If it needs to +# be changed, then make sure that it matches the RuntimeDirectory setting +# in /usr/lib/systemd/system/daos_agent.service +# +# default: /var/run/daos_agent +runtime_dir: /var/run/daos_agent + +# Full path and name of the DAOS agent logfile. +# default: print to stderr +log_file: /var/log/daos/daos_agent.log + +# Force specific debug mask for daos_agent (control plane). +# Mask specifies minimum level of message significance to pass to logger. +# Currently supported values are DISABLED, TRACE, DEBUG, INFO, NOTICE and ERROR. + +# default: INFO +control_log_mask: INFO + +# Disable automatic eviction of open pool handles on agent shutdown. By default, +# the agent will evict all open pool handles for local processes on shutdown. +# Note that this implies that stopping or restarting the agent will result +# in interruption of DAOS I/O for any local DAOS client processes that have +# an open pool handle. +# default: false +disable_auto_evict: true + +# If enabled, the agent will evict any open pool handles associated with this machine on agent +# startup. This allows the servers to reclaim resources that may not have been properly cleaned +# up in the event of an agent or machine crash. +# default: false +enable_evict_on_start: true + +## Disable the agent's internal caches. If set to true, the agent will query the +## server access point and local hardware data every time a client requests +## rank connection information. +# +## default: false +disable_caching: true + +## Automatically expire the agent's remote cache after a period of time defined in +## minutes. It will refresh the data the next time it is requested. +# +## default: 0 (never expires) +#cache_expiration: 30 + +## Ignore a subset of fabric interfaces when selecting an interface for client +## applications. (Mutually exclusive with include). +# +#exclude_fabric_ifaces: ["lo", "eth1"] + +# Conversely, only consider a specific set of fabric interfaces when selecting +# an interface for client applications. (Mutually exclusive with exclude). + +include_fabric_ifaces: ["ibs255f0", "ibp200s0"] + +# Manually define the fabric interfaces and domains to be used by the agent, +# organized by NUMA node. +# If not defined, the agent will automatically detect all fabric interfaces and +# select appropriate ones based on the server preferences. + +fabric_ifaces: +- + numa_node: 0 + devices: + - + iface: ibs255f0 + domain: mlx5_0 +- + numa_node: 3 + devices: + - + iface: ibp200s0 + domain: mlx5_2 diff --git a/io500/sc25/tta/mantastorage/clients/Client-7/etc/sysctl.d/99-daos-net.conf b/io500/sc25/tta/mantastorage/clients/Client-7/etc/sysctl.d/99-daos-net.conf new file mode 100644 index 0000000..feff3ec --- /dev/null +++ b/io500/sc25/tta/mantastorage/clients/Client-7/etc/sysctl.d/99-daos-net.conf @@ -0,0 +1,45 @@ +############################################################################## +# Multi-rail +############################################################################## +net.ipv4.conf.ibs255f0.accept_local = 1 +net.ipv4.conf.ibs255f0.rp_filter = 2 +net.ipv4.conf.ibs255f0.arp_ignore = 2 +net.ipv4.conf.ibs255f0.arp_ignore = 0 +net.ipv4.conf.ibs255f0.arp_filter = 0 + +net.ipv4.conf.ibs255f1.accept_local = 1 +net.ipv4.conf.ibs255f1.rp_filter = 2 +net.ipv4.conf.ibs255f1.arp_ignore = 2 +net.ipv4.conf.ibs255f1.arp_ignore = 0 +net.ipv4.conf.ibs255f1.arp_filter = 0 + +net.ipv4.conf.ibp200s0.accept_local = 1 +net.ipv4.conf.ibp200s0.rp_filter = 2 +net.ipv4.conf.ibp200s0.arp_ignore = 2 +net.ipv4.conf.ibp200s0.arp_ignore = 0 +net.ipv4.conf.ibp200s0.arp_filter = 0 + +net.core.netdev_max_backlog = 250000 +#net.core.rmem_max = 16777216 +net.core.rmem_max = 134217728 +#net.core.wmem_max = 16777216 +net.core.wmem_max = 134217728 +net.core.rmem_default = 16777216 +net.core.wmem_default = 16777216 +net.core.optmem_max = 16777216 + +net.ipv4.neigh.default.gc_thresh1 = 1024 +net.ipv4.neigh.default.gc_thresh2 = 4096 +net.ipv4.neigh.default.gc_thresh3 = 16384 + +net.ipv4.tcp_timestamps = 1 +net.ipv4.tcp_sack = 1 +net.ipv4.tcp_mem = 16777216 16777216 16777216 +net.ipv4.tcp_rmem = 4096 16777216 134217728 +net.ipv4.tcp_wmem = 4096 16777216 134217728 +net.ipv4.tcp_low_latency = 1 +net.ipv4.tcp_mtu_probing = 1 + +net.ipv4.tcp_congestion_control=bbr +net.core.default_qdisc=fq +net.ipv4.tcp_slow_start_after_idle=0 diff --git a/io500/sc25/tta/mantastorage/clients/Client-7/etc/sysctl.d/99-daos-vm.conf b/io500/sc25/tta/mantastorage/clients/Client-7/etc/sysctl.d/99-daos-vm.conf new file mode 100644 index 0000000..40356f3 --- /dev/null +++ b/io500/sc25/tta/mantastorage/clients/Client-7/etc/sysctl.d/99-daos-vm.conf @@ -0,0 +1 @@ +vm.swappiness = 1 diff --git a/io500/sc25/tta/mantastorage/clients/Client-8/etc/daos/daos_agent.yml b/io500/sc25/tta/mantastorage/clients/Client-8/etc/daos/daos_agent.yml new file mode 100644 index 0000000..7c8f627 --- /dev/null +++ b/io500/sc25/tta/mantastorage/clients/Client-8/etc/daos/daos_agent.yml @@ -0,0 +1,183 @@ +# DAOS agent configuration file. +# +# Location of this configuration file is determined by first checking for the +# path specified through the -o option of the daos_agent command line. +# Otherwise, /etc/daos/daos_agent.yml is used. +# +# Section describing the daos_agent configuration +# +# Although not supported for now, one might want to connect to multiple +# DAOS installations from the same node in the future. + +# Specify the associated DAOS systems. +# Name must match name specified in the daos_server.yml file on the server. +# +# NOTE: changing the name is not supported yet, it must be daos_server +# +# default: daos_server +name: daos_server + +# Management server access points +# Must have the same value for all agents and servers in a system. +# default: hostname of this node +access_points: [ + "MantaStorage-01", + "MantaStorage-02", + "MantaStorage-03", + "MantaStorage-04" +] + +# Force different port number to connect to access points. +# default: 10001 +port: 10001 + +## Enable HTTP endpoint for remote telemetry collection. +# Note that enabling the endpoint automatically enables +# client telemetry collection. +# +## default endpoint state: disabled +## default endpoint port: 9192 +#telemetry_port: 9192 + +## Enable client telemetry for all DAOS clients. +# If false, clients will need to optionally enable telemetry by setting +# the D_CLIENT_METRICS_ENABLE environment variable to true. +# +## default: false +#telemetry_enabled: true + +## Retain client telemetry for a period of time after the client +# process exits. +# +## default 0 (do not retain telemetry after client exit) +#telemetry_retain: 1m + +## Enable client telemetry for a subset of DAOS clients matching the +# supplied regular expression pattern. May not be used with +# telemetry_disabled_procs. +# +## default: not set +#telemetry_enabled_procs: ^dfuse$ + +## Disable client telemetry for a subset of DAOS clients matching the +# supplied regular expression pattern. May not be used with +# telemetry_enabled_procs. +# +## default: not set +#telemetry_disabled_procs: ^spambot-.* + +## Configuration for user credential management. +#credential_config: +# # If the agent should be able to resolve unknown client uids and gids +# # (e.g. when running in a container) into ACL principal names, then a +# # client user map may be defined. The optional "default" uid is a special +# # case and applies if no other matches are found. +# client_user_map: +# default: +# user: nobody +# group: nobody +# 1000: +# user: ralph +# group: stanley +# +# # Optionally cache generated credentials with the specified cache +# # lifetime. By default, a credential is generated for every client +# # process that connects to a pool. If the credential cache is +# # enabled, then local client processes connecting with stable +# # uid:gid associations may take advantage of the cached credential +# # and reduce some agent overhead. For heavily-loaded client nodes +# # with many frequent (e.g. hundreds per minute) client connections, +# # a lifetime of 1-5 minutes may be a reasonable tradeoff between +# # performance and responsiveness to user/group database updates. +# # If no expiration is set, credential caching is not enabled. +# cache_expiration: 1m + +## Configuration for SSL certificates used to secure management traffic +# and authenticate/authorize management components. +transport_config: + # In order to disable transport security, uncomment and set allow_insecure + # to true. Not recommended for production configurations. + allow_insecure: false + + # Custom CA Root certificate for generated certs + ca_cert: /etc/daos/certs/daosCA.crt + # Agent certificate for use in TLS handshakes + cert: /etc/daos/certs/agent.crt + # Key portion of Agent Certificate + key: /etc/daos/certs/agent.key + +# Use the given directory for creating unix domain sockets +# +# NOTE: Do not change this when running under systemd control. If it needs to +# be changed, then make sure that it matches the RuntimeDirectory setting +# in /usr/lib/systemd/system/daos_agent.service +# +# default: /var/run/daos_agent +runtime_dir: /var/run/daos_agent + +# Full path and name of the DAOS agent logfile. +# default: print to stderr +log_file: /var/log/daos/daos_agent.log + +# Force specific debug mask for daos_agent (control plane). +# Mask specifies minimum level of message significance to pass to logger. +# Currently supported values are DISABLED, TRACE, DEBUG, INFO, NOTICE and ERROR. + +# default: INFO +control_log_mask: INFO + +# Disable automatic eviction of open pool handles on agent shutdown. By default, +# the agent will evict all open pool handles for local processes on shutdown. +# Note that this implies that stopping or restarting the agent will result +# in interruption of DAOS I/O for any local DAOS client processes that have +# an open pool handle. +# default: false +disable_auto_evict: true + +# If enabled, the agent will evict any open pool handles associated with this machine on agent +# startup. This allows the servers to reclaim resources that may not have been properly cleaned +# up in the event of an agent or machine crash. +# default: false +enable_evict_on_start: true + +## Disable the agent's internal caches. If set to true, the agent will query the +## server access point and local hardware data every time a client requests +## rank connection information. +# +## default: false +disable_caching: true + +## Automatically expire the agent's remote cache after a period of time defined in +## minutes. It will refresh the data the next time it is requetrue +# +## default: 0 (never expires) +#cache_expiration: 30 + +## Ignore a subset of fabric interfaces when selecting an interface for clfalse +## applications. (Mutually exclusive with include). +# +#exclude_fabric_ifaces: ["lo", "eth1"] + +# Conversely, only consider a specific set of fabric interfaces when selecting +# an interface for client applications. (Mutually exclusive with exclude). + +include_fabric_ifaces: ["ibs255f0", "ibp200s0"] + +# Manually define the fabric interfaces and domains to be used by the agent, +# organized by NUMA node. +# If not defined, the agent will automatically detect all fabric interfaces and +# select appropriate ones based on the server preferences. + +fabric_ifaces: +- + numa_node: 0 + devices: + - + iface: ibs255f0 + domain: mlx5_0 +- + numa_node: 3 + devices: + - + iface: ibp200s0 + domain: mlx5_2 diff --git a/io500/sc25/tta/mantastorage/clients/Client-8/etc/sysctl.d/99-daos-net.conf b/io500/sc25/tta/mantastorage/clients/Client-8/etc/sysctl.d/99-daos-net.conf new file mode 100644 index 0000000..feff3ec --- /dev/null +++ b/io500/sc25/tta/mantastorage/clients/Client-8/etc/sysctl.d/99-daos-net.conf @@ -0,0 +1,45 @@ +############################################################################## +# Multi-rail +############################################################################## +net.ipv4.conf.ibs255f0.accept_local = 1 +net.ipv4.conf.ibs255f0.rp_filter = 2 +net.ipv4.conf.ibs255f0.arp_ignore = 2 +net.ipv4.conf.ibs255f0.arp_ignore = 0 +net.ipv4.conf.ibs255f0.arp_filter = 0 + +net.ipv4.conf.ibs255f1.accept_local = 1 +net.ipv4.conf.ibs255f1.rp_filter = 2 +net.ipv4.conf.ibs255f1.arp_ignore = 2 +net.ipv4.conf.ibs255f1.arp_ignore = 0 +net.ipv4.conf.ibs255f1.arp_filter = 0 + +net.ipv4.conf.ibp200s0.accept_local = 1 +net.ipv4.conf.ibp200s0.rp_filter = 2 +net.ipv4.conf.ibp200s0.arp_ignore = 2 +net.ipv4.conf.ibp200s0.arp_ignore = 0 +net.ipv4.conf.ibp200s0.arp_filter = 0 + +net.core.netdev_max_backlog = 250000 +#net.core.rmem_max = 16777216 +net.core.rmem_max = 134217728 +#net.core.wmem_max = 16777216 +net.core.wmem_max = 134217728 +net.core.rmem_default = 16777216 +net.core.wmem_default = 16777216 +net.core.optmem_max = 16777216 + +net.ipv4.neigh.default.gc_thresh1 = 1024 +net.ipv4.neigh.default.gc_thresh2 = 4096 +net.ipv4.neigh.default.gc_thresh3 = 16384 + +net.ipv4.tcp_timestamps = 1 +net.ipv4.tcp_sack = 1 +net.ipv4.tcp_mem = 16777216 16777216 16777216 +net.ipv4.tcp_rmem = 4096 16777216 134217728 +net.ipv4.tcp_wmem = 4096 16777216 134217728 +net.ipv4.tcp_low_latency = 1 +net.ipv4.tcp_mtu_probing = 1 + +net.ipv4.tcp_congestion_control=bbr +net.core.default_qdisc=fq +net.ipv4.tcp_slow_start_after_idle=0 diff --git a/io500/sc25/tta/mantastorage/clients/Client-8/etc/sysctl.d/99-daos-vm.conf b/io500/sc25/tta/mantastorage/clients/Client-8/etc/sysctl.d/99-daos-vm.conf new file mode 100644 index 0000000..40356f3 --- /dev/null +++ b/io500/sc25/tta/mantastorage/clients/Client-8/etc/sysctl.d/99-daos-vm.conf @@ -0,0 +1 @@ +vm.swappiness = 1 diff --git a/io500/sc25/tta/mantastorage/config-all-dfs-rf0-tmpl.ini b/io500/sc25/tta/mantastorage/config-all-dfs-rf0-tmpl.ini new file mode 100644 index 0000000..2462f5f --- /dev/null +++ b/io500/sc25/tta/mantastorage/config-all-dfs-rf0-tmpl.ini @@ -0,0 +1,282 @@ +# Supported and current values of the ini file: +[global] +# The directory where the IO500 runs +datadir = / +# The data directory is suffixed by a timestamp. Useful for running several IO500 tests concurrently. +timestamp-datadir = TRUE +# The result directory. +resultdir = ./results/dfs-rf0 +# The result directory is suffixed by a timestamp. Useful for running several IO500 tests concurrently. +timestamp-resultdir = TRUE +# The general API for the tests (to create/delete the datadir, extra options will be passed to IOR/mdtest) +api = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT +# Purge the caches, this is useful for testing and needed for single node runs +drop-caches = FALSE +# Cache purging command, invoked before each I/O phase +drop-caches-cmd = sudo -n bash -c "echo 3 > /proc/sys/vm/drop_caches" +# Allocate the I/O buffers on the GPU +io-buffers-on-gpu = FALSE +# The verbosity level between 1 and 10 +verbosity = 1 +# Use the rules for the Student Cluster Competition +scc = FALSE +# Type of packet that will be created [timestamp|offset|incompressible|random] +dataPacketType = timestamp + +[debug] +# For a valid result, the stonewall timer must be set to the value according to the rules. If smaller INVALIDATES RUN; FOR DEBUGGING. +stonewall-time = 300 +# Pause between phases while in this directory lies a file with the phase name, e.g., easy-create. This can be useful for performance testing, e.g., of tiered storage. At the moment it INVALIDATES RUN; FOR DEBUGGING. +pause-dir = + +[ior-easy] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=S1 --dfs.oclass=SX +# Transfer size +transferSize = 1m +# Block size; must be a multiple of transferSize +blockSize = 992000m +# Create one file per process +filePerProc = TRUE +# Use unique directory per file per process +uniqueDir = FALSE +# Run this phase +run = TRUE +# The verbosity level +verbosity = + +[ior-easy-write] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=S1 --dfs.oclass=SX +# Run this phase +run = TRUE + +[ior-rnd4K] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=S1 --dfs.oclass=SX +# Size of a random block, change only if explicitly allowed +blockSize = 1073741824 +# Run this phase +run = TRUE +# The verbosity level +verbosity = +# Prefill the file with this blocksize in bytes, e.g., 2097152 +randomPrefill = 0 + +[ior-rnd4K-write] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=S1 --dfs.oclass=SX +# Run this phase +run = TRUE + +[mdtest-easy] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=SX --dfs.oclass=S1 +# Files per proc +n = 2000000 +# Run this phase +run = TRUE + +[mdtest-easy-write] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=SX --dfs.oclass=S1 +# Run this phase +run = TRUE + +[ior-rnd1MB] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=S1 --dfs.oclass=SX +# Size of a random block, change only if explicitly allowed +blockSize = 1073741824 +# Run this phase +run = TRUE +# The verbosity level +verbosity = +# Prefill the file with this blocksize in bytes, e.g., 2097152 +randomPrefill = 0 + +[ior-rnd1MB-write] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=S1 --dfs.oclass=SX +# Run this phase +run = TRUE + +[mdworkbench] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=SX --dfs.oclass=S1 +# Waiting time of an IO operation relative to runtime (1.0 is 100%%) +waitingTime = 0.0 +# Files to precreate per set (always 10 sets), this is normally dynamically determined +precreatePerSet = +# Files to run per iteration and set (always 10 sets), this is normally dynamically determined +filesPerProc = +# Run this phase +run = TRUE +# The verbosity level +verbosity = + +[mdworkbench-create] +# Run this phase +run = TRUE + +[timestamp] + +[find-easy] +# Set to an external script to perform the find phase +external-script = +# Startup arguments for external scripts, some MPI's may not support this! +external-mpi-args = +# Extra arguments for the external scripts +external-extra-args = +# Set the number of processes for pfind/the external script +nproc = @FIND_NPROC +# Run this phase +run = TRUE +# Pfind queue length +pfind-queue-length = @FIND_QLEN +# Pfind Steal from next +pfind-steal-next = FALSE +# Parallelize the readdir by using hashing. Your system must support this! +pfind-parallelize-single-dir-access-using-hashing = FALSE + +[ior-hard] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=S1 --dfs.oclass=SX --dfs.chunk_size=470080 +# Number of segments +segmentCount = 10000000 +# Collective operation (for supported backends) +collective = +# Run this phase +run = TRUE +# The verbosity level +verbosity = + +[ior-hard-write] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=S1 --dfs.oclass=SX --dfs.chunk_size=470080 +# Collective operation (for supported backends) +collective = +# Run this phase +run = TRUE + +[mdtest-hard] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=SX --dfs.oclass=S1 +# Files per proc +n = 500000 +# File limit per directory (MDTest -I flag) to overcome file system limitations INVALIDATES RUN; FOR DEBUGGING. +files-per-dir = +# Run this phase +run = TRUE + +[mdtest-hard-write] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=SX --dfs.oclass=S1 +# Run this phase +run = TRUE + +[find] +# Set to an external script to perform the find phase +external-script = +# Startup arguments for external scripts, some MPI's may not support this! +external-mpi-args = +# Extra arguments for the external scripts +external-extra-args = +# Set the number of processes for pfind/the external script +nproc = @FIND_NPROC +# Run this phase +run = TRUE +# Pfind queue length +pfind-queue-length = @FIND_QLEN +# Pfind Steal from next +pfind-steal-next = FALSE +# Parallelize the readdir by using hashing. Your system must support this! +pfind-parallelize-single-dir-access-using-hashing = FALSE + +[ior-rnd4K-read] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=S1 --dfs.oclass=SX +# Run this phase +run = TRUE + +[ior-rnd1MB-read] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=S1 --dfs.oclass=SX +# Run this phase +run = TRUE + +[find-hard] +# Set to an external script to perform the find phase +external-script = +# Startup arguments for external scripts, some MPI's may not support this! +external-mpi-args = +# Extra arguments for the external scripts +external-extra-args = +# Set the number of processes for pfind/the external script +nproc = @FIND_NPROC +# Run this phase +run = TRUE +# Pfind queue length +pfind-queue-length = @FIND_QLEN +# Pfind Steal from next +pfind-steal-next = FALSE +# Parallelize the readdir by using hashing. Your system must support this! +pfind-parallelize-single-dir-access-using-hashing = FALSE + +[mdworkbench-bench] +# Run this phase +run = TRUE + +[ior-easy-read] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=S1 --dfs.oclass=SX +# Run this phase +run = TRUE + +[mdtest-easy-stat] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=SX --dfs.oclass=S1 +# Run this phase +run = TRUE + +[ior-hard-read] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=S1 --dfs.oclass=SX --dfs.chunk_size=470080 +# Collective operation (for supported backends) +collective = +# Run this phase +run = TRUE + +[mdtest-hard-stat] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=SX --dfs.oclass=S1 +# Run this phase +run = TRUE + +[mdworkbench-delete] +# Run this phase +run = TRUE + +[mdtest-easy-delete] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=SX --dfs.oclass=S1 +# Run this phase +run = TRUE + +[mdtest-hard-read] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=SX --dfs.oclass=S1 +# Run this phase +run = TRUE + +[mdtest-hard-delete] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=SX --dfs.oclass=S1 +# Run this phase +run = TRUE + +[ior-rnd4K-easy-read] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=S1 --dfs.oclass=SX +# Run this phase +run = TRUE diff --git a/io500/sc25/tta/mantastorage/config-all-dfs-rf1-tmpl.ini b/io500/sc25/tta/mantastorage/config-all-dfs-rf1-tmpl.ini new file mode 100644 index 0000000..53a1adb --- /dev/null +++ b/io500/sc25/tta/mantastorage/config-all-dfs-rf1-tmpl.ini @@ -0,0 +1,282 @@ +# Supported and current values of the ini file: +[global] +# The directory where the IO500 runs +datadir = / +# The data directory is suffixed by a timestamp. Useful for running several IO500 tests concurrently. +timestamp-datadir = TRUE +# The result directory. +resultdir = ./results/dfs-rf1 +# The result directory is suffixed by a timestamp. Useful for running several IO500 tests concurrently. +timestamp-resultdir = TRUE +# The general API for the tests (to create/delete the datadir, extra options will be passed to IOR/mdtest) +api = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT +# Purge the caches, this is useful for testing and needed for single node runs +drop-caches = FALSE +# Cache purging command, invoked before each I/O phase +drop-caches-cmd = sudo -n bash -c "echo 3 > /proc/sys/vm/drop_caches" +# Allocate the I/O buffers on the GPU +io-buffers-on-gpu = FALSE +# The verbosity level between 1 and 10 +verbosity = 1 +# Use the rules for the Student Cluster Competition +scc = FALSE +# Type of packet that will be created [timestamp|offset|incompressible|random] +dataPacketType = timestamp + +[debug] +# For a valid result, the stonewall timer must be set to the value according to the rules. If smaller INVALIDATES RUN; FOR DEBUGGING. +stonewall-time = 300 +# Pause between phases while in this directory lies a file with the phase name, e.g., easy-create. This can be useful for performance testing, e.g., of tiered storage. At the moment it INVALIDATES RUN; FOR DEBUGGING. +pause-dir = + +[ior-easy] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=RP_2G1 --dfs.oclass=EC_2P1GX +# Transfer size +transferSize = 1m +# Block size; must be a multiple of transferSize +blockSize = 992000m +# Create one file per process +filePerProc = TRUE +# Use unique directory per file per process +uniqueDir = FALSE +# Run this phase +run = TRUE +# The verbosity level +verbosity = + +[ior-easy-write] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=RP_2G1 --dfs.oclass=EC_2P1GX +# Run this phase +run = TRUE + +[ior-rnd4K] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=RP_2G1 --dfs.oclass=RP_2GX +# Size of a random block, change only if explicitly allowed +blockSize = 1073741824 +# Run this phase +run = TRUE +# The verbosity level +verbosity = +# Prefill the file with this blocksize in bytes, e.g., 2097152 +randomPrefill = 0 + +[ior-rnd4K-write] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=RP_2G1 --dfs.oclass=RP_2GX +# Run this phase +run = TRUE + +[mdtest-easy] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=RP_2GX --dfs.oclass=RP_2G1 +# Files per proc +n = 1000000 +# Run this phase +run = TRUE + +[mdtest-easy-write] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=RP_2GX --dfs.oclass=RP_2G1 +# Run this phase +run = TRUE + +[ior-rnd1MB] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=RP_2G1 --dfs.oclass=EC_2P1GX +# Size of a random block, change only if explicitly allowed +blockSize = 1073741824 +# Run this phase +run = TRUE +# The verbosity level +verbosity = +# Prefill the file with this blocksize in bytes, e.g., 2097152 +randomPrefill = 0 + +[ior-rnd1MB-write] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=RP_2G1 --dfs.oclass=EC_2P1GX +# Run this phase +run = TRUE + +[mdworkbench] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=SX --dfs.oclass=S1 +# Waiting time of an IO operation relative to runtime (1.0 is 100%%) +waitingTime = 0.0 +# Files to precreate per set (always 10 sets), this is normally dynamically determined +precreatePerSet = +# Files to run per iteration and set (always 10 sets), this is normally dynamically determined +filesPerProc = +# Run this phase +run = TRUE +# The verbosity level +verbosity = + +[mdworkbench-create] +# Run this phase +run = TRUE + +[timestamp] + +[find-easy] +# Set to an external script to perform the find phase +external-script = +# Startup arguments for external scripts, some MPI's may not support this! +external-mpi-args = +# Extra arguments for the external scripts +external-extra-args = +# Set the number of processes for pfind/the external script +nproc = 448 +# Run this phase +run = TRUE +# Pfind queue length +pfind-queue-length = 20000 +# Pfind Steal from next +pfind-steal-next = FALSE +# Parallelize the readdir by using hashing. Your system must support this! +pfind-parallelize-single-dir-access-using-hashing = FALSE + +[ior-hard] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=RP_2G1 --dfs.oclass=RP_2GX --dfs.chunk_size=470080 +# Number of segments +segmentCount = 10000000 +# Collective operation (for supported backends) +collective = +# Run this phase +run = TRUE +# The verbosity level +verbosity = + +[ior-hard-write] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=RP_2G1 --dfs.oclass=RP_2GX --dfs.chunk_size=470080 +# Collective operation (for supported backends) +collective = +# Run this phase +run = TRUE + +[mdtest-hard] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=RP_2GX --dfs.oclass=RP_2G1 +# Files per proc +n = 250000 +# File limit per directory (MDTest -I flag) to overcome file system limitations INVALIDATES RUN; FOR DEBUGGING. +files-per-dir = +# Run this phase +run = TRUE + +[mdtest-hard-write] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=RP_2GX --dfs.oclass=RP_2G1 +# Run this phase +run = TRUE + +[find] +# Set to an external script to perform the find phase +external-script = +# Startup arguments for external scripts, some MPI's may not support this! +external-mpi-args = +# Extra arguments for the external scripts +external-extra-args = +# Set the number of processes for pfind/the external script +nproc = 448 +# Run this phase +run = TRUE +# Pfind queue length +pfind-queue-length = 20000 +# Pfind Steal from next +pfind-steal-next = FALSE +# Parallelize the readdir by using hashing. Your system must support this! +pfind-parallelize-single-dir-access-using-hashing = FALSE + +[ior-rnd4K-read] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=RP_2G1 --dfs.oclass=RP_2GX +# Run this phase +run = TRUE + +[ior-rnd1MB-read] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=RP_2G1 --dfs.oclass=EC_2P1GX +# Run this phase +run = TRUE + +[find-hard] +# Set to an external script to perform the find phase +external-script = +# Startup arguments for external scripts, some MPI's may not support this! +external-mpi-args = +# Extra arguments for the external scripts +external-extra-args = +# Set the number of processes for pfind/the external script +nproc = 448 +# Run this phase +run = TRUE +# Pfind queue length +pfind-queue-length = 20000 +# Pfind Steal from next +pfind-steal-next = FALSE +# Parallelize the readdir by using hashing. Your system must support this! +pfind-parallelize-single-dir-access-using-hashing = FALSE + +[mdworkbench-bench] +# Run this phase +run = TRUE + +[ior-easy-read] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=RP_2G1 --dfs.oclass=EC_2P1GX +# Run this phase +run = TRUE + +[mdtest-easy-stat] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=RP_2GX --dfs.oclass=RP_2G1 +# Run this phase +run = TRUE + +[ior-hard-read] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=RP_2G1 --dfs.oclass=RP_2GX --dfs.chunk_size=470080 +# Collective operation (for supported backends) +collective = +# Run this phase +run = TRUE + +[mdtest-hard-stat] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=RP_2GX --dfs.oclass=RP_2G1 +# Run this phase +run = TRUE + +[mdworkbench-delete] +# Run this phase +run = TRUE + +[mdtest-easy-delete] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=RP_2GX --dfs.oclass=RP_2G1 +# Run this phase +run = TRUE + +[mdtest-hard-read] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=RP_2GX --dfs.oclass=RP_2G1 +# Run this phase +run = TRUE + +[mdtest-hard-delete] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=RP_2GX --dfs.oclass=RP_2G1 +# Run this phase +run = TRUE + +[ior-rnd4K-easy-read] +# The API to be used +API = DFS --dfs.pool=@DAOS_POOL --dfs.cont=@DAOS_CONT --dfs.dir_oclass=RP_2G1 --dfs.oclass=RP_2GX +# Run this phase +run = TRUE diff --git a/io500/sc25/tta/mantastorage/io500-dfs-tmpl.sh b/io500/sc25/tta/mantastorage/io500-dfs-tmpl.sh new file mode 100755 index 0000000..7ce5cb4 --- /dev/null +++ b/io500/sc25/tta/mantastorage/io500-dfs-tmpl.sh @@ -0,0 +1,222 @@ +#!/bin/bash +#SBATCH --nodes=10 --ntasks-per-node=6 -p compute -A ku0598 + +# INSTRUCTIONS: +# +# The only parts of the script that may need to be modified are: +# - setup() to configure the binary locations and MPI parameters +# Please visit https://vi4io.org/io500-info-creator/ to help generate the +# "system-information.txt" file, by pasting the output of the info-creator. +# This file contains details of your system hardware for your submission. + +# This script takes its parameters from the same .ini file as io500 binary. +io500_ini="$1" # You can set the ini file here +io500_mpirun="mpirun" + +if [[ -z "${DAOS_POOL}" ]]; then + DAOS_POOL="tank" +fi + +if [[ -z "${DAOS_CONT}" ]]; then + DAOS_CONT="posix_container" +fi + +HOSTS_LIST=( + #"Client-1", + "Client-2", + "Client-3", + "Client-4", + "Client-5", + "Client-6", + "Client-7", + "Client-8" +) + +#PROCS_PER_NODE="$(nproc)" + +PROCS_PER_NODE=144 +NPROC="$((PROCS_PER_NODE * ${#HOSTS_LIST[@]}))" + +#FIND_NPROC="$((NPROC / 3))" +#FIND_QLEN="$((FIND_NPROC * 45))" + +FIND_NPROC="448" +FIND_QLEN="20160" + +#io500_mpiargs="-np 1344 -ppn 192" +#io500_mpiargs="-np 1008 -ppn 144" +io500_mpiargs="-np $NPROC -ppn $PROCS_PER_NODE" + +#io500_mpiargs+=" -bind-to numa -map-by numa" +io500_mpiargs+=" -bind-to core -map-by core" + +OLD_IFS=$IFS +IFS=, +HOSTS_STR="${HOSTS_LIST[*]}" +IFS=$OLD_IFS + +io500_mpiargs+=" -hosts $HOSTS_STR" + +io500_mpiargs+=" -genv DAOS_POOL $DAOS_POOL" +io500_mpiargs+=" -genv DAOS_CONT $DAOS_CONT" + +io500_mpiargs+=" -genv CRT_CTX_SHARE_ADDR 1" +io500_mpiargs+=" -genv CRT_CTX_NUM 8" +io500_mpiargs+=" -genv CRT_CREDIT_EP_CTX 0" +io500_mpiargs+=" -genv CRT_MRC_ENABLE 1" +io500_mpiargs+=" -genv CRT_TIMEOUT 600" + +#io500_mpiargs+=" -genv FI_CXI_RX_MATCH_MODE hybrid" +#io500_mpiargs+=" -genv FI_CXI_REQ_BUF_SIZE 8388608" +#io500_mpiargs+=" -genv FI_CXI_OFLOW_BUF_SIZE 8388608" +#io500_mpiargs+=" -genv FI_CXI_REQ_BUF_MIN_POSTED 8" +#io500_mpiargs+=" -genv FI_CXI_DEFAULT_CQ_SIZE 131072" +#io500_mpiargs+=" -genv FI_CXI_CQ_FILL_PERCENT 20" +#io500_mpiargs+=" -genv FI_OFI_RXM_BUFFER_SIZE 131072" +#io500_mpiargs+=" -genv FI_OFI_RXM_SAR_LIMIT 131072" +io500_mpiargs+=" -genv FI_OFI_RXM_USE_SRX 1" +io500_mpiargs+=" -genv FI_UNIVERSE_SIZE 16383" +io500_mpiargs+=" -genv FI_VERBS_INLINE_SIZE 128" +io500_mpiargs+=" -genv FI_VERBS_PREFER_XRC 1" + +#io500_mpiargs+=" -genv HDF5_PARAPREFIX \"daos:\" " +#io500_mpiargs+=" -genv ROMIO_FSTYPE_FORCE \"daos:\"" +#io500_mpiargs+=" -genv MPICH_MPIIO_HINTS \"romis_ds_write:disable;romio_cb_write:enable;cc_nodes:8\"" +#io500_mpiargs+=" -genv MPICH_MPIIO_HINTS_DISPLAY 1" + +#io500_mpiargs+=" -genv MPIR_CVAR_BCAST_POSIX_INTRA_ALGORITHM mpir" +#io500_mpiargs+=" -genv MPIR_CVAR_ALLREDUCE_POSIX_INTRA_ALGORITHM mpir" +#io500_mpiargs+=" -genv MPIR_CVAR_BARRIER_POSIX_INTRA_ALGORITHM mpir" +#io500_mpiargs+=" -genv MPIR_CVAR_REDUCE_POSIX_INTRA_ALGORITHM mpir" + +#io500_mpiargs+=" -genv FI_PROVIDER verbs" + +# for default? +#io500_mpiargs+=" -genv MPICH_MPIIO_HINTS \"romio_cb_write:disable;romio_cb_read:disable;romio_no_indep_rw=false\"" + +# for ior-easy +#io500_mpiargs+=" -genv MPICH_MPIIO_HINTS \"romio_ds_write:disable;romio_ds_read:disable;romio_cb_write:disable;romio_cb_read:disable:romio_no_indep_rw=true\"" + +# for ior-hard +#io500_mpiargs+=" -genv MPICH_MPIIO_HINTS \"romio_cb_write:enable;romio_cb_read:enable;cb_buffer_size:67108864;cb_nodes:16\"" + +# for mdtest +#io500_mpiarsg+=" -genv MPICH_MPIIO_HINTS \"romio_cb_write:disable;romio_cb_read:disable;romio_no_indep_rw=false\"" + +if [ -f "${io500_ini//.ini/-tmpl.ini}" ]; then + [ -f "$io500_ini" ] && mv "$io500_ini" "${io500_ini}.bak" + + cp -af "${io500_ini//.ini/-tmpl.ini}" "$io500_ini" + + sed -i -e "s/@DAOS_POOL/$DAOS_POOL/g" "$io500_ini" + sed -i -e "s/@DAOS_CONT/$DAOS_CONT/g" "$io500_ini" + sed -i -e "s/@FIND_NPROC/$FIND_NPROC/g" "$io500_ini" + sed -i -e "s/@FIND_QLEN/$FIND_QLEN/g" "$io500_ini" +fi + +function setup(){ + local workdir="$1" + local resultdir="$2" + mkdir -p $workdir $resultdir + + # Example commands to create output directories for Lustre. Creating + # top-level directories is allowed, but not the whole directory tree. + #if (( $(lfs df $workdir | grep -c MDT) > 1 )); then + # lfs setdirstripe -D -c -1 $workdir + #fi + #lfs setstripe -c 1 $workdir + #mkdir $workdir/ior-easy $workdir/ior-hard + #mkdir $workdir/mdtest-easy $workdir/mdtest-hard + #local osts=$(lfs df $workdir | grep -c OST) + # Try overstriping for ior-hard to improve scaling, or use wide striping + #lfs setstripe -C $((osts * 4)) $workdir/ior-hard || + # lfs setstripe -c -1 $workdir/ior-hard + # Try to use DoM if available, otherwise use default for small files + #lfs setstripe -E 64k -L mdt $workdir/mdtest-easy || true #DoM? + #lfs setstripe -E 64k -L mdt $workdir/mdtest-hard || true #DoM? + #lfs setstripe -E 64k -L mdt $workdir/mdtest-rnd +} + +# ***** YOU SHOULD NOT EDIT ANYTHING BELOW THIS LINE ***** +set -eo pipefail # better error handling + +if [[ -z "$io500_ini" ]]; then + echo "error: ini file must be specified. usage: $0 " + exit 1 +fi +if [[ ! -s "$io500_ini" ]]; then + echo "error: ini file '$io500_ini' not found or empty" + exit 2 +fi + +function get_ini_section_param() { + local section="$1" + local param="$2" + local inside=false + + while read LINE; do + LINE=$(sed -e 's/ *#.*//' -e '1s/ *= */=/' <<<$LINE) + $inside && [[ "$LINE" =~ "[.*]" ]] && inside=false && break + [[ -n "$section" && "$LINE" =~ "[$section]" ]] && inside=true && continue + ! $inside && continue + #echo $LINE | awk -F = "/^$param/ { print \$2 }" + if [[ $(echo $LINE | grep "^$param *=" ) != "" ]] ; then + # echo "$section : $param : $inside : $LINE" >> parsed.txt # debugging + echo $LINE | sed -e "s/[^=]*=[ \t]*\(.*\)/\1/" + return + fi + done < $io500_ini + echo "" +} + +function get_ini_global_param() { + local param="$1" + local default="$2" + local val + + val=$(get_ini_section_param global $param | + sed -e 's/[Ff][Aa][Ll][Ss][Ee]/False/' -e 's/[Tt][Rr][Uu][Ee]/True/') + + echo "${val:-$default}" +} + +function run_benchmarks { + $io500_mpirun $io500_mpiargs $PWD/io500 $io500_ini --timestamp $timestamp +} + +create_tarball() { + local sourcedir=$(dirname $io500_resultdir) + local fname=$(basename ${io500_resultdir}) + local tarball=$sourcedir/io500-$HOSTNAME-$fname.tgz + + cp -v $0 $io500_ini $io500_resultdir + tar czf $tarball -C $sourcedir $fname + echo "Created result tarball $tarball" +} + +function main { + # These commands extract the 'datadir' and 'resultdir' from .ini file + timestamp=$(date +%Y.%m.%d-%H.%M.%S) # create a uniquifier + [ $(get_ini_global_param timestamp-datadir True) != "False" ] && + ts="$timestamp" || ts="io500" + # working directory where the test files will be created + export io500_workdir=$(get_ini_global_param datadir $PWD/datafiles)/$ts + [ $(get_ini_global_param timestamp-resultdir True) != "False" ] && + ts="$timestamp" || ts="io500" + # the directory where the output results will be kept + export io500_resultdir=$(get_ini_global_param resultdir $PWD/results)/$ts + + setup $io500_workdir $io500_resultdir + run_benchmarks + + if [[ ! -s "system-information.txt" ]]; then + echo "Warning: please create a 'system-information.txt' description by" + echo "copying the information from https://vi4io.org/io500-info-creator/" + else + cp "system-information.txt" $io500_resultdir + fi + + create_tarball +} + +main diff --git a/io500/sc25/tta/mantastorage/io500-reproducibility.tta-mantastorage.md b/io500/sc25/tta/mantastorage/io500-reproducibility.tta-mantastorage.md new file mode 100644 index 0000000..2f70c4f --- /dev/null +++ b/io500/sc25/tta/mantastorage/io500-reproducibility.tta-mantastorage.md @@ -0,0 +1,301 @@ +# IO500 Reproducibility - SC25 - TTA - MantaStorage + +> The goal of these questions are to demonstrate that your IO500 benchmark execution is valid, can be +> reproduced, and provide additional details of your submitted storage system. Along with the other +> submitted items, the answers to these questions are used to calculate your reproducibility score and +> whether the submission is eligible for the Production or Research list. + +## System Purpose + +> Please describe the purpose and general usage of the submitted system. This would include the +> types of typical applications it supports (e.g., defense applications, molecular dynamics, +> benchmarking, system test, systems research), the general use and purpose of the data generated +> by the applications running on it. + +MantaStorage is the name of an AI storage system that is currently being +researched and developed as part of the Republic of Korea's national R&D project +titled "Development of High-Efficiency Parallel Storage Software Technology +Optimized for AI Computing Accelerators." It is located at the TTA +(Telecommunications Technology Association) in Bundang-gu, near Seongnam-si. + +All participating researchers from institutions involved in the Republic of +Korea's national R&D project, "Development of High-Efficiency Parallel Storage +Software Technology Optimized for AI Computational Accelerators" have access +to MantaStorage. + +## Availability + +> Please provide the deployment timeframe of the submitted system, or for on-demand cloud +> systems, the general period over which it is deployed and destroyed. +> +> Please describe the availability of the system to users and who are its set of most regular users. + +All participating researchers from institutions involved in the Republic of +Korea's national R&D project, "Development of High-Efficiency Parallel Storage +Software Technology Optimized for AI Computational Accelerators" have access +to MantaStorage. + +## Storage System Software + +> Please describe the purpose and general usage of the submitted system. This would include the +> types of typical applications it supports (e.g., defense applications, molecular dynamics, +> benchmarking, system test, systems research), the general use and purpose of the data generated +> by the applications running on it. + +MantaStorage is the name of an AI storage system that is currently being +researched and developed as part of the Republic of Korea's national R&D project +titled "Development of High-Efficiency Parallel Storage Software Technology +Optimized for AI Computing Accelerators." It is located at the TTA +(Telecommunications Technology Association) in Bundang-gu, near Seongnam-si. + +All participating researchers from institutions involved in the Republic of +Korea's national R&D project, "Development of High-Efficiency Parallel Storage +Software Technology Optimized for AI Computational Accelerators" have access +to MantaStorage. + +> How is this software available? (e.g., commercially, open-source, not publicly available) Note that if +> the storage software is not open-source or commercially available, then a general description +> would be requested, but this would limit the submission’s reproducibility score. +> +> Can anyone download/purchase this software? +> +> List either product webpage or open-source repo of the exact software used in IO500. + +The storage software is Distributed Asynchronous Object Store(DAOS), currently +at Version 2.7(under development) DAOS is completely open-source: + +* DAOS [GitHub repository](https://github.com/daos-stack/daos) + - [DAOS 2.7 (under development)](https://github.com/daos-stack/daos/tree/808afd521bb41f3b0e08b43b2b5bae521ed00bd2) +* DAOS [documentation](https://docs.daos.io/) +* [SC-Asia 2020 paper](https://doi.org/10.1007/978-3-030-48842-0_3) + _DAOS: A Scale-Out High Performance Storage Stack for Storage Class Memory_ +* [SC-Asia 2023 paper](https://doi.org/10.1145/3581576.3581577) + _Understanding DAOS Storage Performance Scalability_ + +Commercial support for DAOS is available from multiple companies. +For MantaStorage, DAOS support is provided by [Gluesys](https://www.gluesys.com). + +> Give any and all additional details of this specific storage deployment: (e.g., type of storage server +> product such IBM ESS or DDN SFA400X2, use of Ext4 or some other file system on each storage +> node, dual connected storage media to storage servers). + +The DAOS storage cluster in TTA's MantaStorage system consists of +4x Supermicro SYS-222C-TN servers, currently running DAOS Version 2.7(under development) on Rocky Linux 9.6: + +- Server 1 + - CPU: 2x Intel(R) Xeon(R) 6530P + - RAM: 8x 64GiB DDR5 Samsung 4800MT/s + - NVMe + - 10x Seagate XP7680SE70006 7.68TB + - 2x Dapustor DPRD3108T0T507T6000 5.63TB + - NIC + - 2x Mellanox ConnectX-6 1-port Infiniband HDR adapter +- Server 2 + - CPU: 2x Intel(R) Xeon(R) 6530P + - RAM: 8x 64GiB DDR5 Samsung 4800MT/s + - NVMe + - 10x Seagate XP7680SE70006 7.68TB + - 2x Dapustor DPRD3108T0T507T6000 5.63TB + - NIC + - 2x Mellanox ConnectX-6 1-port Infiniband HDR adapter +- Server 3 + - CPU: 2x Intel(R) Xeon(R) 6520P + - RAM: 8x 64GiB DDR5 Samsung 4800MT/s + - NVMe + - 8x Samsung MZQL23T8HCLS-00A07 3.84TB + - 2x Seagate XP7680SE70006 7.68TB + - NIC + - 2x Dapustor DPRD3108T0T507T6000 5.63TB +- Server 4 + - CPU: 2x Intel(R) Xeon(R) 6520P + - RAM: 8x 64GiB DDR5 Samsung 4800MT/s + - NVMe + - 8x Samsung MZQL23T8HCLS-00A07 3.84TB + - 2x Seagate XP7680SE70006 7.68TB + - NIC + - 2x Dapustor DPRD3108T0T507T6000 5.63TB + +## Runtime Environment + +> State here that you provided all scripts/documentation that would allow someone else to +> reproduce your environment and attempt to achieve a similar IO500 score as the submitted +> system. +> +> NOTE: provide all files/documentation/scripts that would enable a user to build your environment +> and deploy the custom scripts, software, or config files once they have obtained the appropriate +> storage system hardware and software. These may be included into the io500.tgz or into the +> extraCodes upload on the submission form. +> +> Several examples include: +> +> * Commands used to set striping information (either for the entire system or for particular +> directories) +> * File system config and tuning information (or a reason why this is not available due to lack +> of root access, etc) on each node type (e.g., non-default config parameters on all 3 Lustre +> MDS, OSS, and client) +> * Any additional scripts utilized that impact IO500 execution beyond the io500 config file. For +> example, with IBM Spectrum Scale, the output of config, cluster, file system and fileset +> commands (and possibly even a dump of the configuration if possible). Each file system +> probably has similar type of config/tuning information that would need to be shared for a +> user to fully reproduce the environment. + +Full reproducibility documentation including the DAOS server configuration +files as well as client-side setup and run scripts is provided in the +daos-stack github repository: [https://github.com/daos-stack/daos-reproducibility/tree/master/io500/sc25/tta/mantastorage](https://github.com/daos-stack/daos-reproducibility/tree/master/io500/sc25/tta/mantastorage) + +## Fault Tolerance Mechanisms + +> Does your system have a single point of failure? Please describe all mechanisms provide fault +> tolerance for the submitted storage system. Be specific to your submission, not general storage +> system capabilities. +> +> * Power +> * Networking (e.g., dual redundant switches) +> * Inter Storage data and metadata server (e.g., active-active servers, client-directed RAID, +> declustered RAID, erasure-coding, 3-way replication) +> * Intra Storage data and metadata server storage media (e.g., raid5) +> +> Please list any additional information needed to determine whether this system has a single point +> of failure. + +DAOS is a scale-out software-defined storage system. + +Hardware redundancy on the individual storage server level includes dual power supplies +(with redundant facility power feeds at the rack PUD level), and dual InfiniBand network interfaces +(typically attached to different InfiniBand leaf switches). +The servers used at TTA are also using memory DIMMs with ECC error detection and correction. +All other redundancy features are provided in software. + +The DAOS management service is redundant, with instances running on multiple DAOS servers. +Client access to the management plane is transparently redirected in case that individual DAOS servers fail. +There are no dedicated metadata servers. +Data placement is algorithmic, and declustered across all storage targets. + +Contrary to traditional storage systems, a DAOS storage allocation (a DAOS _pool_) does not prescribe +any particular data protection level. It is only a raw storage allocation that is performed by the +storage administrator (or a job prolog). +Data distribution and data protection is handled at the DAOS _container_ level. +A DAOS pool can host multiple containers, which share the storage allocation +but may employ different data distribution and data protection mechanisms. + +Container management is an end user role and does not require storage administrator intervention. +This means that end users are free to pick the appropriate level of data protection +for each of their datasets, on a per-container or even a per-object level. +DAOS supports sharding/striping (redundancy factor rf=0), replication (2-way up to 5-way), +and network erasure coding (4+1P, 4+2P, 8+1P, 8+2P, 16+1P, 16+2P, etc.). + +**Regarding the IO500 rules**: + +* DAOS submissions for the **Production** list use 2-Way replication for metadata and for IOR-Hard, + and single-parity EC (here, 2+1P) for IOR-Easy, + in order to satisfy the requirement of “no single point of failure”. + The following DAOS object class settings are used in the io500.ini file for the Production list: + +``` + [ior-easy] + API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=RP_2G1 --dfs.oclass=EC_4P1GX + [ior-hard] + API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=RP_2G1 --dfs.oclass=RP_2GX --dfs.chunk_size=470080 + [mdtest-easy] + API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=RP_2GX --dfs.oclass=RP_2G1 + [mdtest-hard] + API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=RP_2GX --dfs.oclass=RP_2G1 +``` + +* DAOS submissions for the **Research** list use sharding/striping (no data protection) for all tests, + in order to achieve maximum performance. + The following DAOS object class settings are used in the io500.ini file for the Research list: + +``` + [ior-easy] + API = DFS --dfs.pool=tank --dfs.cont=posix_conatiner --dfs.dir_oclass=S1 --dfs.oclass=SX + [ior-hard] + API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=S1 --dfs.oclass=SX --dfs.chunk_size=470080 + [mdtest-easy] + API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=SX --dfs.oclass=S1 + [mdtest-hard] + API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=SX --dfs.oclass=S1 +``` + +Note that these choices are completely user-defined (as part of the API definitions in the IO500.ini file), +and have been executed on the **exact same** storage system configuration without intervention +of a storage administrator. + +## Execution + +> Please provide a description of how the IO500 benchmark was executed, e.g., via system scheduler +> (e.g., SLURM) to run a job on the compute cluster, which initially ran a setup process to configure +> the client and file system, and then started the full benchmark. + +the IO500-SC25 runs were performed in the deployment stage of the MantaStorage +installation, before user operation started. For this reason the runs have been +performed with interactive `mpirun` invocations, using a `-hosts` options +specified by the `io500_mpiargs` environment variable in io500.sh + +Full reproducibility documentation including the DAOS server configuration files as well as +client-side setup and run scripts is provided in the daos-stack github repository: +[https://github.com/daos-stack/daos-reproducibility/tree/master/io500/sc25/tta/mantastorage](https://github.com/daos-stack/daos-reproducibility/tree/master/io500/sc25/tta/mantastorage) + +> During the IO500 benchmark execution was the system entirely dedicated to running the +> benchmark or were there other jobs running in the same cluster and storage system? + +Compute nodes and storage system were dedicated while running the benchmark. + +## Caching + +> Please describe all caching mechanisms in client/server that were utilized during the IO500 run. +> This could include caching in any storage medium (e.g., SSD, RAM). +> +> A few examples would include: +> +> * Client data/metadata caching (in Linux page cache or in other memory cache) +> * Client side NVMe read-only data cache +> * Storage server metadata/data caching in RAM +> * Storage controller caching +> * RAID card caching + +The DAOS File System (DFS) API that was used for the IO500 runs does not +perform any caching. + +## Data Source + +> Where is the source of truth of the data stored and later read back in the IO500 benchmark? This +> question relates to whether the submitted system is a burst buffer layered on primary storage or +> primary storage itself. + +DAOS is a standalone storage system, +and the DAOS POSIX container used in the IO500 benchmarks is the primary storage. + +## Trust + +> Please describe any steps taken to ensure that the results are trustworthy. +> +> * Did you run the benchmark multiple times and get similar scores? +> * Did you validate the score is below the physical capabilities of the deployed hardware? +> * Did you validate that the data was persistently stored? + +All runs have been repeated multiple times to ensure consistency. + +All results have been reviewed for plausibility and are reasonable within the +hardware performance boundaries of the servers, clients, and HDR InfiniBand +network. + +An in-depth performace sclaing study of IO500 workloads is also available in the +[SC-Asia 2023 paper](https://doi.org/10.1145/3581576.3581577) +_Understanding DAOS Storage Performance Scalability_. + +## Reproducibility + +> Given the 4 possible reproducibility scores listed in the +> [reproducibility description](https://io500.org/the-lists#reproducibility-scores), +> what score do you believe your submission will be assigned? Please double check the definitions of each +> reproducibility level and ensure you have provided enough information to meet your expected +> score. +> +> * Undefined +> * Limited +> * Proprietary +> * Fully Reproducible + +This submission should be assigned the **Fully Reproducible** score. + diff --git a/io500/sc25/tta/mantastorage/io500/Makefile b/io500/sc25/tta/mantastorage/io500/Makefile new file mode 100644 index 0000000..96aa27f --- /dev/null +++ b/io500/sc25/tta/mantastorage/io500/Makefile @@ -0,0 +1,62 @@ +CC = mpicc +CFLAGS += -std=gnu99 -Wall -Wempty-body -Wstrict-prototypes -Werror=maybe-uninitialized -Warray-bounds + +IORCFLAGS = $(shell grep CFLAGS ./build/ior/src/build.conf | cut -d "=" -f 2-) +CFLAGS += -g3 -lefence -I./include/ -I./src/ -I./build/pfind/src/ -I./build/ior/src/ +IORLIBS = $(shell grep LDFLAGS ./build/ior/src/build.conf | cut -d "=" -f 2-) +LDFLAGS += -lm $(IORCFLAGS) $(IORLIBS) # -lgpfs # may need some additional flags as provided to IOR +LDFLAGS += -ldaos -ldaos_common -ldfs -lgurt -luuid + +VERSION_GIT=$(shell git describe --always --abbrev=12) +VERSION_TREE=$(shell git diff src | wc -l | sed -e 's/ *//g' -e 's/^0//' | sed "s/\([0-9]\)/-\1/") +VERSION=$(VERSION_GIT)$(VERSION_TREE) +CFLAGS += -DVERSION="\"$(VERSION)\"" +PROGRAM = io500 +VERIFIER = io500-verify +SEARCHPATH += src +SEARCHPATH += include +SEARCHPATH += test +vpath %.c $(SEARCHPATH) +vpath %.h $(SEARCHPATH) +.SUFFIXES: + +DEPS += io500-util.h io500-debug.h io500-opt.h +OBJSO += util.o +OBJSO += ini-parse.o phase_dbg.o phase_opt.o phase_timestamp.o +OBJSO += phase_find.o phase_find_easy.o phase_find_hard.o phase_ior_easy.o phase_ior_easy_read.o phase_mdtest.o phase_ior.o phase_ior_easy_write.o phase_ior_hard.o phase_ior_hard_read.o phase_ior_hard_write.o phase_mdtest_easy.o phase_mdtest_easy_delete.o phase_mdtest_easy_stat.o phase_mdtest_easy_write.o phase_mdtest_hard.o phase_mdtest_hard_delete.o phase_mdtest_hard_read.o phase_mdtest_hard_stat.o phase_mdtest_hard_write.o phase_ior_rnd1MB.o phase_ior_rnd4K.o phase_ior_rnd_write4K.o phase_ior_rnd_read4K.o phase_ior_rnd_write1MB.o phase_ior_rnd_read1MB.o phase_mdworkbench.o phase_mdworkbench_create.o phase_mdworkbench_delete.o phase_mdworkbench_bench.o phase_ior_rnd_read4k-easywrite.o + +OBJS = $(patsubst %,./build/%,$(OBJSO)) + +TESTS += ini-test +TESTSEXE = $(patsubst %,./build/%.exe,$(TESTS)) + +all: $(VERIFIER) $(PROGRAM) $(TESTSEXE) + +clean: + @echo CLEAN + @$(RM) ./build/*.o ./build/io500.a *.exe $(PROGRAM) + +./build/io500.a: $(OBJS) + @echo AR $@ + ar rcsT $@ $(OBJS) + +$(VERIFIER): ./build/verifier.o ./build/io500.a + @echo LD $@ + $(CC) -o $@ ./build/verifier.o ./build/io500.a $(LDFLAGS) + +$(PROGRAM): ./build/io500.a ./build/main.o + @echo LD $@ + $(CC) -o $@ ./build/main.o $(LDFLAGS) ./build/io500.a ./build/pfind/pfind.a ./build/ior/src/libaiori.a $(LDFLAGS) + +.PHONY: ./build/main.o +./build/main.o: main.c $(DEPS) + @echo CC $@ + $(CC) $(CFLAGS) -c -o $@ $< + +./build/%.o: %.c $(DEPS) + @echo CC $@ + $(CC) $(CFLAGS) -c -o $@ $< + +./build/%.exe: ./build/%.o $(DEPS) ./build/io500.a + @echo LD $@ + $(CC) -o $@ $< $(LDFLAGS) ./build/io500.a diff --git a/io500/sc25/tta/mantastorage/io500/config-all-dfs-rf0.ini b/io500/sc25/tta/mantastorage/io500/config-all-dfs-rf0.ini new file mode 100644 index 0000000..a29f1b2 --- /dev/null +++ b/io500/sc25/tta/mantastorage/io500/config-all-dfs-rf0.ini @@ -0,0 +1,282 @@ +# Supported and current values of the ini file: +[global] +# The directory where the IO500 runs +datadir = / +# The data directory is suffixed by a timestamp. Useful for running several IO500 tests concurrently. +timestamp-datadir = TRUE +# The result directory. +resultdir = ./results/dfs-rf0 +# The result directory is suffixed by a timestamp. Useful for running several IO500 tests concurrently. +timestamp-resultdir = TRUE +# The general API for the tests (to create/delete the datadir, extra options will be passed to IOR/mdtest) +api = DFS --dfs.pool=tank --dfs.cont=posix_container +# Purge the caches, this is useful for testing and needed for single node runs +drop-caches = FALSE +# Cache purging command, invoked before each I/O phase +drop-caches-cmd = sudo -n bash -c "echo 3 > /proc/sys/vm/drop_caches" +# Allocate the I/O buffers on the GPU +io-buffers-on-gpu = FALSE +# The verbosity level between 1 and 10 +verbosity = 1 +# Use the rules for the Student Cluster Competition +scc = FALSE +# Type of packet that will be created [timestamp|offset|incompressible|random] +dataPacketType = timestamp + +[debug] +# For a valid result, the stonewall timer must be set to the value according to the rules. If smaller INVALIDATES RUN; FOR DEBUGGING. +stonewall-time = 300 +# Pause between phases while in this directory lies a file with the phase name, e.g., easy-create. This can be useful for performance testing, e.g., of tiered storage. At the moment it INVALIDATES RUN; FOR DEBUGGING. +pause-dir = + +[ior-easy] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=S1 --dfs.oclass=SX +# Transfer size +transferSize = 1m +# Block size; must be a multiple of transferSize +blockSize = 992000m +# Create one file per process +filePerProc = TRUE +# Use unique directory per file per process +uniqueDir = FALSE +# Run this phase +run = TRUE +# The verbosity level +verbosity = + +[ior-easy-write] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=S1 --dfs.oclass=SX +# Run this phase +run = TRUE + +[ior-rnd4K] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=S1 --dfs.oclass=SX +# Size of a random block, change only if explicitly allowed +blockSize = 1073741824 +# Run this phase +run = TRUE +# The verbosity level +verbosity = +# Prefill the file with this blocksize in bytes, e.g., 2097152 +randomPrefill = 0 + +[ior-rnd4K-write] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=S1 --dfs.oclass=SX +# Run this phase +run = TRUE + +[mdtest-easy] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=SX --dfs.oclass=S1 +# Files per proc +n = 2000000 +# Run this phase +run = TRUE + +[mdtest-easy-write] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=SX --dfs.oclass=S1 +# Run this phase +run = TRUE + +[ior-rnd1MB] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=S1 --dfs.oclass=SX +# Size of a random block, change only if explicitly allowed +blockSize = 1073741824 +# Run this phase +run = TRUE +# The verbosity level +verbosity = +# Prefill the file with this blocksize in bytes, e.g., 2097152 +randomPrefill = 0 + +[ior-rnd1MB-write] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=S1 --dfs.oclass=SX +# Run this phase +run = TRUE + +[mdworkbench] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=SX --dfs.oclass=S1 +# Waiting time of an IO operation relative to runtime (1.0 is 100%%) +waitingTime = 0.0 +# Files to precreate per set (always 10 sets), this is normally dynamically determined +precreatePerSet = +# Files to run per iteration and set (always 10 sets), this is normally dynamically determined +filesPerProc = +# Run this phase +run = TRUE +# The verbosity level +verbosity = + +[mdworkbench-create] +# Run this phase +run = TRUE + +[timestamp] + +[find-easy] +# Set to an external script to perform the find phase +external-script = +# Startup arguments for external scripts, some MPI's may not support this! +external-mpi-args = +# Extra arguments for the external scripts +external-extra-args = +# Set the number of processes for pfind/the external script +nproc = 448 +# Run this phase +run = TRUE +# Pfind queue length +pfind-queue-length = 20160 +# Pfind Steal from next +pfind-steal-next = FALSE +# Parallelize the readdir by using hashing. Your system must support this! +pfind-parallelize-single-dir-access-using-hashing = FALSE + +[ior-hard] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=S1 --dfs.oclass=SX --dfs.chunk_size=470080 +# Number of segments +segmentCount = 10000000 +# Collective operation (for supported backends) +collective = +# Run this phase +run = TRUE +# The verbosity level +verbosity = + +[ior-hard-write] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=S1 --dfs.oclass=SX --dfs.chunk_size=470080 +# Collective operation (for supported backends) +collective = +# Run this phase +run = TRUE + +[mdtest-hard] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=SX --dfs.oclass=S1 +# Files per proc +n = 500000 +# File limit per directory (MDTest -I flag) to overcome file system limitations INVALIDATES RUN; FOR DEBUGGING. +files-per-dir = +# Run this phase +run = TRUE + +[mdtest-hard-write] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=SX --dfs.oclass=S1 +# Run this phase +run = TRUE + +[find] +# Set to an external script to perform the find phase +external-script = +# Startup arguments for external scripts, some MPI's may not support this! +external-mpi-args = +# Extra arguments for the external scripts +external-extra-args = +# Set the number of processes for pfind/the external script +nproc = 448 +# Run this phase +run = TRUE +# Pfind queue length +pfind-queue-length = 20160 +# Pfind Steal from next +pfind-steal-next = FALSE +# Parallelize the readdir by using hashing. Your system must support this! +pfind-parallelize-single-dir-access-using-hashing = FALSE + +[ior-rnd4K-read] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=S1 --dfs.oclass=SX +# Run this phase +run = TRUE + +[ior-rnd1MB-read] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=S1 --dfs.oclass=SX +# Run this phase +run = TRUE + +[find-hard] +# Set to an external script to perform the find phase +external-script = +# Startup arguments for external scripts, some MPI's may not support this! +external-mpi-args = +# Extra arguments for the external scripts +external-extra-args = +# Set the number of processes for pfind/the external script +nproc = 448 +# Run this phase +run = TRUE +# Pfind queue length +pfind-queue-length = 20160 +# Pfind Steal from next +pfind-steal-next = FALSE +# Parallelize the readdir by using hashing. Your system must support this! +pfind-parallelize-single-dir-access-using-hashing = FALSE + +[mdworkbench-bench] +# Run this phase +run = TRUE + +[ior-easy-read] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=S1 --dfs.oclass=SX +# Run this phase +run = TRUE + +[mdtest-easy-stat] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=SX --dfs.oclass=S1 +# Run this phase +run = TRUE + +[ior-hard-read] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=S1 --dfs.oclass=SX --dfs.chunk_size=470080 +# Collective operation (for supported backends) +collective = +# Run this phase +run = TRUE + +[mdtest-hard-stat] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=SX --dfs.oclass=S1 +# Run this phase +run = TRUE + +[mdworkbench-delete] +# Run this phase +run = TRUE + +[mdtest-easy-delete] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=SX --dfs.oclass=S1 +# Run this phase +run = TRUE + +[mdtest-hard-read] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=SX --dfs.oclass=S1 +# Run this phase +run = TRUE + +[mdtest-hard-delete] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=SX --dfs.oclass=S1 +# Run this phase +run = TRUE + +[ior-rnd4K-easy-read] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=S1 --dfs.oclass=SX +# Run this phase +run = TRUE diff --git a/io500/sc25/tta/mantastorage/io500/config-all-dfs-rf1.ini b/io500/sc25/tta/mantastorage/io500/config-all-dfs-rf1.ini new file mode 100644 index 0000000..a002dbd --- /dev/null +++ b/io500/sc25/tta/mantastorage/io500/config-all-dfs-rf1.ini @@ -0,0 +1,282 @@ +# Supported and current values of the ini file: +[global] +# The directory where the IO500 runs +datadir = / +# The data directory is suffixed by a timestamp. Useful for running several IO500 tests concurrently. +timestamp-datadir = TRUE +# The result directory. +resultdir = ./results/dfs-rf1 +# The result directory is suffixed by a timestamp. Useful for running several IO500 tests concurrently. +timestamp-resultdir = TRUE +# The general API for the tests (to create/delete the datadir, extra options will be passed to IOR/mdtest) +api = DFS --dfs.pool=tank --dfs.cont=posix_container +# Purge the caches, this is useful for testing and needed for single node runs +drop-caches = FALSE +# Cache purging command, invoked before each I/O phase +drop-caches-cmd = sudo -n bash -c "echo 3 > /proc/sys/vm/drop_caches" +# Allocate the I/O buffers on the GPU +io-buffers-on-gpu = FALSE +# The verbosity level between 1 and 10 +verbosity = 1 +# Use the rules for the Student Cluster Competition +scc = FALSE +# Type of packet that will be created [timestamp|offset|incompressible|random] +dataPacketType = timestamp + +[debug] +# For a valid result, the stonewall timer must be set to the value according to the rules. If smaller INVALIDATES RUN; FOR DEBUGGING. +stonewall-time = 300 +# Pause between phases while in this directory lies a file with the phase name, e.g., easy-create. This can be useful for performance testing, e.g., of tiered storage. At the moment it INVALIDATES RUN; FOR DEBUGGING. +pause-dir = + +[ior-easy] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=RP_2G1 --dfs.oclass=EC_2P1GX +# Transfer size +transferSize = 1m +# Block size; must be a multiple of transferSize +blockSize = 992000m +# Create one file per process +filePerProc = TRUE +# Use unique directory per file per process +uniqueDir = FALSE +# Run this phase +run = TRUE +# The verbosity level +verbosity = + +[ior-easy-write] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=RP_2G1 --dfs.oclass=EC_2P1GX +# Run this phase +run = TRUE + +[ior-rnd4K] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=RP_2G1 --dfs.oclass=RP_2GX +# Size of a random block, change only if explicitly allowed +blockSize = 1073741824 +# Run this phase +run = TRUE +# The verbosity level +verbosity = +# Prefill the file with this blocksize in bytes, e.g., 2097152 +randomPrefill = 0 + +[ior-rnd4K-write] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=RP_2G1 --dfs.oclass=RP_2GX +# Run this phase +run = TRUE + +[mdtest-easy] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=RP_2GX --dfs.oclass=RP_2G1 +# Files per proc +n = 1000000 +# Run this phase +run = TRUE + +[mdtest-easy-write] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=RP_2GX --dfs.oclass=RP_2G1 +# Run this phase +run = TRUE + +[ior-rnd1MB] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=RP_2G1 --dfs.oclass=EC_2P1GX +# Size of a random block, change only if explicitly allowed +blockSize = 1073741824 +# Run this phase +run = TRUE +# The verbosity level +verbosity = +# Prefill the file with this blocksize in bytes, e.g., 2097152 +randomPrefill = 0 + +[ior-rnd1MB-write] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=RP_2G1 --dfs.oclass=EC_2P1GX +# Run this phase +run = TRUE + +[mdworkbench] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=SX --dfs.oclass=S1 +# Waiting time of an IO operation relative to runtime (1.0 is 100%%) +waitingTime = 0.0 +# Files to precreate per set (always 10 sets), this is normally dynamically determined +precreatePerSet = +# Files to run per iteration and set (always 10 sets), this is normally dynamically determined +filesPerProc = +# Run this phase +run = TRUE +# The verbosity level +verbosity = + +[mdworkbench-create] +# Run this phase +run = TRUE + +[timestamp] + +[find-easy] +# Set to an external script to perform the find phase +external-script = +# Startup arguments for external scripts, some MPI's may not support this! +external-mpi-args = +# Extra arguments for the external scripts +external-extra-args = +# Set the number of processes for pfind/the external script +nproc = 448 +# Run this phase +run = TRUE +# Pfind queue length +pfind-queue-length = 20000 +# Pfind Steal from next +pfind-steal-next = FALSE +# Parallelize the readdir by using hashing. Your system must support this! +pfind-parallelize-single-dir-access-using-hashing = FALSE + +[ior-hard] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=RP_2G1 --dfs.oclass=RP_2GX --dfs.chunk_size=470080 +# Number of segments +segmentCount = 10000000 +# Collective operation (for supported backends) +collective = +# Run this phase +run = TRUE +# The verbosity level +verbosity = + +[ior-hard-write] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=RP_2G1 --dfs.oclass=RP_2GX --dfs.chunk_size=470080 +# Collective operation (for supported backends) +collective = +# Run this phase +run = TRUE + +[mdtest-hard] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=RP_2GX --dfs.oclass=RP_2G1 +# Files per proc +n = 250000 +# File limit per directory (MDTest -I flag) to overcome file system limitations INVALIDATES RUN; FOR DEBUGGING. +files-per-dir = +# Run this phase +run = TRUE + +[mdtest-hard-write] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=RP_2GX --dfs.oclass=RP_2G1 +# Run this phase +run = TRUE + +[find] +# Set to an external script to perform the find phase +external-script = +# Startup arguments for external scripts, some MPI's may not support this! +external-mpi-args = +# Extra arguments for the external scripts +external-extra-args = +# Set the number of processes for pfind/the external script +nproc = 448 +# Run this phase +run = TRUE +# Pfind queue length +pfind-queue-length = 20000 +# Pfind Steal from next +pfind-steal-next = FALSE +# Parallelize the readdir by using hashing. Your system must support this! +pfind-parallelize-single-dir-access-using-hashing = FALSE + +[ior-rnd4K-read] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=RP_2G1 --dfs.oclass=RP_2GX +# Run this phase +run = TRUE + +[ior-rnd1MB-read] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=RP_2G1 --dfs.oclass=EC_2P1GX +# Run this phase +run = TRUE + +[find-hard] +# Set to an external script to perform the find phase +external-script = +# Startup arguments for external scripts, some MPI's may not support this! +external-mpi-args = +# Extra arguments for the external scripts +external-extra-args = +# Set the number of processes for pfind/the external script +nproc = 448 +# Run this phase +run = TRUE +# Pfind queue length +pfind-queue-length = 20000 +# Pfind Steal from next +pfind-steal-next = FALSE +# Parallelize the readdir by using hashing. Your system must support this! +pfind-parallelize-single-dir-access-using-hashing = FALSE + +[mdworkbench-bench] +# Run this phase +run = TRUE + +[ior-easy-read] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=RP_2G1 --dfs.oclass=EC_2P1GX +# Run this phase +run = TRUE + +[mdtest-easy-stat] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=RP_2GX --dfs.oclass=RP_2G1 +# Run this phase +run = TRUE + +[ior-hard-read] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=RP_2G1 --dfs.oclass=RP_2GX --dfs.chunk_size=470080 +# Collective operation (for supported backends) +collective = +# Run this phase +run = TRUE + +[mdtest-hard-stat] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=RP_2GX --dfs.oclass=RP_2G1 +# Run this phase +run = TRUE + +[mdworkbench-delete] +# Run this phase +run = TRUE + +[mdtest-easy-delete] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=RP_2GX --dfs.oclass=RP_2G1 +# Run this phase +run = TRUE + +[mdtest-hard-read] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=RP_2GX --dfs.oclass=RP_2G1 +# Run this phase +run = TRUE + +[mdtest-hard-delete] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=RP_2GX --dfs.oclass=RP_2G1 +# Run this phase +run = TRUE + +[ior-rnd4K-easy-read] +# The API to be used +API = DFS --dfs.pool=tank --dfs.cont=posix_container --dfs.dir_oclass=RP_2G1 --dfs.oclass=RP_2GX +# Run this phase +run = TRUE diff --git a/io500/sc25/tta/mantastorage/io500/prepare.sh b/io500/sc25/tta/mantastorage/io500/prepare.sh new file mode 100755 index 0000000..ef41a12 --- /dev/null +++ b/io500/sc25/tta/mantastorage/io500/prepare.sh @@ -0,0 +1,119 @@ +#!/bin/bash -e + +set -e + +echo This script downloads the code for the benchmarks +echo It will also attempt to build the benchmarks +echo It will output OK at the end if builds succeed +echo +IOR_HASH=8ab8f69b32b919 +#PFIND_HASH=aaba722a178 +PFIND_HASH=dfs_find + +INSTALL_DIR=$PWD +BIN=$INSTALL_DIR/bin +BUILD=$PWD/build +MAKE="make -j${NPROC:-$(nproc 2> /dev/null || echo 4)}" # handle missing nproc + +function main { + # listed here, easier to spot and run if something fails + setup + + get_schema_tools + get_ior + get_pfind + + build_ior + build_pfind + build_io500 + + echo + echo "OK: All required software packages are now prepared" + ls "$BIN" +} + +function setup { + #rm -rf $BUILD $BIN + mkdir -p "$BUILD" "$BIN" + #cp utilities/find/mmfind.sh $BIN +} + +function git_co { + local repo=$1 + local dir=$2 + local tag=$3 + + pushd "$BUILD" + [ -d "$dir" ] || git clone "$repo" "$dir" + cd "$dir" + git fetch + if [ -n "$tag" ]; then git checkout "$tag"; fi + popd +} + +###### GET FUNCTIONS +function get_ior { + local ior_dir="ior" + if [ -d "$ior_dir" ]; then + echo "IOR already exists. Skipping download." + else + echo "Getting IOR and mdtest" + git_co https://github.com/hpc/ior.git "$ior_dir" $IOR_HASH + fi +} + +function get_pfind { + local pfind_dir="pfind" + if [ -d "$pfind_dir" ]; then + echo "Parallel find already exists. Skipping download." + else + echo "Preparing parallel find" + #git_co https://github.com/VI4IO/pfind.git "$pfind_dir" $PFIND_HASH + git_co https://github.com/mchaarawi/pfind.git "$pfind_dir" $PFIND_HASH + fi +} + +function get_schema_tools { + local schema_build_dir="build/cdcl-schema-tools" + local schema_dir="schema-tools" + if [ -d "$schema_build_dir" ] && [ -d "$schema_dir" ]; then + echo "Schema tools already exist. Skipping download." + else + echo "Downloading supplementary schema tools" + git_co https://github.com/VI4IO/cdcl-schema-tools.git cdcl-schema-tools + [ -d "$dir" ] || ln -sf "$PWD"/build/cdcl-schema-tools schema-tools + fi +} + +###### BUILD FUNCTIONS +function build_ior { + pushd "$BUILD"/ior + ./bootstrap + # Add here extra flags + ./configure --prefix="$INSTALL_DIR" --with-daos --with-mpiio --with-hdf5 + cd src + $MAKE clean + $MAKE install + echo "IOR: OK" + echo + popd +} + +function build_pfind { + pushd "$BUILD"/pfind + ./prepare.sh + ./compile.sh + ln -sf "$BUILD"/pfind/pfind "$BIN"/pfind + echo "Pfind: OK" + echo + popd +} + +function build_io500 { + $MAKE + echo "io500: OK" + echo +} + +###### CALL MAIN +main diff --git a/io500/sc25/tta/mantastorage/servers/MantaStorage-01/etc/daos/daos_agent.yml b/io500/sc25/tta/mantastorage/servers/MantaStorage-01/etc/daos/daos_agent.yml new file mode 100644 index 0000000..4045f91 --- /dev/null +++ b/io500/sc25/tta/mantastorage/servers/MantaStorage-01/etc/daos/daos_agent.yml @@ -0,0 +1,183 @@ +# DAOS agent configuration file. +# +# Location of this configuration file is determined by first checking for the +# path specified through the -o option of the daos_agent command line. +# Otherwise, /etc/daos/daos_agent.yml is used. +# +# Section describing the daos_agent configuration +# +# Although not supported for now, one might want to connect to multiple +# DAOS installations from the same node in the future. +# +# Specify the associated DAOS systems. +# Name must match name specified in the daos_server.yml file on the server. +# +# NOTE: changing the name is not supported yet, it must be daos_server +# +# default: daos_server +name: daos_server + +# Management server access points +# Must have the same value for all agents and servers in a system. +# default: hostname of this node +access_points: [ + "MantaStorage-01", + "MantaStorage-02", + "MantaStorage-03", + "MantaStorage-04" +] + +# Force different port number to connect to access points. +# default: 10001 +port: 10001 + +## Enable HTTP endpoint for remote telemetry collection. +# Note that enabling the endpoint automatically enables +# client telemetry collection. +# +## default endpoint state: disabled +## default endpoint port: 9192 +#telemetry_port: 9192 + +## Enable client telemetry for all DAOS clients. +# If false, clients will need to optionally enable telemetry by setting +# the D_CLIENT_METRICS_ENABLE environment variable to true. +# +## default: false +#telemetry_enabled: true + +## Retain client telemetry for a period of time after the client +# process exits. +# +## default 0 (do not retain telemetry after client exit) +#telemetry_retain: 1m + +## Enable client telemetry for a subset of DAOS clients matching the +# supplied regular expression pattern. May not be used with +# telemetry_disabled_procs. +# +## default: not set +#telemetry_enabled_procs: ^dfuse$ + +## Disable client telemetry for a subset of DAOS clients matching the +# supplied regular expression pattern. May not be used with +# telemetry_enabled_procs. +# +## default: not set +#telemetry_disabled_procs: ^spambot-.* + +## Configuration for user credential management. +#credential_config: +# # If the agent should be able to resolve unknown client uids and gids +# # (e.g. when running in a container) into ACL principal names, then a +# # client user map may be defined. The optional "default" uid is a special +# # case and applies if no other matches are found. +# client_user_map: +# default: +# user: nobody +# group: nobody +# 1000: +# user: ralph +# group: stanley + +# # Optionally cache generated credentials with the specified cache +# # lifetime. By default, a credential is generated for every client +# # process that connects to a pool. If the credential cache is +# # enabled, then local client processes connecting with stable +# # uid:gid associations may take advantage of the cached credential +# # and reduce some agent overhead. For heavily-loaded client nodes +# # with many frequent (e.g. hundreds per minute) client connections, +# # a lifetime of 1-5 minutes may be a reasonable tradeoff between +# # performance and responsiveness to user/group database updates. +# # If no expiration is set, credential caching is not enabled. +# cache_expiration: 1m + +## Configuration for SSL certificates used to secure management traffic +# and authenticate/authorize management components. +transport_config: +# # In order to disable transport security, uncomment and set allow_insecure +# # to true. Not recommended for production configurations. + allow_insecure: false + + # Custom CA Root certificate for generated certs + ca_cert: /etc/daos/certs/daosCA.crt + # Agent certificate for use in TLS handshakes + cert: /etc/daos/certs/agent.crt + # Key portion of Agent Certificate + key: /etc/daos/certs/agent.key + +# Use the given directory for creating unix domain sockets +# +# NOTE: Do not change this when running under systemd control. If it needs to +# be changed, then make sure that it matches the RuntimeDirectory setting +# in /usr/lib/systemd/system/daos_agent.service +# +# default: /var/run/daos_agent +runtime_dir: /var/run/daos_agent + +# Full path and name of the DAOS agent logfile. +# default: print to stderr +log_file: /var/log/daos/daos_agent.log + +# Force specific debug mask for daos_agent (control plane). +# Mask specifies minimum level of message significance to pass to logger. +# Currently supported values are DISABLED, TRACE, DEBUG, INFO, NOTICE and ERROR. + +# default: INFO +control_log_mask: DEBUG + +# Disable automatic eviction of open pool handles on agent shutdown. By default, +# the agent will evict all open pool handles for local processes on shutdown. +# Note that this implies that stopping or restarting the agent will result +# in interruption of DAOS I/O for any local DAOS client processes that have +# an open pool handle. +# default: false +disable_auto_evict: true + +# If enabled, the agent will evict any open pool handles associated with this machine on agent +# startup. This allows the servers to reclaim resources that may not have been properly cleaned +# up in the event of an agent or machine crash. +# default: false +enable_evict_on_start: true + +## Disable the agent's internal caches. If set to true, the agent will query the +## server access point and local hardware data every time a client requests +## rank connection information. +# +## default: false +disable_caching: true + +## Automatically expire the agent's remote cache after a period of time defined in +## minutes. It will refresh the data the next time it is requested. +# +## default: 0 (never expires) +#cache_expiration: 30 + +## Ignore a subset of fabric interfaces when selecting an interface for client +## applications. (Mutually exclusive with include). +# +#exclude_fabric_ifaces: ["lo", "eth1"] + +# Conversely, only consider a specific set of fabric interfaces when selecting +# an interface for client applications. (Mutually exclusive with exclude). + +include_fabric_ifaces: ["ibs1", "ibs769"] + +# Manually define the fabric interfaces and domains to be used by the agent, +# organized by NUMA node. +# If not defined, the agent will automatically detect all fabric interfaces and +# select appropriate ones based on the server preferences. +# +fabric_ifaces: +- + numa_node: 0 + devices: + - + iface: ibs1 + domain: mlx5_0 +- + numa_node: 1 + devices: + - + iface: ibs769 + domain: mlx5_1 diff --git a/io500/sc25/tta/mantastorage/servers/MantaStorage-01/etc/daos/daos_control.yml b/io500/sc25/tta/mantastorage/servers/MantaStorage-01/etc/daos/daos_control.yml new file mode 100644 index 0000000..f3f2c79 --- /dev/null +++ b/io500/sc25/tta/mantastorage/servers/MantaStorage-01/etc/daos/daos_control.yml @@ -0,0 +1,50 @@ +# DAOS manager (dmg) configuration file. +# +# Location of this configuration file is determined by first checking for the +# path specified through the -o option of the dmg command line. +# Otherwise, /etc/daos/daos_control.yml is used. +# +# Section describing the DAOS manager (dmg) configuration +# +# Although not supported for now, one might want to connect to multiple +# DAOS installations from the same node in the future. +# +# Specify the associated DAOS systems. +# Name must match name specified in the daos_server.yml file on the server. +# +# NOTE: Changing the name is not supported yet, it must be daos_server +# +# default: daos_server +#name: daos_server +name: daos_server + +# Default destination port to use when connecting to hosts in the hostlist. +# default: 10001 +port: 10001 + +# Hostlist, a comma separated list of addresses (hostnames or IPv4 addresses). +# default: ['localhost'] +hostlist: [ + "MantaStorage-01", + "MantaStorage-02", + "MantaStorage-03", + "MantaStorage-04" +] + +## Transport Credentials Specifying certificates to secure communications + +transport_config: +# # In order to disable transport security, uncomment and set allow_insecure +# # to true. Not recommended for production configurations. +# allow_insecure: false + allow_insecure: false +# +# # Custom CA Root certificate for generated certs +# ca_cert: /etc/daos/certs/daosCA.crt + ca_cert: /etc/daos/certs/daosCA.crt +# # Admin certificate for use in TLS handshakes +# cert: /etc/daos/certs/admin.crt + cert: /etc/daos/certs/admin.crt +# # Key portion of Admin Certificate +# key: /etc/daos/certs/admin.key + key: /etc/daos/certs/admin.key diff --git a/io500/sc25/tta/mantastorage/servers/MantaStorage-01/etc/daos/daos_server.yml b/io500/sc25/tta/mantastorage/servers/MantaStorage-01/etc/daos/daos_server.yml new file mode 100644 index 0000000..413ce7f --- /dev/null +++ b/io500/sc25/tta/mantastorage/servers/MantaStorage-01/etc/daos/daos_server.yml @@ -0,0 +1,675 @@ +## DAOS server configuration file. +# +## Location of this configuration file is determined by first checking for the +## path specified through the -o option of the daos_server command line. +## Otherwise, /etc/daos/daos_server.yml is used. +# +# +## Name associated with the DAOS system. +## Immutable after running "dmg storage format". +# +## NOTE: Changing the DAOS system name is not supported yet. +## It must not be changed from the default "daos_server". +# +## default: daos_server +name: daos_server + +## MS replicas +## Immutable after running "dmg storage format". +# +## To operate, DAOS requires a quorum of Management Service (MS) replica +## hosts to be available. All servers (replica or otherwise) must have the +## same list of replicas in order for the system to operate correctly. Choose +## 3-5 hosts to serve as replicas, preferably not co-located within the same +## fault domains. +## +## Hosts can be specified with or without port. The default port that is set +## up in port: will be used if a port is not specified here. +# +## default: hostname of this node +#mgmt_svc_replicas: ['hostname1', 'hostname2', 'hostname3'] +mgmt_svc_replicas: ["MantaStorage-01", "MantaStorage-02", "MantaStorage-03"] + +## Control plane metadata +## Immutable after running "dmg storage format". +# +## Mandatory if MD-on-SSD bdev device roles have been assigned. +## Define a directory or partition/mountpoint to be used as the storage location for +## control plane metadata. The location specified should be persistent across reboots. +# +control_metadata: +# # Directory to store control plane metadata. +# # If device is also defined, this path will be used as the mountpoint. +# path: /home/daos_server/control_meta + path: /var/daos/control_meta +# # Storage partition to be formatted with an ext4 filesystem and mounted for +# # control plane metadata storage. +# device: /dev/sdb1 + +## Default control plane port +# +## Port number to bind daos_server to. This will also be used when connecting +## to MS replicas, unless a port is specified in mgmt_svc_replicas: +# +## default: 10001 +port: 10001 + +## Transport credentials specifying certificates to secure communications +# +#transport_config: +# # In order to disable transport security, uncomment and set allow_insecure +# # to true. Not recommended for production configurations. +# allow_insecure: false +# +# # Location where daos_server will look for Client certificates +# client_cert_dir: /etc/daos/certs/clients +# # Custom CA Root certificate for generated certs +# ca_cert: /etc/daos/certs/daosCA.crt +# # Server certificate for use in TLS handshakes +# cert: /etc/daos/certs/server.crt +# # Key portion of Server Certificate +# key: /etc/daos/certs/server.key + +## Fault domain path +## Immutable after running "dmg storage format". +# +## default: /hostname for a local configuration w/o fault domain +#fault_path: /vcdu0/rack1/hostname + +## Fault domain callback +## Immutable after running "dmg storage format". +# +## Path to executable which will return fault domain string. +# +#fault_cb: ./.daos/fd_callback +#fault_cb: /etc/daos/daos_fd_callback.sh + +## Network provider +# +## Set the network provider to be used by all the engines. +## There is no default - run "daos_server network scan" to list the +## providers that are supported on the fabric interfaces. Examples: +## +## ofi+verbs;ofi_rxm for libfabric with Infiniband/RoCE +## ofi+tcp;ofi_rxm for libfabric with non-RDMA-capable fabrics +## +## (Starting with DAOS 2.2, ofi_rxm will be automatically added to the +## libfabric verbs and tcp providers, if not explicitly specified.) +# +#provider: ofi+verbs;ofi_rxm +provider: ofi+verbs;ofi_rxm + +## CART: global RPC timeout +## parameters shared with client. +# +#crt_timeout: 30 + +## CART: Disable SRX +## parameters shared with client. set it to true if network card +## does not support shared receive context, eg intel E810-C. +# +#disable_srx: false + +## CART: Fabric authorization key +## If the fabric requires an authorization key, set it here to +## be used on the server and clients. +# +#fabric_auth_key: foo:bar + +## Core Dump Filter +## Optional filter to control which mappings are written to the core +## dump in the event of a crash. See the following URL for more detail: +## https://man7.org/linux/man-pages/man5/core.5.html +# +#core_dump_filter: 0x13 + +## NVMe SSD exclusion list +## Immutable after running "dmg storage format". +# +## Only use NVMe controllers with specific PCI addresses. +## Excludes drives listed and forces auto-detection to skip those drives. +## default: Use all the NVMe SSDs that don't have active mount points. +# +#bdev_exclude: ["0000:81:00.1"] +bdev_exclude: ["0000:59:00.0"] + +## Disable VFIO Driver +# +## In some circumstances it may be preferable to force SPDK to use the UIO +## driver for NVMe device access even though an IOMMU is available. +## NOTE: Use of the UIO driver requires that daos_server must run as root. +# +## default: false +#disable_vfio: true + +## Disable VMD Usage +# +## VMD (Intel Volume Management Devices) is enabled by default but can be +## optionally disabled in which case VMD backing devices will not be visible. +# +## VMD needs to be available and configured in the system BIOS before it +## can be used. The main use case for VMD is managing NVMe SSD LED activity. +# +## default: false +#disable_vmd: true +disable_vmd: true + +## Disable NVMe SSD Hotplug +# +## NVMe SSD hotplug is enabled by default but can be optionally disabled. +## When enabled io engine will periodically check device hot +## plug/remove event, and setup/teardown the device automatically. +# +## default: false +#disable_hotplug: true + +## Use Hyperthreads +# +## When Hyperthreading is enabled and supported on the system, this parameter +## defines whether the DAOS service should try to take advantage of +## hyperthreading to scheduling different task on each hardware thread. +## Not supported yet. +# +## default: false +hyperthreads: true + +## Use the given directory for creating unix domain sockets +# +## DAOS Agent and DAOS Server both use unix domain sockets for communication +## with other system components. This setting is the base location to place +## the sockets in. +# +## NOTE: Do not change this when running under systemd control. If it needs to +## be changed, then make sure that it matches the RuntimeDirectory setting +## in /usr/lib/systemd/system/daos_server.service +# +## default: /var/run/daos_server +# +#socket_dir: ./.daos/daos_server + +## Number of hugepages to allocate for DMA buffer memory +# +## Optional parameter that should only be set if overriding the automatically calculated value is # +## #necessary. Specifies the number (not size) of hugepages to allocate for use by NVMe through +## #SPDK. For optimum performance each target requires 1 GiB of hugepage space. The provided value +## should be calculated by dividing the total amount of hugepages memory required for all targets +## across all engines on a host by the system hugepage size. If not set here, the value will be +## automatically calculated based on the number of targets (using the default system hugepage size). +# +## Example: (2 engines * (16 targets/engine * 1GiB)) / 2MiB hugepage size = 16834 +# +## default: 0 +#nr_hugepages: 0 +#nr_hugepages: 25600 + +## Hugepages are mandatory with NVME SSDs configured and optional without. +## To disable the use of hugepages when no NVMe SSDs are configured, set disable_hugepages to true. +# +## default: false +disable_hugepages: false + +## Hugepages will be applied across NUMA-nodes based on engine affinity. Typical scenarios where an +## equal number of engines exist across a number of NUMA-nodes or all engines on a single NUMA-node +## will be handled but if it's expected that there are a different number of engines on multiple +## NUMA-nodes e.g. two engines on NUMA-0 and one engine on NUMA-1, then explicitly setting this flag +## will enable engine start-up with an imbalanced configuration. +# +## default: false +#allow_numa_imbalance: false + +## Reserve an amount of RAM for system use when calculating the size of RAM-disks that will be +## created for DAOS I/O engines. Units are in GiB and represents the total RAM that will be +## reserved when calculating RAM-disk sizes for all engines. +# +## Optional parameter that should only be set if the automatically calculated value is unsuitable. +## In situations when a host is running applications alongside DAOS that use a significant amount +## of RAM resulting in MemAvailable value being too low to support the calculated RAM-disk size +## increasing the value will reduce the calculate size. Alternatively in situations where total +## RAM is low, reducing the value may prevent problems where RAM-disk size calculated is below the +## minimum of 4gib. Increasing the value may help avoid the potential of OOM killer terminating +## engine processes but could also result in stopping DAOS from using available memory resources. +# +## default: 26 +system_ram_reserved: 8 + +## Set specific debug mask for daos_server (control plane). +## The mask specifies minimum level of message significance to pass to logger. +## Currently supported values are DISABLED, TRACE, DEBUG, INFO, NOTICE and ERROR. +# +## default: INFO +#control_log_mask: ERROR + +## Force specific path for daos_server (control plane) logs. +# +## default: print to stderr +#control_log_file: /var/log/daos/daos_server.log +control_log_file: /var/log/daos/daos_server.log + +## Enable daos_server_helper (privileged helper) logging. +# +## default: disabled (errors only to control_log_file) +#helper_log_file: /var/log/daos/daos_server_helper.log +helper_log_file: /var/log/daos/daos_server_helper.log + +## Enable daos_firmware_helper (privileged helper) logging. +# +## default: disabled (errors only to control_log_file) +#firmware_helper_log_file: /var/log/daos/daos_firmware_helper.log +firmware_helper_log_file: /var/log/daos/daos_firmware_helper.log + +## Enable HTTP endpoint for remote telemetry collection. +# +## default endpoint state: disabled +## default endpoint port: 9191 +telemetry_port: 9191 + +## If desired, a set of client-side environment variables may be +## defined here. Note that these are intended to be defaults and +## may be overridden by manually-set environment variables when +## the client application is launched. +client_env_vars: + - CRT_CTX_SHARE_ADDR=1 + - CRT_CTX_NUM=8 + - CRT_CREDIT_EP_CTX=0 + - CRT_MRC_ENABLE=1 + - CRT_TIMEOUT=600 + - FI_OFI_RXM_USE_SRX=1 + - FI_UNIVERSE_SIZE=16383 + - FI_VERBS_INLINE_SIZE=128 + - FI_VERBS_PREFER_XRC=1 + +## When per-engine definitions exist, auto-allocation of resources is not +## performed. Without per-engine definitions, node resources will +## automatically be assigned to engines based on NUMA ratings. +## There will be a one-to-one relationship between engines and sockets. +# +engines: +- + # Number of I/O service threads (and network endpoints) per engine. + # Immutable after running "dmg storage format". + # + # Each storage target manages a fraction of the (interleaved) SCM storage space, + # and a fraction of one of the NVMe SSDs that are managed by this engine. + # For optimal balance regarding the NVMe space, the number of targets should be + # an integer multiple of the number of NVMe disks configured in bdev_list: + # To obtain the maximum SCM performance, a certain number of targets is needed. + # This is device- and workload-dependent, but around 16 targets usually work well. + # + # The server should have sufficiently many physical cores to support the + # number of targets, plus the additional service threads. + # + # Besides the number of CPU cores, this target number is limited by the number of + # SSDs listed in the bdev_list as well. One NVMe SSD in the bdev_list can be assigned + # to at most 64 targets in pmem mode, or 21 targets in md-on-ssd mode (when the SSD + # has three roles), so the maximum target number will be 'number_of_SSD * 64' in + # pmem mode or 'number_of_SSD * 21' in md-on-ssd mode. + + targets: 16 + + # Number of additional offload service threads per engine. + # Immutable after running "dmg storage format". + # + # Helper threads to accelerate checksum and server-side RPC dispatch. + # When using EC, it is recommended to configure helper threads in + # roughly a 1:4 ratio to the number of target threads. For example, + # when using 16 targets it is recommended to set nr_xs_helpers to 4. + # + # The server should have sufficiently many physical cores to support the + # number of helper threads, plus the number of targets. + # + # default: 0 (using existing target threads for this task) + + nr_xs_helpers: 4 + + # Pin this engine instance to cores and memory that are local to the + # NUMA node ID specified with this value. + # + # For best performance, it is necessary that the fabric_iface of this engine + # resides on the same NUMA node as the pinned_numa_node. + # + # Optional parameter; set either this option or first_core, but not both. + + pinned_numa_node: 0 + + # Offset of the first core to be used for I/O service threads (targets). + # Immutable after running "dmg storage format". + # + # For best performance, it is necessary that the fabric_iface of this engine + # resides on the same NUMA node as the first_core. + # + # Optional parameter; set either this option or pinned_numa_node but not both. + + #first_core: 2 + + # A boolean that instructs the I/O Engine instance to bypass the NVMe + # health check. This eliminates the check and related log output for those + # systems with NVMe that do not support the device health data query. + + bypass_health_chk: true + + # Specify the fabric network interface used for this engine. + + fabric_iface: ibs1 + + # Specify the fabric network interface port that will be used by this engine. + # The fabric_iface_port must be different for each engine on a DAOS server + # if each engine is assigned to the same fabric_iface. + + fabric_iface_port: 20000 + + # Force specific debug mask for the engine at start up time. + # By default, just use the default debug mask used by DAOS. + # Mask specifies minimum level of message significance to pass to logger. + + # default: ERR + log_mask: INFO + + # Force specific path for DAOS debug logs. + + # default: engine log goes to control_log_file + log_file: /var/log/daos/daos_engine.0.log + + # Pass specific environment variables to the engine process. + # Empty by default. Values should be supplied without encapsulating quotes. + + env_vars: + - CRT_CTX_SHARE_ADDR=1 + - CRT_CTX_NUM=8 + - CRT_TIMEOUT=600 + - CRT_CREDIT_EP_CTX=0 + - DAOS_MD_CAP=4096 + - DAOS_DTX_AGG_THD_CNT=16777216 + - DAOS_DTX_AGG_THD_AGE=1200 + - FI_CXI_CQ_FILL_PERCENT=20 + - FI_CXI_OFLOW_BUF_SIZE=8388608 + - FI_CXI_REQ_BUF_MIN_POSTED=8 + - FI_CXI_REQ_BUF_SIZE=8388608 + - FI_CXI_RX_MATCH_MODE=hybrid + - FI_OFI_RXM_USE_SRX=1 + - FI_UNIVERSE_SIZE=16384 + - FI_VERBS_PREFER_XRC=1 + - UCX_IB_FORK_INIT=n + + storage: + - + # Define a pre-configured mountpoint for storage class memory to be used + # by this engine. + # Path should be unique to engine instance (can use different subdirs). + # Either the specified directory or its parent must be a mount point. + + scm_mount: /mnt/daos/1 + + # Backend SCM device type. Either use PMem (Intel(R) Optane(TM) persistent + # memory) modules configured in interleaved mode or a tmpfs running in RAM. + # Options are: + # - "dcpm" for SCM, scm_size is ignored + # - "ram" to use tmpfs, scm_list is ignored + # Immutable after running "dmg storage format". + + class: ram + + # When class is set to ram, tmpfs will be used instead of dcpm. + # The size of ram is specified by scm_size in GB units and will be automatically calculated + # unless overridden by this optional parameter (units in GiB). + + #scm_size: 0 + scm_size: 160 + + # When class is set to ram, tmpfs will be mounted with hugepage + # support, if the kernel supports it. If this is not desirable, + # the behavior may be disabled here. + scm_hugepages_disabled: false + + - + # Backend block device type. Force a SPDK driver to be used by this engine + # instance. + # Options are: + # - "nvme" for NVMe SSDs (preferred option), bdev_size ignored + # - "file" to emulate a NVMe SSD with a regular file + # - "kdev" to use a kernel block device, bdev_size ignored + # Immutable after running "dmg storage format". + class: nvme + + # Backend block device configuration to be used by this engine instance. + # When class is set to nvme, bdev_list is the list of unique NVMe IDs + # that should be different across different engine instances. + # Immutable after running "dmg storage format". + bdev_list: [ + "0000:5b:00.0", + "0000:5c:00.0", + "0000:6c:00.0", + "0000:6d:00.0", + "0000:6e:00.0", + "0000:6f:00.0", + ] + + # If VMD-enabled NVMe SSDs are used, the bdev_list should consist of the VMD + # PCIe addresses, and not the BDF format transport IDs of the backing NVMe SSDs + # behind the VMD address. Also, 'disable_vmd' needs to be set to false. + #bdev_list: ["0000:5d:05.5"] + + # Optional override, will be automatically generated based on NUMA affinity. + # Filter hot-pluggable devices by PCI bus-ID by specifying a hexadecimal + # range. Hotplug events relating to devices with PCI bus-IDs outside this range + # will not be processed by this engine. Empty or unset range signifies allow all. + #bdev_busid_range: 0x80-0x8f + #bdev_busid_range: 128-143 + + # Optional explicit nvme-class bdev tier role assignments will + # define the roles and responsibilities of this bdev tier. + # If DCPM class is defined for the first tier, + # only one bdev tier is supported and its role must be data. + + # Roles will be derived based on configured bdev + # tiers, if not specified here. You must assign all roles or none. + # Options are: + # - "data" SSDs will be used to store actual data + # - "meta" SSDs will be used to store the VOS metadata + # - "wal" SSDs will be used to store the write-ahead-log + bdev_roles: ['data', 'meta', 'wal'] + + # Set criteria for automatic detection and eviction of faulty NVMe devices. The + # default criteria parameters are `enable: true`, `max_io_errs: 10` and + # `max_csum_errs: ` (essentially eviction due to checksum errors is + # disabled by default). + bdev_auto_faulty: + enable: true + max_io_errs: 100 + max_csum_errs: 200 + +- + # Number of I/O service threads (and network endpoints) per engine. + # Immutable after running "dmg storage format". + # + # Each storage target manages a fraction of the (interleaved) SCM storage space, + # and a fraction of one of the NVMe SSDs that are managed by this engine. + # For optimal balance regarding the NVMe space, the number of targets should be + # an integer multiple of the number of NVMe disks configured in bdev_list: + # To obtain the maximum SCM performance, a certain number of targets is needed. + # This is device- and workload-dependent, but around 16 targets usually work well. + # + # The server should have sufficiently many physical cores to support the + # number of targets, plus the additional service threads. + + targets: 16 + + # Number of additional offload service threads per engine. + # Immutable after running "dmg storage format". + # + # Helper threads to accelerate checksum and server-side RPC dispatch. + # + # The server should have sufficiently many physical cores to support the + # number of helper threads, plus the number of targets. + + nr_xs_helpers: 4 + + # Pin this engine instance to cores and memory that are local to the + # NUMA node ID specified with this value. + # + # For best performance, it is necessary that the fabric_iface of this engine + # resides on the same NUMA node as the pinned_numa_node. + # + # Optional parameter; set either this option or first_core, but not both. + + pinned_numa_node: 1 + + # Offset of the first core to be used for I/O service threads (targets). + # Immutable after running "dmg storage format". + # + # For best performance, it is necessary that the fabric_iface of this engine + # resides on the same NUMA node as the first_core. + # + # Optional parameter; set either this option or pinned_numa_node but not both. + + #first_core: 64 + + # A boolean that instructs the I/O Engine instance to bypass the NVMe + # health check. This eliminates the check and related log output for those + # systems with NVMe that do not support the device health data query. + + bypass_health_chk: true + + # Specify the fabric network interface used for this engine. + + fabric_iface: ibs769 + + # Specify the fabric network interface port that will be used by this engine. + # The fabric_iface_port must be different for each engine on a DAOS server + # if each engine is assigned to the same fabric_iface. + + fabric_iface_port: 21000 + + # Force specific debug mask for the engine at start up time. + # By default, just use the default debug mask used by DAOS. + # Mask specifies minimum level of message significance to pass to logger. + + # default: ERR + log_mask: INFO + + # Force specific path for DAOS debug logs. + + # default: engine log goes to control_log_file + log_file: /var/log/daos/daos_engine.1.log + + # Pass specific environment variables to the engine process. + # Empty by default. Values should be supplied without encapsulating quotes. + + env_vars: + - CRT_CTX_SHARE_ADDR=1 + - CRT_CTX_NUM=8 + - CRT_TIMEOUT=600 + - CRT_CREDIT_EP_CTX=0 + - DAOS_MD_CAP=4096 + - DAOS_DTX_AGG_THD_CNT=16777216 + - DAOS_DTX_AGG_THD_AGE=1200 + - FI_CXI_CQ_FILL_PERCENT=20 + - FI_CXI_OFLOW_BUF_SIZE=8388608 + - FI_CXI_REQ_BUF_MIN_POSTED=8 + - FI_CXI_REQ_BUF_SIZE=8388608 + - FI_CXI_RX_MATCH_MODE=hybrid + - FI_OFI_RXM_USE_SRX=1 + - FI_UNIVERSE_SIZE=16384 + - FI_VERBS_PREFER_XRC=1 + - UCX_IB_FORK_INIT=n + + storage: + - + # Define a pre-configured mountpoint for storage class memory to be used + # by this engine. + # Path should be unique to engine instance (can use different subdirs). + # Either the specified directory or its parent must be a mount point. + + scm_mount: /mnt/daos/2 + + # Backend SCM device type. Either use PMem (Intel(R) Optane(TM) persistent + # memory) modules configured in interleaved mode or a tmpfs running in RAM. + # Options are: + # - "dcpm" for SCM, scm_size is ignored + # - "ram" to use tmpfs, scm_list is ignored + # Immutable after running "dmg storage format". + + class: ram + + # When class is set to ram, tmpfs will be used instead of dcpm. + # The size of ram is specified by scm_size in GB units and will be automatically calculated + # unless overridden by this optional parameter (units in GiB). + + #scm_size: 0 + scm_size: 160 + + # When class is set to ram, tmpfs will be mounted with hugepage + # support, if the kernel supports it. If this is not desirable, + # the behavior may be disabled here. + scm_hugepages_disabled: false + + # When class is set to dcpm, scm_list is the list of device paths for + # PMem namespaces (currently only one per engine supported). + #class: dcpm + #scm_list: [/dev/pmem1] + + - + # Backend block device type. Force a SPDK driver to be used by this engine + # instance. + # Options are: + # - "nvme" for NVMe SSDs (preferred option), bdev_size ignored + # - "file" to emulate a NVMe SSD with a regular file + # - "kdev" to use a kernel block device, bdev_size ignored + # Immutable after running "dmg storage format". + + # When class is set to file, Linux AIO will be used to emulate NVMe. + # The size of file that will be created is specified by bdev_size in GB units. + # The location of the files that will be created is specified in bdev_list. + #class: file + #bdev_list: [/tmp/daos-bdev1,/tmp/daos-bdev2] + #bdev_size: 16 + + # When class is set to kdev, bdev_list is the list of unique kernel + # block devices that should be different across different engine instance. + #class: kdev + #bdev_list: [/dev/sdc,/dev/sdd] + + # If Volume Management Devices (VMD) are to be used, then the disable_vmd + # flag needs to be set to false (default). The class will remain the + # default "nvme" type, and bdev_list will include the VMD addresses. + #class: nvme + #bdev_list: ["0000:5d:05.5"] + + class: nvme + bdev_list: [ + "0000:d8:00.0", + "0000:d9:00.0", + "0000:da:00.0", + "0000:db:00.0", + "0000:ea:00.0", + "0000:eb:00.0", + ] + + # Optional override, will be automatically generated based on NUMA affinity. + # Filter hot-pluggable devices by PCI bus-ID by specifying a hexadecimal + # range. Hotplug events relating to devices with PCI bus-IDs outside this range + # will not be processed by this engine. Empty or unset range signifies allow all. + #bdev_busid_range: 0xd0-0xdf + #bdev_busid_range: 208-223 + + # See about bdev_roles above. + bdev_roles: ['data', 'meta', 'wal'] + + # Disable automatic detection and eviction of faulty NVMe devices. The default + # criteria parameters are `enable: true`, `max_io_errs: 10` and + # `max_csum_errs: ` (essentially eviction due to checksum errors is + # disabled by default). + + bdev_auto_faulty: + enable: true + max_io_errs: 100 + max_csum_errs: 200 + +#support_config: +# +## Override the default file transfer mechanism for dmg support collect-log by supplying +## the path to an alternative script or binary that will be used to copy log off of the servers. +## Note that the --transfer-args flag to the collect-log command may be used to supply +## extra runtime arguments used by the copy tool (e.g. a cloud bucket name, etc.) +## file_transfer_exec: /usr/bin/rsync (example) +# +# file_transfer_exec: diff --git a/io500/sc25/tta/mantastorage/servers/MantaStorage-01/etc/sysctl.d/99-daos-net.conf b/io500/sc25/tta/mantastorage/servers/MantaStorage-01/etc/sysctl.d/99-daos-net.conf new file mode 100644 index 0000000..c407299 --- /dev/null +++ b/io500/sc25/tta/mantastorage/servers/MantaStorage-01/etc/sysctl.d/99-daos-net.conf @@ -0,0 +1,41 @@ + ############################################################################## + # Multi-rail + ############################################################################## + net.ipv4.conf.ibs1.accept_local=1 + net.ipv4.conf.ibs1.rp_filter=2 + net.ipv4.conf.ibs1.arp_ignore=2 + net.ipv4.conf.ibs1.arp_announce=0 + net.ipv4.conf.ibs1.arp_filter=0 + + net.ipv4.conf.ibs769.accept_local=1 + net.ipv4.conf.ibs769.rp_filter=2 + net.ipv4.conf.ibs769.arp_ignore=2 + net.ipv4.conf.ibs769.arp_announce=0 + net.ipv4.conf.ibs769.arp_filter=0 + + ############################################################################## + # Common tunning + ############################################################################## + net.core.netdev_max_backlog = 250000 + #net.core.rmem_max = 16777216 + net.core.rmem_max = 134217728 + #net.core.wmem_max = 16777216 + net.core.wmem_max = 134217728 + net.core.rmem_default = 16777216 + net.core.wmem_default = 16777216 + net.core.optmem_max = 16777216 + + net.ipv4.neigh.default.gc_thresh1 = 1024 + net.ipv4.neigh.default.gc_thresh2 = 4096 + net.ipv4.neigh.default.gc_thresh3 = 16384 + + net.ipv4.tcp_timestamps = 1 + net.ipv4.tcp_sack = 1 + net.ipv4.tcp_mem = 16777216 16777216 16777216 + net.ipv4.tcp_rmem = 4096 1677216 134217728 + net.ipv4.tcp_wmem = 4096 1677216 134217728 + net.ipv4.tcp_low_latency = 1 + net.ipv4.tcp_mtu_probing = 1 + net.ipv4.tcp_congestion_control=bbr + net.core.default_qdisc=fq + net.ipv4.tcp_slow_start_after_idle=0 diff --git a/io500/sc25/tta/mantastorage/servers/MantaStorage-01/etc/sysctl.d/99-daos-vm.conf b/io500/sc25/tta/mantastorage/servers/MantaStorage-01/etc/sysctl.d/99-daos-vm.conf new file mode 100644 index 0000000..bc13d41 --- /dev/null +++ b/io500/sc25/tta/mantastorage/servers/MantaStorage-01/etc/sysctl.d/99-daos-vm.conf @@ -0,0 +1 @@ + vm.swappiness = 1 diff --git a/io500/sc25/tta/mantastorage/servers/MantaStorage-02/etc/daos/daos_agent.yml b/io500/sc25/tta/mantastorage/servers/MantaStorage-02/etc/daos/daos_agent.yml new file mode 100644 index 0000000..89ad165 --- /dev/null +++ b/io500/sc25/tta/mantastorage/servers/MantaStorage-02/etc/daos/daos_agent.yml @@ -0,0 +1,183 @@ +# DAOS agent configuration file. +# +# Location of this configuration file is determined by first checking for the +# path specified through the -o option of the daos_agent command line. +# Otherwise, /etc/daos/daos_agent.yml is used. +# +# Section describing the daos_agent configuration +# +# Although not supported for now, one might want to connect to multiple +# DAOS installations from the same node in the future. +# +# Specify the associated DAOS systems. +# Name must match name specified in the daos_server.yml file on the server. +# +# NOTE: changing the name is not supported yet, it must be daos_server +# +# default: daos_server +name: daos_server + +# Management server access points +# Must have the same value for all agents and servers in a system. +# default: hostname of this node +access_points: [ + "MantaStorage-01", + "MantaStorage-02", + "MantaStorage-03", + "MantaStorage-04" +] + +# Force different port number to connect to access points. +# default: 10001 +port: 10001 + +## Enable HTTP endpoint for remote telemetry collection. +# Note that enabling the endpoint automatically enables +# client telemetry collection. +# +## default endpoint state: disabled +## default endpoint port: 9192 +#telemetry_port: 9192 + +## Enable client telemetry for all DAOS clients. +# If false, clients will need to optionally enable telemetry by setting +# the D_CLIENT_METRICS_ENABLE environment variable to true. +# +## default: false +#telemetry_enabled: true + +## Retain client telemetry for a period of time after the client +# process exits. +# +## default 0 (do not retain telemetry after client exit) +#telemetry_retain: 1m + +## Enable client telemetry for a subset of DAOS clients matching the +# supplied regular expression pattern. May not be used with +# telemetry_disabled_procs. +# +## default: not set +#telemetry_enabled_procs: ^dfuse$ + +## Disable client telemetry for a subset of DAOS clients matching the +# supplied regular expression pattern. May not be used with +# telemetry_enabled_procs. +# +## default: not set +#telemetry_disabled_procs: ^spambot-.* + +## Configuration for user credential management. +#credential_config: +# # If the agent should be able to resolve unknown client uids and gids +# # (e.g. when running in a container) into ACL principal names, then a +# # client user map may be defined. The optional "default" uid is a special +# # case and applies if no other matches are found. +# client_user_map: +# default: +# user: nobody +# group: nobody +# 1000: +# user: ralph +# group: stanley +# +# # Optionally cache generated credentials with the specified cache +# # lifetime. By default, a credential is generated for every client +# # process that connects to a pool. If the credential cache is +# # enabled, then local client processes connecting with stable +# # uid:gid associations may take advantage of the cached credential +# # and reduce some agent overhead. For heavily-loaded client nodes +# # with many frequent (e.g. hundreds per minute) client connections, +# # a lifetime of 1-5 minutes may be a reasonable tradeoff between +# # performance and responsiveness to user/group database updates. +# # If no expiration is set, credential caching is not enabled. +# cache_expiration: 1m + +## Configuration for SSL certificates used to secure management traffic +# and authenticate/authorize management components. +transport_config: + # In order to disable transport security, uncomment and set allow_insecure + # to true. Not recommended for production configurations. + allow_insecure: false + + # Custom CA Root certificate for generated certs + ca_cert: /etc/daos/certs/daosCA.crt + # Agent certificate for use in TLS handshakes + cert: /etc/daos/certs/agent.crt + # Key portion of Agent Certificate + key: /etc/daos/certs/agent.key + +# Use the given directory for creating unix domain sockets +# +# NOTE: Do not change this when running under systemd control. If it needs to +# be changed, then make sure that it matches the RuntimeDirectory setting +# in /usr/lib/systemd/system/daos_agent.service +# +# default: /var/run/daos_agent +#runtime_dir: /var/run/daos_agent + +# Full path and name of the DAOS agent logfile. +# default: print to stderr +log_file: /var/log/daos/daos_agent.log + +# Force specific debug mask for daos_agent (control plane). +# Mask specifies minimum level of message significance to pass to logger. +# Currently supported values are DISABLED, TRACE, DEBUG, INFO, NOTICE and ERROR. + +# default: INFO +control_log_mask: DEBUG + +# Disable automatic eviction of open pool handles on agent shutdown. By default, +# the agent will evict all open pool handles for local processes on shutdown. +# Note that this implies that stopping or restarting the agent will result +# in interruption of DAOS I/O for any local DAOS client processes that have +# an open pool handle. +# default: false +disable_auto_evict: true + +# If enabled, the agent will evict any open pool handles associated with this machine on agent +# startup. This allows the servers to reclaim resources that may not have been properly cleaned +# up in the event of an agent or machine crash. +# default: false +enable_evict_on_start: true + +## Disable the agent's internal caches. If set to true, the agent will query the +## server access point and local hardware data every time a client requests +## rank connection information. +# +## default: false +disable_caching: true + +## Automatically expire the agent's remote cache after a period of time defined in +## minutes. It will refresh the data the next time it is requested. +# +## default: 0 (never expires) +#cache_expiration: 30 + +## Ignore a subset of fabric interfaces when selecting an interface for client +## applications. (Mutually exclusive with include). +# +#exclude_fabric_ifaces: ["lo", "eth1"] + +# Conversely, only consider a specific set of fabric interfaces when selecting +# an interface for client applications. (Mutually exclusive with exclude). + +#include_fabric_ifaces: ["ibs1", "ibs769"] + +# Manually define the fabric interfaces and domains to be used by the agent, +# organized by NUMA node. +# If not defined, the agent will automatically detect all fabric interfaces and +# select appropriate ones based on the server preferences. +# +fabric_ifaces: +- + numa_node: 0 + devices: + - + iface: ibs1 + domain: mlx5_0 +- + numa_node: 1 + devices: + - + iface: ibs769 + domain: mlx5_1 diff --git a/io500/sc25/tta/mantastorage/servers/MantaStorage-02/etc/daos/daos_control.yml b/io500/sc25/tta/mantastorage/servers/MantaStorage-02/etc/daos/daos_control.yml new file mode 100644 index 0000000..f3f2c79 --- /dev/null +++ b/io500/sc25/tta/mantastorage/servers/MantaStorage-02/etc/daos/daos_control.yml @@ -0,0 +1,50 @@ +# DAOS manager (dmg) configuration file. +# +# Location of this configuration file is determined by first checking for the +# path specified through the -o option of the dmg command line. +# Otherwise, /etc/daos/daos_control.yml is used. +# +# Section describing the DAOS manager (dmg) configuration +# +# Although not supported for now, one might want to connect to multiple +# DAOS installations from the same node in the future. +# +# Specify the associated DAOS systems. +# Name must match name specified in the daos_server.yml file on the server. +# +# NOTE: Changing the name is not supported yet, it must be daos_server +# +# default: daos_server +#name: daos_server +name: daos_server + +# Default destination port to use when connecting to hosts in the hostlist. +# default: 10001 +port: 10001 + +# Hostlist, a comma separated list of addresses (hostnames or IPv4 addresses). +# default: ['localhost'] +hostlist: [ + "MantaStorage-01", + "MantaStorage-02", + "MantaStorage-03", + "MantaStorage-04" +] + +## Transport Credentials Specifying certificates to secure communications + +transport_config: +# # In order to disable transport security, uncomment and set allow_insecure +# # to true. Not recommended for production configurations. +# allow_insecure: false + allow_insecure: false +# +# # Custom CA Root certificate for generated certs +# ca_cert: /etc/daos/certs/daosCA.crt + ca_cert: /etc/daos/certs/daosCA.crt +# # Admin certificate for use in TLS handshakes +# cert: /etc/daos/certs/admin.crt + cert: /etc/daos/certs/admin.crt +# # Key portion of Admin Certificate +# key: /etc/daos/certs/admin.key + key: /etc/daos/certs/admin.key diff --git a/io500/sc25/tta/mantastorage/servers/MantaStorage-02/etc/daos/daos_server.yml b/io500/sc25/tta/mantastorage/servers/MantaStorage-02/etc/daos/daos_server.yml new file mode 100644 index 0000000..413ce7f --- /dev/null +++ b/io500/sc25/tta/mantastorage/servers/MantaStorage-02/etc/daos/daos_server.yml @@ -0,0 +1,675 @@ +## DAOS server configuration file. +# +## Location of this configuration file is determined by first checking for the +## path specified through the -o option of the daos_server command line. +## Otherwise, /etc/daos/daos_server.yml is used. +# +# +## Name associated with the DAOS system. +## Immutable after running "dmg storage format". +# +## NOTE: Changing the DAOS system name is not supported yet. +## It must not be changed from the default "daos_server". +# +## default: daos_server +name: daos_server + +## MS replicas +## Immutable after running "dmg storage format". +# +## To operate, DAOS requires a quorum of Management Service (MS) replica +## hosts to be available. All servers (replica or otherwise) must have the +## same list of replicas in order for the system to operate correctly. Choose +## 3-5 hosts to serve as replicas, preferably not co-located within the same +## fault domains. +## +## Hosts can be specified with or without port. The default port that is set +## up in port: will be used if a port is not specified here. +# +## default: hostname of this node +#mgmt_svc_replicas: ['hostname1', 'hostname2', 'hostname3'] +mgmt_svc_replicas: ["MantaStorage-01", "MantaStorage-02", "MantaStorage-03"] + +## Control plane metadata +## Immutable after running "dmg storage format". +# +## Mandatory if MD-on-SSD bdev device roles have been assigned. +## Define a directory or partition/mountpoint to be used as the storage location for +## control plane metadata. The location specified should be persistent across reboots. +# +control_metadata: +# # Directory to store control plane metadata. +# # If device is also defined, this path will be used as the mountpoint. +# path: /home/daos_server/control_meta + path: /var/daos/control_meta +# # Storage partition to be formatted with an ext4 filesystem and mounted for +# # control plane metadata storage. +# device: /dev/sdb1 + +## Default control plane port +# +## Port number to bind daos_server to. This will also be used when connecting +## to MS replicas, unless a port is specified in mgmt_svc_replicas: +# +## default: 10001 +port: 10001 + +## Transport credentials specifying certificates to secure communications +# +#transport_config: +# # In order to disable transport security, uncomment and set allow_insecure +# # to true. Not recommended for production configurations. +# allow_insecure: false +# +# # Location where daos_server will look for Client certificates +# client_cert_dir: /etc/daos/certs/clients +# # Custom CA Root certificate for generated certs +# ca_cert: /etc/daos/certs/daosCA.crt +# # Server certificate for use in TLS handshakes +# cert: /etc/daos/certs/server.crt +# # Key portion of Server Certificate +# key: /etc/daos/certs/server.key + +## Fault domain path +## Immutable after running "dmg storage format". +# +## default: /hostname for a local configuration w/o fault domain +#fault_path: /vcdu0/rack1/hostname + +## Fault domain callback +## Immutable after running "dmg storage format". +# +## Path to executable which will return fault domain string. +# +#fault_cb: ./.daos/fd_callback +#fault_cb: /etc/daos/daos_fd_callback.sh + +## Network provider +# +## Set the network provider to be used by all the engines. +## There is no default - run "daos_server network scan" to list the +## providers that are supported on the fabric interfaces. Examples: +## +## ofi+verbs;ofi_rxm for libfabric with Infiniband/RoCE +## ofi+tcp;ofi_rxm for libfabric with non-RDMA-capable fabrics +## +## (Starting with DAOS 2.2, ofi_rxm will be automatically added to the +## libfabric verbs and tcp providers, if not explicitly specified.) +# +#provider: ofi+verbs;ofi_rxm +provider: ofi+verbs;ofi_rxm + +## CART: global RPC timeout +## parameters shared with client. +# +#crt_timeout: 30 + +## CART: Disable SRX +## parameters shared with client. set it to true if network card +## does not support shared receive context, eg intel E810-C. +# +#disable_srx: false + +## CART: Fabric authorization key +## If the fabric requires an authorization key, set it here to +## be used on the server and clients. +# +#fabric_auth_key: foo:bar + +## Core Dump Filter +## Optional filter to control which mappings are written to the core +## dump in the event of a crash. See the following URL for more detail: +## https://man7.org/linux/man-pages/man5/core.5.html +# +#core_dump_filter: 0x13 + +## NVMe SSD exclusion list +## Immutable after running "dmg storage format". +# +## Only use NVMe controllers with specific PCI addresses. +## Excludes drives listed and forces auto-detection to skip those drives. +## default: Use all the NVMe SSDs that don't have active mount points. +# +#bdev_exclude: ["0000:81:00.1"] +bdev_exclude: ["0000:59:00.0"] + +## Disable VFIO Driver +# +## In some circumstances it may be preferable to force SPDK to use the UIO +## driver for NVMe device access even though an IOMMU is available. +## NOTE: Use of the UIO driver requires that daos_server must run as root. +# +## default: false +#disable_vfio: true + +## Disable VMD Usage +# +## VMD (Intel Volume Management Devices) is enabled by default but can be +## optionally disabled in which case VMD backing devices will not be visible. +# +## VMD needs to be available and configured in the system BIOS before it +## can be used. The main use case for VMD is managing NVMe SSD LED activity. +# +## default: false +#disable_vmd: true +disable_vmd: true + +## Disable NVMe SSD Hotplug +# +## NVMe SSD hotplug is enabled by default but can be optionally disabled. +## When enabled io engine will periodically check device hot +## plug/remove event, and setup/teardown the device automatically. +# +## default: false +#disable_hotplug: true + +## Use Hyperthreads +# +## When Hyperthreading is enabled and supported on the system, this parameter +## defines whether the DAOS service should try to take advantage of +## hyperthreading to scheduling different task on each hardware thread. +## Not supported yet. +# +## default: false +hyperthreads: true + +## Use the given directory for creating unix domain sockets +# +## DAOS Agent and DAOS Server both use unix domain sockets for communication +## with other system components. This setting is the base location to place +## the sockets in. +# +## NOTE: Do not change this when running under systemd control. If it needs to +## be changed, then make sure that it matches the RuntimeDirectory setting +## in /usr/lib/systemd/system/daos_server.service +# +## default: /var/run/daos_server +# +#socket_dir: ./.daos/daos_server + +## Number of hugepages to allocate for DMA buffer memory +# +## Optional parameter that should only be set if overriding the automatically calculated value is # +## #necessary. Specifies the number (not size) of hugepages to allocate for use by NVMe through +## #SPDK. For optimum performance each target requires 1 GiB of hugepage space. The provided value +## should be calculated by dividing the total amount of hugepages memory required for all targets +## across all engines on a host by the system hugepage size. If not set here, the value will be +## automatically calculated based on the number of targets (using the default system hugepage size). +# +## Example: (2 engines * (16 targets/engine * 1GiB)) / 2MiB hugepage size = 16834 +# +## default: 0 +#nr_hugepages: 0 +#nr_hugepages: 25600 + +## Hugepages are mandatory with NVME SSDs configured and optional without. +## To disable the use of hugepages when no NVMe SSDs are configured, set disable_hugepages to true. +# +## default: false +disable_hugepages: false + +## Hugepages will be applied across NUMA-nodes based on engine affinity. Typical scenarios where an +## equal number of engines exist across a number of NUMA-nodes or all engines on a single NUMA-node +## will be handled but if it's expected that there are a different number of engines on multiple +## NUMA-nodes e.g. two engines on NUMA-0 and one engine on NUMA-1, then explicitly setting this flag +## will enable engine start-up with an imbalanced configuration. +# +## default: false +#allow_numa_imbalance: false + +## Reserve an amount of RAM for system use when calculating the size of RAM-disks that will be +## created for DAOS I/O engines. Units are in GiB and represents the total RAM that will be +## reserved when calculating RAM-disk sizes for all engines. +# +## Optional parameter that should only be set if the automatically calculated value is unsuitable. +## In situations when a host is running applications alongside DAOS that use a significant amount +## of RAM resulting in MemAvailable value being too low to support the calculated RAM-disk size +## increasing the value will reduce the calculate size. Alternatively in situations where total +## RAM is low, reducing the value may prevent problems where RAM-disk size calculated is below the +## minimum of 4gib. Increasing the value may help avoid the potential of OOM killer terminating +## engine processes but could also result in stopping DAOS from using available memory resources. +# +## default: 26 +system_ram_reserved: 8 + +## Set specific debug mask for daos_server (control plane). +## The mask specifies minimum level of message significance to pass to logger. +## Currently supported values are DISABLED, TRACE, DEBUG, INFO, NOTICE and ERROR. +# +## default: INFO +#control_log_mask: ERROR + +## Force specific path for daos_server (control plane) logs. +# +## default: print to stderr +#control_log_file: /var/log/daos/daos_server.log +control_log_file: /var/log/daos/daos_server.log + +## Enable daos_server_helper (privileged helper) logging. +# +## default: disabled (errors only to control_log_file) +#helper_log_file: /var/log/daos/daos_server_helper.log +helper_log_file: /var/log/daos/daos_server_helper.log + +## Enable daos_firmware_helper (privileged helper) logging. +# +## default: disabled (errors only to control_log_file) +#firmware_helper_log_file: /var/log/daos/daos_firmware_helper.log +firmware_helper_log_file: /var/log/daos/daos_firmware_helper.log + +## Enable HTTP endpoint for remote telemetry collection. +# +## default endpoint state: disabled +## default endpoint port: 9191 +telemetry_port: 9191 + +## If desired, a set of client-side environment variables may be +## defined here. Note that these are intended to be defaults and +## may be overridden by manually-set environment variables when +## the client application is launched. +client_env_vars: + - CRT_CTX_SHARE_ADDR=1 + - CRT_CTX_NUM=8 + - CRT_CREDIT_EP_CTX=0 + - CRT_MRC_ENABLE=1 + - CRT_TIMEOUT=600 + - FI_OFI_RXM_USE_SRX=1 + - FI_UNIVERSE_SIZE=16383 + - FI_VERBS_INLINE_SIZE=128 + - FI_VERBS_PREFER_XRC=1 + +## When per-engine definitions exist, auto-allocation of resources is not +## performed. Without per-engine definitions, node resources will +## automatically be assigned to engines based on NUMA ratings. +## There will be a one-to-one relationship between engines and sockets. +# +engines: +- + # Number of I/O service threads (and network endpoints) per engine. + # Immutable after running "dmg storage format". + # + # Each storage target manages a fraction of the (interleaved) SCM storage space, + # and a fraction of one of the NVMe SSDs that are managed by this engine. + # For optimal balance regarding the NVMe space, the number of targets should be + # an integer multiple of the number of NVMe disks configured in bdev_list: + # To obtain the maximum SCM performance, a certain number of targets is needed. + # This is device- and workload-dependent, but around 16 targets usually work well. + # + # The server should have sufficiently many physical cores to support the + # number of targets, plus the additional service threads. + # + # Besides the number of CPU cores, this target number is limited by the number of + # SSDs listed in the bdev_list as well. One NVMe SSD in the bdev_list can be assigned + # to at most 64 targets in pmem mode, or 21 targets in md-on-ssd mode (when the SSD + # has three roles), so the maximum target number will be 'number_of_SSD * 64' in + # pmem mode or 'number_of_SSD * 21' in md-on-ssd mode. + + targets: 16 + + # Number of additional offload service threads per engine. + # Immutable after running "dmg storage format". + # + # Helper threads to accelerate checksum and server-side RPC dispatch. + # When using EC, it is recommended to configure helper threads in + # roughly a 1:4 ratio to the number of target threads. For example, + # when using 16 targets it is recommended to set nr_xs_helpers to 4. + # + # The server should have sufficiently many physical cores to support the + # number of helper threads, plus the number of targets. + # + # default: 0 (using existing target threads for this task) + + nr_xs_helpers: 4 + + # Pin this engine instance to cores and memory that are local to the + # NUMA node ID specified with this value. + # + # For best performance, it is necessary that the fabric_iface of this engine + # resides on the same NUMA node as the pinned_numa_node. + # + # Optional parameter; set either this option or first_core, but not both. + + pinned_numa_node: 0 + + # Offset of the first core to be used for I/O service threads (targets). + # Immutable after running "dmg storage format". + # + # For best performance, it is necessary that the fabric_iface of this engine + # resides on the same NUMA node as the first_core. + # + # Optional parameter; set either this option or pinned_numa_node but not both. + + #first_core: 2 + + # A boolean that instructs the I/O Engine instance to bypass the NVMe + # health check. This eliminates the check and related log output for those + # systems with NVMe that do not support the device health data query. + + bypass_health_chk: true + + # Specify the fabric network interface used for this engine. + + fabric_iface: ibs1 + + # Specify the fabric network interface port that will be used by this engine. + # The fabric_iface_port must be different for each engine on a DAOS server + # if each engine is assigned to the same fabric_iface. + + fabric_iface_port: 20000 + + # Force specific debug mask for the engine at start up time. + # By default, just use the default debug mask used by DAOS. + # Mask specifies minimum level of message significance to pass to logger. + + # default: ERR + log_mask: INFO + + # Force specific path for DAOS debug logs. + + # default: engine log goes to control_log_file + log_file: /var/log/daos/daos_engine.0.log + + # Pass specific environment variables to the engine process. + # Empty by default. Values should be supplied without encapsulating quotes. + + env_vars: + - CRT_CTX_SHARE_ADDR=1 + - CRT_CTX_NUM=8 + - CRT_TIMEOUT=600 + - CRT_CREDIT_EP_CTX=0 + - DAOS_MD_CAP=4096 + - DAOS_DTX_AGG_THD_CNT=16777216 + - DAOS_DTX_AGG_THD_AGE=1200 + - FI_CXI_CQ_FILL_PERCENT=20 + - FI_CXI_OFLOW_BUF_SIZE=8388608 + - FI_CXI_REQ_BUF_MIN_POSTED=8 + - FI_CXI_REQ_BUF_SIZE=8388608 + - FI_CXI_RX_MATCH_MODE=hybrid + - FI_OFI_RXM_USE_SRX=1 + - FI_UNIVERSE_SIZE=16384 + - FI_VERBS_PREFER_XRC=1 + - UCX_IB_FORK_INIT=n + + storage: + - + # Define a pre-configured mountpoint for storage class memory to be used + # by this engine. + # Path should be unique to engine instance (can use different subdirs). + # Either the specified directory or its parent must be a mount point. + + scm_mount: /mnt/daos/1 + + # Backend SCM device type. Either use PMem (Intel(R) Optane(TM) persistent + # memory) modules configured in interleaved mode or a tmpfs running in RAM. + # Options are: + # - "dcpm" for SCM, scm_size is ignored + # - "ram" to use tmpfs, scm_list is ignored + # Immutable after running "dmg storage format". + + class: ram + + # When class is set to ram, tmpfs will be used instead of dcpm. + # The size of ram is specified by scm_size in GB units and will be automatically calculated + # unless overridden by this optional parameter (units in GiB). + + #scm_size: 0 + scm_size: 160 + + # When class is set to ram, tmpfs will be mounted with hugepage + # support, if the kernel supports it. If this is not desirable, + # the behavior may be disabled here. + scm_hugepages_disabled: false + + - + # Backend block device type. Force a SPDK driver to be used by this engine + # instance. + # Options are: + # - "nvme" for NVMe SSDs (preferred option), bdev_size ignored + # - "file" to emulate a NVMe SSD with a regular file + # - "kdev" to use a kernel block device, bdev_size ignored + # Immutable after running "dmg storage format". + class: nvme + + # Backend block device configuration to be used by this engine instance. + # When class is set to nvme, bdev_list is the list of unique NVMe IDs + # that should be different across different engine instances. + # Immutable after running "dmg storage format". + bdev_list: [ + "0000:5b:00.0", + "0000:5c:00.0", + "0000:6c:00.0", + "0000:6d:00.0", + "0000:6e:00.0", + "0000:6f:00.0", + ] + + # If VMD-enabled NVMe SSDs are used, the bdev_list should consist of the VMD + # PCIe addresses, and not the BDF format transport IDs of the backing NVMe SSDs + # behind the VMD address. Also, 'disable_vmd' needs to be set to false. + #bdev_list: ["0000:5d:05.5"] + + # Optional override, will be automatically generated based on NUMA affinity. + # Filter hot-pluggable devices by PCI bus-ID by specifying a hexadecimal + # range. Hotplug events relating to devices with PCI bus-IDs outside this range + # will not be processed by this engine. Empty or unset range signifies allow all. + #bdev_busid_range: 0x80-0x8f + #bdev_busid_range: 128-143 + + # Optional explicit nvme-class bdev tier role assignments will + # define the roles and responsibilities of this bdev tier. + # If DCPM class is defined for the first tier, + # only one bdev tier is supported and its role must be data. + + # Roles will be derived based on configured bdev + # tiers, if not specified here. You must assign all roles or none. + # Options are: + # - "data" SSDs will be used to store actual data + # - "meta" SSDs will be used to store the VOS metadata + # - "wal" SSDs will be used to store the write-ahead-log + bdev_roles: ['data', 'meta', 'wal'] + + # Set criteria for automatic detection and eviction of faulty NVMe devices. The + # default criteria parameters are `enable: true`, `max_io_errs: 10` and + # `max_csum_errs: ` (essentially eviction due to checksum errors is + # disabled by default). + bdev_auto_faulty: + enable: true + max_io_errs: 100 + max_csum_errs: 200 + +- + # Number of I/O service threads (and network endpoints) per engine. + # Immutable after running "dmg storage format". + # + # Each storage target manages a fraction of the (interleaved) SCM storage space, + # and a fraction of one of the NVMe SSDs that are managed by this engine. + # For optimal balance regarding the NVMe space, the number of targets should be + # an integer multiple of the number of NVMe disks configured in bdev_list: + # To obtain the maximum SCM performance, a certain number of targets is needed. + # This is device- and workload-dependent, but around 16 targets usually work well. + # + # The server should have sufficiently many physical cores to support the + # number of targets, plus the additional service threads. + + targets: 16 + + # Number of additional offload service threads per engine. + # Immutable after running "dmg storage format". + # + # Helper threads to accelerate checksum and server-side RPC dispatch. + # + # The server should have sufficiently many physical cores to support the + # number of helper threads, plus the number of targets. + + nr_xs_helpers: 4 + + # Pin this engine instance to cores and memory that are local to the + # NUMA node ID specified with this value. + # + # For best performance, it is necessary that the fabric_iface of this engine + # resides on the same NUMA node as the pinned_numa_node. + # + # Optional parameter; set either this option or first_core, but not both. + + pinned_numa_node: 1 + + # Offset of the first core to be used for I/O service threads (targets). + # Immutable after running "dmg storage format". + # + # For best performance, it is necessary that the fabric_iface of this engine + # resides on the same NUMA node as the first_core. + # + # Optional parameter; set either this option or pinned_numa_node but not both. + + #first_core: 64 + + # A boolean that instructs the I/O Engine instance to bypass the NVMe + # health check. This eliminates the check and related log output for those + # systems with NVMe that do not support the device health data query. + + bypass_health_chk: true + + # Specify the fabric network interface used for this engine. + + fabric_iface: ibs769 + + # Specify the fabric network interface port that will be used by this engine. + # The fabric_iface_port must be different for each engine on a DAOS server + # if each engine is assigned to the same fabric_iface. + + fabric_iface_port: 21000 + + # Force specific debug mask for the engine at start up time. + # By default, just use the default debug mask used by DAOS. + # Mask specifies minimum level of message significance to pass to logger. + + # default: ERR + log_mask: INFO + + # Force specific path for DAOS debug logs. + + # default: engine log goes to control_log_file + log_file: /var/log/daos/daos_engine.1.log + + # Pass specific environment variables to the engine process. + # Empty by default. Values should be supplied without encapsulating quotes. + + env_vars: + - CRT_CTX_SHARE_ADDR=1 + - CRT_CTX_NUM=8 + - CRT_TIMEOUT=600 + - CRT_CREDIT_EP_CTX=0 + - DAOS_MD_CAP=4096 + - DAOS_DTX_AGG_THD_CNT=16777216 + - DAOS_DTX_AGG_THD_AGE=1200 + - FI_CXI_CQ_FILL_PERCENT=20 + - FI_CXI_OFLOW_BUF_SIZE=8388608 + - FI_CXI_REQ_BUF_MIN_POSTED=8 + - FI_CXI_REQ_BUF_SIZE=8388608 + - FI_CXI_RX_MATCH_MODE=hybrid + - FI_OFI_RXM_USE_SRX=1 + - FI_UNIVERSE_SIZE=16384 + - FI_VERBS_PREFER_XRC=1 + - UCX_IB_FORK_INIT=n + + storage: + - + # Define a pre-configured mountpoint for storage class memory to be used + # by this engine. + # Path should be unique to engine instance (can use different subdirs). + # Either the specified directory or its parent must be a mount point. + + scm_mount: /mnt/daos/2 + + # Backend SCM device type. Either use PMem (Intel(R) Optane(TM) persistent + # memory) modules configured in interleaved mode or a tmpfs running in RAM. + # Options are: + # - "dcpm" for SCM, scm_size is ignored + # - "ram" to use tmpfs, scm_list is ignored + # Immutable after running "dmg storage format". + + class: ram + + # When class is set to ram, tmpfs will be used instead of dcpm. + # The size of ram is specified by scm_size in GB units and will be automatically calculated + # unless overridden by this optional parameter (units in GiB). + + #scm_size: 0 + scm_size: 160 + + # When class is set to ram, tmpfs will be mounted with hugepage + # support, if the kernel supports it. If this is not desirable, + # the behavior may be disabled here. + scm_hugepages_disabled: false + + # When class is set to dcpm, scm_list is the list of device paths for + # PMem namespaces (currently only one per engine supported). + #class: dcpm + #scm_list: [/dev/pmem1] + + - + # Backend block device type. Force a SPDK driver to be used by this engine + # instance. + # Options are: + # - "nvme" for NVMe SSDs (preferred option), bdev_size ignored + # - "file" to emulate a NVMe SSD with a regular file + # - "kdev" to use a kernel block device, bdev_size ignored + # Immutable after running "dmg storage format". + + # When class is set to file, Linux AIO will be used to emulate NVMe. + # The size of file that will be created is specified by bdev_size in GB units. + # The location of the files that will be created is specified in bdev_list. + #class: file + #bdev_list: [/tmp/daos-bdev1,/tmp/daos-bdev2] + #bdev_size: 16 + + # When class is set to kdev, bdev_list is the list of unique kernel + # block devices that should be different across different engine instance. + #class: kdev + #bdev_list: [/dev/sdc,/dev/sdd] + + # If Volume Management Devices (VMD) are to be used, then the disable_vmd + # flag needs to be set to false (default). The class will remain the + # default "nvme" type, and bdev_list will include the VMD addresses. + #class: nvme + #bdev_list: ["0000:5d:05.5"] + + class: nvme + bdev_list: [ + "0000:d8:00.0", + "0000:d9:00.0", + "0000:da:00.0", + "0000:db:00.0", + "0000:ea:00.0", + "0000:eb:00.0", + ] + + # Optional override, will be automatically generated based on NUMA affinity. + # Filter hot-pluggable devices by PCI bus-ID by specifying a hexadecimal + # range. Hotplug events relating to devices with PCI bus-IDs outside this range + # will not be processed by this engine. Empty or unset range signifies allow all. + #bdev_busid_range: 0xd0-0xdf + #bdev_busid_range: 208-223 + + # See about bdev_roles above. + bdev_roles: ['data', 'meta', 'wal'] + + # Disable automatic detection and eviction of faulty NVMe devices. The default + # criteria parameters are `enable: true`, `max_io_errs: 10` and + # `max_csum_errs: ` (essentially eviction due to checksum errors is + # disabled by default). + + bdev_auto_faulty: + enable: true + max_io_errs: 100 + max_csum_errs: 200 + +#support_config: +# +## Override the default file transfer mechanism for dmg support collect-log by supplying +## the path to an alternative script or binary that will be used to copy log off of the servers. +## Note that the --transfer-args flag to the collect-log command may be used to supply +## extra runtime arguments used by the copy tool (e.g. a cloud bucket name, etc.) +## file_transfer_exec: /usr/bin/rsync (example) +# +# file_transfer_exec: diff --git a/io500/sc25/tta/mantastorage/servers/MantaStorage-02/etc/sysctl.d/99-daos-net.conf b/io500/sc25/tta/mantastorage/servers/MantaStorage-02/etc/sysctl.d/99-daos-net.conf new file mode 100644 index 0000000..c407299 --- /dev/null +++ b/io500/sc25/tta/mantastorage/servers/MantaStorage-02/etc/sysctl.d/99-daos-net.conf @@ -0,0 +1,41 @@ + ############################################################################## + # Multi-rail + ############################################################################## + net.ipv4.conf.ibs1.accept_local=1 + net.ipv4.conf.ibs1.rp_filter=2 + net.ipv4.conf.ibs1.arp_ignore=2 + net.ipv4.conf.ibs1.arp_announce=0 + net.ipv4.conf.ibs1.arp_filter=0 + + net.ipv4.conf.ibs769.accept_local=1 + net.ipv4.conf.ibs769.rp_filter=2 + net.ipv4.conf.ibs769.arp_ignore=2 + net.ipv4.conf.ibs769.arp_announce=0 + net.ipv4.conf.ibs769.arp_filter=0 + + ############################################################################## + # Common tunning + ############################################################################## + net.core.netdev_max_backlog = 250000 + #net.core.rmem_max = 16777216 + net.core.rmem_max = 134217728 + #net.core.wmem_max = 16777216 + net.core.wmem_max = 134217728 + net.core.rmem_default = 16777216 + net.core.wmem_default = 16777216 + net.core.optmem_max = 16777216 + + net.ipv4.neigh.default.gc_thresh1 = 1024 + net.ipv4.neigh.default.gc_thresh2 = 4096 + net.ipv4.neigh.default.gc_thresh3 = 16384 + + net.ipv4.tcp_timestamps = 1 + net.ipv4.tcp_sack = 1 + net.ipv4.tcp_mem = 16777216 16777216 16777216 + net.ipv4.tcp_rmem = 4096 1677216 134217728 + net.ipv4.tcp_wmem = 4096 1677216 134217728 + net.ipv4.tcp_low_latency = 1 + net.ipv4.tcp_mtu_probing = 1 + net.ipv4.tcp_congestion_control=bbr + net.core.default_qdisc=fq + net.ipv4.tcp_slow_start_after_idle=0 diff --git a/io500/sc25/tta/mantastorage/servers/MantaStorage-02/etc/sysctl.d/99-daos-vm.conf b/io500/sc25/tta/mantastorage/servers/MantaStorage-02/etc/sysctl.d/99-daos-vm.conf new file mode 100644 index 0000000..bc13d41 --- /dev/null +++ b/io500/sc25/tta/mantastorage/servers/MantaStorage-02/etc/sysctl.d/99-daos-vm.conf @@ -0,0 +1 @@ + vm.swappiness = 1 diff --git a/io500/sc25/tta/mantastorage/servers/MantaStorage-03/etc/daos/daos_agent.yml b/io500/sc25/tta/mantastorage/servers/MantaStorage-03/etc/daos/daos_agent.yml new file mode 100644 index 0000000..7b5ca6b --- /dev/null +++ b/io500/sc25/tta/mantastorage/servers/MantaStorage-03/etc/daos/daos_agent.yml @@ -0,0 +1,183 @@ +# DAOS agent configuration file. +# +# Location of this configuration file is determined by first checking for the +# path specified through the -o option of the daos_agent command line. +# Otherwise, /etc/daos/daos_agent.yml is used. +# +# Section describing the daos_agent configuration +# +# Although not supported for now, one might want to connect to multiple +# DAOS installations from the same node in the future. +# +# Specify the associated DAOS systems. +# Name must match name specified in the daos_server.yml file on the server. +# +# NOTE: changing the name is not supported yet, it must be daos_server +# +# default: daos_server +name: daos_server + +# Management server access points +# Must have the same value for all agents and servers in a system. +# default: hostname of this node +access_points: [ + "MantaStorage-01", + "MantaStorage-02", + "MantaStorage-03", + "MantaStorage-04" +] + +# Force different port number to connect to access points. +# default: 10001 +port: 10001 + +## Enable HTTP endpoint for remote telemetry collection. +# Note that enabling the endpoint automatically enables +# client telemetry collection. +# +## default endpoint state: disabled +## default endpoint port: 9192 +#telemetry_port: 9192 + +## Enable client telemetry for all DAOS clients. +# If false, clients will need to optionally enable telemetry by setting +# the D_CLIENT_METRICS_ENABLE environment variable to true. +# +## default: false +#telemetry_enabled: true + +## Retain client telemetry for a period of time after the client +# process exits. +# +## default 0 (do not retain telemetry after client exit) +#telemetry_retain: 1m + +## Enable client telemetry for a subset of DAOS clients matching the +# supplied regular expression pattern. May not be used with +# telemetry_disabled_procs. +# +## default: not set +#telemetry_enabled_procs: ^dfuse$ + +## Disable client telemetry for a subset of DAOS clients matching the +# supplied regular expression pattern. May not be used with +# telemetry_enabled_procs. +# +## default: not set +#telemetry_disabled_procs: ^spambot-.* + +## Configuration for user credential management. +#credential_config: +# # If the agent should be able to resolve unknown client uids and gids +# # (e.g. when running in a container) into ACL principal names, then a +# # client user map may be defined. The optional "default" uid is a special +# # case and applies if no other matches are found. +# client_user_map: +# default: +# user: nobody +# group: nobody +# 1000: +# user: ralph +# group: stanley +# +# # Optionally cache generated credentials with the specified cache +# # lifetime. By default, a credential is generated for every client +# # process that connects to a pool. If the credential cache is +# # enabled, then local client processes connecting with stable +# # uid:gid associations may take advantage of the cached credential +# # and reduce some agent overhead. For heavily-loaded client nodes +# # with many frequent (e.g. hundreds per minute) client connections, +# # a lifetime of 1-5 minutes may be a reasonable tradeoff between +# # performance and responsiveness to user/group database updates. +# # If no expiration is set, credential caching is not enabled. +# cache_expiration: 1m + +## Configuration for SSL certificates used to secure management traffic +# and authenticate/authorize management components. +transport_config: + # In order to disable transport security, uncomment and set allow_insecure + # to true. Not recommended for production configurations. + allow_insecure: false + + # Custom CA Root certificate for generated certs + ca_cert: /etc/daos/certs/daosCA.crt + # Agent certificate for use in TLS handshakes + cert: /etc/daos/certs/agent.crt + # Key portion of Agent Certificate + key: /etc/daos/certs/agent.key + +# Use the given directory for creating unix domain sockets +# +# NOTE: Do not change this when running under systemd control. If it needs to +# be changed, then make sure that it matches the RuntimeDirectory setting +# in /usr/lib/systemd/system/daos_agent.service +# +# default: /var/run/daos_agent +#runtime_dir: /var/run/daos_agent + +# Full path and name of the DAOS agent logfile. +# default: print to stderr +log_file: /var/log/daos/daos_agent.log + +# Force specific debug mask for daos_agent (control plane). +# Mask specifies minimum level of message significance to pass to logger. +# Currently supported values are DISABLED, TRACE, DEBUG, INFO, NOTICE and ERROR. + +# default: INFO +control_log_mask: DEBUG + +# Disable automatic eviction of open pool handles on agent shutdown. By default, +# the agent will evict all open pool handles for local processes on shutdown. +# Note that this implies that stopping or restarting the agent will result +# in interruption of DAOS I/O for any local DAOS client processes that have +# an open pool handle. +# default: false +disable_auto_evict: true + +# If enabled, the agent will evict any open pool handles associated with this machine on agent +# startup. This allows the servers to reclaim resources that may not have been properly cleaned +# up in the event of an agent or machine crash. +# default: false +enable_evict_on_start: true + +## Disable the agent's internal caches. If set to true, the agent will query the +## server access point and local hardware data every time a client requests +## rank connection information. +# +## default: false +disable_caching: true + +## Automatically expire the agent's remote cache after a period of time defined in +## minutes. It will refresh the data the next time it is requested. +# +## default: 0 (never expires) +#cache_expiration: 30 + +## Ignore a subset of fabric interfaces when selecting an interface for client +## applications. (Mutually exclusive with include). +# +#exclude_fabric_ifaces: ["lo", "eth1"] + +# Conversely, only consider a specific set of fabric interfaces when selecting +# an interface for client applications. (Mutually exclusive with exclude). + +include_fabric_ifaces: ["ibs1", "ibs1025"] + +# Manually define the fabric interfaces and domains to be used by the agent, +# organized by NUMA node. +# If not defined, the agent will automatically detect all fabric interfaces and +# select appropriate ones based on the server preferences. + +fabric_ifaces: +- + numa_node: 0 + devices: + - + iface: ibs1 + domain: mlx5_0 +- + numa_node: 1 + devices: + - + iface: ibs1025 + domain: mlx5_1 diff --git a/io500/sc25/tta/mantastorage/servers/MantaStorage-03/etc/daos/daos_control.yml b/io500/sc25/tta/mantastorage/servers/MantaStorage-03/etc/daos/daos_control.yml new file mode 100644 index 0000000..f3f2c79 --- /dev/null +++ b/io500/sc25/tta/mantastorage/servers/MantaStorage-03/etc/daos/daos_control.yml @@ -0,0 +1,50 @@ +# DAOS manager (dmg) configuration file. +# +# Location of this configuration file is determined by first checking for the +# path specified through the -o option of the dmg command line. +# Otherwise, /etc/daos/daos_control.yml is used. +# +# Section describing the DAOS manager (dmg) configuration +# +# Although not supported for now, one might want to connect to multiple +# DAOS installations from the same node in the future. +# +# Specify the associated DAOS systems. +# Name must match name specified in the daos_server.yml file on the server. +# +# NOTE: Changing the name is not supported yet, it must be daos_server +# +# default: daos_server +#name: daos_server +name: daos_server + +# Default destination port to use when connecting to hosts in the hostlist. +# default: 10001 +port: 10001 + +# Hostlist, a comma separated list of addresses (hostnames or IPv4 addresses). +# default: ['localhost'] +hostlist: [ + "MantaStorage-01", + "MantaStorage-02", + "MantaStorage-03", + "MantaStorage-04" +] + +## Transport Credentials Specifying certificates to secure communications + +transport_config: +# # In order to disable transport security, uncomment and set allow_insecure +# # to true. Not recommended for production configurations. +# allow_insecure: false + allow_insecure: false +# +# # Custom CA Root certificate for generated certs +# ca_cert: /etc/daos/certs/daosCA.crt + ca_cert: /etc/daos/certs/daosCA.crt +# # Admin certificate for use in TLS handshakes +# cert: /etc/daos/certs/admin.crt + cert: /etc/daos/certs/admin.crt +# # Key portion of Admin Certificate +# key: /etc/daos/certs/admin.key + key: /etc/daos/certs/admin.key diff --git a/io500/sc25/tta/mantastorage/servers/MantaStorage-03/etc/daos/daos_server.yml b/io500/sc25/tta/mantastorage/servers/MantaStorage-03/etc/daos/daos_server.yml new file mode 100644 index 0000000..5eceed6 --- /dev/null +++ b/io500/sc25/tta/mantastorage/servers/MantaStorage-03/etc/daos/daos_server.yml @@ -0,0 +1,675 @@ +## DAOS server configuration file. +# +## Location of this configuration file is determined by first checking for the +## path specified through the -o option of the daos_server command line. +## Otherwise, /etc/daos/daos_server.yml is used. +# +# +## Name associated with the DAOS system. +## Immutable after running "dmg storage format". +# +## NOTE: Changing the DAOS system name is not supported yet. +## It must not be changed from the default "daos_server". +# +## default: daos_server +name: daos_server + +## MS replicas +## Immutable after running "dmg storage format". +# +## To operate, DAOS requires a quorum of Management Service (MS) replica +## hosts to be available. All servers (replica or otherwise) must have the +## same list of replicas in order for the system to operate correctly. Choose +## 3-5 hosts to serve as replicas, preferably not co-located within the same +## fault domains. +## +## Hosts can be specified with or without port. The default port that is set +## up in port: will be used if a port is not specified here. +# +## default: hostname of this node +#mgmt_svc_replicas: ['hostname1', 'hostname2', 'hostname3'] +mgmt_svc_replicas: ["MantaStorage-01", "MantaStorage-02", "MantaStorage-03"] + +## Control plane metadata +## Immutable after running "dmg storage format". +# +## Mandatory if MD-on-SSD bdev device roles have been assigned. +## Define a directory or partition/mountpoint to be used as the storage location for +## control plane metadata. The location specified should be persistent across reboots. +# +control_metadata: +# # Directory to store control plane metadata. +# # If device is also defined, this path will be used as the mountpoint. +# path: /home/daos_server/control_meta + path: /var/daos/control_meta +# # Storage partition to be formatted with an ext4 filesystem and mounted for +# # control plane metadata storage. +# device: /dev/sdb1 + +## Default control plane port +# +## Port number to bind daos_server to. This will also be used when connecting +## to MS replicas, unless a port is specified in mgmt_svc_replicas: +# +## default: 10001 +port: 10001 + +## Transport credentials specifying certificates to secure communications +# +#transport_config: +# # In order to disable transport security, uncomment and set allow_insecure +# # to true. Not recommended for production configurations. +# allow_insecure: false +# +# # Location where daos_server will look for Client certificates +# client_cert_dir: /etc/daos/certs/clients +# # Custom CA Root certificate for generated certs +# ca_cert: /etc/daos/certs/daosCA.crt +# # Server certificate for use in TLS handshakes +# cert: /etc/daos/certs/server.crt +# # Key portion of Server Certificate +# key: /etc/daos/certs/server.key + +## Fault domain path +## Immutable after running "dmg storage format". +# +## default: /hostname for a local configuration w/o fault domain +#fault_path: /vcdu0/rack1/hostname + +## Fault domain callback +## Immutable after running "dmg storage format". +# +## Path to executable which will return fault domain string. +# +#fault_cb: ./.daos/fd_callback +#fault_cb: /etc/daos/daos_fd_callback.sh + +## Network provider +# +## Set the network provider to be used by all the engines. +## There is no default - run "daos_server network scan" to list the +## providers that are supported on the fabric interfaces. Examples: +## +## ofi+verbs;ofi_rxm for libfabric with Infiniband/RoCE +## ofi+tcp;ofi_rxm for libfabric with non-RDMA-capable fabrics +## +## (Starting with DAOS 2.2, ofi_rxm will be automatically added to the +## libfabric verbs and tcp providers, if not explicitly specified.) +# +#provider: ofi+verbs;ofi_rxm +provider: ofi+verbs;ofi_rxm + +## CART: global RPC timeout +## parameters shared with client. +# +#crt_timeout: 30 + +## CART: Disable SRX +## parameters shared with client. set it to true if network card +## does not support shared receive context, eg intel E810-C. +# +#disable_srx: false + +## CART: Fabric authorization key +## If the fabric requires an authorization key, set it here to +## be used on the server and clients. +# +#fabric_auth_key: foo:bar + +## Core Dump Filter +## Optional filter to control which mappings are written to the core +## dump in the event of a crash. See the following URL for more detail: +## https://man7.org/linux/man-pages/man5/core.5.html +# +#core_dump_filter: 0x13 + +## NVMe SSD exclusion list +## Immutable after running "dmg storage format". +# +## Only use NVMe controllers with specific PCI addresses. +## Excludes drives listed and forces auto-detection to skip those drives. +## default: Use all the NVMe SSDs that don't have active mount points. +# +#bdev_exclude: ["0000:81:00.1"] +bdev_exclude: ["0000:59:00.0"] + +## Disable VFIO Driver +# +## In some circumstances it may be preferable to force SPDK to use the UIO +## driver for NVMe device access even though an IOMMU is available. +## NOTE: Use of the UIO driver requires that daos_server must run as root. +# +## default: false +#disable_vfio: true + +## Disable VMD Usage +# +## VMD (Intel Volume Management Devices) is enabled by default but can be +## optionally disabled in which case VMD backing devices will not be visible. +# +## VMD needs to be available and configured in the system BIOS before it +## can be used. The main use case for VMD is managing NVMe SSD LED activity. +# +## default: false +#disable_vmd: true +disable_vmd: true + +## Disable NVMe SSD Hotplug +# +## NVMe SSD hotplug is enabled by default but can be optionally disabled. +## When enabled io engine will periodically check device hot +## plug/remove event, and setup/teardown the device automatically. +# +## default: false +#disable_hotplug: true + +## Use Hyperthreads +# +## When Hyperthreading is enabled and supported on the system, this parameter +## defines whether the DAOS service should try to take advantage of +## hyperthreading to scheduling different task on each hardware thread. +## Not supported yet. +# +## default: false +hyperthreads: true + +## Use the given directory for creating unix domain sockets +# +## DAOS Agent and DAOS Server both use unix domain sockets for communication +## with other system components. This setting is the base location to place +## the sockets in. +# +## NOTE: Do not change this when running under systemd control. If it needs to +## be changed, then make sure that it matches the RuntimeDirectory setting +## in /usr/lib/systemd/system/daos_server.service +# +## default: /var/run/daos_server +# +#socket_dir: ./.daos/daos_server + +## Number of hugepages to allocate for DMA buffer memory +# +## Optional parameter that should only be set if overriding the automatically calculated value is # +## #necessary. Specifies the number (not size) of hugepages to allocate for use by NVMe through +## #SPDK. For optimum performance each target requires 1 GiB of hugepage space. The provided value +## should be calculated by dividing the total amount of hugepages memory required for all targets +## across all engines on a host by the system hugepage size. If not set here, the value will be +## automatically calculated based on the number of targets (using the default system hugepage size). +# +## Example: (2 engines * (16 targets/engine * 1GiB)) / 2MiB hugepage size = 16834 +# +## default: 0 +#nr_hugepages: 0 +#nr_hugepages: 25600 + +## Hugepages are mandatory with NVME SSDs configured and optional without. +## To disable the use of hugepages when no NVMe SSDs are configured, set disable_hugepages to true. +# +## default: false +disable_hugepages: false + +## Hugepages will be applied across NUMA-nodes based on engine affinity. Typical scenarios where an +## equal number of engines exist across a number of NUMA-nodes or all engines on a single NUMA-node +## will be handled but if it's expected that there are a different number of engines on multiple +## NUMA-nodes e.g. two engines on NUMA-0 and one engine on NUMA-1, then explicitly setting this flag +## will enable engine start-up with an imbalanced configuration. +# +## default: false +#allow_numa_imbalance: false + +## Reserve an amount of RAM for system use when calculating the size of RAM-disks that will be +## created for DAOS I/O engines. Units are in GiB and represents the total RAM that will be +## reserved when calculating RAM-disk sizes for all engines. +# +## Optional parameter that should only be set if the automatically calculated value is unsuitable. +## In situations when a host is running applications alongside DAOS that use a significant amount +## of RAM resulting in MemAvailable value being too low to support the calculated RAM-disk size +## increasing the value will reduce the calculate size. Alternatively in situations where total +## RAM is low, reducing the value may prevent problems where RAM-disk size calculated is below the +## minimum of 4gib. Increasing the value may help avoid the potential of OOM killer terminating +## engine processes but could also result in stopping DAOS from using available memory resources. +# +## default: 26 +system_ram_reserved: 8 + +## Set specific debug mask for daos_server (control plane). +## The mask specifies minimum level of message significance to pass to logger. +## Currently supported values are DISABLED, TRACE, DEBUG, INFO, NOTICE and ERROR. +# +## default: INFO +#control_log_mask: ERROR + +## Force specific path for daos_server (control plane) logs. +# +## default: print to stderr +#control_log_file: /var/log/daos/daos_server.log +control_log_file: /var/log/daos/daos_server.log + +## Enable daos_server_helper (privileged helper) logging. +# +## default: disabled (errors only to control_log_file) +#helper_log_file: /var/log/daos/daos_server_helper.log +helper_log_file: /var/log/daos/daos_server_helper.log + +## Enable daos_firmware_helper (privileged helper) logging. +# +## default: disabled (errors only to control_log_file) +#firmware_helper_log_file: /var/log/daos/daos_firmware_helper.log +firmware_helper_log_file: /var/log/daos/daos_firmware_helper.log + +## Enable HTTP endpoint for remote telemetry collection. +# +## default endpoint state: disabled +## default endpoint port: 9191 +telemetry_port: 9191 + +## If desired, a set of client-side environment variables may be +## defined here. Note that these are intended to be defaults and +## may be overridden by manually-set environment variables when +## the client application is launched. +client_env_vars: + - CRT_CTX_SHARE_ADDR=1 + - CRT_CTX_NUM=8 + - CRT_CREDIT_EP_CTX=0 + - CRT_MRC_ENABLE=1 + - CRT_TIMEOUT=600 + - FI_OFI_RXM_USE_SRX=1 + - FI_UNIVERSE_SIZE=16383 + - FI_VERBS_INLINE_SIZE=128 + - FI_VERBS_PREFER_XRC=1 + +## When per-engine definitions exist, auto-allocation of resources is not +## performed. Without per-engine definitions, node resources will +## automatically be assigned to engines based on NUMA ratings. +## There will be a one-to-one relationship between engines and sockets. +# +engines: +- + # Number of I/O service threads (and network endpoints) per engine. + # Immutable after running "dmg storage format". + # + # Each storage target manages a fraction of the (interleaved) SCM storage space, + # and a fraction of one of the NVMe SSDs that are managed by this engine. + # For optimal balance regarding the NVMe space, the number of targets should be + # an integer multiple of the number of NVMe disks configured in bdev_list: + # To obtain the maximum SCM performance, a certain number of targets is needed. + # This is device- and workload-dependent, but around 16 targets usually work well. + # + # The server should have sufficiently many physical cores to support the + # number of targets, plus the additional service threads. + # + # Besides the number of CPU cores, this target number is limited by the number of + # SSDs listed in the bdev_list as well. One NVMe SSD in the bdev_list can be assigned + # to at most 64 targets in pmem mode, or 21 targets in md-on-ssd mode (when the SSD + # has three roles), so the maximum target number will be 'number_of_SSD * 64' in + # pmem mode or 'number_of_SSD * 21' in md-on-ssd mode. + + targets: 16 + + # Number of additional offload service threads per engine. + # Immutable after running "dmg storage format". + # + # Helper threads to accelerate checksum and server-side RPC dispatch. + # When using EC, it is recommended to configure helper threads in + # roughly a 1:4 ratio to the number of target threads. For example, + # when using 16 targets it is recommended to set nr_xs_helpers to 4. + # + # The server should have sufficiently many physical cores to support the + # number of helper threads, plus the number of targets. + # + # default: 0 (using existing target threads for this task) + + nr_xs_helpers: 4 + + # Pin this engine instance to cores and memory that are local to the + # NUMA node ID specified with this value. + # + # For best performance, it is necessary that the fabric_iface of this engine + # resides on the same NUMA node as the pinned_numa_node. + # + # Optional parameter; set either this option or first_core, but not both. + + pinned_numa_node: 0 + + # Offset of the first core to be used for I/O service threads (targets). + # Immutable after running "dmg storage format". + # + # For best performance, it is necessary that the fabric_iface of this engine + # resides on the same NUMA node as the first_core. + # + # Optional parameter; set either this option or pinned_numa_node but not both. + + #first_core: 2 + + # A boolean that instructs the I/O Engine instance to bypass the NVMe + # health check. This eliminates the check and related log output for those + # systems with NVMe that do not support the device health data query. + + bypass_health_chk: true + + # Specify the fabric network interface used for this engine. + + fabric_iface: ibs1 + + # Specify the fabric network interface port that will be used by this engine. + # The fabric_iface_port must be different for each engine on a DAOS server + # if each engine is assigned to the same fabric_iface. + + fabric_iface_port: 20000 + + # Force specific debug mask for the engine at start up time. + # By default, just use the default debug mask used by DAOS. + # Mask specifies minimum level of message significance to pass to logger. + + # default: ERR + log_mask: INFO + + # Force specific path for DAOS debug logs. + + # default: engine log goes to control_log_file + log_file: /var/log/daos/daos_engine.0.log + + # Pass specific environment variables to the engine process. + # Empty by default. Values should be supplied without encapsulating quotes. + + env_vars: + - CRT_CTX_SHARE_ADDR=1 + - CRT_CTX_NUM=8 + - CRT_TIMEOUT=600 + - CRT_CREDIT_EP_CTX=0 + - DAOS_MD_CAP=4096 + - DAOS_DTX_AGG_THD_CNT=16777216 + - DAOS_DTX_AGG_THD_AGE=1200 + - FI_CXI_CQ_FILL_PERCENT=20 + - FI_CXI_OFLOW_BUF_SIZE=8388608 + - FI_CXI_REQ_BUF_MIN_POSTED=8 + - FI_CXI_REQ_BUF_SIZE=8388608 + - FI_CXI_RX_MATCH_MODE=hybrid + - FI_OFI_RXM_USE_SRX=1 + - FI_UNIVERSE_SIZE=16384 + - FI_VERBS_PREFER_XRC=1 + - UCX_IB_FORK_INIT=n + + storage: + - + # Define a pre-configured mountpoint for storage class memory to be used + # by this engine. + # Path should be unique to engine instance (can use different subdirs). + # Either the specified directory or its parent must be a mount point. + + scm_mount: /mnt/daos/1 + + # Backend SCM device type. Either use PMem (Intel(R) Optane(TM) persistent + # memory) modules configured in interleaved mode or a tmpfs running in RAM. + # Options are: + # - "dcpm" for SCM, scm_size is ignored + # - "ram" to use tmpfs, scm_list is ignored + # Immutable after running "dmg storage format". + + class: ram + + # When class is set to ram, tmpfs will be used instead of dcpm. + # The size of ram is specified by scm_size in GB units and will be automatically calculated + # unless overridden by this optional parameter (units in GiB). + + #scm_size: 0 + scm_size: 160 + + # When class is set to ram, tmpfs will be mounted with hugepage + # support, if the kernel supports it. If this is not desirable, + # the behavior may be disabled here. + scm_hugepages_disabled: false + + - + # Backend block device type. Force a SPDK driver to be used by this engine + # instance. + # Options are: + # - "nvme" for NVMe SSDs (preferred option), bdev_size ignored + # - "file" to emulate a NVMe SSD with a regular file + # - "kdev" to use a kernel block device, bdev_size ignored + # Immutable after running "dmg storage format". + class: nvme + + # Backend block device configuration to be used by this engine instance. + # When class is set to nvme, bdev_list is the list of unique NVMe IDs + # that should be different across different engine instances. + # Immutable after running "dmg storage format". + bdev_list: [ + "0000:5b:00.0", + "0000:5c:00.0", + "0000:6c:00.0", + "0000:6d:00.0", + "0000:6e:00.0", + "0000:6f:00.0", + ] + + # If VMD-enabled NVMe SSDs are used, the bdev_list should consist of the VMD + # PCIe addresses, and not the BDF format transport IDs of the backing NVMe SSDs + # behind the VMD address. Also, 'disable_vmd' needs to be set to false. + #bdev_list: ["0000:5d:05.5"] + + # Optional override, will be automatically generated based on NUMA affinity. + # Filter hot-pluggable devices by PCI bus-ID by specifying a hexadecimal + # range. Hotplug events relating to devices with PCI bus-IDs outside this range + # will not be processed by this engine. Empty or unset range signifies allow all. + #bdev_busid_range: 0x80-0x8f + #bdev_busid_range: 128-143 + + # Optional explicit nvme-class bdev tier role assignments will + # define the roles and responsibilities of this bdev tier. + # If DCPM class is defined for the first tier, + # only one bdev tier is supported and its role must be data. + + # Roles will be derived based on configured bdev + # tiers, if not specified here. You must assign all roles or none. + # Options are: + # - "data" SSDs will be used to store actual data + # - "meta" SSDs will be used to store the VOS metadata + # - "wal" SSDs will be used to store the write-ahead-log + bdev_roles: ['data', 'meta', 'wal'] + + # Set criteria for automatic detection and eviction of faulty NVMe devices. The + # default criteria parameters are `enable: true`, `max_io_errs: 10` and + # `max_csum_errs: ` (essentially eviction due to checksum errors is + # disabled by default). + bdev_auto_faulty: + enable: true + max_io_errs: 100 + max_csum_errs: 200 + +- + # Number of I/O service threads (and network endpoints) per engine. + # Immutable after running "dmg storage format". + # + # Each storage target manages a fraction of the (interleaved) SCM storage space, + # and a fraction of one of the NVMe SSDs that are managed by this engine. + # For optimal balance regarding the NVMe space, the number of targets should be + # an integer multiple of the number of NVMe disks configured in bdev_list: + # To obtain the maximum SCM performance, a certain number of targets is needed. + # This is device- and workload-dependent, but around 16 targets usually work well. + # + # The server should have sufficiently many physical cores to support the + # number of targets, plus the additional service threads. + + targets: 16 + + # Number of additional offload service threads per engine. + # Immutable after running "dmg storage format". + # + # Helper threads to accelerate checksum and server-side RPC dispatch. + # + # The server should have sufficiently many physical cores to support the + # number of helper threads, plus the number of targets. + + nr_xs_helpers: 4 + + # Pin this engine instance to cores and memory that are local to the + # NUMA node ID specified with this value. + # + # For best performance, it is necessary that the fabric_iface of this engine + # resides on the same NUMA node as the pinned_numa_node. + # + # Optional parameter; set either this option or first_core, but not both. + + pinned_numa_node: 1 + + # Offset of the first core to be used for I/O service threads (targets). + # Immutable after running "dmg storage format". + # + # For best performance, it is necessary that the fabric_iface of this engine + # resides on the same NUMA node as the first_core. + # + # Optional parameter; set either this option or pinned_numa_node but not both. + + #first_core: 64 + + # A boolean that instructs the I/O Engine instance to bypass the NVMe + # health check. This eliminates the check and related log output for those + # systems with NVMe that do not support the device health data query. + + bypass_health_chk: true + + # Specify the fabric network interface used for this engine. + + fabric_iface: ibs1025 + + # Specify the fabric network interface port that will be used by this engine. + # The fabric_iface_port must be different for each engine on a DAOS server + # if each engine is assigned to the same fabric_iface. + + fabric_iface_port: 21000 + + # Force specific debug mask for the engine at start up time. + # By default, just use the default debug mask used by DAOS. + # Mask specifies minimum level of message significance to pass to logger. + + # default: ERR + log_mask: INFO + + # Force specific path for DAOS debug logs. + + # default: engine log goes to control_log_file + log_file: /var/log/daos/daos_engine.1.log + + # Pass specific environment variables to the engine process. + # Empty by default. Values should be supplied without encapsulating quotes. + + env_vars: + - CRT_CTX_SHARE_ADDR=1 + - CRT_CTX_NUM=8 + - CRT_TIMEOUT=600 + - CRT_CREDIT_EP_CTX=0 + - DAOS_MD_CAP=4096 + - DAOS_DTX_AGG_THD_CNT=16777216 + - DAOS_DTX_AGG_THD_AGE=1200 + - FI_CXI_CQ_FILL_PERCENT=20 + - FI_CXI_OFLOW_BUF_SIZE=8388608 + - FI_CXI_REQ_BUF_MIN_POSTED=8 + - FI_CXI_REQ_BUF_SIZE=8388608 + - FI_CXI_RX_MATCH_MODE=hybrid + - FI_OFI_RXM_USE_SRX=1 + - FI_UNIVERSE_SIZE=16384 + - FI_VERBS_PREFER_XRC=1 + - UCX_IB_FORK_INIT=n + + storage: + - + # Define a pre-configured mountpoint for storage class memory to be used + # by this engine. + # Path should be unique to engine instance (can use different subdirs). + # Either the specified directory or its parent must be a mount point. + + scm_mount: /mnt/daos/2 + + # Backend SCM device type. Either use PMem (Intel(R) Optane(TM) persistent + # memory) modules configured in interleaved mode or a tmpfs running in RAM. + # Options are: + # - "dcpm" for SCM, scm_size is ignored + # - "ram" to use tmpfs, scm_list is ignored + # Immutable after running "dmg storage format". + + class: ram + + # When class is set to ram, tmpfs will be used instead of dcpm. + # The size of ram is specified by scm_size in GB units and will be automatically calculated + # unless overridden by this optional parameter (units in GiB). + + #scm_size: 0 + scm_size: 160 + + # When class is set to ram, tmpfs will be mounted with hugepage + # support, if the kernel supports it. If this is not desirable, + # the behavior may be disabled here. + scm_hugepages_disabled: false + + # When class is set to dcpm, scm_list is the list of device paths for + # PMem namespaces (currently only one per engine supported). + #class: dcpm + #scm_list: [/dev/pmem1] + + - + # Backend block device type. Force a SPDK driver to be used by this engine + # instance. + # Options are: + # - "nvme" for NVMe SSDs (preferred option), bdev_size ignored + # - "file" to emulate a NVMe SSD with a regular file + # - "kdev" to use a kernel block device, bdev_size ignored + # Immutable after running "dmg storage format". + + # When class is set to file, Linux AIO will be used to emulate NVMe. + # The size of file that will be created is specified by bdev_size in GB units. + # The location of the files that will be created is specified in bdev_list. + #class: file + #bdev_list: [/tmp/daos-bdev1,/tmp/daos-bdev2] + #bdev_size: 16 + + # When class is set to kdev, bdev_list is the list of unique kernel + # block devices that should be different across different engine instance. + #class: kdev + #bdev_list: [/dev/sdc,/dev/sdd] + + # If Volume Management Devices (VMD) are to be used, then the disable_vmd + # flag needs to be set to false (default). The class will remain the + # default "nvme" type, and bdev_list will include the VMD addresses. + #class: nvme + #bdev_list: ["0000:5d:05.5"] + + class: nvme + bdev_list: [ + "0000:d8:00.0", + "0000:d9:00.0", + "0000:da:00.0", + "0000:db:00.0", + "0000:ea:00.0", + "0000:eb:00.0", + ] + + # Optional override, will be automatically generated based on NUMA affinity. + # Filter hot-pluggable devices by PCI bus-ID by specifying a hexadecimal + # range. Hotplug events relating to devices with PCI bus-IDs outside this range + # will not be processed by this engine. Empty or unset range signifies allow all. + #bdev_busid_range: 0xd0-0xdf + #bdev_busid_range: 208-223 + + # See about bdev_roles above. + bdev_roles: ['data', 'meta', 'wal'] + + # Disable automatic detection and eviction of faulty NVMe devices. The default + # criteria parameters are `enable: true`, `max_io_errs: 10` and + # `max_csum_errs: ` (essentially eviction due to checksum errors is + # disabled by default). + + bdev_auto_faulty: + enable: true + max_io_errs: 100 + max_csum_errs: 200 + +#support_config: +# +## Override the default file transfer mechanism for dmg support collect-log by supplying +## the path to an alternative script or binary that will be used to copy log off of the servers. +## Note that the --transfer-args flag to the collect-log command may be used to supply +## extra runtime arguments used by the copy tool (e.g. a cloud bucket name, etc.) +## file_transfer_exec: /usr/bin/rsync (example) +# +# file_transfer_exec: diff --git a/io500/sc25/tta/mantastorage/servers/MantaStorage-03/etc/sysctl.d/99-daos-net.conf b/io500/sc25/tta/mantastorage/servers/MantaStorage-03/etc/sysctl.d/99-daos-net.conf new file mode 100644 index 0000000..c407299 --- /dev/null +++ b/io500/sc25/tta/mantastorage/servers/MantaStorage-03/etc/sysctl.d/99-daos-net.conf @@ -0,0 +1,41 @@ + ############################################################################## + # Multi-rail + ############################################################################## + net.ipv4.conf.ibs1.accept_local=1 + net.ipv4.conf.ibs1.rp_filter=2 + net.ipv4.conf.ibs1.arp_ignore=2 + net.ipv4.conf.ibs1.arp_announce=0 + net.ipv4.conf.ibs1.arp_filter=0 + + net.ipv4.conf.ibs769.accept_local=1 + net.ipv4.conf.ibs769.rp_filter=2 + net.ipv4.conf.ibs769.arp_ignore=2 + net.ipv4.conf.ibs769.arp_announce=0 + net.ipv4.conf.ibs769.arp_filter=0 + + ############################################################################## + # Common tunning + ############################################################################## + net.core.netdev_max_backlog = 250000 + #net.core.rmem_max = 16777216 + net.core.rmem_max = 134217728 + #net.core.wmem_max = 16777216 + net.core.wmem_max = 134217728 + net.core.rmem_default = 16777216 + net.core.wmem_default = 16777216 + net.core.optmem_max = 16777216 + + net.ipv4.neigh.default.gc_thresh1 = 1024 + net.ipv4.neigh.default.gc_thresh2 = 4096 + net.ipv4.neigh.default.gc_thresh3 = 16384 + + net.ipv4.tcp_timestamps = 1 + net.ipv4.tcp_sack = 1 + net.ipv4.tcp_mem = 16777216 16777216 16777216 + net.ipv4.tcp_rmem = 4096 1677216 134217728 + net.ipv4.tcp_wmem = 4096 1677216 134217728 + net.ipv4.tcp_low_latency = 1 + net.ipv4.tcp_mtu_probing = 1 + net.ipv4.tcp_congestion_control=bbr + net.core.default_qdisc=fq + net.ipv4.tcp_slow_start_after_idle=0 diff --git a/io500/sc25/tta/mantastorage/servers/MantaStorage-03/etc/sysctl.d/99-daos-vm.conf b/io500/sc25/tta/mantastorage/servers/MantaStorage-03/etc/sysctl.d/99-daos-vm.conf new file mode 100644 index 0000000..bc13d41 --- /dev/null +++ b/io500/sc25/tta/mantastorage/servers/MantaStorage-03/etc/sysctl.d/99-daos-vm.conf @@ -0,0 +1 @@ + vm.swappiness = 1 diff --git a/io500/sc25/tta/mantastorage/servers/MantaStorage-04/etc/daos/daos_agent.yml b/io500/sc25/tta/mantastorage/servers/MantaStorage-04/etc/daos/daos_agent.yml new file mode 100644 index 0000000..7b5ca6b --- /dev/null +++ b/io500/sc25/tta/mantastorage/servers/MantaStorage-04/etc/daos/daos_agent.yml @@ -0,0 +1,183 @@ +# DAOS agent configuration file. +# +# Location of this configuration file is determined by first checking for the +# path specified through the -o option of the daos_agent command line. +# Otherwise, /etc/daos/daos_agent.yml is used. +# +# Section describing the daos_agent configuration +# +# Although not supported for now, one might want to connect to multiple +# DAOS installations from the same node in the future. +# +# Specify the associated DAOS systems. +# Name must match name specified in the daos_server.yml file on the server. +# +# NOTE: changing the name is not supported yet, it must be daos_server +# +# default: daos_server +name: daos_server + +# Management server access points +# Must have the same value for all agents and servers in a system. +# default: hostname of this node +access_points: [ + "MantaStorage-01", + "MantaStorage-02", + "MantaStorage-03", + "MantaStorage-04" +] + +# Force different port number to connect to access points. +# default: 10001 +port: 10001 + +## Enable HTTP endpoint for remote telemetry collection. +# Note that enabling the endpoint automatically enables +# client telemetry collection. +# +## default endpoint state: disabled +## default endpoint port: 9192 +#telemetry_port: 9192 + +## Enable client telemetry for all DAOS clients. +# If false, clients will need to optionally enable telemetry by setting +# the D_CLIENT_METRICS_ENABLE environment variable to true. +# +## default: false +#telemetry_enabled: true + +## Retain client telemetry for a period of time after the client +# process exits. +# +## default 0 (do not retain telemetry after client exit) +#telemetry_retain: 1m + +## Enable client telemetry for a subset of DAOS clients matching the +# supplied regular expression pattern. May not be used with +# telemetry_disabled_procs. +# +## default: not set +#telemetry_enabled_procs: ^dfuse$ + +## Disable client telemetry for a subset of DAOS clients matching the +# supplied regular expression pattern. May not be used with +# telemetry_enabled_procs. +# +## default: not set +#telemetry_disabled_procs: ^spambot-.* + +## Configuration for user credential management. +#credential_config: +# # If the agent should be able to resolve unknown client uids and gids +# # (e.g. when running in a container) into ACL principal names, then a +# # client user map may be defined. The optional "default" uid is a special +# # case and applies if no other matches are found. +# client_user_map: +# default: +# user: nobody +# group: nobody +# 1000: +# user: ralph +# group: stanley +# +# # Optionally cache generated credentials with the specified cache +# # lifetime. By default, a credential is generated for every client +# # process that connects to a pool. If the credential cache is +# # enabled, then local client processes connecting with stable +# # uid:gid associations may take advantage of the cached credential +# # and reduce some agent overhead. For heavily-loaded client nodes +# # with many frequent (e.g. hundreds per minute) client connections, +# # a lifetime of 1-5 minutes may be a reasonable tradeoff between +# # performance and responsiveness to user/group database updates. +# # If no expiration is set, credential caching is not enabled. +# cache_expiration: 1m + +## Configuration for SSL certificates used to secure management traffic +# and authenticate/authorize management components. +transport_config: + # In order to disable transport security, uncomment and set allow_insecure + # to true. Not recommended for production configurations. + allow_insecure: false + + # Custom CA Root certificate for generated certs + ca_cert: /etc/daos/certs/daosCA.crt + # Agent certificate for use in TLS handshakes + cert: /etc/daos/certs/agent.crt + # Key portion of Agent Certificate + key: /etc/daos/certs/agent.key + +# Use the given directory for creating unix domain sockets +# +# NOTE: Do not change this when running under systemd control. If it needs to +# be changed, then make sure that it matches the RuntimeDirectory setting +# in /usr/lib/systemd/system/daos_agent.service +# +# default: /var/run/daos_agent +#runtime_dir: /var/run/daos_agent + +# Full path and name of the DAOS agent logfile. +# default: print to stderr +log_file: /var/log/daos/daos_agent.log + +# Force specific debug mask for daos_agent (control plane). +# Mask specifies minimum level of message significance to pass to logger. +# Currently supported values are DISABLED, TRACE, DEBUG, INFO, NOTICE and ERROR. + +# default: INFO +control_log_mask: DEBUG + +# Disable automatic eviction of open pool handles on agent shutdown. By default, +# the agent will evict all open pool handles for local processes on shutdown. +# Note that this implies that stopping or restarting the agent will result +# in interruption of DAOS I/O for any local DAOS client processes that have +# an open pool handle. +# default: false +disable_auto_evict: true + +# If enabled, the agent will evict any open pool handles associated with this machine on agent +# startup. This allows the servers to reclaim resources that may not have been properly cleaned +# up in the event of an agent or machine crash. +# default: false +enable_evict_on_start: true + +## Disable the agent's internal caches. If set to true, the agent will query the +## server access point and local hardware data every time a client requests +## rank connection information. +# +## default: false +disable_caching: true + +## Automatically expire the agent's remote cache after a period of time defined in +## minutes. It will refresh the data the next time it is requested. +# +## default: 0 (never expires) +#cache_expiration: 30 + +## Ignore a subset of fabric interfaces when selecting an interface for client +## applications. (Mutually exclusive with include). +# +#exclude_fabric_ifaces: ["lo", "eth1"] + +# Conversely, only consider a specific set of fabric interfaces when selecting +# an interface for client applications. (Mutually exclusive with exclude). + +include_fabric_ifaces: ["ibs1", "ibs1025"] + +# Manually define the fabric interfaces and domains to be used by the agent, +# organized by NUMA node. +# If not defined, the agent will automatically detect all fabric interfaces and +# select appropriate ones based on the server preferences. + +fabric_ifaces: +- + numa_node: 0 + devices: + - + iface: ibs1 + domain: mlx5_0 +- + numa_node: 1 + devices: + - + iface: ibs1025 + domain: mlx5_1 diff --git a/io500/sc25/tta/mantastorage/servers/MantaStorage-04/etc/daos/daos_control.yml b/io500/sc25/tta/mantastorage/servers/MantaStorage-04/etc/daos/daos_control.yml new file mode 100644 index 0000000..f3f2c79 --- /dev/null +++ b/io500/sc25/tta/mantastorage/servers/MantaStorage-04/etc/daos/daos_control.yml @@ -0,0 +1,50 @@ +# DAOS manager (dmg) configuration file. +# +# Location of this configuration file is determined by first checking for the +# path specified through the -o option of the dmg command line. +# Otherwise, /etc/daos/daos_control.yml is used. +# +# Section describing the DAOS manager (dmg) configuration +# +# Although not supported for now, one might want to connect to multiple +# DAOS installations from the same node in the future. +# +# Specify the associated DAOS systems. +# Name must match name specified in the daos_server.yml file on the server. +# +# NOTE: Changing the name is not supported yet, it must be daos_server +# +# default: daos_server +#name: daos_server +name: daos_server + +# Default destination port to use when connecting to hosts in the hostlist. +# default: 10001 +port: 10001 + +# Hostlist, a comma separated list of addresses (hostnames or IPv4 addresses). +# default: ['localhost'] +hostlist: [ + "MantaStorage-01", + "MantaStorage-02", + "MantaStorage-03", + "MantaStorage-04" +] + +## Transport Credentials Specifying certificates to secure communications + +transport_config: +# # In order to disable transport security, uncomment and set allow_insecure +# # to true. Not recommended for production configurations. +# allow_insecure: false + allow_insecure: false +# +# # Custom CA Root certificate for generated certs +# ca_cert: /etc/daos/certs/daosCA.crt + ca_cert: /etc/daos/certs/daosCA.crt +# # Admin certificate for use in TLS handshakes +# cert: /etc/daos/certs/admin.crt + cert: /etc/daos/certs/admin.crt +# # Key portion of Admin Certificate +# key: /etc/daos/certs/admin.key + key: /etc/daos/certs/admin.key diff --git a/io500/sc25/tta/mantastorage/servers/MantaStorage-04/etc/daos/daos_server.yml b/io500/sc25/tta/mantastorage/servers/MantaStorage-04/etc/daos/daos_server.yml new file mode 100644 index 0000000..5eceed6 --- /dev/null +++ b/io500/sc25/tta/mantastorage/servers/MantaStorage-04/etc/daos/daos_server.yml @@ -0,0 +1,675 @@ +## DAOS server configuration file. +# +## Location of this configuration file is determined by first checking for the +## path specified through the -o option of the daos_server command line. +## Otherwise, /etc/daos/daos_server.yml is used. +# +# +## Name associated with the DAOS system. +## Immutable after running "dmg storage format". +# +## NOTE: Changing the DAOS system name is not supported yet. +## It must not be changed from the default "daos_server". +# +## default: daos_server +name: daos_server + +## MS replicas +## Immutable after running "dmg storage format". +# +## To operate, DAOS requires a quorum of Management Service (MS) replica +## hosts to be available. All servers (replica or otherwise) must have the +## same list of replicas in order for the system to operate correctly. Choose +## 3-5 hosts to serve as replicas, preferably not co-located within the same +## fault domains. +## +## Hosts can be specified with or without port. The default port that is set +## up in port: will be used if a port is not specified here. +# +## default: hostname of this node +#mgmt_svc_replicas: ['hostname1', 'hostname2', 'hostname3'] +mgmt_svc_replicas: ["MantaStorage-01", "MantaStorage-02", "MantaStorage-03"] + +## Control plane metadata +## Immutable after running "dmg storage format". +# +## Mandatory if MD-on-SSD bdev device roles have been assigned. +## Define a directory or partition/mountpoint to be used as the storage location for +## control plane metadata. The location specified should be persistent across reboots. +# +control_metadata: +# # Directory to store control plane metadata. +# # If device is also defined, this path will be used as the mountpoint. +# path: /home/daos_server/control_meta + path: /var/daos/control_meta +# # Storage partition to be formatted with an ext4 filesystem and mounted for +# # control plane metadata storage. +# device: /dev/sdb1 + +## Default control plane port +# +## Port number to bind daos_server to. This will also be used when connecting +## to MS replicas, unless a port is specified in mgmt_svc_replicas: +# +## default: 10001 +port: 10001 + +## Transport credentials specifying certificates to secure communications +# +#transport_config: +# # In order to disable transport security, uncomment and set allow_insecure +# # to true. Not recommended for production configurations. +# allow_insecure: false +# +# # Location where daos_server will look for Client certificates +# client_cert_dir: /etc/daos/certs/clients +# # Custom CA Root certificate for generated certs +# ca_cert: /etc/daos/certs/daosCA.crt +# # Server certificate for use in TLS handshakes +# cert: /etc/daos/certs/server.crt +# # Key portion of Server Certificate +# key: /etc/daos/certs/server.key + +## Fault domain path +## Immutable after running "dmg storage format". +# +## default: /hostname for a local configuration w/o fault domain +#fault_path: /vcdu0/rack1/hostname + +## Fault domain callback +## Immutable after running "dmg storage format". +# +## Path to executable which will return fault domain string. +# +#fault_cb: ./.daos/fd_callback +#fault_cb: /etc/daos/daos_fd_callback.sh + +## Network provider +# +## Set the network provider to be used by all the engines. +## There is no default - run "daos_server network scan" to list the +## providers that are supported on the fabric interfaces. Examples: +## +## ofi+verbs;ofi_rxm for libfabric with Infiniband/RoCE +## ofi+tcp;ofi_rxm for libfabric with non-RDMA-capable fabrics +## +## (Starting with DAOS 2.2, ofi_rxm will be automatically added to the +## libfabric verbs and tcp providers, if not explicitly specified.) +# +#provider: ofi+verbs;ofi_rxm +provider: ofi+verbs;ofi_rxm + +## CART: global RPC timeout +## parameters shared with client. +# +#crt_timeout: 30 + +## CART: Disable SRX +## parameters shared with client. set it to true if network card +## does not support shared receive context, eg intel E810-C. +# +#disable_srx: false + +## CART: Fabric authorization key +## If the fabric requires an authorization key, set it here to +## be used on the server and clients. +# +#fabric_auth_key: foo:bar + +## Core Dump Filter +## Optional filter to control which mappings are written to the core +## dump in the event of a crash. See the following URL for more detail: +## https://man7.org/linux/man-pages/man5/core.5.html +# +#core_dump_filter: 0x13 + +## NVMe SSD exclusion list +## Immutable after running "dmg storage format". +# +## Only use NVMe controllers with specific PCI addresses. +## Excludes drives listed and forces auto-detection to skip those drives. +## default: Use all the NVMe SSDs that don't have active mount points. +# +#bdev_exclude: ["0000:81:00.1"] +bdev_exclude: ["0000:59:00.0"] + +## Disable VFIO Driver +# +## In some circumstances it may be preferable to force SPDK to use the UIO +## driver for NVMe device access even though an IOMMU is available. +## NOTE: Use of the UIO driver requires that daos_server must run as root. +# +## default: false +#disable_vfio: true + +## Disable VMD Usage +# +## VMD (Intel Volume Management Devices) is enabled by default but can be +## optionally disabled in which case VMD backing devices will not be visible. +# +## VMD needs to be available and configured in the system BIOS before it +## can be used. The main use case for VMD is managing NVMe SSD LED activity. +# +## default: false +#disable_vmd: true +disable_vmd: true + +## Disable NVMe SSD Hotplug +# +## NVMe SSD hotplug is enabled by default but can be optionally disabled. +## When enabled io engine will periodically check device hot +## plug/remove event, and setup/teardown the device automatically. +# +## default: false +#disable_hotplug: true + +## Use Hyperthreads +# +## When Hyperthreading is enabled and supported on the system, this parameter +## defines whether the DAOS service should try to take advantage of +## hyperthreading to scheduling different task on each hardware thread. +## Not supported yet. +# +## default: false +hyperthreads: true + +## Use the given directory for creating unix domain sockets +# +## DAOS Agent and DAOS Server both use unix domain sockets for communication +## with other system components. This setting is the base location to place +## the sockets in. +# +## NOTE: Do not change this when running under systemd control. If it needs to +## be changed, then make sure that it matches the RuntimeDirectory setting +## in /usr/lib/systemd/system/daos_server.service +# +## default: /var/run/daos_server +# +#socket_dir: ./.daos/daos_server + +## Number of hugepages to allocate for DMA buffer memory +# +## Optional parameter that should only be set if overriding the automatically calculated value is # +## #necessary. Specifies the number (not size) of hugepages to allocate for use by NVMe through +## #SPDK. For optimum performance each target requires 1 GiB of hugepage space. The provided value +## should be calculated by dividing the total amount of hugepages memory required for all targets +## across all engines on a host by the system hugepage size. If not set here, the value will be +## automatically calculated based on the number of targets (using the default system hugepage size). +# +## Example: (2 engines * (16 targets/engine * 1GiB)) / 2MiB hugepage size = 16834 +# +## default: 0 +#nr_hugepages: 0 +#nr_hugepages: 25600 + +## Hugepages are mandatory with NVME SSDs configured and optional without. +## To disable the use of hugepages when no NVMe SSDs are configured, set disable_hugepages to true. +# +## default: false +disable_hugepages: false + +## Hugepages will be applied across NUMA-nodes based on engine affinity. Typical scenarios where an +## equal number of engines exist across a number of NUMA-nodes or all engines on a single NUMA-node +## will be handled but if it's expected that there are a different number of engines on multiple +## NUMA-nodes e.g. two engines on NUMA-0 and one engine on NUMA-1, then explicitly setting this flag +## will enable engine start-up with an imbalanced configuration. +# +## default: false +#allow_numa_imbalance: false + +## Reserve an amount of RAM for system use when calculating the size of RAM-disks that will be +## created for DAOS I/O engines. Units are in GiB and represents the total RAM that will be +## reserved when calculating RAM-disk sizes for all engines. +# +## Optional parameter that should only be set if the automatically calculated value is unsuitable. +## In situations when a host is running applications alongside DAOS that use a significant amount +## of RAM resulting in MemAvailable value being too low to support the calculated RAM-disk size +## increasing the value will reduce the calculate size. Alternatively in situations where total +## RAM is low, reducing the value may prevent problems where RAM-disk size calculated is below the +## minimum of 4gib. Increasing the value may help avoid the potential of OOM killer terminating +## engine processes but could also result in stopping DAOS from using available memory resources. +# +## default: 26 +system_ram_reserved: 8 + +## Set specific debug mask for daos_server (control plane). +## The mask specifies minimum level of message significance to pass to logger. +## Currently supported values are DISABLED, TRACE, DEBUG, INFO, NOTICE and ERROR. +# +## default: INFO +#control_log_mask: ERROR + +## Force specific path for daos_server (control plane) logs. +# +## default: print to stderr +#control_log_file: /var/log/daos/daos_server.log +control_log_file: /var/log/daos/daos_server.log + +## Enable daos_server_helper (privileged helper) logging. +# +## default: disabled (errors only to control_log_file) +#helper_log_file: /var/log/daos/daos_server_helper.log +helper_log_file: /var/log/daos/daos_server_helper.log + +## Enable daos_firmware_helper (privileged helper) logging. +# +## default: disabled (errors only to control_log_file) +#firmware_helper_log_file: /var/log/daos/daos_firmware_helper.log +firmware_helper_log_file: /var/log/daos/daos_firmware_helper.log + +## Enable HTTP endpoint for remote telemetry collection. +# +## default endpoint state: disabled +## default endpoint port: 9191 +telemetry_port: 9191 + +## If desired, a set of client-side environment variables may be +## defined here. Note that these are intended to be defaults and +## may be overridden by manually-set environment variables when +## the client application is launched. +client_env_vars: + - CRT_CTX_SHARE_ADDR=1 + - CRT_CTX_NUM=8 + - CRT_CREDIT_EP_CTX=0 + - CRT_MRC_ENABLE=1 + - CRT_TIMEOUT=600 + - FI_OFI_RXM_USE_SRX=1 + - FI_UNIVERSE_SIZE=16383 + - FI_VERBS_INLINE_SIZE=128 + - FI_VERBS_PREFER_XRC=1 + +## When per-engine definitions exist, auto-allocation of resources is not +## performed. Without per-engine definitions, node resources will +## automatically be assigned to engines based on NUMA ratings. +## There will be a one-to-one relationship between engines and sockets. +# +engines: +- + # Number of I/O service threads (and network endpoints) per engine. + # Immutable after running "dmg storage format". + # + # Each storage target manages a fraction of the (interleaved) SCM storage space, + # and a fraction of one of the NVMe SSDs that are managed by this engine. + # For optimal balance regarding the NVMe space, the number of targets should be + # an integer multiple of the number of NVMe disks configured in bdev_list: + # To obtain the maximum SCM performance, a certain number of targets is needed. + # This is device- and workload-dependent, but around 16 targets usually work well. + # + # The server should have sufficiently many physical cores to support the + # number of targets, plus the additional service threads. + # + # Besides the number of CPU cores, this target number is limited by the number of + # SSDs listed in the bdev_list as well. One NVMe SSD in the bdev_list can be assigned + # to at most 64 targets in pmem mode, or 21 targets in md-on-ssd mode (when the SSD + # has three roles), so the maximum target number will be 'number_of_SSD * 64' in + # pmem mode or 'number_of_SSD * 21' in md-on-ssd mode. + + targets: 16 + + # Number of additional offload service threads per engine. + # Immutable after running "dmg storage format". + # + # Helper threads to accelerate checksum and server-side RPC dispatch. + # When using EC, it is recommended to configure helper threads in + # roughly a 1:4 ratio to the number of target threads. For example, + # when using 16 targets it is recommended to set nr_xs_helpers to 4. + # + # The server should have sufficiently many physical cores to support the + # number of helper threads, plus the number of targets. + # + # default: 0 (using existing target threads for this task) + + nr_xs_helpers: 4 + + # Pin this engine instance to cores and memory that are local to the + # NUMA node ID specified with this value. + # + # For best performance, it is necessary that the fabric_iface of this engine + # resides on the same NUMA node as the pinned_numa_node. + # + # Optional parameter; set either this option or first_core, but not both. + + pinned_numa_node: 0 + + # Offset of the first core to be used for I/O service threads (targets). + # Immutable after running "dmg storage format". + # + # For best performance, it is necessary that the fabric_iface of this engine + # resides on the same NUMA node as the first_core. + # + # Optional parameter; set either this option or pinned_numa_node but not both. + + #first_core: 2 + + # A boolean that instructs the I/O Engine instance to bypass the NVMe + # health check. This eliminates the check and related log output for those + # systems with NVMe that do not support the device health data query. + + bypass_health_chk: true + + # Specify the fabric network interface used for this engine. + + fabric_iface: ibs1 + + # Specify the fabric network interface port that will be used by this engine. + # The fabric_iface_port must be different for each engine on a DAOS server + # if each engine is assigned to the same fabric_iface. + + fabric_iface_port: 20000 + + # Force specific debug mask for the engine at start up time. + # By default, just use the default debug mask used by DAOS. + # Mask specifies minimum level of message significance to pass to logger. + + # default: ERR + log_mask: INFO + + # Force specific path for DAOS debug logs. + + # default: engine log goes to control_log_file + log_file: /var/log/daos/daos_engine.0.log + + # Pass specific environment variables to the engine process. + # Empty by default. Values should be supplied without encapsulating quotes. + + env_vars: + - CRT_CTX_SHARE_ADDR=1 + - CRT_CTX_NUM=8 + - CRT_TIMEOUT=600 + - CRT_CREDIT_EP_CTX=0 + - DAOS_MD_CAP=4096 + - DAOS_DTX_AGG_THD_CNT=16777216 + - DAOS_DTX_AGG_THD_AGE=1200 + - FI_CXI_CQ_FILL_PERCENT=20 + - FI_CXI_OFLOW_BUF_SIZE=8388608 + - FI_CXI_REQ_BUF_MIN_POSTED=8 + - FI_CXI_REQ_BUF_SIZE=8388608 + - FI_CXI_RX_MATCH_MODE=hybrid + - FI_OFI_RXM_USE_SRX=1 + - FI_UNIVERSE_SIZE=16384 + - FI_VERBS_PREFER_XRC=1 + - UCX_IB_FORK_INIT=n + + storage: + - + # Define a pre-configured mountpoint for storage class memory to be used + # by this engine. + # Path should be unique to engine instance (can use different subdirs). + # Either the specified directory or its parent must be a mount point. + + scm_mount: /mnt/daos/1 + + # Backend SCM device type. Either use PMem (Intel(R) Optane(TM) persistent + # memory) modules configured in interleaved mode or a tmpfs running in RAM. + # Options are: + # - "dcpm" for SCM, scm_size is ignored + # - "ram" to use tmpfs, scm_list is ignored + # Immutable after running "dmg storage format". + + class: ram + + # When class is set to ram, tmpfs will be used instead of dcpm. + # The size of ram is specified by scm_size in GB units and will be automatically calculated + # unless overridden by this optional parameter (units in GiB). + + #scm_size: 0 + scm_size: 160 + + # When class is set to ram, tmpfs will be mounted with hugepage + # support, if the kernel supports it. If this is not desirable, + # the behavior may be disabled here. + scm_hugepages_disabled: false + + - + # Backend block device type. Force a SPDK driver to be used by this engine + # instance. + # Options are: + # - "nvme" for NVMe SSDs (preferred option), bdev_size ignored + # - "file" to emulate a NVMe SSD with a regular file + # - "kdev" to use a kernel block device, bdev_size ignored + # Immutable after running "dmg storage format". + class: nvme + + # Backend block device configuration to be used by this engine instance. + # When class is set to nvme, bdev_list is the list of unique NVMe IDs + # that should be different across different engine instances. + # Immutable after running "dmg storage format". + bdev_list: [ + "0000:5b:00.0", + "0000:5c:00.0", + "0000:6c:00.0", + "0000:6d:00.0", + "0000:6e:00.0", + "0000:6f:00.0", + ] + + # If VMD-enabled NVMe SSDs are used, the bdev_list should consist of the VMD + # PCIe addresses, and not the BDF format transport IDs of the backing NVMe SSDs + # behind the VMD address. Also, 'disable_vmd' needs to be set to false. + #bdev_list: ["0000:5d:05.5"] + + # Optional override, will be automatically generated based on NUMA affinity. + # Filter hot-pluggable devices by PCI bus-ID by specifying a hexadecimal + # range. Hotplug events relating to devices with PCI bus-IDs outside this range + # will not be processed by this engine. Empty or unset range signifies allow all. + #bdev_busid_range: 0x80-0x8f + #bdev_busid_range: 128-143 + + # Optional explicit nvme-class bdev tier role assignments will + # define the roles and responsibilities of this bdev tier. + # If DCPM class is defined for the first tier, + # only one bdev tier is supported and its role must be data. + + # Roles will be derived based on configured bdev + # tiers, if not specified here. You must assign all roles or none. + # Options are: + # - "data" SSDs will be used to store actual data + # - "meta" SSDs will be used to store the VOS metadata + # - "wal" SSDs will be used to store the write-ahead-log + bdev_roles: ['data', 'meta', 'wal'] + + # Set criteria for automatic detection and eviction of faulty NVMe devices. The + # default criteria parameters are `enable: true`, `max_io_errs: 10` and + # `max_csum_errs: ` (essentially eviction due to checksum errors is + # disabled by default). + bdev_auto_faulty: + enable: true + max_io_errs: 100 + max_csum_errs: 200 + +- + # Number of I/O service threads (and network endpoints) per engine. + # Immutable after running "dmg storage format". + # + # Each storage target manages a fraction of the (interleaved) SCM storage space, + # and a fraction of one of the NVMe SSDs that are managed by this engine. + # For optimal balance regarding the NVMe space, the number of targets should be + # an integer multiple of the number of NVMe disks configured in bdev_list: + # To obtain the maximum SCM performance, a certain number of targets is needed. + # This is device- and workload-dependent, but around 16 targets usually work well. + # + # The server should have sufficiently many physical cores to support the + # number of targets, plus the additional service threads. + + targets: 16 + + # Number of additional offload service threads per engine. + # Immutable after running "dmg storage format". + # + # Helper threads to accelerate checksum and server-side RPC dispatch. + # + # The server should have sufficiently many physical cores to support the + # number of helper threads, plus the number of targets. + + nr_xs_helpers: 4 + + # Pin this engine instance to cores and memory that are local to the + # NUMA node ID specified with this value. + # + # For best performance, it is necessary that the fabric_iface of this engine + # resides on the same NUMA node as the pinned_numa_node. + # + # Optional parameter; set either this option or first_core, but not both. + + pinned_numa_node: 1 + + # Offset of the first core to be used for I/O service threads (targets). + # Immutable after running "dmg storage format". + # + # For best performance, it is necessary that the fabric_iface of this engine + # resides on the same NUMA node as the first_core. + # + # Optional parameter; set either this option or pinned_numa_node but not both. + + #first_core: 64 + + # A boolean that instructs the I/O Engine instance to bypass the NVMe + # health check. This eliminates the check and related log output for those + # systems with NVMe that do not support the device health data query. + + bypass_health_chk: true + + # Specify the fabric network interface used for this engine. + + fabric_iface: ibs1025 + + # Specify the fabric network interface port that will be used by this engine. + # The fabric_iface_port must be different for each engine on a DAOS server + # if each engine is assigned to the same fabric_iface. + + fabric_iface_port: 21000 + + # Force specific debug mask for the engine at start up time. + # By default, just use the default debug mask used by DAOS. + # Mask specifies minimum level of message significance to pass to logger. + + # default: ERR + log_mask: INFO + + # Force specific path for DAOS debug logs. + + # default: engine log goes to control_log_file + log_file: /var/log/daos/daos_engine.1.log + + # Pass specific environment variables to the engine process. + # Empty by default. Values should be supplied without encapsulating quotes. + + env_vars: + - CRT_CTX_SHARE_ADDR=1 + - CRT_CTX_NUM=8 + - CRT_TIMEOUT=600 + - CRT_CREDIT_EP_CTX=0 + - DAOS_MD_CAP=4096 + - DAOS_DTX_AGG_THD_CNT=16777216 + - DAOS_DTX_AGG_THD_AGE=1200 + - FI_CXI_CQ_FILL_PERCENT=20 + - FI_CXI_OFLOW_BUF_SIZE=8388608 + - FI_CXI_REQ_BUF_MIN_POSTED=8 + - FI_CXI_REQ_BUF_SIZE=8388608 + - FI_CXI_RX_MATCH_MODE=hybrid + - FI_OFI_RXM_USE_SRX=1 + - FI_UNIVERSE_SIZE=16384 + - FI_VERBS_PREFER_XRC=1 + - UCX_IB_FORK_INIT=n + + storage: + - + # Define a pre-configured mountpoint for storage class memory to be used + # by this engine. + # Path should be unique to engine instance (can use different subdirs). + # Either the specified directory or its parent must be a mount point. + + scm_mount: /mnt/daos/2 + + # Backend SCM device type. Either use PMem (Intel(R) Optane(TM) persistent + # memory) modules configured in interleaved mode or a tmpfs running in RAM. + # Options are: + # - "dcpm" for SCM, scm_size is ignored + # - "ram" to use tmpfs, scm_list is ignored + # Immutable after running "dmg storage format". + + class: ram + + # When class is set to ram, tmpfs will be used instead of dcpm. + # The size of ram is specified by scm_size in GB units and will be automatically calculated + # unless overridden by this optional parameter (units in GiB). + + #scm_size: 0 + scm_size: 160 + + # When class is set to ram, tmpfs will be mounted with hugepage + # support, if the kernel supports it. If this is not desirable, + # the behavior may be disabled here. + scm_hugepages_disabled: false + + # When class is set to dcpm, scm_list is the list of device paths for + # PMem namespaces (currently only one per engine supported). + #class: dcpm + #scm_list: [/dev/pmem1] + + - + # Backend block device type. Force a SPDK driver to be used by this engine + # instance. + # Options are: + # - "nvme" for NVMe SSDs (preferred option), bdev_size ignored + # - "file" to emulate a NVMe SSD with a regular file + # - "kdev" to use a kernel block device, bdev_size ignored + # Immutable after running "dmg storage format". + + # When class is set to file, Linux AIO will be used to emulate NVMe. + # The size of file that will be created is specified by bdev_size in GB units. + # The location of the files that will be created is specified in bdev_list. + #class: file + #bdev_list: [/tmp/daos-bdev1,/tmp/daos-bdev2] + #bdev_size: 16 + + # When class is set to kdev, bdev_list is the list of unique kernel + # block devices that should be different across different engine instance. + #class: kdev + #bdev_list: [/dev/sdc,/dev/sdd] + + # If Volume Management Devices (VMD) are to be used, then the disable_vmd + # flag needs to be set to false (default). The class will remain the + # default "nvme" type, and bdev_list will include the VMD addresses. + #class: nvme + #bdev_list: ["0000:5d:05.5"] + + class: nvme + bdev_list: [ + "0000:d8:00.0", + "0000:d9:00.0", + "0000:da:00.0", + "0000:db:00.0", + "0000:ea:00.0", + "0000:eb:00.0", + ] + + # Optional override, will be automatically generated based on NUMA affinity. + # Filter hot-pluggable devices by PCI bus-ID by specifying a hexadecimal + # range. Hotplug events relating to devices with PCI bus-IDs outside this range + # will not be processed by this engine. Empty or unset range signifies allow all. + #bdev_busid_range: 0xd0-0xdf + #bdev_busid_range: 208-223 + + # See about bdev_roles above. + bdev_roles: ['data', 'meta', 'wal'] + + # Disable automatic detection and eviction of faulty NVMe devices. The default + # criteria parameters are `enable: true`, `max_io_errs: 10` and + # `max_csum_errs: ` (essentially eviction due to checksum errors is + # disabled by default). + + bdev_auto_faulty: + enable: true + max_io_errs: 100 + max_csum_errs: 200 + +#support_config: +# +## Override the default file transfer mechanism for dmg support collect-log by supplying +## the path to an alternative script or binary that will be used to copy log off of the servers. +## Note that the --transfer-args flag to the collect-log command may be used to supply +## extra runtime arguments used by the copy tool (e.g. a cloud bucket name, etc.) +## file_transfer_exec: /usr/bin/rsync (example) +# +# file_transfer_exec: diff --git a/io500/sc25/tta/mantastorage/servers/MantaStorage-04/etc/sysctl.d/99-daos-net.conf b/io500/sc25/tta/mantastorage/servers/MantaStorage-04/etc/sysctl.d/99-daos-net.conf new file mode 100644 index 0000000..c407299 --- /dev/null +++ b/io500/sc25/tta/mantastorage/servers/MantaStorage-04/etc/sysctl.d/99-daos-net.conf @@ -0,0 +1,41 @@ + ############################################################################## + # Multi-rail + ############################################################################## + net.ipv4.conf.ibs1.accept_local=1 + net.ipv4.conf.ibs1.rp_filter=2 + net.ipv4.conf.ibs1.arp_ignore=2 + net.ipv4.conf.ibs1.arp_announce=0 + net.ipv4.conf.ibs1.arp_filter=0 + + net.ipv4.conf.ibs769.accept_local=1 + net.ipv4.conf.ibs769.rp_filter=2 + net.ipv4.conf.ibs769.arp_ignore=2 + net.ipv4.conf.ibs769.arp_announce=0 + net.ipv4.conf.ibs769.arp_filter=0 + + ############################################################################## + # Common tunning + ############################################################################## + net.core.netdev_max_backlog = 250000 + #net.core.rmem_max = 16777216 + net.core.rmem_max = 134217728 + #net.core.wmem_max = 16777216 + net.core.wmem_max = 134217728 + net.core.rmem_default = 16777216 + net.core.wmem_default = 16777216 + net.core.optmem_max = 16777216 + + net.ipv4.neigh.default.gc_thresh1 = 1024 + net.ipv4.neigh.default.gc_thresh2 = 4096 + net.ipv4.neigh.default.gc_thresh3 = 16384 + + net.ipv4.tcp_timestamps = 1 + net.ipv4.tcp_sack = 1 + net.ipv4.tcp_mem = 16777216 16777216 16777216 + net.ipv4.tcp_rmem = 4096 1677216 134217728 + net.ipv4.tcp_wmem = 4096 1677216 134217728 + net.ipv4.tcp_low_latency = 1 + net.ipv4.tcp_mtu_probing = 1 + net.ipv4.tcp_congestion_control=bbr + net.core.default_qdisc=fq + net.ipv4.tcp_slow_start_after_idle=0 diff --git a/io500/sc25/tta/mantastorage/servers/MantaStorage-04/etc/sysctl.d/99-daos-vm.conf b/io500/sc25/tta/mantastorage/servers/MantaStorage-04/etc/sysctl.d/99-daos-vm.conf new file mode 100644 index 0000000..bc13d41 --- /dev/null +++ b/io500/sc25/tta/mantastorage/servers/MantaStorage-04/etc/sysctl.d/99-daos-vm.conf @@ -0,0 +1 @@ + vm.swappiness = 1 diff --git a/io500/sc25/tta/mantastorage/servers/create-cont.sh b/io500/sc25/tta/mantastorage/servers/create-cont.sh new file mode 100644 index 0000000..d203216 --- /dev/null +++ b/io500/sc25/tta/mantastorage/servers/create-cont.sh @@ -0,0 +1,19 @@ +#!/bin/bash + +D_POOL="$1" +D_CONT="$2" +D_RF="$3" + +echo "daos container create --type=POSIX $D_POOL --properties=rd_fac:${D_RF},compression:off,dedup:off $D_CONT" +daos container destroy "$D_POOL" "$D_CONT" --force + +for i in $(seq 1 4); +do + NODE_NAME="$(printf "MantaStorage-%02d" "${i}")" + + ssh "$NODE_NAME" "systemctl stop daos_agent"; + ssh "$NODE_NAME" "systemctl start daos_agent"; +done + +echo "daos container create --type=POSIX $D_POOL --properties=rd_fac:${D_RF},compression:off,dedup:off $D_CONT" +daos container create --type=POSIX "$D_POOL" --properties="rd_fac:${D_RF},compression:off,dedup:off" "$D_CONT" diff --git a/io500/sc25/tta/mantastorage/servers/create-pool.sh b/io500/sc25/tta/mantastorage/servers/create-pool.sh new file mode 100644 index 0000000..3d7278c --- /dev/null +++ b/io500/sc25/tta/mantastorage/servers/create-pool.sh @@ -0,0 +1,23 @@ +#!/bin/bash + +D_POOL="$1" +D_SIZE="$2" +D_RATIO="$3" + +for i in $(seq 1 4); +do + NODE_NAME="$(printf "MantaStorage-%02d" "${i}")" + + ssh "$NODE_NAME" "systemctl stop daos_server"; + ssh "$NODE_NAME" "rm -rf /var/daos/control_meta/daos_control"; + ssh "$NODE_NAME" "rm -rf /mnt/daos/*/*"; + ssh "$NODE_NAME" "umount /mnt/daos/*"; + ssh "$NODE_NAME" "systemctl start daos_server"; +done + +dmg storage format + +sleep 60; + +echo "dmg pool create --size=$D_SIZE --mem-ratio=$D_RATIO $D_POOL" +dmg pool create --size="$D_SIZE" --mem-ratio="$D_RATIO" "$D_POOL" diff --git a/io500/sc25/tta/mantastorage/site.json b/io500/sc25/tta/mantastorage/site.json new file mode 100644 index 0000000..e794070 --- /dev/null +++ b/io500/sc25/tta/mantastorage/site.json @@ -0,0 +1 @@ +{"DATA":{"type":"Site","att":{"abbreviation":"TTA","institution":"Telecommunications Technology Association","nationality":"KOR"},"childs":[{"type":"IO500","att":{"exclusive_cluster_usage":"yes","number_clientNodes":"7","procsPerNode":"144","number_serverNodes":"4","storage type":"Flash","storage net capacity":["164","TiB"],"clients network bandwidth":["175","GiB/s"],"servers network bandwidth":["400","GiB/s"],"type of filesystem":"DAOS"},"childs":[]},{"type":"StorageSystem","att":{"name":"MantaStorage","model":"MT8000","usage":"research/test","overlapping_compute_storage":"no","overlapping_metadata_dataServers":"yes","vendor":"Gluesys","type":"object storage","software":"DAOS","frameworks":"SPDK","client storage access":"library linked to IO500","client mount options":"N/A","version":"DAOS-2.7.101-16.11505.g808afd52","net capacity":["163","TiB"],"Servers":"4"},"childs":[{"type":"DAOS","att":{"Version":"2.7"},"childs":[{"type":"Servers","att":{"name":"MantaStorage-01","model":"SYS-222C-TN","vendor":"Supermicro","count":"1","distribution":"Rocky Linux","distribution version":"9.6","kernel version":"5.14.0-570.52.1.el9_6.x86_64","Cooling":""},"childs":[{"type":"Processor","att":{"sockets":"2","model":"INTEL XEON 6530P","architecture":"x86_64","vendor":"Intel","microarchitecture":"Granite Rapids","cores per socket":"32","threads per core":"2","frequency":["2.3","GHz"]},"childs":[]},{"type":"StorageMedia","att":{"model":"Samsung MZWLR3T8HCLS-00A07","type":"Flash","count":"10","interface":"NVMe","file system":"none","vendor":"Samsung","net capacity":["3.84","TB"]},"childs":[]},{"type":"StorageMedia","att":{"model":"Dapustor","type":"Flash","count":"2","interface":"NVMe","file system":"none","vendor":"Dapustor","net capacity":["4.63","TB"]},"childs":[]},{"type":"Memory","att":{"type":"","net capacity":["512","GB"],"protection":""},"childs":[]},{"type":"Interconnect","att":{"model":"Mellanox ConnectX-6","type":"Infiniband","speed":"200 GBit","features":"RDMA","vendor":"Mellanox","links":"1"},"childs":[]},{"type":"Interconnect","att":{"model":"Mellanox ConnectX-6","type":"Infiniband","speed":"200 GBit","features":"RDMA","vendor":"Mellanox","links":"1"},"childs":[]}]},{"type":"Servers","att":{"name":"MantaStorage-02","model":"SYS-222C-TN","vendor":"Supermicro","count":"1","distribution":"Rocky Linux","distribution version":"9.6","kernel version":"5.14.0-570.52.1.el9_6.x86_64","Cooling":""},"childs":[{"type":"Processor","att":{"sockets":"2","model":"INTEL XEON 6530P","architecture":"x86_64","vendor":"Intel","microarchitecture":"Granite Rapids","cores per socket":"32","threads per core":"2","frequency":["2.3","GHz"]},"childs":[]},{"type":"StorageMedia","att":{"model":"Seagate XP7680SE70006","type":"Flash","count":"8","interface":"NVMe","file system":"none","vendor":"Seagate","net capacity":["7.78","TB"]},"childs":[]},{"type":"StorageMedia","att":{"model":"Dapustor DPRD3108T0T507T6000","type":"Flash","count":"2","interface":"NVMe","file system":"none","vendor":"Dapustor","net capacity":["5.63","TB"]},"childs":[]},{"type":"StorageMedia","att":{"model":"Samsung MZWLR3T8HCLS-00A07","type":"Flash","count":"2","interface":"NVMe","file system":"none","vendor":"Samsung","net capacity":["3.84","TB"]},"childs":[]},{"type":"Memory","att":{"type":"","net capacity":["513","GB"],"protection":""},"childs":[]},{"type":"Interconnect","att":{"model":"Mellanox ConnectX-6","type":"Infiniband","speed":"200 GBit","features":"RDMA","vendor":"Mellanox","links":"1"},"childs":[]},{"type":"Interconnect","att":{"model":"Mellanox ConnectX-6","type":"Infiniband","speed":"200 GBit","features":"RDMA","vendor":"Mellanox","links":"1"},"childs":[]}]},{"type":"Servers","att":{"name":"MantaStorage-03","model":"SYS-222C-TN","vendor":"Supermicro","count":"1","distribution":"Rocky Linux","distribution version":"9.6","kernel version":"5.14.0-570.52.1.el9_6.x86_64","Cooling":""},"childs":[{"type":"Processor","att":{"sockets":"2","model":"INTEL XEON 6520P","architecture":"x86_64","vendor":"Intel","microarchitecture":"Granite Rapids","cores per socket":"24","threads per core":"2","frequency":["2.4","GHz"]},"childs":[]},{"type":"StorageMedia","att":{"model":"SAMSUNG MZQL23T8HCLS-00A07","type":"Flash","count":"8","interface":"NVMe","file system":"none","vendor":"Samsung","net capacity":["3.84","TB"]},"childs":[]},{"type":"StorageMedia","att":{"model":"Seagate XP7680SE70006","type":"Flash","count":"2","interface":"NVMe","file system":"none","vendor":"Seagate","net capacity":["7.68","TB"]},"childs":[]},{"type":"StorageMedia","att":{"model":"Dapustor DPRD3108T0T507T6000","type":"Flash","count":"2","interface":"NVMe","file system":"none","vendor":"Dapustor","net capacity":["5.63","TB"]},"childs":[]},{"type":"Memory","att":{"type":"","net capacity":["512","GB"],"protection":""},"childs":[]},{"type":"Interconnect","att":{"model":"Mellanox ConnectX-6","type":"Infiniband","speed":"200 GBit","features":"RDMA","vendor":"Mellanox","links":"1"},"childs":[]},{"type":"Interconnect","att":{"model":"Mellanox ConnectX-6","type":"Infiniband","speed":"200 GBit","features":"RDMA","vendor":"Mellanox","links":"1"},"childs":[]}]},{"type":"Servers","att":{"name":"MantaStorage-04","model":"SYS-222C-TN","vendor":"Supermicro","count":"1","distribution":"Rocky Linux","distribution version":"9.6","kernel version":"5.14.0-570.52.1.el9_6.x86_64","Cooling":""},"childs":[{"type":"Processor","att":{"sockets":"2","model":"INTEL XEON 6520P","architecture":"x86_64","vendor":"Intel","microarchitecture":"Granite Rapids","cores per socket":"24","threads per core":"2","frequency":["2.4","GHz"]},"childs":[]},{"type":"StorageMedia","att":{"model":"Samsung MZQL23T8HCLS-00A07","type":"Flash","count":"8","interface":"NVMe","file system":"none","vendor":"Samsung","net capacity":["3.84","TB"]},"childs":[]},{"type":"StorageMedia","att":{"model":"Seagate XP7680SE70006","type":"Flash","count":"2","interface":"NVMe","file system":"none","vendor":"Seagate","net capacity":["7.68","TB"]},"childs":[]},{"type":"StorageMedia","att":{"model":"Dapustor DPRD3108T0T507T6000","type":"Flash","count":"2","interface":"NVMe","file system":"none","vendor":"Dapustor","net capacity":["5.63","TB"]},"childs":[]},{"type":"Memory","att":{"type":"","net capacity":["512","GB"],"protection":""},"childs":[]},{"type":"Interconnect","att":{"model":"Mellanox ConnectX-6","type":"Infiniband","speed":"200 GBit","features":"RDMA","vendor":"Mellanox","links":"1"},"childs":[]},{"type":"Interconnect","att":{"model":"Mellanox ConnectX-6","type":"Infiniband","speed":"200 GBit","features":"RDMA","vendor":"Mellanox","links":"1"},"childs":[]}]}]}]},{"type":"Supercomputer","att":{"name":"ESC8000-E11","total nodes":"7","total cores":"672","memory capacity":["7","TiB"]},"childs":[{"type":"Nodes","att":{"name":"ESC8000-E11","vendor":"ASUSTeK COMPUTER INC.","count":"7","distribution":"Rocky Linux","distribution version":"9.6","kernel version":"5.14.0-570.52.1.el9_6.x86_64","Cooling":""},"childs":[{"type":"Processor","att":{"sockets":"2","model":"INTEL XEON PLATINUM 8558","architecture":"x86_64","vendor":"","microarchitecture":"Emerald Rapids","cores per socket":"48","threads per core":"96","frequency":["2.1","GHz"]},"childs":[]},{"type":"Memory","att":{"type":"","net capacity":["1","TiB"],"protection":""},"childs":[]},{"type":"Interconnect","att":{"model":"Mellanox ConnectX-5","type":"Infiniband","speed":"100 GBit","features":"RDMA","vendor":"Mellanox","links":"1"},"childs":[]},{"type":"Interconnect","att":{"model":"Mellanox ConnectX-5","type":"Infiniband","speed":"100 GBit","features":"RDMA","vendor":"Mellanox","links":"1"},"childs":[]}]}]}]},"GRAPH":{"Network":{"pos":{},"edges":[]},"Building":{"pos":{},"edges":[]}}} \ No newline at end of file