Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
146 changes: 146 additions & 0 deletions io500/sc25/tta/mantastorage/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# IO500-SC25 - TTA - MantaStorage

Reproducibility information for benchmarks submitted to the IO500-SC25
[Production](https://io500.org/list/sc25/production) list and
[Research](https://io500.org/list/sc25/io500) list.

General information on building and running the IO500 benchmarks with DAOS can
be found in the [DAOS Community Wiki](https://daosio.atlassian.net/wiki/spaces/DC/pages/11167301633/IO-500+SC22)
page on IO500.

## IO500.org Reproducibility Questionnaire

Answers to the _IO500.org Reproducibility Questionnaire_ are provided in
the
[io500-reproducibility.tta-mantastorage.md](io500-reproducibility.tta-mantastorage.md)
document. Note that this document covers two different _user-selectable_ data protection
schemes to address the different requirements of the _Production_ and _Research_ list.

## Institution

[TTA](https://tta.or.kr)

The TTA operates this supercomputer as part of the Republic of Korea's
national R&D project, "Development of High-Efficiency Parallel Storage
Software Technology Optimized for AI Computational Accelerators."

## Storage System

The DAOS storage cluster in TTA's MantaStorage system consists of 4x
Supermicro SYS-222C-TN servers, currently running DAOS version 2.7(under dev)
on Rocky Linux 9.6. The system was deployed and is commercially supported by
[Gluesys](https://www.gluesys.com):

- Server 1(MantaStorage-01)
- CPU: 2x Intel(R) Xeon(R) 6530P
- RAM: 8x 64GiB DDR5 Samsung 4800MT/s
- NVMe
- 10x Samsung MZWLR3T8HCLS-00A07 3.84TB
- 2x Dapustor DPRD3108T0T507T6000 5.63TB
- NIC
- 2x Mellanox ConnectX-6 1-port Infiniband HDR adapter
- Server 2(MantaStorage-02)
- CPU: 2x Intel(R) Xeon(R) 6530P
- RAM: 8x 64GiB DDR5 Samsung 4800MT/s
- NVMe
- 8x Seagate XP7680SE70006 7.68TB
- 2x Dapustor DPRD3108T0T507T6000 5.63TB
- 2x Samsung MZWLR3T8HCLS-00A07 3.84TB
- NIC
- 2x Mellanox ConnectX-6 1-port Infiniband HDR adapter
- Server 3(MantaStorage-03)
- CPU: 2x Intel(R) Xeon(R) 6520P
- RAM: 8x 64GiB DDR5 Samsung 4800MT/s
- NVMe
- 8x Samsung MZQL23T8HCLS-00A07 3.84TB
- 2x Seagate XP7680SE70006 7.68TB
- 2x Dapustor DPRD3108T0T507T6000 5.63TB
- NIC
- 2x Mellanox ConnectX-6 1-port Infiniband HDR adapter
- Server 4(MantaStorage-04)
- CPU: 2x Intel(R) Xeon(R) 6520P
- RAM: 8x 64GiB DDR5 Samsung 4800MT/s
- NVMe
- 8x Samsung MZQL23T8HCLS-00A07 3.84TB
- 2x Seagate XP7680SE70006 7.68TB
- 2x Dapustor DPRD3108T0T507T6000 5.63TB
- NIC
- 2x Mellanox ConnectX-6 1-port Infiniband HDR adapter

DAOS storage software references:

* DAOS [github repository](https://github.com/daos-stack/daos)
* DAOS [packages repository](https://packages.daos.io)
* DAOS [documentation](https://docs.daos.io/)
* [SC-Asia 2020 paper](https://doi.org/10.1007/978-3-030-48842-0_3)
_DAOS: A Scale-Out High Performance Storage Stack for Storage Class Memory_
* [SC-Asia 2023 paper](https://doi.org/10.1145/3581576.3581577)
_Understanding DAOS Storage Performance Scalability_

## Client Nodes

The clients in TTA's MantaStorage system are ASUSTeK ESC8000-E11 servers,
currently running DAOS version 2.7(under dev) on Rocky Linux 9.6:

- CPU: 2x Intel(R) Xeon(R) Platinum 8558
- RAM: 16x 64GiB DDR5 Samsung 4800MT/s
- NIC: 2x Mellanox ConnectX-5 1-port Infiniband EDR adapter

## High-Performance Fabric

The HPC Fabric is a fully non-blocking HDR InfiniBand network, using
[Mellanox QM8790](https://docs.nvidia.com/networking/display/qm87xx/introduction) switch.

Both servers and clients use two single-port Mellanox ConnectX-6 HDR adapters
(one per CPU socket). On the servers, each port is managed by a dedicated
`daos_engine` running on that CPU socket. On the clients, each MPI task is
communicating through the IB interface on the same NUMA node.

## Execution Environment

All servers and clients were installed with the following software stack:

- Rocky Linux 9.6 (Kernel version 5.14.0-570.52.1.el9_6.x86_64)
- [MLNX_OFED_LINUX-24.10-3.2.5.0](https://docs.nvidia.com/networking/display/mlnxofedv24103250lts/release+notes) on the DAOS servers and clients.
- [DAOS 2.7 (under development)](https://github.com/daos-stack/daos/tree/808afd521bb41f3b0e08b43b2b5bae521ed00bd2)
- [MPICH 4.2.3](https://www.mpich.org/downloads/)

The following DAOS server and client configuration files were used.

### DAOS Server configuration

- [/etc/daos/daos_server.yml](servers/MantaStorage-01/etc/daos/daos_server.yml)
- [/etc/daos/daos_control.yml](servers/MantaStorage-01/etc/daos/daos_control.yml)
- [/etc/daos/daos_agent.yml](servers/MantaStorage-01/etc/daos/daos_agent.yml)
- [/etc/sysctl.d/99-daos-net.conf](servers/MantaStorage-01/etc/sysctl.d/99-daos-net.conf)

For the IO500 benchmarks, one storage pool was created that spans all 4
servers, using the [create-pool.sh](servers/create-pool.sh) script. In that pool, a DAOS POSIX
container was created using the [create-cont.sh](servers/create-cont.sh) script.

### DAOS Client environment

Our IO500-SC25 benchmark runs were performed in the deployment stage, before
user operation started. For this reason the runs have been performed with
interactive `mpirun` invocations, using a hostlist to specify client nodes as
described above.

The IO500 run scripts are included in the IO500 results tarballs.

The rules for the IO500 Production lists require that the storage system has
no single point of failure. So the
[config-all-dfs-rf1-tmpl.ini](config-all-dfs-rf1-tmpl.ini) configuration file
has been used for Production runs. It protects against single faults by using
2-Way replication for metadata and IOR-Hard, and 2+1P Erasure Coding for IOR-Easy.

Submissions to the IO500 Research lists are using an identical storage system
setup, but since the "no single point of failure" requirement does not apply
to the Research list the
[config-all-dfs-rf0-tmpl.ini](config-all-dfs-rf0-tmpl.ini) configuration file
has been used for Research runs. This configuration does not use replication
or Erasure Coding to maximize the achievable performance.

## IO500 List Entries

- MantaStorage: SC25 Research List #85, submission 770
- MantaStorage-EC: SC25 Production List #113, submission 776
183 changes: 183 additions & 0 deletions io500/sc25/tta/mantastorage/clients/Client-2/etc/daos/daos_agent.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
# DAOS agent configuration file.
#
# Location of this configuration file is determined by first checking for the
# path specified through the -o option of the daos_agent command line.
# Otherwise, /etc/daos/daos_agent.yml is used.
#
# Section describing the daos_agent configuration
#
# Although not supported for now, one might want to connect to multiple
# DAOS installations from the same node in the future.
#
# Specify the associated DAOS systems.
# Name must match name specified in the daos_server.yml file on the server.
#
# NOTE: changing the name is not supported yet, it must be daos_server
#
# default: daos_server
name: daos_server

# Management server access points
# Must have the same value for all agents and servers in a system.
# default: hostname of this node
access_points: [
"MantaStorage-01",
"MantaStorage-02",
"MantaStorage-03",
"MantaStorage-04"
]

# Force different port number to connect to access points.
# default: 10001
port: 10001

## Enable HTTP endpoint for remote telemetry collection.
# Note that enabling the endpoint automatically enables
# client telemetry collection.
#
## default endpoint state: disabled
## default endpoint port: 9192
#telemetry_port: 9192

## Enable client telemetry for all DAOS clients.
# If false, clients will need to optionally enable telemetry by setting
# the D_CLIENT_METRICS_ENABLE environment variable to true.
#
## default: false
#telemetry_enabled: true

## Retain client telemetry for a period of time after the client
# process exits.
#
## default 0 (do not retain telemetry after client exit)
#telemetry_retain: 1m

## Enable client telemetry for a subset of DAOS clients matching the
# supplied regular expression pattern. May not be used with
# telemetry_disabled_procs.
#
## default: not set
#telemetry_enabled_procs: ^dfuse$

## Disable client telemetry for a subset of DAOS clients matching the
# supplied regular expression pattern. May not be used with
# telemetry_enabled_procs.
#
## default: not set
#telemetry_disabled_procs: ^spambot-.*

## Configuration for user credential management.
#credential_config:
# # If the agent should be able to resolve unknown client uids and gids
# # (e.g. when running in a container) into ACL principal names, then a
# # client user map may be defined. The optional "default" uid is a special
# # case and applies if no other matches are found.
# client_user_map:
# default:
# user: nobody
# group: nobody
# 1000:
# user: ralph
# group: stanley
#
# # Optionally cache generated credentials with the specified cache
# # lifetime. By default, a credential is generated for every client
# # process that connects to a pool. If the credential cache is
# # enabled, then local client processes connecting with stable
# # uid:gid associations may take advantage of the cached credential
# # and reduce some agent overhead. For heavily-loaded client nodes
# # with many frequent (e.g. hundreds per minute) client connections,
# # a lifetime of 1-5 minutes may be a reasonable tradeoff between
# # performance and responsiveness to user/group database updates.
# # If no expiration is set, credential caching is not enabled.
# cache_expiration: 1m
#
## Configuration for SSL certificates used to secure management traffic
# and authenticate/authorize management components.
transport_config:
# # In order to disable transport security, uncomment and set allow_insecure
# # to true. Not recommended for production configurations.
allow_insecure: false

# # Custom CA Root certificate for generated certs
ca_cert: /etc/daos/certs/daosCA.crt
# # Agent certificate for use in TLS handshakes
cert: /etc/daos/certs/agent.crt
# # Key portion of Agent Certificate
key: /etc/daos/certs/agent.key

# Use the given directory for creating unix domain sockets
#
# NOTE: Do not change this when running under systemd control. If it needs to
# be changed, then make sure that it matches the RuntimeDirectory setting
# in /usr/lib/systemd/system/daos_agent.service
#
# default: /var/run/daos_agent
runtime_dir: /var/run/daos_agent

## Full path and name of the DAOS agent logfile.
## default: print to stderr
log_file: /var/log/daos/daos_agent.log

## Force specific debug mask for daos_agent (control plane).
## Mask specifies minimum level of message significance to pass to logger.
## Currently supported values are DISABLED, TRACE, DEBUG, INFO, NOTICE and ERROR.
#
## default: INFO
control_log_mask: INFO

## Disable automatic eviction of open pool handles on agent shutdown. By default,
## the agent will evict all open pool handles for local processes on shutdown.
## Note that this implies that stopping or restarting the agent will result
## in interruption of DAOS I/O for any local DAOS client processes that have
## an open pool handle.
## default: false
disable_auto_evict: true

## If enabled, the agent will evict any open pool handles associated with this machine on agent
## startup. This allows the servers to reclaim resources that may not have been properly cleaned
## up in the event of an agent or machine crash.
## default: false
enable_evict_on_start: true

## Disable the agent's internal caches. If set to true, the agent will query the
## server access point and local hardware data every time a client requests
## rank connection information.
#
## default: false
disable_caching: true

## Automatically expire the agent's remote cache after a period of time defined in
## minutes. It will refresh the data the next time it is requested.
#
## default: 0 (never expires)
#cache_expiration: 30

## Ignore a subset of fabric interfaces when selecting an interface for client
## applications. (Mutually exclusive with include).
#
#exclude_fabric_ifaces: ["lo", "eno1", "eno2", "ens4f0", "ens4f1", "usb0"]

## Conversely, only consider a specific set of fabric interfaces when selecting
## an interface for client applications. (Mutually exclusive with exclude).
#
include_fabric_ifaces: ["ibs255", "ibp200s0"]

# Manually define the fabric interfaces and domains to be used by the agent,
# organized by NUMA node.
# If not defined, the agent will automatically detect all fabric interfaces and
# select appropriate ones based on the server preferences.
#
fabric_ifaces:
-
numa_node: 0
devices:
-
iface: ibs255
domain: mlx5_0
-
numa_node: 3
devices:
-
iface: ibp200s0
domain: mlx5_1
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
##############################################################################
# Multi-rail
##############################################################################
net.ipv4.conf.ibp200s0.accept_local = 1
net.ipv4.conf.ibp200s0.rp_filter = 2
net.ipv4.conf.ibp200s0.arp_ignore = 2
net.ipv4.conf.ibp200s0.arp_announce = 0
net.ipv4.conf.ibp200s0.arp_filter = 0

net.ipv4.conf.ibs255.accept_local = 1
net.ipv4.conf.ibs255.rp_filter = 2
net.ipv4.conf.ibs255.arp_ignore = 2
net.ipv4.conf.ibs255.arp_announce = 0
net.ipv4.conf.ibs255.arp_filter = 0

net.core.netdev_max_backlog = 250000
#net.core.rmem_max = 16777216
net.core.rmem_max = 134217728
#net.core.wmem_max = 16777216
net.core.wmem_max = 134217728
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.optmem_max = 16777216

net.ipv4.neigh.default.gc_thresh1 = 1024
net.ipv4.neigh.default.gc_thresh2 = 4096
net.ipv4.neigh.default.gc_thresh3 = 16384

net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_sack = 1
net.ipv4.tcp_mem = 16777216 16777216 16777216
net.ipv4.tcp_rmem = 4096 16777216 134217728
net.ipv4.tcp_wmem = 4096 16777216 134217728
net.ipv4.tcp_low_latency = 1
net.ipv4.tcp_mtu_probing = 1

net.ipv4.tcp_congestion_control=bbr
net.core.default_qdisc=fq
net.ipv4.tcp_slow_start_after_idle=0
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
vm.swappiness = 1
Loading