Description
ThreadSanitizer detects data races in TopicPayloadPool during DDS intra-process communication. This affects ROS 2 applications using spin_some on multiple threads with the same DDS domain.
Reported from ROS 2: ros2/rclcpp#2941
TSan Report Summary
The race is between:
- Write by publisher thread:
fastcdr::Cdr::serialize_array → writing to payload buffer
- Read by subscriber thread:
fastcdr::Cdr::deserialize_array → reading from same payload buffer
Both access the same memory block allocated via TopicPayloadPool::do_get_payload.
Root Causes
1. Mutex released before reference increment in do_get_payload()
At TopicPayloadPool.cpp:83, the pool mutex is released before payload_node->reference() is called. A concurrent release_payload() could see ref_count==0 in this window and recycle the node.
2. PayloadNode::reference() uses memory_order_relaxed
The sharing path (get_payload(const SerializedPayload_t&)) increments the reference count with memory_order_relaxed, which provides no happens-before guarantee. The subscriber may not see the publisher's writes to the payload buffer.
Reproducer
From https://gist.github.com/muji-4ok/6b02f0de0a23a6a5cbf7ef07f4ead4fc — a ROS 2 program with one publisher and two subscribers on separate threads, built with -fsanitize=thread.
Proposed Fix
PR #6339 addresses both issues:
- Holds the lock through
reference() and metadata reads in do_get_payload()
- Strengthens memory ordering from
relaxed to acq_rel in PayloadNode::reference()
Description
ThreadSanitizer detects data races in
TopicPayloadPoolduring DDS intra-process communication. This affects ROS 2 applications usingspin_someon multiple threads with the same DDS domain.Reported from ROS 2: ros2/rclcpp#2941
TSan Report Summary
The race is between:
fastcdr::Cdr::serialize_array→ writing to payload bufferfastcdr::Cdr::deserialize_array→ reading from same payload bufferBoth access the same memory block allocated via
TopicPayloadPool::do_get_payload.Root Causes
1. Mutex released before reference increment in
do_get_payload()At
TopicPayloadPool.cpp:83, the pool mutex is released beforepayload_node->reference()is called. A concurrentrelease_payload()could seeref_count==0in this window and recycle the node.2.
PayloadNode::reference()usesmemory_order_relaxedThe sharing path (
get_payload(const SerializedPayload_t&)) increments the reference count withmemory_order_relaxed, which provides no happens-before guarantee. The subscriber may not see the publisher's writes to the payload buffer.Reproducer
From https://gist.github.com/muji-4ok/6b02f0de0a23a6a5cbf7ef07f4ead4fc — a ROS 2 program with one publisher and two subscribers on separate threads, built with
-fsanitize=thread.Proposed Fix
PR #6339 addresses both issues:
reference()and metadata reads indo_get_payload()relaxedtoacq_relinPayloadNode::reference()