Skip to content

Stress testing causes PDBGroupMonitor to segfault #64

@akete

Description

@akete

Background: this was first observed using custom-built IOC using our custom record type support and using Phoebus CSS as a client. Later, I was able to reproduce the problem entirely in softIocPVA with multiple pvget -m clients.

Problem

With Qsrv grouping multiple PVs into NTTable (complete setup attached) and either CSS screen displaying all components in a X/Y Plot or using multiple pvget -m clients (*), IOC crashes with

Core was generated by `../../softIocPVA -D /test/softIoc/dbd/softIocPVA.dbd st.cmd'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  epicsMutex::lock (this=this@entry=0x48) at ../osi/epicsMutex.cpp:276
276	../osi/epicsMutex.cpp: No such file or directory.
[Current thread is 1 (Thread 0x7f3fe75fa700 (LWP 23040))]
(gdb) bt
#0  epicsMutex::lock (this=this@entry=0x48) at ../osi/epicsMutex.cpp:276
#1  0x000055e51b191a62 in epicsGuard<epicsMutex>::epicsGuard (mutexIn=..., this=<synthetic pointer>) at /build/3rdparty/epics/base-7/include/epicsGuard.h:143
#2  PDBGroupMonitor::requestUpdate (this=0x7f3fc800d630) at ../pdbgroup.cpp:472
#3  0x000055e51b18c6d9 in BaseMonitor::release (this=<optimized out>, elem=...) at ../../common/pvahelper.h:338
#4  0x000055e51b26e999 in epics::pvAccess::MonitorElement::Ref::reset (this=0x7f3fe75f9ab0) at ../../src/client/pv/monitor.h:186
#5  epics::pvAccess::ServerMonitorRequesterImpl::send (this=0x7f3fc800d450, buffer=<optimized out>, control=<optimized out>) at ../../src/server/responseHandlers.cpp:2039
#6  0x000055e51b1f1db0 in epics::pvAccess::detail::AbstractCodec::processSender (this=this@entry=0x7f3fd0000b20, sender=std::shared_ptr<epics::pvAccess::TransportSender> (use count 3, weak count 2) = {...}) at ../../src/remote/codec.cpp:884
#7  0x000055e51b1f31ec in epics::pvAccess::detail::AbstractCodec::processSendQueue (this=0x7f3fd0000b20) at ../../src/remote/codec.cpp:844
#8  0x000055e51b1f37e5 in epics::pvAccess::detail::AbstractCodec::processWrite (this=this@entry=0x7f3fd0000b20) at ../../src/remote/codec.cpp:754
#9  0x000055e51b1f6438 in epics::pvAccess::detail::BlockingTCPTransportCodec::sendThread (this=0x7f3fd0000b20) at ../../src/remote/codec.cpp:1157
#10 0x000055e51b3d9899 in epicsThreadCallEntryPoint (pPvt=0x7f3fd0000ca0) at ../osi/epicsThread.cpp:95
#11 0x000055e51b3dea7a in start_routine (arg=0x7f3fd001a9b0) at ../osi/os/posix/osdThread.c:442
#12 0x00007f40310666db in start_thread (arg=0x7f3fe75fa700) at pthread_create.c:463
#13 0x00007f402fdfb61f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Steps to reproduce

  • start softIocPVA serving structured PV,
  • start triggering PV processing as fast as possible (e. g. while true; do pvput ....; done) and continue to do so until IOC crashes,
  • start multiple (7+) monitoring clients or open GUI (see image below or attached .bob file),
  • wait for a minute
  • kill clients or close GUI

Test setup

NTTable_example.zip

softIocPVA with PV structure:

$ pvget -v  myioc:hist:pva | cut -c 1-50
myioc:hist:pva epics:nt/NTTable:1.0 
    structure record
        structure _options
            uint queueSize 0
            boolean atomic true
    alarm_t alarm 
        int severity 0
        int status 0
        string message NO_ALARM
    structure timeStamp
        long secondsPastEpoch 1737632421
        int nanoseconds 721458675
        int userTag 0
    structure value
        uint[] a [1,0,1,1,1,1,0,1,3,0,0,2,0,1,1,0,
        uint[] b [0,0,2,1,0,0,1,0,0,0,0,1,0,2,4,0,
        uint[] c [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
        uint[] d [0,1,0,0,0,2,2,2,0,1,1,0,0,0,1,0,
        uint[] e [1,0,0,1,3,0,0,1,0,0,3,1,1,0,0,0,
        uint[] f [0,1,2,0,0,1,1,0,0,1,1,0,2,1,1,0,
        uint[] g [0,0,2,1,0,0,1,0,0,0,0,1,0,2,4,0,

CSS GUI:

(*) I was able to reproduce this using 7 pvget -m clients and killing them (kill -9) at the same time after a while.

Initial analysis

By adding additional traces into modules/pva2pva/pdbApp/pdbgroup.cpp I was able to confirm that PDBGroupMonitor::release() is being called after PDBGroupMonitor::destroy() has already been called. This happens when IOC becomes overwhelmed, and calls to PDBGroupMonitor::release() start occurring.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions