From edab71b8236d8445ab56a59377fc2c75b2daa933 Mon Sep 17 00:00:00 2001 From: Hayodea Hekol Date: Sun, 23 Nov 2025 23:13:23 -0400 Subject: [PATCH] Todo: update --- todo | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 79 insertions(+) diff --git a/todo b/todo index 0363a85..f578ff6 100644 --- a/todo +++ b/todo @@ -27,3 +27,82 @@ whenever we wish to release a desc without closing the underlying fd. Because we've discovered that release() doesn't fully cleanup internal metadata. +* There's a bug where deferred production timeslices can result in + freezing. Explore this and figure out why. When we examined it, + it didn't appear to be a spinlock-deadlock. + It seems to be reliably reproducible when we use the NVidia GTX + card as our OpenCL ComputeDevice, since the GTX card doesn't + have unified memory with the host cpu complex. This causes the + kernels to overrun their timelices and triggers repeated + timeslice deferrals. + + PcloudStimProducer::stop=>start() sequence: + IoUringAssemblyEngine::finalize(): +I'm worried that calling PcloudStimProducer::stop() will leave +in-flight sequences running which will remain alive even after +the PcloudStimProducer object itself has been destroyed. This may +be possible for IoUringAssmEngn because it has a running timer +which may well just time out. +* There's no reason to think that an in-flight IoUringAssmEngn + assembly operation won't actually run until it times out. In + fact, that's the standard case if you configure + nDgramsPerFrame to be large enough. +* This means that when we call IoUringAssmEngn::finalize(), an + in-flight assembly could be going on, which isn't receiving + any CQE notifications on the eventFd. Thus, that in-flight + assembly op could plausibly timeout and resume execution + after IoUringAssemEngn::finalize has completed. +* We ought to do a bridged async timeout for the std::max() + of all timeouts used by IoUringAssmEngn. + + OpenClCollatingAndMeshingEngine::finalize(): +I'm also worried, though less so, about the OClCollMeshEngn: it's +a lot less likely to have an in-flight op run past the point where +the OClCollMeshEngn object has expired. +* But there's still a chance that a long-running OCl kernel could + cause an in-flight async contin to resume executing after its + OclCollMeshEngn has expired. +* We should do a bridged async wait for the std::max() of all + timeouts used by OClCollMeshEngn to pass before leaving + PcloudStimProducer::stop. + + Attaching and detaching StimBuffs from StimProducers: +We've written code recently to attach and detact stimBuffs from a +stimProducer. The code is quite nice, but there's this hanging +omen over the fact that we put no thought into ensuring that +detachment doesn't cause an in-flight async production op to +access invalid data. + +The in-flight async production ops use the SpMcRingbuffs that +inhabit the stimbuffs. If we don't ensure that all in-flight +async ops are retired before we detach a stimbuff from a +producer, we could end up with the producer writing data into +memory which has been reclaimed and repurposed. +Similarly, if we're not careful about the order in which we +assign the stimBuff pointers during attachment, we could +potentially cause producers to see a partially initialized +StimBuff object. + +I think this can be solved without locking/synchronization +by being very careful to ensure that by the time that +StimProducer::stop() exits, all in-flight production +operations are reasonably sure to be halted. If all +in-flight operations are halted; and if production ops +cannot be launched while a StimBuff is being attached/ +detached, this means we don't have to worry about accesses +to stale StimBuff instance state; or access to partially +initialized StimBuff instance state. + +So this problem is solved by dealing with the in-flight +cancelation problem described above, concerning +[IoUringAssmEngn|OClCollMeshEngn]::start/stop(), and +StimulusBuffer::start/stop(), and ensuring that after +stop() has returned, we can be reasonably sure that all +in-flight ops have exited. + + Making sh_ptr atomic for mem barriers: +We could also complete our implemetation's correctness by converting +the sh_ptrs to StimulusBuffer inside of the PCloudStimulusProducer +into std::atomic>, and using +std::memory_order_release/memory_order_acquire when writing and +reading them respectively.