Files
salmanoff/todo
T
hayodea 313454c426 OClCollMeshEngn: Add bridged delay in finalize()
See the diff of the todo file within this commit for more details.

In short, we do this to prevent the possibility of an in-flight async
contin accessing metadata that we've already destroyed after finalize()
has been called.
2025-11-27 22:26:50 -04:00

74 lines
3.7 KiB
Plaintext

* Check through all managed objects and properly refcount them
using shared_ptr.
* Ensure that we comb through the current code and enforce the distinction
between user errors and program exceptions.
* Investigate using UMONITOR/UMWAIT for spinlocks to reduce busy-waiting
stress/power consumption. Look for a parallel on ARM.
* Investigate WFE/SEV to reduce busy-waiting in spinlocks on ARM.
* The input arg `requiredLocks` to LockSet::LockSet() should be
a ref and not by-value. Propagate this upward into
SerializedAsyncContin and into all derived classes'
constructors.
* In livoxProto1/device.cpp, migrate the registerUdpCommandHandler() calls
from using the inProgress collection to the per-device collections.
* In cases where we use boost deadline_timers and pass in an async
contin as context preservation across the delay, but they aren't
part of a branch pattern, we may still need to call cancel() on them
after they expire just in case boost doesn't clean up the internal
callable that we passed it. Or else we'll have circular sh_ptr
references in our continuations.
* UdpCommandDemuxer::registerUdpCommandHandler should accept a pointer
to the io_context of the thread it should post its callbacks to, and
then post callbacks to those io_contexts when UDP cmd responses
come in.
* Consider using MAP_HUGEPAGE with both PcloudStimBuff::StagingBuffer
and in the PcloudStimulusBuffer's ringbuff.
* We should prolly call stream_descriptor::reset() after release()
whenever we wish to release a desc without closing the underlying
fd. Because we've discovered that release() doesn't fully cleanup
internal metadata.
* There's a bug where deferred production timeslices can result in
freezing. Explore this and figure out why. When we examined it,
it didn't appear to be a spinlock-deadlock.
It seems to be reliably reproducible when we use the NVidia GTX
card as our OpenCL ComputeDevice, since the GTX card doesn't
have unified memory with the host cpu complex. This causes the
kernels to overrun their timelices and triggers repeated
timeslice deferrals.
PcloudStimProducer::stop=>start() sequence:
Attaching and detaching StimBuffs from StimProducers:
We've written code recently to attach and detact stimBuffs from a
stimProducer. The code is quite nice, but there's this hanging
omen over the fact that we put no thought into ensuring that
detachment doesn't cause an in-flight async production op to
access invalid data.
The in-flight async production ops use the SpMcRingbuffs that
inhabit the stimbuffs. If we don't ensure that all in-flight
async ops are retired before we detach a stimbuff from a
producer, we could end up with the producer writing data into
memory which has been reclaimed and repurposed.
Similarly, if we're not careful about the order in which we
assign the stimBuff pointers during attachment, we could
potentially cause producers to see a partially initialized
StimBuff object.
I think this can be solved without locking/synchronization
by being very careful to ensure that by the time that
StimProducer::stop() exits, all in-flight production
operations are reasonably sure to be halted. If all
in-flight operations are halted; and if production ops
cannot be launched while a StimBuff is being attached/
detached, this means we don't have to worry about accesses
to stale StimBuff instance state; or access to partially
initialized StimBuff instance state.
So this problem is solved by dealing with the in-flight
cancelation problem described above, concerning
[IoUringAssmEngn|OClCollMeshEngn]::start/stop(), and
StimulusBuffer::start/stop(), and ensuring that after
stop() has returned, we can be reasonably sure that all
in-flight ops have exited.