Todo: update
This commit is contained in:
@@ -27,3 +27,82 @@
|
|||||||
whenever we wish to release a desc without closing the underlying
|
whenever we wish to release a desc without closing the underlying
|
||||||
fd. Because we've discovered that release() doesn't fully cleanup
|
fd. Because we've discovered that release() doesn't fully cleanup
|
||||||
internal metadata.
|
internal metadata.
|
||||||
|
* There's a bug where deferred production timeslices can result in
|
||||||
|
freezing. Explore this and figure out why. When we examined it,
|
||||||
|
it didn't appear to be a spinlock-deadlock.
|
||||||
|
It seems to be reliably reproducible when we use the NVidia GTX
|
||||||
|
card as our OpenCL ComputeDevice, since the GTX card doesn't
|
||||||
|
have unified memory with the host cpu complex. This causes the
|
||||||
|
kernels to overrun their timelices and triggers repeated
|
||||||
|
timeslice deferrals.
|
||||||
|
|
||||||
|
PcloudStimProducer::stop=>start() sequence:
|
||||||
|
IoUringAssemblyEngine::finalize():
|
||||||
|
I'm worried that calling PcloudStimProducer::stop() will leave
|
||||||
|
in-flight sequences running which will remain alive even after
|
||||||
|
the PcloudStimProducer object itself has been destroyed. This may
|
||||||
|
be possible for IoUringAssmEngn because it has a running timer
|
||||||
|
which may well just time out.
|
||||||
|
* There's no reason to think that an in-flight IoUringAssmEngn
|
||||||
|
assembly operation won't actually run until it times out. In
|
||||||
|
fact, that's the standard case if you configure
|
||||||
|
nDgramsPerFrame to be large enough.
|
||||||
|
* This means that when we call IoUringAssmEngn::finalize(), an
|
||||||
|
in-flight assembly could be going on, which isn't receiving
|
||||||
|
any CQE notifications on the eventFd. Thus, that in-flight
|
||||||
|
assembly op could plausibly timeout and resume execution
|
||||||
|
after IoUringAssemEngn::finalize has completed.
|
||||||
|
* We ought to do a bridged async timeout for the std::max()
|
||||||
|
of all timeouts used by IoUringAssmEngn.
|
||||||
|
|
||||||
|
OpenClCollatingAndMeshingEngine::finalize():
|
||||||
|
I'm also worried, though less so, about the OClCollMeshEngn: it's
|
||||||
|
a lot less likely to have an in-flight op run past the point where
|
||||||
|
the OClCollMeshEngn object has expired.
|
||||||
|
* But there's still a chance that a long-running OCl kernel could
|
||||||
|
cause an in-flight async contin to resume executing after its
|
||||||
|
OclCollMeshEngn has expired.
|
||||||
|
* We should do a bridged async wait for the std::max() of all
|
||||||
|
timeouts used by OClCollMeshEngn to pass before leaving
|
||||||
|
PcloudStimProducer::stop.
|
||||||
|
|
||||||
|
Attaching and detaching StimBuffs from StimProducers:
|
||||||
|
We've written code recently to attach and detact stimBuffs from a
|
||||||
|
stimProducer. The code is quite nice, but there's this hanging
|
||||||
|
omen over the fact that we put no thought into ensuring that
|
||||||
|
detachment doesn't cause an in-flight async production op to
|
||||||
|
access invalid data.
|
||||||
|
|
||||||
|
The in-flight async production ops use the SpMcRingbuffs that
|
||||||
|
inhabit the stimbuffs. If we don't ensure that all in-flight
|
||||||
|
async ops are retired before we detach a stimbuff from a
|
||||||
|
producer, we could end up with the producer writing data into
|
||||||
|
memory which has been reclaimed and repurposed.
|
||||||
|
Similarly, if we're not careful about the order in which we
|
||||||
|
assign the stimBuff pointers during attachment, we could
|
||||||
|
potentially cause producers to see a partially initialized
|
||||||
|
StimBuff object.
|
||||||
|
|
||||||
|
I think this can be solved without locking/synchronization
|
||||||
|
by being very careful to ensure that by the time that
|
||||||
|
StimProducer::stop() exits, all in-flight production
|
||||||
|
operations are reasonably sure to be halted. If all
|
||||||
|
in-flight operations are halted; and if production ops
|
||||||
|
cannot be launched while a StimBuff is being attached/
|
||||||
|
detached, this means we don't have to worry about accesses
|
||||||
|
to stale StimBuff instance state; or access to partially
|
||||||
|
initialized StimBuff instance state.
|
||||||
|
|
||||||
|
So this problem is solved by dealing with the in-flight
|
||||||
|
cancelation problem described above, concerning
|
||||||
|
[IoUringAssmEngn|OClCollMeshEngn]::start/stop(), and
|
||||||
|
StimulusBuffer::start/stop(), and ensuring that after
|
||||||
|
stop() has returned, we can be reasonably sure that all
|
||||||
|
in-flight ops have exited.
|
||||||
|
|
||||||
|
Making sh_ptr<StimulusBuffer> atomic for mem barriers:
|
||||||
|
We could also complete our implemetation's correctness by converting
|
||||||
|
the sh_ptrs to StimulusBuffer inside of the PCloudStimulusProducer
|
||||||
|
into std::atomic<std::shared_ptr<StimulusBuffer>>, and using
|
||||||
|
std::memory_order_release/memory_order_acquire when writing and
|
||||||
|
reading them respectively.
|
||||||
|
|||||||
Reference in New Issue
Block a user