Todo: update
This commit is contained in:
@@ -27,3 +27,82 @@
|
||||
whenever we wish to release a desc without closing the underlying
|
||||
fd. Because we've discovered that release() doesn't fully cleanup
|
||||
internal metadata.
|
||||
* There's a bug where deferred production timeslices can result in
|
||||
freezing. Explore this and figure out why. When we examined it,
|
||||
it didn't appear to be a spinlock-deadlock.
|
||||
It seems to be reliably reproducible when we use the NVidia GTX
|
||||
card as our OpenCL ComputeDevice, since the GTX card doesn't
|
||||
have unified memory with the host cpu complex. This causes the
|
||||
kernels to overrun their timelices and triggers repeated
|
||||
timeslice deferrals.
|
||||
|
||||
PcloudStimProducer::stop=>start() sequence:
|
||||
IoUringAssemblyEngine::finalize():
|
||||
I'm worried that calling PcloudStimProducer::stop() will leave
|
||||
in-flight sequences running which will remain alive even after
|
||||
the PcloudStimProducer object itself has been destroyed. This may
|
||||
be possible for IoUringAssmEngn because it has a running timer
|
||||
which may well just time out.
|
||||
* There's no reason to think that an in-flight IoUringAssmEngn
|
||||
assembly operation won't actually run until it times out. In
|
||||
fact, that's the standard case if you configure
|
||||
nDgramsPerFrame to be large enough.
|
||||
* This means that when we call IoUringAssmEngn::finalize(), an
|
||||
in-flight assembly could be going on, which isn't receiving
|
||||
any CQE notifications on the eventFd. Thus, that in-flight
|
||||
assembly op could plausibly timeout and resume execution
|
||||
after IoUringAssemEngn::finalize has completed.
|
||||
* We ought to do a bridged async timeout for the std::max()
|
||||
of all timeouts used by IoUringAssmEngn.
|
||||
|
||||
OpenClCollatingAndMeshingEngine::finalize():
|
||||
I'm also worried, though less so, about the OClCollMeshEngn: it's
|
||||
a lot less likely to have an in-flight op run past the point where
|
||||
the OClCollMeshEngn object has expired.
|
||||
* But there's still a chance that a long-running OCl kernel could
|
||||
cause an in-flight async contin to resume executing after its
|
||||
OclCollMeshEngn has expired.
|
||||
* We should do a bridged async wait for the std::max() of all
|
||||
timeouts used by OClCollMeshEngn to pass before leaving
|
||||
PcloudStimProducer::stop.
|
||||
|
||||
Attaching and detaching StimBuffs from StimProducers:
|
||||
We've written code recently to attach and detact stimBuffs from a
|
||||
stimProducer. The code is quite nice, but there's this hanging
|
||||
omen over the fact that we put no thought into ensuring that
|
||||
detachment doesn't cause an in-flight async production op to
|
||||
access invalid data.
|
||||
|
||||
The in-flight async production ops use the SpMcRingbuffs that
|
||||
inhabit the stimbuffs. If we don't ensure that all in-flight
|
||||
async ops are retired before we detach a stimbuff from a
|
||||
producer, we could end up with the producer writing data into
|
||||
memory which has been reclaimed and repurposed.
|
||||
Similarly, if we're not careful about the order in which we
|
||||
assign the stimBuff pointers during attachment, we could
|
||||
potentially cause producers to see a partially initialized
|
||||
StimBuff object.
|
||||
|
||||
I think this can be solved without locking/synchronization
|
||||
by being very careful to ensure that by the time that
|
||||
StimProducer::stop() exits, all in-flight production
|
||||
operations are reasonably sure to be halted. If all
|
||||
in-flight operations are halted; and if production ops
|
||||
cannot be launched while a StimBuff is being attached/
|
||||
detached, this means we don't have to worry about accesses
|
||||
to stale StimBuff instance state; or access to partially
|
||||
initialized StimBuff instance state.
|
||||
|
||||
So this problem is solved by dealing with the in-flight
|
||||
cancelation problem described above, concerning
|
||||
[IoUringAssmEngn|OClCollMeshEngn]::start/stop(), and
|
||||
StimulusBuffer::start/stop(), and ensuring that after
|
||||
stop() has returned, we can be reasonably sure that all
|
||||
in-flight ops have exited.
|
||||
|
||||
Making sh_ptr<StimulusBuffer> atomic for mem barriers:
|
||||
We could also complete our implemetation's correctness by converting
|
||||
the sh_ptrs to StimulusBuffer inside of the PCloudStimulusProducer
|
||||
into std::atomic<std::shared_ptr<StimulusBuffer>>, and using
|
||||
std::memory_order_release/memory_order_acquire when writing and
|
||||
reading them respectively.
|
||||
|
||||
Reference in New Issue
Block a user