salmanoff/docs/design/qutexes.md at d10217f3f57706850792773b1418dcb32989ed89

Files

T

hayodea d10217f3f5 Docs: update qutex algo plan

2025-09-19 18:52:27 -04:00

19 KiB

Raw Blame History

I just realized that my spinqueueing mechanism is highly power inefficient if a lock needs to be held across a "true async wait"—where the async sequence actually waits on a hardware bottleneck. In this case, the thread acquires the spinlock, then goes to sleep in the kernel schedq until some hardware event occurs, and is then awakened—all while still holding the spinlock.

Meanwhile, other sequences running on other threads and contending for that lock will be Qspinning. This is acceptable if all I care about is maximum throughput: the Qspinning just re-posts the sequences back into the Q, and eventually they'll acquire the LockSet and proceed.

Importantly, since the thread itself isn't slept in the kschedQ, it will be deQing and processing other sequences that aren't bottlenecked on the lock held by the sequence waiting for the hardware response. Throughput is indeed maximized.

However, I just realized that if kernel mutexes expose FD events, I can apply this same logic to sleeplocks: I can wait on the sleeplocks asynchronously instead of synchronously. If I can make my asio::io_service wait on all the mutex FDs requested by sequences on the current thread, then in theory I can put the thread to sleep and know that when the mutex becomes available, I'll be awakened again.

Hence, I can get the best of both worlds: maximum throughput and power saving. Instead of spinqueueing, we just add the lock FDs to an FD set to be waited on by asio. If any of those locks become available, the kernel scheduler will awaken our asioQ thread, and we can then awaken and retry the lock.

Boost asio queue-based sleep locking:

Instead of using FDs, we can also try to use a fifo Q based mechanism: each lock is a spinlock and a fifo queue.

Acquire:

lock(spinlock);
q.push_back(self);
head = q.peek_front();
if (head == self) {
	// We acquired the lock.
	unlock(spinlock);
	return;
}

unlock(spinlock);

release:

lock(spinlock);
// Should get back ourself.
q.pop_front();
// Wake up the next request in the q.
head = q.peek_front();
if (head == NULL) {
	// Nobody was waiting.
	unlock(spinlock);
	return;
}

head.thread_to_wake.getIoService().post([]{
	// This lambda causes thread_to_wake to check this lock's
	// Q and then proceed to execute since it now owns the lock.
});
unlock(spinlock);

Something like this: it causes the entire thing to be, at least ostensibly, in userspace -- though idk how Boost handles its queues internally.

Priortizing LockSets:

One problem we have with a FIFO-based sleeping system is that it makes it very unlikely that LockSets will ever acquire all of their locks, if there are contenders for those same locks who only need to acquire one of the locks in that LockSet.

We could theoretically give locksets an advantage by not making them backout if they fail to acquire all locks in their set. I.e: if they get 2/3, then they hold those 2 and then wait for the 3rd. This is problematic because it leaves room open for deadlocks in the form of T1 and T2 needing both LockA and LockB, but they acquire them in reverse order. I.e: T1 takes LockA and now waits for LockB; and T2 takes LockB and now waits for LockA. This will now happen among the LockSets if we don't impose backing out. It may be possible to avoid this using very careful lock ordering and dependency analysis but this project is asynchronous the locking is done in the async sequences and not in the sync accessor functions. So this kind of analysis is almost impossible to do.

We need to think of a way to make the FIFOs biased toward LockSets so that they have an advantage over single-lock acquirers. Or else LockSet sequences will be starved.

Timed backoff:

We could have Locksets be greedy and try to hold on to the locks they've acquired (say, 2/3 and then wait for the 3rd) but then be forced to backoff after a timeout.

This introduces async event complexity and also the timeout we choose is almost guaranteed to be arbitrary.

Fractionally inserted FIFOs:

We insert sequences with a LockSet.size() of 1, at the back. We insert all other sequences (>1) into first 1/LockSet.size()th position in the Queue. So a Lockset of size 2 will be inserted at the end of the first half of the items in the queue. A Lockset of size 3 will be inserted at the end of the first 33% of items. A lockset of size 4 will be inserted at the end of the first 25% of items. And so on.

This ensures that higher LockSet.size()s will be prioritized ever higher, and at the same time they don't completely hog everything. Those single-lock sequences that have already naturally progressed past the fraction-mark of a given LockSet size will continue making progress toward the front.

For queueing sequences with Locksets>1, we can enQ them on the FIFO of the first lock in their set. They'll back off each time anyway, so they'll always be re-trying from the first lock in their set each time.

Impl details:

We'd like to use std::unordered_set because insertion will require lots of moving items around, but we'll have to use std::vector because we need direct access to insert at arbitrary fractional indexes. It's unlikely the number of items in any lock's Q will ever be large enough to require lots of displacement, but welp there's no reason not to plan for scaling. Although if we end up needing scaling that's a symptom of a bigger problem...with scaling itself lol. There shouldn't be enough items blocked on a lock that we have to design the lock's queue to be scalable.

Inverted Fractionally acquired locksets:

The previous ideas of fractionally inserted lockQs was okay, but the acquisition algo required that the async seq be at the front of a locks queue to successfully acquire that lock. That makes it almost impossible for Locksets>1 to ever acquire all of their locks. If we add backoff to that, it basically means no lockset will ever acquire all of its locks.

Instead what we now do is always insert at the rear (push_back()) and then when acquiring, we check to see if the sequence is in the first 1/(1/(LockSet.size())), and if so, it successfully acquires the lock. I.e: if the sequence item isn't in the LAST 1/(LockSet.size()) items, then it succeeds.

For a lockset of size=1: It must be at the front of the queue.
Lockset.size=2: it must be in the first 50% of items.
Lockset.size=3: it must be in the first 66% of items.
Lockset.size=4: It must be in the first 75% of items.

So this way larger LockSets are favoured, but 1-size locksets make progress.

For performance:

We obv can just scan the smaller tail percentage for the item instead of scanning the larger front percentage.
If we use a doubly-linked list, we can prolly keep the insertion iterator and this way we won't have to actually find the item in the lockQ when we wish to eventually remove it from the lockQ when releasing the lock.

Total overall design:

Asio queues and Lockvokers:

Lockvokers are initially enqueued on a CompThread's queue. When the lockvoker first runs, it checks a flag to see if it has been "registered" into the queues for all locks in its set. If not, then it "registers" itself in each lock's ticketQ and then attempts to acquire each lock. Registration and acquisition are logically separate operations; and locks will often attempt acquisition many times after first registering, without needing to register again. Ideally we can implement a LockSet::registerAndTryAcquireAll() method, but that's for us to think about later.

/* We'll need to rename current class LockSpec to LockSet. */
class LockSet
{
	/* Add this either inside of LockSet or outside of it -- depends on whether
	 * it's we can get it to compile because I'm seeing some potential circular
	 * definition dependencies.
	 */
	typedef std::pair<Qutex, LockerAndInvokerList::iterator>
		LockUsageDesc;

	/* Find a LockUsageDesc -- useful below */
	LockUsageDesc &getLockUsageDesc(Qutex &criterionLock)
	{
		for (auto &reqLock: requiredLocks) {
			if (reqLock.first == &criterionLock) { return reqLock; }
		}

		// Should never happen.
		throw;
	}
};

LockSet::register(LockerAndInvoker &lockvoker)
{
	for (auto &lock: lockset.locks) {
		// Register the Lockvoker object in each lock's ticketQ.
		lock.second = lock.first.register(lockvoker);
	}
	registered = true;
}

bool LockSet::tryAcquire(LockerAndInvoker &lockvoker)
{
	if (!registered) {
		// Should never happen.
		throw ...;
	}
	int nLocksAcquired=0,
		nLocksInSet = lockset.size();
	for (auto &lock: lockset.locks) {
		if (!lock.first.tryAcquire(nLocksInSet)) {
			break;
		}

		nLocksAcquired++;
	}

	if (nLocksAcquired == nLocksInSet) {
		// Success
		return true;
	}

	for (int i=0; i<nLocksAcquired; i++) {
		// Backoff does different stuff from release();
		locks[i].first.backoff(lockvoker);
	}
}

LockSet::release()
{
	for (auto &lock: requiredLocks) {
		lock.release();
	}
}

Now, the Qutex class is what we'll use for synchronization. It's just a combination of 2 atomic and a std::list.

class SpinLock
{
	/* Modify to add methods acquire() and release() which busy-wait.
	 */
	void acquire();
	void release();
};

class LockSet
{
	/* Modify the std::vector of SpinLock to instead be:
	 *	std::vector<LockUsageDesc> locks;
	 */
	std::vector<LockUsageDesc> locks;
}

bool LockerAndInvoker::operator==(const LockerAndInvoker &other)
{
	/* Compare by the address of the continuation objects. Why?
	 * Because there's no guarantee that the lockvoker object that was
	 * passed in by the io_service invocation is the same object as that
	 * which is in the qutexQs. Especially because we make_shared() a
	 * copy when registerInQutexQueues()ing.
	 *
	 * Generally when we "wake" a lockvoker by enqueuing it, boost's
	 * io_service::post will copy the lockvoker object.
	 */
	return &this->serializedContinuation == &other.serializedContinuation;
}

bool LockerAndInvoker::operator !=(const LockerAndInvoker &other)
{
	return &this->serializedContinuation != &other.serializedContinuation;
}

class Qutex
{
public:
	typedef std::list<LockerAndInvoker>		LockerAndInvokerList;

	LockerAndInvokerList::iterator register(const LockerAndInvoker &lockvoker)
	{
		/** EXPLANATION:
		 * Just insert the lockvoker into the rear of the list.
		 *
		 * Then, since we want to store the 
		 */
		LockerAndInvokerList::iterator it;

		lock.acquire();
		queue.push_back(lockvoker);
		it = queue.end();
		--it;
		lock.release();

		return it;
	}

	void unregister(LockerAndInvokerList::iterator it, bool shouldLock=1)
	{
		if (shouldLock)
		{
			lock.acquire();
			queue.erase(it);
			lock.release();
		}
		else{
			queue.erase(it);
		}
	}

	bool tryAcquire(LockerAndInvoker &tryingLockvoker)
	{
		const nRequiredLocks = tryingLockvoker.serializedContinuation
			.requiredLocks.size();

		lock.acquire();

		const qNItems = queue.size();

		if (qNItems < 1) {
			lock.release();

			/**	EXPLANATION:
			 * requiredLocks before ever trying to tryAcquire() them, so if
			 * tryAcquire is being called, that must mean that queue.size() > 0.
			 *
			 * Ergo this should never happen.
			 */
			throw;
		}

		if (!!currentOwner) {
			lock.release();
			return false;
		}

		/**	EXPLANATION:
		 * From here:
		 * if qNItems == 1 the we are the only one in the ticketQ and we have
		 *	successfully acquired the lock.
		 * If qNitems / nRequiredLocks == 0, then we acquire by default since
		 *	the number of items in the ticketQ guarantees that we are in the top
		 *	X% for that nRequiredLocks.
		 * If qNItems / nRequiredLocks >= 1, then we must do the normal algo:
		 *	Check the last (qNItems/nRequiredLocks) items, and if the item isn't
		 *	in those items, then it must be in the earlier ones (obviously).
		 *	Hence this Lockvoker acquisition should be considered successful.
		 *
		 *	EXPLANATION 2:
		 * You'll notice that we don't do actual percentages but rather we just
		 * do discrete fractions -- this makes the algo more deterministic
		 * and much easier to reason about. I.e:
		 *	If nRequiredLocks is 6 and qNItems==3:
		 *		we don't actually calculate that the Lockvoker item must be in
		 *		the top (100-17%), and then try to calculate whether we ought to
		 *		consider the 3rd item to be in the last 17-percentile. We just
		 *		do a fractional count and assume complete discreteness.
		 */
		const int nRearItemsToScan = qNItems / nRequiredLocks;

		if (qNItems == 1 || nRearItemsToScan < 1) {
			currOwner = tryingLockvoker;
			lock.release();
			return true;
		}

		/**	EXPLANATION:
		 * For lockvokers that only have 1 requiredLock, they must be at the
		 * front of the queue to successfully acquire.
		 */
		if (nRequiredLocks == 1)
		{
			if (tryingLockvoker == &queue.front())
			{
				currOwner = tryingLockvoker;
				lock.release();
				return true;
			}
		}

		auto rIt = queue.rbegin();
		auto rEndIt = queue.rend();
		bool foundInRear = false;
		for (int i=0; i<nRearItemsToScan && rIt != rEndIt; rIt++, i++)
		{
			if (*rIt != tryingLockvoker) { continue; }

			foundInRear == true;
			break;
		}

		if (foundInRear) {
			lock.release();
			return false;
		}

		/* Not found in rear: this means the item is in the top X%. That means
		 * it should be allowed to claim the lock.
		 */
		currOwner = tryingLockvoker;
		lock.release();
		return true;
	}

	backoff(LockerAndInvoker &failedAcquirer)
	{
		lock.acquire();

		// Rotate queue members if failedAcquirer is at front of queue.
		LockerAndInvoker &currFront = queue.front();
		if (currFront == failedAcquirer)
		{
			/**	EXPLANATION:
			 * Rotate the top LockSet.size() items in the queue by moving
			 * the failedAcquirer to the last position in the top
			 * LockSet.size() items within the queue.
			 *
			 * I.e: if queue.size()==20, and lockSet.size()==5, then move
			 * failedAcquirer from the front the 5th position in the queue,
			 * which should push the other 4 items forward.
			 * If queue.size==3 and LockSet.size()==5, then just
			 * push_back(failedAcquirer).
			 *
			 * It is impossible for a Qutex queue to have only one
			 * item in it, yet for that Lockvoker item to have failed to
			 * acquire the Qutex. Being the only item in the ticketQ
			 * means that you must succeed at acquiring the Qutex.
			 */
			int swapPosition = min(
				queue.size(),
				failedAcquirer.serializedContinuation.requiredLocks.size());

			/*	EXPLANATION:
			 * Swap them here.
			 *
			 * The reason why we do this swap is to avoid a particular kind of
			 * deadlock wherein a grid of async requests is perfectly configured
			 * so as to guarantee that none of them can make any forward
			 * progress unless they get reordered.
			 *
			 * Consider 2 different locks with 2 different items in them
			 * each, both of which come from 2 particular requests:
			 *	Qutex1: Lockvoker1, Lv2
			 *	Qutex2: Lv2, Lv1
			 *
			 * Moreover, both of these lockvokers have requiredLocks.size()==2,
			 * and the particular 2 locks that each one requires are indeed
			 * Qutex1 and Qutex2.
			 *
			 * This particular setup basically means that in TL1's queue, Lv1
			 * will wakeup since it's at the front of TL1. It'll successfully
			 * acquire TL1 (since it's at the front), and then it'll try to
			 * acquire TL2. But since Lv1 isn't in the top 50% of items in TL2's
			 * queue, Lv1 will fail to acquire TL2.
			 *
			 * Then similarly, in TL2's queue, Lv2 will wakeup since it's at
			 * the front. Again, it'll successfully acquire TL2 since it's at
			 * the front of TL2's queue. But then it'll try to acquire TL1.
			 * Since it's not in the top 50% of TL1's enqueued items, it'll fail
			 * to acquire TL1.
			 *
			 * N.B: This type of perfectly ordered deadlock can occur in any
			 * kind of NxN situation where ticketQ.size()==requiredLocks.size().
			 * That could be 4x4, 5x5, 6x6, etc. It doesn't happen in 1x1
			 * because a Lockvoker that only requires one lock will always just
			 * succeed if it's at the front of its queue.
			 *
			 * This state of affairs is stable and will persist unless these
			 * queues are reordered in some way. Hence: that's why we rotate the
			 * items in a QutexQ after backing off of it. Backing off means
			 * Not necessarily that the calling LockVoker failed to acquire
			 * THIS PARTICULAR Qutex, but rather than it failed to acquire
			 * ALL of its required locks.
			 *
			 * Hence, if we are backing out, we should also rotate the items
			 * in the queue if the current front item is the failed acquirer.
			 * So that's why we do this rotation here.
			 */
		}

		currOwner.release();

		LockerAndInvoker &newFront = queue.front();

		lock.release();

		wakeUp(newFront);
	}

	void release()
	{
		lock.acquire();

#ifndef CONFIG_LOCKVOKER_AGGRESSIVE_WAKEUPS
		LockerAndInvoker	&oldFront = queue.front();
#endif

		/* Get the saved iterator and use it to unregister.
		 * Don't acquire lock because we already acquired it in this function.
		 */
		unregister(currOwner->serializedContinuation.requiredLocks
			.getLockUsageDesc(*this).second, false);

		currOwner.release();

		/** NOTE:
		 * I am not sure whether we should only wake up the front item if
		 * the prev owner was the previous front item; or whether we should
		 * always wake up the new front item.
		 *
		 * Recall that because a sequence can acquire a Qutex without being
		 * at the front of the queue (because it could merely be in the top X%
		 * of items instead), this means that during a call to release(), the
		 * owning async sequence may not be at the front -- it needs only be in
		 * the top X% of items.
		 *
		 * When the owning sequence is the front, then we should definitely wake
		 * the new front item after removing the previous owner.
		 *
		 * But when the front item is not the prev owner, then should we wake it
		 * up? It should have been awakened previously, so if it's still at the
		 * front, that implies that it failed to make forward progress last time
		 * it awoke. You could argue that during the interim while this owner
		 * did its thing, it's possible for the current front item to have
		 * become capable of acquiring all of its requiredLocks, due to changes
		 * in the other locks in its requiredLocks set. This is a fair argument.
		 *
		 * You could also argue that we can just wait until its registered
		 * items in its other locks' ticketQs reach the front of those other
		 * ticketQs, and then when that happens, those locks will wake it up
		 * when release() is called.
		 *
		 * The latter eases contention since one could argue that we have a
		 * surer chance of it successfully acquiring all of its requiredLocks if
		 * we wait until it bubbles to the front of another ticketQ. Whereas,
		 * if we aggressively/greedily awaken it to try just because an item in
		 * another ticketQ has been removed, we're just introducing contention.
		 * Both cases have good arguments. The aggressive approach enables us to
		 * potentially retire more requests, and thus increase throughput.
		 *
		 * This current pseudocode assumes the latter and waits for the other
		 * locks' ticketQs to wake it up when it reaches the top in their
		 * Qs.
		 */
		LockerAndInvoker &newFront = queue.front();

		lock.release();

#ifndef CONFIG_LOCKVOKER_AGGRESSIVE_WAKEUPS
		if (&newFront != &oldFront)
#endif
			{ wakeUp(newFront); }
	}

public:
	SpinLock					lock;
	std::shared_ptr<LockerAndInvoker> currOwner;
	LockerAndInvokerList		queue;
};

19 KiB Raw Blame History