docs/design/qutexes.md

I just realized that my spinqueueing mechanism is highly power inefficient if a
lock needs to be held across a "true async wait"—where the async sequence
actually waits on a hardware bottleneck. In this case, the thread acquires the
spinlock, then goes to sleep in the kernel schedq until some hardware event
occurs, and is then awakened—all while still holding the spinlock.

Meanwhile, other sequences running on other threads and contending for that lock
will be Qspinning. This is acceptable if all I care about is maximum throughput:
the Qspinning just re-posts the sequences back into the Q, and eventually
they'll acquire the LockSet and proceed.

Importantly, since the thread itself isn't slept in the kschedQ, it will be
deQing and processing other sequences that aren't bottlenecked on the lock held
by the sequence waiting for the hardware response. Throughput is indeed
maximized.

However, I just realized that if kernel mutexes expose FD events, I can apply
this same logic to sleeplocks: I can wait on the sleeplocks asynchronously
instead of synchronously. If I can make my asio::io_service wait on all the
mutex FDs requested by sequences on the current thread, then in theory I can put
the thread to sleep and know that when the mutex becomes available, I'll be
awakened again.

Hence, I can get the best of both worlds: maximum throughput and power saving.
Instead of spinqueueing, we just add the lock FDs to an FD set to be waited on
by asio. If any of those locks become available, the kernel scheduler will
awaken our asioQ thread, and we can then awaken and retry the lock.

## Boost asio queue-based sleep locking:

Instead of using FDs, we can also try to use a fifo Q based mechanism: each lock
is a spinlock and a fifo queue.

Acquire:
```
lock(spinlock);
q.push_back(self);
head = q.peek_front();
if (head == self) {
	// We acquired the lock.
	unlock(spinlock);
	return;
}

unlock(spinlock);
```

release:
```
lock(spinlock);
// Should get back ourself.
q.pop_front();
// Wake up the next request in the q.
head = q.peek_front();
if (head == NULL) {
	// Nobody was waiting.
	unlock(spinlock);
	return;
}

head.thread_to_wake.getIoService().post([]{
	// This lambda causes thread_to_wake to check this lock's
	// Q and then proceed to execute since it now owns the lock.
});
unlock(spinlock);
```

Something like this: it causes the entire thing to be, at least ostensibly,
in userspace -- though idk how Boost handles its queues internally.

## Priortizing LockSets:

One problem we have with a FIFO-based sleeping system is that it makes it very
unlikely that LockSets will ever acquire all of their locks, if there are
contenders for those same locks who only need to acquire one of the locks in
that LockSet.

We could theoretically give locksets an advantage by not making them backout
if they fail to acquire all locks in their set. I.e: if they get 2/3, then they
hold those 2 and then wait for the 3rd. This is problematic because it leaves
room open for deadlocks in the form of T1 and T2 needing both LockA and LockB,
but they acquire them in reverse order. I.e: T1 takes LockA and now waits for
LockB; and T2 takes LockB and now waits for LockA. This will now happen among
the LockSets if we don't impose backing out. It may be possible to avoid this
using very careful lock ordering and dependency analysis but this project is
asynchronous the locking is done in the async sequences and not in the sync
accessor functions. So this kind of analysis is almost impossible to do.


We need to think of a way to make the FIFOs biased toward LockSets so that they
have an advantage over single-lock acquirers. Or else LockSet sequences will be
starved.

### Timed backoff:

We could have Locksets be greedy and try to hold on to the locks they've
acquired (say, 2/3 and then wait for the 3rd) but then be forced to backoff
after a timeout.

This introduces async event complexity and also the timeout we choose is almost
guaranteed to be arbitrary.

### Fractionally inserted FIFOs:

We insert sequences with a LockSet.size() of 1, at the back.
We insert all other sequences (>1) into first 1/LockSet.size()th position in the
Queue.
So a Lockset of size 2 will be inserted at the end of the first half of the
items in the queue.
A Lockset of size 3 will be inserted at the end of the first 33% of items.
A lockset of size 4 will be inserted at the end of the first 25% of items.
And so on.

This ensures that higher LockSet.size()s will be prioritized ever higher, and
at the same time they don't completely hog everything. Those single-lock
sequences that have already naturally progressed past the fraction-mark of a
given LockSet size will continue making progress toward the front.

For queueing sequences with Locksets>1, we can enQ them on the FIFO of the first
lock in their set. They'll back off each time anyway, so they'll always be
re-trying from the first lock in their set each time.

#### Impl details:

We'd like to use std::unordered_set because insertion will require lots of
moving items around, but we'll have to use std::vector because we need direct
access to insert at arbitrary fractional indexes. It's unlikely the number of
items in any lock's Q will ever be large enough to require lots of displacement,
but welp there's no reason not to plan for scaling. Although if we end up
needing scaling that's a symptom of a bigger problem...with scaling itself lol.
There shouldn't be enough items blocked on a lock that we have to design the
lock's queue to be scalable.

### Inverted Fractionally acquired locksets:

The previous ideas of fractionally inserted lockQs was okay, but the acquisition
algo required that the async seq be at the front of a locks queue to
successfully acquire that lock. That makes it almost impossible for Locksets>1
to ever acquire all of their locks. If we add backoff to that, it basically
means no lockset will ever acquire all of its locks.

Instead what we now do is always insert at the rear (push_back()) and then when
acquiring, we check to see if the sequence is in the first
1/(1/(LockSet.size())), and if so, it successfully acquires the lock. I.e: if
the sequence item isn't in the LAST 1/(LockSet.size()) items, then it succeeds.
* For a lockset of size=1: It must be at the front of the queue.
* Lockset.size=2: it must be in the first 50% of items.
* Lockset.size=3: it must be in the first 66% of items.
* Lockset.size=4: It must be in the first 75% of items.

So this way larger LockSets are favoured, but 1-size locksets make progress.

For performance:
* We obv can just scan the smaller tail percentage for the item instead of
  scanning the larger front percentage.
* If we use a doubly-linked list, we can prolly keep the insertion iterator
  and this way we won't have to actually find the item in the lockQ when we wish
  to eventually remove it from the lockQ when releasing the lock.

## Total overall design:

### Asio queues and Lockvokers:

Lockvokers are initially enqueued on a CompThread's queue. When the lockvoker
first runs, it checks a flag to see if it has been "registered" into the queues
for all locks in its set. If not, then it "registers" itself in each lock's
ticketQ and then attempts to acquire each lock. Registration and acquisition
are logically separate operations; and locks will often attempt acquisition
many times after first registering, without needing to register again. Ideally
we can implement a LockSet::registerAndTryAcquireAll() method, but that's for
us to think about later.

```
/* We'll need to rename current class LockSpec to LockSet. */
class LockSet
{
	/* Add this either inside of LockSet or outside of it -- depends on whether
	 * it's we can get it to compile because I'm seeing some potential circular
	 * definition dependencies.
	 */
	typedef std::pair<Qutex, LockerAndInvokerList::iterator>
		LockUsageDesc;

	/* Find a LockUsageDesc -- useful below */
	LockUsageDesc &getLockUsageDesc(Qutex &criterionLock)
	{
		for (auto &reqLock: requiredLocks) {
			if (reqLock.first == &criterionLock) { return reqLock; }
		}

		// Should never happen.
		throw;
	}
};

LockSet::register(LockerAndInvoker &lockvoker)
{
	for (auto &lock: lockset.locks) {
		// Register the Lockvoker object in each lock's ticketQ.
		lock.second = lock.first.register(lockvoker);
	}
	registered = true;
}

bool LockSet::tryAcquire(LockerAndInvoker &lockvoker)
{
	if (!registered) {
		// Should never happen.
		throw ...;
	}
	int nLocksAcquired=0,
		nLocksInSet = lockset.size();
	for (auto &lock: lockset.locks) {
		if (!lock.first.tryAcquire(nLocksInSet)) {
			break;
		}

		nLocksAcquired++;
	}

	if (nLocksAcquired == nLocksInSet) {
		// Success
		return true;
	}

	for (int i=0; i<nLocksAcquired; i++) {
		// Backoff does different stuff from release();
		locks[i].first.backoff(lockvoker);
	}
}

LockSet::release()
{
	for (auto &lock: requiredLocks) {
		lock.release();
	}
}
```

Now, the Qutex class is what we'll use for synchronization. It's just a
combination of a SpinLock, a sh_ptr<LockerAndInvoker> and a std::list.

```
class SpinLock
{
	/* Modify to add methods acquire() and release() which busy-wait.
	 */
	void acquire();
	void release();
};

class LockSet
{
	/* Modify the std::vector of SpinLock to instead be:
	 *	std::vector<LockUsageDesc> locks;
	 */
	std::vector<LockUsageDesc> locks;
}

bool LockerAndInvoker::operator==(const LockerAndInvoker &other)
{
	/* Compare by the address of the continuation objects. Why?
	 * Because there's no guarantee that the lockvoker object that was
	 * passed in by the io_service invocation is the same object as that
	 * which is in the qutexQs. Especially because we make_shared() a
	 * copy when registerInQutexQueues()ing.
	 *
	 * Generally when we "wake" a lockvoker by enqueuing it, boost's
	 * io_service::post will copy the lockvoker object.
	 */
	return &this->serializedContinuation == &other.serializedContinuation;
}

bool LockerAndInvoker::operator !=(const LockerAndInvoker &other)
{
	return &this->serializedContinuation != &other.serializedContinuation;
}

class Qutex
{
public:
	typedef std::list<LockerAndInvoker>		LockerAndInvokerList;

	LockerAndInvokerList::iterator register(const LockerAndInvoker &lockvoker)
	{
		/** EXPLANATION:
		 * Just insert the lockvoker into the rear of the list.
		 *
		 * Then, since we want to store the 
		 */
		LockerAndInvokerList::iterator it;

		lock.acquire();
		queue.push_back(lockvoker);
		it = queue.end();
		--it;
		lock.release();

		return it;
	}

	void unregister(LockerAndInvokerList::iterator it, bool shouldLock=1)
	{
		if (shouldLock)
		{
			lock.acquire();
			queue.erase(it);
			lock.release();
		}
		else{
			queue.erase(it);
		}
	}

	bool tryAcquire(LockerAndInvoker &tryingLockvoker)
	{
		const nRequiredLocks = tryingLockvoker.serializedContinuation
			.requiredLocks.size();

		lock.acquire();

		const qNItems = queue.size();

		if (qNItems < 1) {
			lock.release();

			/**	EXPLANATION:
			 * requiredLocks before ever trying to tryAcquire() them, so if
			 * tryAcquire is being called, that must mean that queue.size() > 0.
			 *
			 * Ergo this should never happen.
			 */
			throw;
		}

		if (!!currentOwner) {
			lock.release();
			return false;
		}

		/**	EXPLANATION:
		 * From here:
		 * if qNItems == 1 the we are the only one in the ticketQ and we have
		 *	successfully acquired the lock.
		 * If qNitems / nRequiredLocks == 0, then we acquire by default since
		 *	the number of items in the ticketQ guarantees that we are in the top
		 *	X% for that nRequiredLocks.
		 * If qNItems / nRequiredLocks >= 1, then we must do the normal algo:
		 *	Check the last (qNItems/nRequiredLocks) items, and if the item isn't
		 *	in those items, then it must be in the earlier ones (obviously).
		 *	Hence this Lockvoker acquisition should be considered successful.
		 *
		 *	EXPLANATION 2:
		 * You'll notice that we don't do actual percentages but rather we just
		 * do discrete fractions -- this makes the algo more deterministic
		 * and much easier to reason about. I.e:
		 *	If nRequiredLocks is 6 and qNItems==3:
		 *		we don't actually calculate that the Lockvoker item must be in
		 *		the top (100-17%), and then try to calculate whether we ought to
		 *		consider the 3rd item to be in the last 17-percentile. We just
		 *		do a fractional count and assume complete discreteness.
		 */
		const int nRearItemsToScan = qNItems / nRequiredLocks;

		if (qNItems == 1 || nRearItemsToScan < 1) {
			currOwner = tryingLockvoker;
			lock.release();
			return true;
		}

		/**	EXPLANATION:
		 * For lockvokers that only have 1 requiredLock, they must be at the
		 * front of the queue to successfully acquire.
		 */
		if (nRequiredLocks == 1)
		{
			bool ret=false;

			if (tryingLockvoker == &queue.front())
			{
				currOwner = tryingLockvoker;
				ret = true;
			}

			ret = false;
			lock.release();
			return ret;
		}

		auto rIt = queue.rbegin();
		auto rEndIt = queue.rend();
		bool foundInRear = false;
		for (int i=0; i<nRearItemsToScan && rIt != rEndIt; rIt++, i++)
		{
			if (*rIt != tryingLockvoker) { continue; }

			foundInRear = true;
			break;
		}

		if (foundInRear) {
			lock.release();
			return false;
		}

		/* Not found in rear: this means the item is in the top X%. That means
		 * it should be allowed to claim the lock.
		 */
		currOwner = tryingLockvoker;
		lock.release();
		return true;
	}

	backoff(LockerAndInvoker &failedAcquirer)
	{
		lock.acquire();

		const int nQItems = queue.size();
		// Rotate queue members if failedAcquirer is at front of queue.
		LockerAndInvoker &currFront = queue.front();
		if (currFront == failedAcquirer && nQItems > 1)
		{
			/**	EXPLANATION:
			 * Rotate the top LockSet.size() items in the queue by moving
			 * the failedAcquirer to the last position in the top
			 * LockSet.size() items within the queue.
			 *
			 * I.e: if queue.size()==20, and lockSet.size()==5, then move
			 * failedAcquirer from the front the 5th position in the queue,
			 * which should push the other 4 items forward.
			 * If queue.size==3 and LockSet.size()==5, then just
			 * push_back(failedAcquirer).
			 *
			 * It is impossible for a Qutex queue to have only one
			 * item in it, yet for that Lockvoker item to have failed to
			 * acquire the Qutex. Being the only item in the ticketQ
			 * means that you must succeed at acquiring the Qutex.
			 */
			int indexOfItemToInsertCurrFrontBehind = min(
				nQItems - 1,
				failedAcquirer.serializedContinuation.requiredLocks.size() - 1);

			/*	EXPLANATION:
			 * Rotate them here.
			 *
			 * The reason why we do this rotation is to avoid a particular kind
			 * of deadlock wherein a grid of async requests is perfectly
			 * configured so as to guarantee that none of them can make any
			 * forward progress unless they get reordered.
			 *
			 * Consider 2 different locks with 2 different items in them
			 * each, both of which come from 2 particular requests:
			 *	Qutex1: Lockvoker1, Lv2
			 *	Qutex2: Lv2, Lv1
			 *
			 * Moreover, both of these lockvokers have requiredLocks.size()==2,
			 * and the particular 2 locks that each one requires are indeed
			 * Qutex1 and Qutex2.
			 *
			 * This particular setup basically means that in TL1's queue, Lv1
			 * will wakeup since it's at the front of TL1. It'll successfully
			 * acquire TL1 (since it's at the front), and then it'll try to
			 * acquire TL2. But since Lv1 isn't in the top 50% of items in TL2's
			 * queue, Lv1 will fail to acquire TL2.
			 *
			 * Then similarly, in TL2's queue, Lv2 will wakeup since it's at
			 * the front. Again, it'll successfully acquire TL2 since it's at
			 * the front of TL2's queue. But then it'll try to acquire TL1.
			 * Since it's not in the top 50% of TL1's enqueued items, it'll fail
			 * to acquire TL1.
			 *
			 * N.B: This type of perfectly ordered deadlock can occur in any
			 * kind of NxN situation where ticketQ.size()==requiredLocks.size().
			 * That could be 4x4, 5x5, 6x6, etc. It doesn't happen in 1x1
			 * because a Lockvoker that only requires one lock will always just
			 * succeed if it's at the front of its queue.
			 *
			 * This state of affairs is stable and will persist unless these
			 * queues are reordered in some way. Hence: that's why we rotate the
			 * items in a QutexQ after backing off of it. Backing off means
			 * Not necessarily that the calling LockVoker failed to acquire
			 * THIS PARTICULAR Qutex, but rather than it failed to acquire
			 * ALL of its required locks.
			 *
			 * Hence, if we are backing out, we should also rotate the items
			 * in the queue if the current front item is the failed acquirer.
			 * So that's why we do this rotation here.
			 */
			// The first arg (the iterator) is a ref in case it must be updated.
			rotate(
				currFront.serializedContinuation.requiredLocks.getLockDesc(
					*this).second,
				indexOfItemToInsertCurrFrontBehind);
		}

		currOwner.release();

		LockerAndInvoker &newFront = queue.front();

		lock.release();

		/**	EXPLANATION:
		 * Why should this never happen? Well, if we were at the front of the queue
		 * and we failed to acquire the lock, we should have been rotated away from
		 * the front. On the other hand, if we were not at the front of the queue
		 * and we failed to acquire the lock, then we weren't at the front of the
		 * queue to begin with.
		 * The exception is if the queue has only one item in it.
		 *
		 * Hence there ought to be no way for the failedAcquirer to be at the front
		 * of the queue at this point UNLESS the queue has only one item in it.
		 */
		if (newFront == failedAcquirer && nQItems > 1)
		{
			throw;
		}

		/**	EXPLANATION:
		 * We should always awaken whoever is at the front of the queue, even if
		 * we didn't rotate. Why? Consider this scenario:
		 *
		 *	Lv1 has LockSet.size==1. Lv2 has LockSet.size==3.
		 *	Lv1's required lock overlaps with Lv2's set of 3 required locks.
		 *	Lv1 registers itself in its 1 qutex's queue.
		 *	Lv2 registers itself in all 3 of its qutexes' queues.
		 *	Lv2 acquires the lock that it needs in common with Lv1.
		 *		(Assume that Lv2 was not at the front of the common qutex's
		 *		internal queue -- it only needed to be in the top 66%.)
		 *	Lv1 tries to acquire the common lock and fails. It gets taken off of
		 *		its io_service. It's now asleep until it gets
		 *		re-added into an io_service.
		 *	Lv2 fails to acquire the other 2 locks it needs and backoff()s from
		 *		the common lock it shares with Lv1.
		 *
		 *	If Lv2 does NOT awaken the item at the front of the common lock's
		 *	queue (aka: Lv1), then Lv1 is doomed to never wake up again.
		 *
		 * Hence: backout() callers should always wake up the lockvoker at the
		 * front of their queue before leaving.
		 *
		 * The exception is if the item at the front is the backout() caller
		 * itself. This can happen if, for example a multi-locking lockvoker
		 * is backing off of a qutex within which it's the only waiter.
		 */
		if (nQItems > 1) {
			wakeUp(newFront);
		}
	}

	void release()
	{
		lock.acquire();

		/* Get the saved iterator and use it to unregister.
		 * Don't acquire lock because we already acquired it in this function.
		 */
		unregister(currOwner->serializedContinuation.requiredLocks
			.getLockUsageDesc(*this).second, false);

		currOwner.release();

		/** EXPLANATION:
		 * It would be nice to be able to optimize by only awakening if the
		 * release()ing lockvoker was at the front of the qutexQ, but if we
		 * don't unconditionally wakeup() the front item, we could get lost
		 * wakeups. Consider:
		 *
		 *	Lv1 only has 1 requiredLock.
		 *	Lv2 has 3 requiredLocks. One of its requiredLocks overlaps with
		 *		Lv1's single requiredLock. So they both share a common lock.
		 *	Lv3's currently owns Lv1 & Lv2's common requiredLock.
		 *	Lv3 release()s that common lock.
		 *	Lv1 happens to be next in queue after Lv3 unregisters itself.
		 *	Lv3 wakes up Lv1.
		 *	Just before Lv1 can acquire the common lock, Lv2 acquires it now,
		 *		because it only needs to be in the top 66% to succeed.
		 *	Lv1 checks the currOwner and sees that it's owned. Lv1 is now
		 *		dequeued from its io_service. It won't be awakened until someone
		 *		awakens it.
		 *	Lv2 finishes its critical section and releas()es the common lock.
		 *	Lv2 was not at the front of the qutexQ, so it does NOT awaken the
		 *		current item at the front.
		 *
		 * Thus, Lv1 never gets awakened again. The end.
		 * This also means that no LockSet.size()==1 lockvoker will ever be able
		 * to run again since they can only run if they are at the front of the
		 * qutexQ.
		 *
		 * Therefore we must always awaken the front item when releas()ing.
		 */
		LockerAndInvoker &front = queue.front();

		lock.release();

		wakeUp(front);
	}

public:
	SpinLock					lock;
	std::shared_ptr<LockerAndInvoker> currOwner;
	LockerAndInvokerList		queue;
};
```