I just realized that my spinqueueing mechanism is highly power inefficient if a lock needs to be held across a "true async wait"—where the async sequence actually waits on a hardware bottleneck. In this case, the thread acquires the spinlock, then goes to sleep in the kernel schedq until some hardware event occurs, and is then awakened—all while still holding the spinlock. Meanwhile, other sequences running on other threads and contending for that lock will be Qspinning. This is acceptable if all I care about is maximum throughput: the Qspinning just re-posts the sequences back into the Q, and eventually they'll acquire the LockSet and proceed. Importantly, since the thread itself isn't slept in the kschedQ, it will be deQing and processing other sequences that aren't bottlenecked on the lock held by the sequence waiting for the hardware response. Throughput is indeed maximized. However, I just realized that if kernel mutexes expose FD events, I can apply this same logic to sleeplocks: I can wait on the sleeplocks asynchronously instead of synchronously. If I can make my asio::io_service wait on all the mutex FDs requested by sequences on the current thread, then in theory I can put the thread to sleep and know that when the mutex becomes available, I'll be awakened again. Hence, I can get the best of both worlds: maximum throughput and power saving. Instead of spinqueueing, we just add the lock FDs to an FD set to be waited on by asio. If any of those locks become available, the kernel scheduler will awaken our asioQ thread, and we can then awaken and retry the lock. ## Boost asio queue-based sleep locking: Instead of using FDs, we can also try to use a fifo Q based mechanism: each lock is a spinlock and a fifo queue. Acquire: ``` lock(spinlock); q.push_back(self); head = q.peek_front(); if (head == self) { // We acquired the lock. unlock(spinlock); return; } unlock(spinlock); ``` release: ``` lock(spinlock); // Should get back ourself. q.pop_front(); // Wake up the next request in the q. head = q.peek_front(); if (head == NULL) { // Nobody was waiting. unlock(spinlock); return; } head.thread_to_wake.getIoService().post([]{ // This lambda causes thread_to_wake to check this lock's // Q and then proceed to execute since it now owns the lock. }); unlock(spinlock); ``` Something like this: it causes the entire thing to be, at least ostensibly, in userspace -- though idk how Boost handles its queues internally. ## Priortizing LockSets: One problem we have with a FIFO-based sleeping system is that it makes it very unlikely that LockSets will ever acquire all of their locks, if there are contenders for those same locks who only need to acquire one of the locks in that LockSet. We could theoretically give locksets an advantage by not making them backout if they fail to acquire all locks in their set. I.e: if they get 2/3, then they hold those 2 and then wait for the 3rd. This is problematic because it leaves room open for deadlocks in the form of T1 and T2 needing both LockA and LockB, but they acquire them in reverse order. I.e: T1 takes LockA and now waits for LockB; and T2 takes LockB and now waits for LockA. This will now happen among the LockSets if we don't impose backing out. It may be possible to avoid this using very careful lock ordering and dependency analysis but this project is asynchronous the locking is done in the async sequences and not in the sync accessor functions. So this kind of analysis is almost impossible to do. We need to think of a way to make the FIFOs biased toward LockSets so that they have an advantage over single-lock acquirers. Or else LockSet sequences will be starved. ### Timed backoff: We could have Locksets be greedy and try to hold on to the locks they've acquired (say, 2/3 and then wait for the 3rd) but then be forced to backoff after a timeout. This introduces async event complexity and also the timeout we choose is almost guaranteed to be arbitrary. ### Fractionally inserted FIFOs: We insert sequences with a LockSet.size() of 1, at the back. We insert all other sequences (>1) into first 1/LockSet.size()th position in the Queue. So a Lockset of size 2 will be inserted at the end of the first half of the items in the queue. A Lockset of size 3 will be inserted at the end of the first 33% of items. A lockset of size 4 will be inserted at the end of the first 25% of items. And so on. This ensures that higher LockSet.size()s will be prioritized ever higher, and at the same time they don't completely hog everything. Those single-lock sequences that have already naturally progressed past the fraction-mark of a given LockSet size will continue making progress toward the front. For queueing sequences with Locksets>1, we can enQ them on the FIFO of the first lock in their set. They'll back off each time anyway, so they'll always be re-trying from the first lock in their set each time. #### Impl details: We'd like to use std::unordered_set because insertion will require lots of moving items around, but we'll have to use std::vector because we need direct access to insert at arbitrary fractional indexes. It's unlikely the number of items in any lock's Q will ever be large enough to require lots of displacement, but welp there's no reason not to plan for scaling. Although if we end up needing scaling that's a symptom of a bigger problem...with scaling itself lol. There shouldn't be enough items blocked on a lock that we have to design the lock's queue to be scalable. ### Inverted Fractionally acquired locksets: The previous ideas of fractionally inserted lockQs was okay, but the acquisition algo required that the async seq be at the front of a locks queue to successfully acquire that lock. That makes it almost impossible for Locksets>1 to ever acquire all of their locks. If we add backoff to that, it basically means no lockset will ever acquire all of its locks. Instead what we now do is always insert at the rear (push_back()) and then when acquiring, we check to see if the sequence is in the first 1/(1/(LockSet.size())), and if so, it successfully acquires the lock. I.e: if the sequence item isn't in the LAST 1/(LockSet.size()) items, then it succeeds. * For a lockset of size=1: It must be at the front of the queue. * Lockset.size=2: it must be in the first 50% of items. * Lockset.size=3: it must be in the first 66% of items. * Lockset.size=4: It must be in the first 75% of items. So this way larger LockSets are favoured, but 1-size locksets make progress. For performance: * We obv can just scan the smaller tail percentage for the item instead of scanning the larger front percentage. * If we use a doubly-linked list, we can prolly keep the insertion iterator and this way we won't have to actually find the item in the lockQ when we wish to eventually remove it from the lockQ when releasing the lock. ## Total overall design: ### Asio queues and Lockvokers: Lockvokers are initially enqueued on a CompThread's queue. When the lockvoker first runs, it checks a flag to see if it has been "registered" into the queues for all locks in its set. If not, then it "registers" itself in each lock's ticketQ and then attempts to acquire each lock. Registration and acquisition are logically separate operations; and locks will often attempt acquisition many times after first registering, without needing to register again. Ideally we can implement a LockSet::registerAndTryAcquireAll() method, but that's for us to think about later. ``` /* We'll need to rename current class LockSpec to LockSet. */ class LockSet { /* Add this either inside of LockSet or outside of it -- depends on whether * it's we can get it to compile because I'm seeing some potential circular * definition dependencies. */ typedef std::pair LockUsageDesc; /* Find a LockUsageDesc -- useful below */ LockUsageDesc &getLockUsageDesc(Qutex &criterionLock) { for (auto &reqLock: requiredLocks) { if (reqLock.first == &criterionLock) { return reqLock; } } // Should never happen. throw; } }; LockSet::register(LockerAndInvoker &lockvoker) { for (auto &lock: lockset.locks) { // Register the Lockvoker object in each lock's ticketQ. lock.second = lock.first.register(lockvoker); } registered = true; } bool LockSet::tryAcquire(LockerAndInvoker &lockvoker) { if (!registered) { // Should never happen. throw ...; } int nLocksAcquired=0, nLocksInSet = lockset.size(); for (auto &lock: lockset.locks) { if (!lock.first.tryAcquire(nLocksInSet)) { break; } nLocksAcquired++; } if (nLocksAcquired == nLocksInSet) { // Success return true; } for (int i=0; i and a std::list. ``` class SpinLock { /* Modify to add methods acquire() and release() which busy-wait. */ void acquire(); void release(); }; class LockSet { /* Modify the std::vector of SpinLock to instead be: * std::vector locks; */ std::vector locks; } class Qutex { public: typedef std::list LockerAndInvokerList; LockerAndInvokerList::iterator register(const LockerAndInvoker &lockvoker) { /** EXPLANATION: * Just insert the lockvoker into the rear of the list. * * Then, since we want to store the */ LockerAndInvokerList::iterator it; lock.acquire(); queue.push_back(lockvoker); it = queue.end(); --it; lock.release(); return it; } void unregister(LockerAndInvokerList::iterator it) { lock.acquire(); queue.erase(it); lock.release(); } bool tryAcquire(LockerAndInvoker &tryingLockvoker, int nRequiredLocks) { lock.acquire(); qNItems = queue.size(); if (qNItems < 1) { lock.release(); /** EXPLANATION: * requiredLocks before ever trying to tryAcquire() them, so if * tryAcquire is being called, that must mean that queue.size() > 0. * * Ergo this should never happen. */ throw; } if (isOwned) { lock.release(); return false; } if (nRequiredLocks == 1) { isOwned.store(true); lock.release(); return true; } /** EXPLANATION: * From here: * if qNItems == 1 the we are the only one in the ticketQ and we have * successfully acquired the lock. * If qNitems / nRequiredLocks == 0, then we acquire by default since * the number of items in the ticketQ guarantees that we are in the top * X% for that nRequiredLocks. * If qNItems / nRequiredLocks >= 1, then we must do the normal algo: * Check the last (qNItems/nRequiredLocks) items, and if the item isn't * in those items, then it must be in the earlier ones (obviously). * Hence this Lockvoker acquisition should be considered successful. * * EXPLANATION 2: * You'll notice that we don't do actual percentages but rather we just * do discrete fractions -- this makes the algo more deterministic * and much easier to reason about. I.e: * If nRequiredLocks is 6 and qNItems==3: * we don't actually calculate that the Lockvoker item must be in * the top (100-17%), and then try to calculate whether we ought to * consider the 3rd item to be in the last 17-percentile. We just * do a fractional count and assume complete discreteness. */ const int nRearItemsToScan = qNItems / nRequiredLocks; if (qNItems == 1 || nRearItemsToScan < 1) { isOwned.store(true); lock.release(); return true; } auto rIt = queue.rbegin(); auto rEndIt = queue.rend(); bool foundInRear = false; for (int i=0; i isOwned; LockerAndInvokerList queue; }; ```