Monday, 24 October 2016

Linked Array Queues, part 1: SPSC

When considering concurrent queues people often go for either:
  1. An array backed queue (circular array/ring buffer etc.)
  2. A linked list queue
The trade off in the Java world seems to be that array backed queues offer better throughput, but are always bounded and allocated upfront, and linked queues offer smaller footprint when empty, but worse throughput and higher overhead per element when full. Linked queues are also often unbounded.
In JCTools we have the above duality, but in later versions introduced a hybrid queue which attempts to offer a middle ground between the 2 extremes above. This hybrid is the linked array queue:
  1. Queue is backed by one or more arrays of references.
  2. As long as the queue can fit in a single array no further arrays are allocated.
  3. When empty the queue can naturally shrink to a single backing array.
For the SPSC case this has already been done in C++ Fast Flow with their uSPSC queues. In Java there are no other implementations that I know of (please let me know if I missed anything).
In implementing these queues I have relied heavily on the feedback and support of @akarnokd@pcholakov and others who contributed fixes, filed bugs, and so forth. Thanks guys!
3 variations on linked array queues have been explored in JCTools:
  1. Chunked: Each backing array is a fixed chunk size. The queue is bounded.
  2. Unbounded: Each backing array is a fixed chunk size. The queue is unbounded.
  3. Growable: Chunk size doubles every time until the full blown backing array is used. The queue is bounded.

Hot Path offer/poll

The queues all share the same polling logic and on the fast path share the same offer logic as well:
If you've not read JCTools code before, or maybe you've forgotten, here's the convention:
  • sv/lvXXX: Store/Load Volatile, same as a volatile field write/read
  • sp/lpXXX: Store/Load Plain, same as a normal field write/read
  • soXXX: Store Ordered, same as an AtomicXXX.lazySet
This code will need reconsidering in a post-JDK9 world, some other time perhaps.
So what's the deal:
  • As long as we are not passed the producerLimit, keep writing.
    • If we have passed it go to slow path (where the money be)
  • As long as there's stuff in the queue, read it.
    • Unless it's JUMP, in which case read through to next array.
The general idea here being that the common case is small and simple. This should have the following effects:
  1. offer/poll code is small enough to inline when hot.
  2. offerColdPath/newBufferPoll are set up to either not inline or, when inlined be out of band code blocks. This should keep size on the hot path small and help the compiler focus on more important things.
  3. offer/poll should perform similar to the SpscArrayQueue in the common case.
NOTE: The producer publishes element then index, but the consumer does the reverse. This ensures a consistent view on the consumer side where the following assertion must hold:
  !queue.isEmpty() => queue.poll() != null

NOTE: In some early versions the new array was entered instead of the JUMP constant marker. This required an instanceof check for each element loaded and a compromise to either not allow Object[] to be an element of the queue or introduce some wrapper class. Comparing to a constant turned out to be much simpler and faster.

Cold Path poll

The consumer has hit a JUMP, which indicates the producer has linked a new array to this one. The new array is the last element of the current array. We go to newBufferPoll:

The consumer therefore has to:
  1. Load new array from the last element of old array.
  2. Null out reference from old to new (prevent nepotism).
  3. Adjust consumer view on buffer and mask.
  4. Poll (remove element and progress counter) from new buffer.
Note that peek similarly refreshes view of buffer on hitting the JUMP marker. This goes against the standard spirit on peek which is a view only operation.

Cold Path offer: Unbounded

This method is implemented differently for each of the use cases, unbounded is the simplest:
In the unbounded case we care only about our ability to make progress inside the current buffer:

  1. Probe inside buffer to see if we have 'many' elements to go before full. If buffer is mostly empty (this is the case for most applications most of the time), than we have successfully saved ourselves loading elements from the queue before writing in. A successful probe will update the producer limit and write to the queue.
  2. Probe failed, we check if we can write a single element. Remember we always need one free slot to JUMP with, so we look at the slot following the one we are on. If the next slot is empty we write to the current one, but we don't update the limit.
  3. A single slot remains in the buffer. We allocate a new buffer and link to it.
This is not a large departure from the SpscArrayQueue, I leave the comparison as an exercise to the delightful reader.

Cold Path offer: Chunked

With chunked linked array queues we have a fixed chunk size, but an overall bound on the size. This complicates matters because on top of the buffer level limits put on the producer we must also consider the queue level limitation. In particular there might be space available in the current producer buffer, while the queue is in fact full. Here's the implementation:
Similar to the above but the difference lies in negotiating the buffer vs. queue limit.

Cold Path offer: Growable

The growable queue is similar in spirit to an ArrayList as it doubles it's backing array capacity when a buffer is full. This adds an interesting bit of information to the game, since:
  1. If we are not on the last buffer, we've not hit the queue limit,
  2. If we're on the last buffer, and so is the consumer, we can revert to just checking for null in the buffer.
  3. If we're on the last buffer, and the consumer isn't, we need to hang tight and let it pass. It's a temporary state.
The code for handling this is rather ugly and long, but since you've put up with the rest:
The lookAheadStep is dynamically adjusted as the buffer grows, and also acts as an indicator for the transition period which the producer is on the last buffer and the consumer is trailing. It's a mess to look at, but sadly performs better than a neater alternative which builds on the Chunked variant... General idea:

  1. lookAheadStep is positive => we are either not on last buffer, or on  it for both consumer and producer => it is enough to consider the elements in the producer buffer to determine if the queue is full. In particular if the buffer is full then we must resize unless we are on the last buffer in which case we are full. Note that we allow using the last buffer to the last element, since we don't need a JUMP anymore.
  2. lookAheadStep is negative => we are waiting for consumer to get to the last buffer. We use the lookAheadStep to indicate how far behind the consumer is.
It's not complex, just messy, and if you got an elegant representation please ping me with your suggestions.

Performance?

GODDAMN it! this is the time consuming part! I've benchmarked on a few setups, but not kept track or clear records. I'll need to do it all again, might as well be another post since nobody reads this far into these things anyhow. La la la la la, performance is great, la la la la la, if I look now I might be disappointed, la la la la la, this is better than benchmarking, la la la la la.

TO BE CONTINUED...

1 comment:

  1. Nitsan, Please stop singing, and show me the mon... benchamrks ;-)

    ReplyDelete