Wait free writer/lock free reader on POD struct. Guaranteeing integrity ?

Wait free writer/lock free reader on POD struct. Guaranteeing integrity ?

Post by David Jobe » Sun, 30 May 2010 10:48:29


Hello,

this is a question for a very specific case.

- 1 writer thread which is very time sensitive (reading UDP packets as fast
as it can)
- 1 or more reader threads that sometime need to retrieve the work of the
writer thread

the writer thread manages an array of structure.
The structure is composed only of integer fields (plain POD)
For example :
struct FieldStruct
{
int field1;
int field2;
// ...
int fieldN;
};

From time to time one of the reader thread wants to copy the content of the
structure in its own memory space. To make things nice, the reader thread
already knows the adress of the structure updated by the writer thread.

So basically, the reader thread only has to do
FieldStruct myPrivateCopy;
memcpy(&myPrivateCopy, writerStructPtr, sizeof(FieldStruct));

Now what I want to do is to guarantee that the reader will always see a
complete update of this structure. ie, I want to guarantee that the writer
is not in the middle of an update operation while the reader is copying.

In the old days, I would just put a lock on the reader/writer operation.
But I don't want to have the writer wait. I prefer that the reader wait.
Anyway, the chance of contention are low, so if the readers has to wait, he
can wait.

So the idea I have come up with is to add 2 counters to the structure :
struct FieldStruct
{
int counterBegin; // initialized to 0
int field1;
int field2;
// ...
int fieldN;
int counterEnd; // initialized to 0
};

Now the writer does something like :
- increment counterBegin
- update fields
- increment counterEnd

And for the reader to be guaranteed to read the full operation, he just has
to
1- memcpy
2- compare counterBegin to counterEnd. If different loop in 1.

For this to work correctly, I need to add fences so that the very first
operation of an update is guaranteed to be an update to counterBegin and the
very last operation of an update is guaranteed to be an update to counterEnd

the writer does something like :
- increment counterBegin
- mfence
- update fields
- increment counterEnd
- mfence

My reasoning is that the fences enforces an "happen before". I don't think I
need a "synchronize with" operation in the reader thread since I just want
to guarantee an "atomic read". It's OK to do a partial read as long as I'm
able to detect it and to retry.

And finally the question : do you think it works ?
Perhaps do you have a better solution ?

Tx

David
 
 
 

Wait free writer/lock free reader on POD struct. Guaranteeing integrity ?

Post by KjellKo » Sun, 30 May 2010 17:52:41

> the writer thread manages an array of structure.

David,

I think your strategy could work, as long as the fences are working as
they should ;-)

Another way of doing it is to use your array as a circular buffer,
especially in streaming media context (audio) this seems to be
popular. There are several implementations of lock-free circular
buffers (read array). The limitations with the circular (const size
only) buffers is of course you need a strategy to handle a completely
full buffer.

 
 
 

Wait free writer/lock free reader on POD struct. Guaranteeing integrity ?

Post by Marcel Mle » Sun, 30 May 2010 19:17:48


The last mfence does not ensure that counterEnd is updated at last.
You need a mfence before the increment to counterEnd. And if you do so a
simple in-service flag might do the job as well.

But in theory you may drop the last mfence, but it might cause the
reader to spin for an undefined period.


Yes, with all restrictions that spinlocks might cause.


With the above correction, yes.


Use a MD5 hash as checksum. While in theory this will ensure nothing, in
practice it would work even without any mfence in the writer. However, a
single mfence after all could avoid unnecessary spins of readers.


Marcel
 
 
 

Wait free writer/lock free reader on POD struct. Guaranteeing integrity ?

Post by David Jobe » Mon, 31 May 2010 03:09:35

> The last mfence does not ensure that counterEnd is updated at last.
Woops, yes :

Corrected writer :
- increment counterBegin
- mfence
- update fields
- mfence
- increment counterEnd

Like in :
- inServiceFlag = true
- mfence
- update fields
- mfence
- inServiceFlag = false

Yes that would work too !


For this one, I don't follow you. How does an extra last mfence would
prevent the reader to spin ?
In my understanding, we're just preventing the CPU to reorder the
loads/stores, we're not forcing a flush of the write buffer right ?

If I want to force a flush, I should use a RMW instruction like an xchg on a
dummy variable ? That would force a "synchronize with" and would prevent my
client from spinning right ?

I'll consider implementing a checksum too. I'm just afraid it's going to be
too much for the writer to keep up.
I wish there was a way in UDP to retrieve several buffers at once (recvmmsg
in latest linux kernel) or to have the OS dumps the buffer asynchronously
for me (like an asio on steroid that would also work with UDP)

David
 
 
 

Wait free writer/lock free reader on POD struct. Guaranteeing integrity ?

Post by David Jobe » Mon, 31 May 2010 03:33:48

> Another way of doing it is to use your array as a circular buffer,

In fact I have several independant structures in the array. Each of them
have their own life.
I guess I could have an array of circular buffer. I don't need to have a lot
of elements in the circular buffer, just enough so that the reader can
memcpy without the writer doing a full loop. Because this is not something I
can guarantee, I still need a way to detect a race, but that would lower the
chance of contention on my reader.

Something like
typedef vector<CircularBuffer<FieldStruct, 10> > FieldStructs;
FieldStructs array;

writer (for FieldStruct i) :
CircularBuffer<FieldStruct, 10> &circularBuffer = array[i];
FieldStruct &lastVersion = circularBuffer.current();
FieldStruct &newVersion = circularBuffer.reserveNext(); // return the
element just past current() in the circular buffer
updateFields(newVersion);
circularBuffer.commit(); // make newVersion the current version (advance the
current index)

reader :
FieldStruct myCopy = array[i].current(); // now, this copy is non atomic, so
I need to detect a potential "loop" of the writer
// detecting a potential loop (roll over ?) would indicate that the writer
has rewritten this particular element in the circular buffer a second time
while I was copying it.

I guess I can just change the interface of the circular buffer to also
return the current index in the circular buffer.
Let's say the circularbuffer owns an index which is always incremented and
never resetted to 0. (well except if it overflows)

current() could be implemented as
T &CircularBuffer<T, N>::current() {
return internalBuffer[currentIndex mod N];
}

now if I call this new function
T &CircularBuffer<T, N>::getCurrentAndIndex(int &index) {
index = currentIndex;
return internalBuffer[index mod N];
}

And I have this new function
bool CircularBuffer<T, N>::rolled(int index) {
return currentIndex - index > N; // need to handle the case of currentIndex
overflowing and rolling back to 0
}

the reader becomes :
tryAgain :
int index;
FieldStruct myCopy = array[i].getCurrentAndIndex(index);
if (array[i].rolled(index))
{
goto tryAgain;
}

Now all I need is CircularBuffer::currentIndex being a volatile variable and
using an atomic increment...

Sorry for this long post. I hope this makes some sense ;-)

David
 
 
 

Wait free writer/lock free reader on POD struct. Guaranteeing integrity ?

Post by sfuers » Mon, 31 May 2010 09:35:44

n May 29, 11:09m, David Jobet < XXXX@XXXXX.COM > wrote:

The above algorithm will not work.

A reader can start reading while inServiceFlag is false, and then be
scheduled out. The writer can then update the structure, toggling the
flag to true, and then back to false. When the reader continues, it
reads the updated data, and sees a false flag. It will get a
combination of data from two (or more) updates. A similar problem
affects the counter version, but there the reader needs to miss so
many updates the counter wraps.

How about using the xchg() operation to publish the data? Note that
the xchg instruction will nicely include the memory barrier you need.
Since mfence + memory operation is suggested to be replaced by xchg as
an optimization, this algorithm should be quite a bit faster than the
two mfence version.

The shared state consists of a pointer-sized piece of memory. If the
lower bit is not set, then the memory is a pointer. If it is set,
then it corresponds in some way to a reader.
The writer can then allocate some memory, and store the publishable
information in that block of memory. When it is done, use the xchg
operation to swap the pointer to point at the new information to
publish it. If the writer gets a cleared lower bit from the xchg, it
can reuse a memory block.

A reader then uses xchg() to set the pointer-sized piece of memory to
some other value with a low bit set. If the reader gets something
without a low bits set, then it has a pointer to the latest bit of
published information. If it doesn't, it gets to know which other
reader has that information. (Or which reader knows which reader has
that information, or similar.) The readers can then use locks to
arbitrate amongst themselves without affecting the wait-free writer.

An example lock-based algorithm for the readers, with a writer-
preferring read-write lock per reader, might be:
Write-lock the rwlock corresponding to my thread.
Try to get data from writer via xchg.
If I succed, copy the data into my publication area.
Unlock the rwlock corresponding to my thread. Done.

If I fail, store the location of the reader who has the latest
information in my publication area.
Unlock my rwlock.
read-lock the lock in the reader thread indicated by the output from
xchg.
If they don't have the data, read who they think has it. Unlock their
read lock, and try that new reader recursively until you find the
data.

Copy the data from the reader, and then unlock their read lock.
Write-lock my rwlock.
Copy the data into my publication area.
Unlock my rwlock. Done.

 
 
 

Wait free writer/lock free reader on POD struct. Guaranteeing integrity ?

Post by Chris M. T » Mon, 31 May 2010 12:57:09

 
 
 

Wait free writer/lock free reader on POD struct. Guaranteeing integrity ?

Post by David Jobe » Mon, 31 May 2010 23:23:28

gt; The above algorithm will not work.
Indeed

Yes, for that one, I'm willing to take the risk considering that any way
contention is supposed to be low.


- xchg(counterBegin, counterBegin + 1)
- update fields
- xchg(counterEnd, counterEnd + 1)

??


I see. I don't want to do any allocations inside the writer.


Sounds complicated now.


So you establish a chain of readers. Sounds a lot like the CLH lock (queue
based spinlock).
That's clever.
I'll remember it.

However I don't think the added complexity is worth it since I think the
contention will be low + I now have to manage memory (deallocation).

Usecase is something like
- 1500 structures updated in the writer in a very tight UDP reading loop
- 50 in use per sec in the readers. Distribution vary.

David
 
 
 

Wait free writer/lock free reader on POD struct. Guaranteeing integrity ?

Post by David Jobe » Mon, 31 May 2010 23:29:55

> SeqLock:

Tx !
This page list this lwn page : http://www.yqcomputer.com/
And the patch gives bit of implementation.

So it sounds a lot like what I proposed with the addition of :
- read mem bar in readers
- a spinlock in the writer

for the read mem bar, I guess I don't need it on x86 right ?
for the spinlock, what is the purpose ? In case there are multiple writers ?

I like a lot the API, I'm going to try to change my code so it looks like
this.

Tx !

David
 
 
 

Wait free writer/lock free reader on POD struct. Guaranteeing integrity ?

Post by David Jobe » Mon, 31 May 2010 23:33:17

> How about using the xchg() operation to publish the data? Note that

Is that true ?

Quicker in term of
(1) the CPU executes the instructions quicker, or
(2) the memory is visible quicker ?

I'd agree with (2) but not with (1) since a RMW operation involves a lock of
the bus right ?

If I want to quicken the visibility of memory, can I just add an xchg on a
dummy variable ?

Tx

David
 
 
 

Wait free writer/lock free reader on POD struct. Guaranteeing integrity ?

Post by sfuers » Tue, 01 Jun 2010 01:28:25

lt;snip>


um.... more like:
xchg(&counter, counter+1)
update fields
xchg(&counter, counter+1)

You only need one counter.

A reader:
read counter, repeat if low bit is set.
read data
read counter again, repeat everything if the counter is different from
original read.


Actually, the number of memory blocks you need is bounded by the total
number of threads. A simple array of "inuse" bytes will work. Since
only the writer allocates, it changes a byte from 0->1. A reader
deallocates by setting a byte from 1->0. No locking is required.
Most of the time, no memory barriers are required either. Only if the
writer falsely sees that all blocks are in use does it need to execute
a mfence to synchronize.


Yeah... very complex. However, the writer mostly does one atomic
operation per iteration. The readers also are guaranteed to make
progress, instead of perhaps perpetually spinning while the writer
updates. Livelock can be annoying.
>>>> An example lock-based algorithm for the readers, with a writer- >>>> preferring read-write lock per reader, might be: >>>> Write-lock the rwlock corresponding to my thread. >>>> Try to get data from writer via xchg. >>>> If I succed, copy the data into my publication area. >>>> Unlock the rwlock corresponding to my thread. one. >> >>>> If I fail, store the location of the reader who has the latest >>>> information in my publication area. >>>> Unlock my rwlock. >>>> read-lock the lock in the reader thread indicated by the output from >>>> xchg. >>>> If they don't have the data, read who they think has it. nlock their >>>> read lock, and try that new reader recursively until you find the >>>> data. >> >>>> Copy the data from the reader, and then unlock their read lock. >>>> Write-lock my rwlock. >>>> Copy the data into my publication area. >>>> Unlock my rwlock. Done. >> >> So you establish a chain of readers. Sounds a lot like the CLH lock (queue >> based spinlock). >> That's clever. >> I'll remember it.

It's CLH with queue-jumping because multiple readers can read the same
data at the same time. :-)
>> However I don't think the added complexity is worth it since I think the >> contention will be low + I now have to manage memory (deallocation). >> >> Usecase is something like >> - 1500 structures updated in the writer in a very tight UDP reading loop >> - 50 in use per sec in the readers. Distribution vary. >> >> David

Yeah, it is a rather complex algorithm. I'd try the seqlock first,
and see if livelock isn't a problem. It was just an exercise to test
to see if an xchg-based "publication" algorithm would work.

Steven
 
 
 

Wait free writer/lock free reader on POD struct. Guaranteeing integrity ?

Post by sfuers » Tue, 01 Jun 2010 01:41:40


Apparently, nearly all the time you use an mfence instruction, there
will be a store to memory immediately before or after the mfence.
(Otherwise, why would you use an mfence at that point?)

You can replace the sequences:

store to location x
mfence

and

mfence
store to location x

with y = xchg(&x, value for x), and then ignore y.

This is one instruction shorter, and supposedly faster on x86. I
haven't actually benchmarked this though.
I'm guessing you can also replace memory altering operation
instructions with their locked versions to elide the mfence in a
similar optimization:

i.e.
mfence
x = x | 2

could change into:

atomic_or(&x, 2)

which is "lock; or $2, x(%rip)" in gnu asm.

Steven
 
 
 

Wait free writer/lock free reader on POD struct. Guaranteeing integrity ?

Post by Chris M. T » Mon, 07 Jun 2010 16:06:08


FWIW,there is a more "scaleable" version of a seqlock wrt readers. IIRC,
Dmitriy Vyukov presented it. Humm, let's see if I can remember it from
memory... Here is some quick pseudo-code, membars aside for a moment:
________________________________________________________________________
template<typename T, size_t T_depth>
class seqlock
{
struct cell
{
T m_object; // = T()
atomic<unsigned> m_version;
};


cell m_buffer[T_depth];
atomic<cell*> m_current; // = &m_buffer[0]
mutex m_mutex;


T read()
{
T object;

for (;;)
{
cell* c = m_current.load();
unsigned version1 = c->m_version.load();

if (! (version1 % 2))
{
object = c->m_object;
unsigned version2 = c->m_version.load();
if (version1 == version2) break;
}

backoff();
}

return object;
}


template<typename T_copy_functor>
void write(T_copy_functor copy_fun)
{
m_mutex::scope_guard lock(m_mutex);
cell* current = m_current.load();
cell* next = m_buffer + (((m_current - m_buffer) + 1) % T_depth);
unsigned version = next->m_version.load();
next->m_version.store(version + 1);
copy_fun(current->m_object, next->m_object); // copy current to next
next->m_version.store(version + 2);
m_current.store(next);
}
};
________________________________________________________________________




You could use it like:
________________________________________________________________________
struct data
{
int m_1; // = 0
int m_2; // = 0
};


static seqlock<data, 8> g_data;


void reader_threads()
{
data ldata;

for (;;)
{
// wait for frequent read requests.

// perform the read.
ldata = g_data.read();

// check for consistency.
assert(ldata.m_1 == ldata.m_2);
}
}


struct writer_functor
{
void operator ()(data const& current, data& next)
{
next.m_1 = current.m_1 + 1;
next.m_2 = current.m_2 + 1;
}
};


void writer_threads()
{
for (;;)
{
// wait for infrequent write requests.

// peform the write.
g_data.write(writer_functor());
}
}
________________________________________________________________________




That should do it! :^)
 
 
 

Wait free writer/lock free reader on POD struct. Guaranteeing integrity ?

Post by David Jobe » Tue, 15 Jun 2010 00:21:19

hat's a nice one too !
It's a combination of the seqlock with a ring buffer so as to reduce the
probability of a reader to spin.

Tx very much !

David

Chris M. Thomasson wrote: