On Thu, Sep 26, 2019 at 05:06:51PM +0200, Yann Sionneau wrote:
Hello,
I would like to know if there is consensus on the fact that this
program hanging on a deadlock is a uClibc-ng bug:
https://pastebin.com/11qLsTW5
I've asked people to try this on several arch/libc combination, it
works well (no hang) on:
* x86_64/glibc
* x86_64/musl
* armv7l/musl
Also, I've run the puts.c test (linked above) on 3 archs with
uClibc-ng, it fails (hangs) on:
* or1k/uClibc-ng
* mips32r6/uClibc-ng
* k1c/uClibc-ng (port not completely published yet)
See failure logs + strace -f of what happens when hanging:
https://mypads.framapad.org/mypads/?/mypads/group/uclibc-ng-10be37ap/pad/vi…
My understanding of the issue is that:
puts, and possibly other libc functions, are taking a lock (
https://elixir.bootlin.com/uclibc-ng/latest/source/libc/stdio/puts.c#L17
) and end up calling write() which is a cancellation point. (
https://elixir.bootlin.com/uclibc-ng/latest/source/libc/stdio/_stdio.h#L150
)
So, if a thread is canceled, is asynchronous mode (which is the
default one), and the cancelation is triggered by the write() inside
the puts(), then the thread will unwind and exit without unlocking
the puts lock.
Then, any other thread calling puts() will hang indefinetely (and
hang other threads if it hangs with locks held...).
My understanding of what can be done to fix this issue:
1/ Either make puts a non cancelation point (see man 7 pthreads,
puts is not listed in mandatory cancelation point, only in "may").
For instance it is not a cancelation point in glibc.
2/ Or keep puts as a cancelation point and fix the puts code so that
it releases the lock upon cancelation (using
pthread_cleanup_push/pop for instance)
In case people think this is indeed a bug, here are examples of code
fixes that I have in mind, please don't hesitate to comment or/and
propose something else:
1/
https://pastebin.com/ePsWJzdi
2/
https://pastebin.com/5EA4RedS
One problem of those fixes is that we need to identify all libc
functions that take a lock and call a cancellable function and apply
such kind of fixes... This is not easy and a bit painful.
I think your analysis is correct here. On top of that, though, uclibc
has the broken, inherently-racy cancellation implementation inherited
from glibc. See:
-
https://sourceware.org/bugzilla/show_bug.cgi?id=12683
-
https://ewontfix.com/2/
-
https://ewontfix.com/4/
-
https://ewontfix.com/16/
As such, even if you fix the above bug, it will be unsafe to use it,
and critically unsafe unless you block cancellation around at least
resource-freeing operations like close. I'd go so far as to say that,
if uclibc can't fix this, it should ignore cancellation points which
are resource-freeing operations (close, maybe a few others).
Resource-allocating ones are also problematic but "just" resource
leaks at worst; maybe cancellation should be ignored for them too.
Rich