Hello,
I would like to know if there is consensus on the fact that this program hanging on a deadlock is a uClibc-ng bug: https://pastebin.com/11qLsTW5
I've asked people to try this on several arch/libc combination, it works well (no hang) on:
* x86_64/glibc
* x86_64/musl
* armv7l/musl
Also, I've run the puts.c test (linked above) on 3 archs with uClibc-ng, it fails (hangs) on:
* or1k/uClibc-ng
* mips32r6/uClibc-ng
* k1c/uClibc-ng (port not completely published yet)
See failure logs + strace -f of what happens when hanging: https://mypads.framapad.org/mypads/?/mypads/group/uclibc-ng-10be37ap/pad/vie...
My understanding of the issue is that:
puts, and possibly other libc functions, are taking a lock ( https://elixir.bootlin.com/uclibc-ng/latest/source/libc/stdio/puts.c#L17 ) and end up calling write() which is a cancellation point. ( https://elixir.bootlin.com/uclibc-ng/latest/source/libc/stdio/_stdio.h#L150 )
So, if a thread is canceled, is asynchronous mode (which is the default one), and the cancelation is triggered by the write() inside the puts(), then the thread will unwind and exit without unlocking the puts lock.
Then, any other thread calling puts() will hang indefinetely (and hang other threads if it hangs with locks held...).
My understanding of what can be done to fix this issue:
1/ Either make puts a non cancelation point (see man 7 pthreads, puts is not listed in mandatory cancelation point, only in "may"). For instance it is not a cancelation point in glibc.
2/ Or keep puts as a cancelation point and fix the puts code so that it releases the lock upon cancelation (using pthread_cleanup_push/pop for instance)
In case people think this is indeed a bug, here are examples of code fixes that I have in mind, please don't hesitate to comment or/and propose something else:
1/ https://pastebin.com/ePsWJzdi
2/ https://pastebin.com/5EA4RedS
One problem of those fixes is that we need to identify all libc functions that take a lock and call a cancellable function and apply such kind of fixes... This is not easy and a bit painful.
Thanks!
Regards,
On Thu, Sep 26, 2019 at 05:06:51PM +0200, Yann Sionneau wrote:
Hello,
I would like to know if there is consensus on the fact that this program hanging on a deadlock is a uClibc-ng bug: https://pastebin.com/11qLsTW5
I've asked people to try this on several arch/libc combination, it works well (no hang) on:
x86_64/glibc
x86_64/musl
armv7l/musl
Also, I've run the puts.c test (linked above) on 3 archs with uClibc-ng, it fails (hangs) on:
or1k/uClibc-ng
mips32r6/uClibc-ng
k1c/uClibc-ng (port not completely published yet)
See failure logs + strace -f of what happens when hanging: https://mypads.framapad.org/mypads/?/mypads/group/uclibc-ng-10be37ap/pad/vie...
My understanding of the issue is that:
puts, and possibly other libc functions, are taking a lock ( https://elixir.bootlin.com/uclibc-ng/latest/source/libc/stdio/puts.c#L17 ) and end up calling write() which is a cancellation point. ( https://elixir.bootlin.com/uclibc-ng/latest/source/libc/stdio/_stdio.h#L150 )
So, if a thread is canceled, is asynchronous mode (which is the default one), and the cancelation is triggered by the write() inside the puts(), then the thread will unwind and exit without unlocking the puts lock.
Then, any other thread calling puts() will hang indefinetely (and hang other threads if it hangs with locks held...).
My understanding of what can be done to fix this issue:
1/ Either make puts a non cancelation point (see man 7 pthreads, puts is not listed in mandatory cancelation point, only in "may"). For instance it is not a cancelation point in glibc.
2/ Or keep puts as a cancelation point and fix the puts code so that it releases the lock upon cancelation (using pthread_cleanup_push/pop for instance)
In case people think this is indeed a bug, here are examples of code fixes that I have in mind, please don't hesitate to comment or/and propose something else:
1/ https://pastebin.com/ePsWJzdi
2/ https://pastebin.com/5EA4RedS
One problem of those fixes is that we need to identify all libc functions that take a lock and call a cancellable function and apply such kind of fixes... This is not easy and a bit painful.
I think your analysis is correct here. On top of that, though, uclibc has the broken, inherently-racy cancellation implementation inherited from glibc. See:
- https://sourceware.org/bugzilla/show_bug.cgi?id=12683 - https://ewontfix.com/2/ - https://ewontfix.com/4/ - https://ewontfix.com/16/
As such, even if you fix the above bug, it will be unsafe to use it, and critically unsafe unless you block cancellation around at least resource-freeing operations like close. I'd go so far as to say that, if uclibc can't fix this, it should ignore cancellation points which are resource-freeing operations (close, maybe a few others). Resource-allocating ones are also problematic but "just" resource leaks at worst; maybe cancellation should be ignored for them too.
Rich