On 2023/06/07 23:57, Yann Sionneau wrote:
Hello uClibc-ng hackers, Damien, Rich,
I am sending this email to discuss the possibility to revert https://cgit.uclibc-ng.org/cgi/cgit/uclibc-ng.git/commit/?id=08d46f1ce21e4ec...
I think this change is wrong, I have been tracking an issue for a few days and I am pretty sure the issue comes from this commit.
Basically what happens is that with this commit, tst-cancel18 test is failing.
The test is available over there: https://github.com/wbx-github/uclibc-ng-test/blob/master/test/nptl/tst-cance...
The reason for this fail is that clock_nanosleep code from libc is badly compiled: https://elixir.bootlin.com/uclibc-ng/latest/source/librt/clock_nanosleep.c#L...
Because SINGLE_THREAD_P is actually defined, in case your libc build supports NPTL threads (which is, I think, the majority of configs, except very old systems using LINUXTHREADS or NOMMU maybe).
In my case (kvx arch):
#define SINGLE_THREAD_P __builtin_expect (THREAD_GETMEM (THREAD_SELF, header.multiple_threads) == 0, 1)
See: https://elixir.bootlin.com/uclibc-ng/latest/source/libpthread/nptl/sysdeps/u...
This results to clock_nanosleep code being compiled to a very simple version that directly calls the syscall and does not change at all the thread cancellability to asynchronous before calling the syscall.
Therefore, when the test main thread calls pthread_cancel(), no signal is sent (see https://elixir.bootlin.com/uclibc-ng/latest/source/libpthread/nptl/pthread_c...) and therefore the nanosleep syscall is never interrupted and the test fails.
clock_nanosleep is defined to be a mandatory cancellation point by POSIX, it must be interruptible in all cases by pthread_cancel, whether in synchronous or asynchronous mode.
I think the fix for the NOMMU case should be different, maybe checking for if LIBC_CANCEL_ASYNC is defined ?
Damien do you have time to work on a new fix for the NOMMU case? Are you OK with the revert?
Not OK with a revert as that will bring back the compilation errors for RISC-V NOMMU case.
Instead, please work on a fix for your case on top of the current code. If I recall correctly, the code around this is rather messy with lots of ifdefs, which may be why this regression was introduced. Cleaning up the code first to avoid that may help avoiding further issues in the future.
Note that to test patches/regressions for risc-v NOMMU builds, you can simply use buildroot and run:
make sipeed_maix_bit_defconfig make
To check that it builds.
Can someone confirm my analysis here?
Thanks!
Best regards,
On Thu, Jun 08, 2023 at 12:03:25AM +0000, Damien Le Moal wrote:
On 2023/06/07 23:57, Yann Sionneau wrote:
Hello uClibc-ng hackers, Damien, Rich,
I am sending this email to discuss the possibility to revert https://cgit.uclibc-ng.org/cgi/cgit/uclibc-ng.git/commit/?id=08d46f1ce21e4ec...
I think this change is wrong, I have been tracking an issue for a few days and I am pretty sure the issue comes from this commit.
Basically what happens is that with this commit, tst-cancel18 test is failing.
The test is available over there: https://github.com/wbx-github/uclibc-ng-test/blob/master/test/nptl/tst-cance...
The reason for this fail is that clock_nanosleep code from libc is badly compiled: https://elixir.bootlin.com/uclibc-ng/latest/source/librt/clock_nanosleep.c#L...
Because SINGLE_THREAD_P is actually defined, in case your libc build supports NPTL threads (which is, I think, the majority of configs, except very old systems using LINUXTHREADS or NOMMU maybe).
In my case (kvx arch):
#define SINGLE_THREAD_P __builtin_expect (THREAD_GETMEM (THREAD_SELF, header.multiple_threads) == 0, 1)
See: https://elixir.bootlin.com/uclibc-ng/latest/source/libpthread/nptl/sysdeps/u...
This results to clock_nanosleep code being compiled to a very simple version that directly calls the syscall and does not change at all the thread cancellability to asynchronous before calling the syscall.
Therefore, when the test main thread calls pthread_cancel(), no signal is sent (see https://elixir.bootlin.com/uclibc-ng/latest/source/libpthread/nptl/pthread_c...) and therefore the nanosleep syscall is never interrupted and the test fails.
clock_nanosleep is defined to be a mandatory cancellation point by POSIX, it must be interruptible in all cases by pthread_cancel, whether in synchronous or asynchronous mode.
I think the fix for the NOMMU case should be different, maybe checking for if LIBC_CANCEL_ASYNC is defined ?
Damien do you have time to work on a new fix for the NOMMU case? Are you OK with the revert?
Not OK with a revert as that will bring back the compilation errors for RISC-V NOMMU case.
Instead, please work on a fix for your case on top of the current code. If I recall correctly, the code around this is rather messy with lots of ifdefs, which may be why this regression was introduced. Cleaning up the code first to avoid that may help avoiding further issues in the future.
Note that to test patches/regressions for risc-v NOMMU builds, you can simply use buildroot and run:
make sipeed_maix_bit_defconfig make
To check that it builds.
As I understand it, the change was wrong. It's conditionally compiling the cancellation support only if SINGLE_THREAD_P is undefined, i.e. only if there is no threads support.
Instead, the cancellation support should probably be compiled only if the cancellation macros are defined. But regardless, the code as it is now is wrong.
Rich
On 2023/06/08 9:18, dalias@libc.org wrote:
On Thu, Jun 08, 2023 at 12:03:25AM +0000, Damien Le Moal wrote:
On 2023/06/07 23:57, Yann Sionneau wrote:
Hello uClibc-ng hackers, Damien, Rich,
I am sending this email to discuss the possibility to revert https://cgit.uclibc-ng.org/cgi/cgit/uclibc-ng.git/commit/?id=08d46f1ce21e4ec...
I think this change is wrong, I have been tracking an issue for a few days and I am pretty sure the issue comes from this commit.
Basically what happens is that with this commit, tst-cancel18 test is failing.
The test is available over there: https://github.com/wbx-github/uclibc-ng-test/blob/master/test/nptl/tst-cance...
The reason for this fail is that clock_nanosleep code from libc is badly compiled: https://elixir.bootlin.com/uclibc-ng/latest/source/librt/clock_nanosleep.c#L...
Because SINGLE_THREAD_P is actually defined, in case your libc build supports NPTL threads (which is, I think, the majority of configs, except very old systems using LINUXTHREADS or NOMMU maybe).
In my case (kvx arch):
#define SINGLE_THREAD_P __builtin_expect (THREAD_GETMEM (THREAD_SELF, header.multiple_threads) == 0, 1)
See: https://elixir.bootlin.com/uclibc-ng/latest/source/libpthread/nptl/sysdeps/u...
This results to clock_nanosleep code being compiled to a very simple version that directly calls the syscall and does not change at all the thread cancellability to asynchronous before calling the syscall.
Therefore, when the test main thread calls pthread_cancel(), no signal is sent (see https://elixir.bootlin.com/uclibc-ng/latest/source/libpthread/nptl/pthread_c...) and therefore the nanosleep syscall is never interrupted and the test fails.
clock_nanosleep is defined to be a mandatory cancellation point by POSIX, it must be interruptible in all cases by pthread_cancel, whether in synchronous or asynchronous mode.
I think the fix for the NOMMU case should be different, maybe checking for if LIBC_CANCEL_ASYNC is defined ?
Damien do you have time to work on a new fix for the NOMMU case? Are you OK with the revert?
Not OK with a revert as that will bring back the compilation errors for RISC-V NOMMU case.
Instead, please work on a fix for your case on top of the current code. If I recall correctly, the code around this is rather messy with lots of ifdefs, which may be why this regression was introduced. Cleaning up the code first to avoid that may help avoiding further issues in the future.
Note that to test patches/regressions for risc-v NOMMU builds, you can simply use buildroot and run:
make sipeed_maix_bit_defconfig make
To check that it builds.
As I understand it, the change was wrong. It's conditionally compiling the cancellation support only if SINGLE_THREAD_P is undefined, i.e. only if there is no threads support.
Instead, the cancellation support should probably be compiled only if the cancellation macros are defined. But regardless, the code as it is now is wrong.
OK. Then let's send a fix on top of that then. Reverting will only change this problem into another. Let's address both with a fix on top instead of reverting.
Rich
June 8, 2023 2:05 AM, "Damien Le Moal" Damien.LeMoal@wdc.com wrote:
Note that to test patches/regressions for risc-v NOMMU builds, you can simply use buildroot and run:
make sipeed_maix_bit_defconfig make
To check that it builds.
Hello Damien,
With buildroot master branch, I can build sipeed_maix_bit_defconfig config with the reverted patch with no build error.
How can I reproduce the issue?
Regards,
Le 08/06/2023 à 10:39, Yann a écrit :
June 8, 2023 2:05 AM, "Damien Le Moal" Damien.LeMoal@wdc.com wrote:
Note that to test patches/regressions for risc-v NOMMU builds, you can simply use buildroot and run:
make sipeed_maix_bit_defconfig make
To check that it builds.
Hello Damien,
With buildroot master branch, I can build sipeed_maix_bit_defconfig config with the reverted patch with no build error.
How can I reproduce the issue?
At most I can see those warnings but it still builds fine:
librt/clock_nanosleep.c: In function 'clock_nanosleep': librt/clock_nanosleep.c:43:22: warning: implicit declaration of function 'LIBC_CANCEL_ASYNC'; did you mean 'LIBC_CANCEL_HANDLED'? [-Wimplicit-function-declaration] 43 | int oldstate = LIBC_CANCEL_ASYNC (); | ^~~~~~~~~~~~~~~~~ | LIBC_CANCEL_HANDLED librt/clock_nanosleep.c:48:7: warning: implicit declaration of function 'LIBC_CANCEL_RESET'; did you mean 'LIBC_CANCEL_HANDLED'? [-Wimplicit-function-declaration] 48 | LIBC_CANCEL_RESET (oldstate); | ^~~~~~~~~~~~~~~~~ | LIBC_CANCEL_HANDLED
Also, I tried to build with and without your patch and then I compared the 2 resulting librt/clock_nanosleep.os object files: they show up exactly the same.
I am guessing that now it works because of a newer compiler that does a better job at optimizing because if I build clock_nanosleep.os with -save-temps I can see the output of pre-processor being:
if (1) r = ({ long _sys_result; { register long int _a7 __asm__ ("a7"); register long _a3 __asm__ ("a3"); long _a3tmp; register long _a2 __asm__ ("a2"); long _a2tmp; register long _a1 __asm__ ("a1"); long _a1tmp; long _a0tmp; register long _a0 __asm__ ("a0"); _a0tmp = (long) (clock_id); _a0 = _a0tmp; _a1tmp = (long) (flags); _a1 = _a1tmp; _a2tmp = (long) (req); _a2 = _a2tmp; _a3tmp = (long) (rem); _a3 = _a3tmp; _a7 = (115); __asm__ volatile ( "scall\n\t" : "=r" (_a0) : "r"(_a7) , "r" (_a0), "r" (_a1), "r" (_a2), "r" (_a3) : "memory"); _sys_result = _a0; } _sys_result; }); else { int oldstate = LIBC_CANCEL_ASYNC ();
r = ({ long _sys_result; { register long int _a7 __asm__ ("a7"); register long _a3 __asm__ ("a3"); long _a3tmp; register long _a2 __asm__ ("a2"); long _a2tmp; register long _a1 __asm__ ("a1"); long _a1tmp; long _a0tmp; register long _a0 __asm__ ("a0"); _a0tmp = (long) (clock_id); _a0 = _a0tmp; _a1tmp = (long) (flags); _a1 = _a1tmp; _a2tmp = (long) (req); _a2 = _a2tmp; _a3tmp = (long) (rem); _a3 = _a3tmp; _a7 = (115); __asm__ volatile ( "scall\n\t" : "=r" (_a0) : "r"(_a7) , "r" (_a0), "r" (_a1), "r" (_a2), "r" (_a3) : "memory"); _sys_result = _a0; } _sys_result; }) ;
LIBC_CANCEL_RESET (oldstate); }
So basically if (1) { } else { call to undefined function }.
I guess the `else` part is being optimized out and it silents the issue.
Regards,
Le 12/06/2023 à 09:17, Yann Sionneau a écrit :
Le 08/06/2023 à 10:39, Yann a écrit :
June 8, 2023 2:05 AM, "Damien Le Moal" Damien.LeMoal@wdc.com wrote:
Note that to test patches/regressions for risc-v NOMMU builds, you can simply use buildroot and run:
make sipeed_maix_bit_defconfig make
To check that it builds.
Hello Damien,
With buildroot master branch, I can build sipeed_maix_bit_defconfig config with the reverted patch with no build error.
How can I reproduce the issue?
At most I can see those warnings but it still builds fine:
librt/clock_nanosleep.c: In function 'clock_nanosleep': librt/clock_nanosleep.c:43:22: warning: implicit declaration of function 'LIBC_CANCEL_ASYNC'; did you mean 'LIBC_CANCEL_HANDLED'? [-Wimplicit-function-declaration] 43 | int oldstate = LIBC_CANCEL_ASYNC (); | ^~~~~~~~~~~~~~~~~ | LIBC_CANCEL_HANDLED librt/clock_nanosleep.c:48:7: warning: implicit declaration of function 'LIBC_CANCEL_RESET'; did you mean 'LIBC_CANCEL_HANDLED'? [-Wimplicit-function-declaration] 48 | LIBC_CANCEL_RESET (oldstate); | ^~~~~~~~~~~~~~~~~ | LIBC_CANCEL_HANDLED
Also, I tried to build with and without your patch and then I compared the 2 resulting librt/clock_nanosleep.os object files: they show up exactly the same.
I am guessing that now it works because of a newer compiler that does a better job at optimizing because if I build clock_nanosleep.os with -save-temps I can see the output of pre-processor being:
if (1) r = ({ long _sys_result; { register long int _a7 __asm__ ("a7"); register long _a3 __asm__ ("a3"); long _a3tmp; register long _a2 __asm__ ("a2"); long _a2tmp; register long _a1 __asm__ ("a1"); long _a1tmp; long _a0tmp; register long _a0 __asm__ ("a0"); _a0tmp = (long) (clock_id); _a0 = _a0tmp; _a1tmp = (long) (flags); _a1 = _a1tmp; _a2tmp = (long) (req); _a2 = _a2tmp; _a3tmp = (long) (rem); _a3 = _a3tmp; _a7 = (115); __asm__ volatile ( "scall\n\t" : "=r" (_a0) : "r"(_a7) , "r" (_a0), "r" (_a1), "r" (_a2), "r" (_a3) : "memory"); _sys_result = _a0; } _sys_result; }); else { int oldstate = LIBC_CANCEL_ASYNC ();
r = ({ long _sys_result; { register long int _a7 __asm__ ("a7"); register long _a3 __asm__ ("a3"); long _a3tmp; register long _a2 __asm__ ("a2"); long _a2tmp; register long _a1 __asm__ ("a1"); long _a1tmp; long _a0tmp; register long _a0 __asm__ ("a0"); _a0tmp = (long) (clock_id); _a0 = _a0tmp; _a1tmp = (long) (flags); _a1 = _a1tmp; _a2tmp = (long) (req); _a2 = _a2tmp; _a3tmp = (long) (rem); _a3 = _a3tmp; _a7 = (115); __asm__ volatile ( "scall\n\t" : "=r" (_a0) : "r"(_a7) , "r" (_a0), "r" (_a1), "r" (_a2), "r" (_a3) : "memory"); _sys_result = _a0; } _sys_result; }) ;
LIBC_CANCEL_RESET (oldstate); }
So basically if (1) { } else { call to undefined function }.
I guess the `else` part is being optimized out and it silents the issue.
Regards,
OK I'll just send a fix for that that applies on top of a revert of 08d46f1ce21e4ec51b2b1626beeaea6cbe7fdc6b
Regards,