Discussion:
Another potential awk or xargs bug?
Jordan Geoghegan
2021-04-15 05:19:10 UTC
Permalink
Hello,

I've found some very interesting behaviour when subjecting various awk implementations to some very specific circumstances.

I'm basically looking for a sanity check here to confirm if I'm just wildly flailing, or if I am indeed onto something here.

Here's my situation:

When parsing some RIR data in parallel using awk with xargs, I seem to have found a way to reliable lose and/or mangle output with parallel xargs. My google-fu seems to be failing me. I understand that xargs does not buffer output and that lines may arrive out of order, but in this case I am reliably and reproducibly losing data and receiving mangled output. But wait, it gets stranger.

I don't want to lose you guys here with a long winded explanation, so I'm going to show you a diff that shows reproducibly mangled output when using xargs in parallel mode:

--- /tmp/bad.txt  Wed Apr 14 21:06:51 2021
+++ /tmp/good.txt  Wed Apr 14 21:06:41 2021
@@ -1,5 +1,3 @@
-267386
-A264890
 AS262399
 AS262400
 AS262401
@@ -1774,6 +1772,7 @@
 AS264887
 AS264888
 AS264889
+AS264890
 AS264891
 AS264892
 AS264893
@@ -3552,6 +3551,7 @@
 AS267383
 AS267384
 AS267385
+AS267386
 AS267387
 AS267388
 AS267389
@@ -4220,6 +4220,7 @@
 AS268318
 AS268319
 AS268320
+AS268320
 AS268321
 AS268321
 AS268323
@@ -7785,6 +7786,7 @@
 AS270633
 AS270633
 AS270634
+AS270634
 AS270635
 AS270635
 AS270636
@@ -10277,5 +10279,3 @@
 AS46210
 AS46280
 AS46280
-ASAS268320
-ASS270634

The only thing that changed between these runs was me using either xargs -P 1 or -P 2.

To allow folks to follow along with me at home, I've included the two files (gzipped for politeness) I used to trigger this behaviour.

Once you've extracted the attached text files into your working directory, here's a snippet that should reproduce my issue:

$ printf 'BR\nCA\n' > cc.txt

$ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 2 -- awk -F '|' 'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }' cc.txt

What does this 1 liner do, well it's supposed to slurp the country codes specified in cc.txt into an array where we then check the first field of each row of the RIR data against. If the first field matches a country code in the array and the second field indicates that this row is an ASN record, then we print the 3rd field prepended with 'AS'. As you can see, if you grep the output of the above command for the string "ASAS", "ASS" or 'A2' you should see some mangled ASNs. If you change "-P 2" to "-P 1" this mangling will not occur.

Here's where things get very weird. While parsing this data (as part of a larger dataset comprising an aggregation of all the registrar delegation statistics) I've been using this snippet for a while to quickly fetch ASN records. It is not until I have BOTH the BR and CA country codes in the array that I can trigger this bug. I can have any number of country codes in the array, but if Brazil AND Canada happen to be specified in the array, then I get mangled output, but ONLY if executed with parallel xargs. This reproducibly happens when using awk, gawk or mawk. To further melt your brain, this behaviour has NOT been observed when using goawk, a POSIX compliant awk implementation written in go.

Just to prove my point, here's me testing the hash outputs between various awk implementations with my above 1 liner:

$ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 2 -- awk -F '|' 'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }' cc.txt | sort | md5
    2a20f44ce6a23d5c49b05b9f2689ef93

$ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 1 -- awk -F '|' 'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }' cc.txt | sort | md5
    9ab3dbfbff5746f059cdb35221ff73b1
---
$ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 2 -- mawk -F '|' 'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }' cc.txt | sort | md5
    2a20f44ce6a23d5c49b05b9f2689ef93

$ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 1 -- mawk -F '|' 'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }' cc.txt | sort | md5         >
    9ab3dbfbff5746f059cdb35221ff73b1
---
$ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 2 -- ~/go/bin/goawk -F '|' 'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }' cc.txt | sort | md>
    9ab3dbfbff5746f059cdb35221ff73b1

$ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 1 -- ~/go/bin/goawk -F '|' 'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }' cc.txt | sort | md
    9ab3dbfbff5746f059cdb35221ff73b1

I've racked my brain and the internet for hours, I've tested and toiled, and I'm left thoroughly perplexed. I now humbly ask the fine folks here in OpenBSD Land for guidance, insight or suggestions.

As always, is this a bug, or am I holding it wrong?

Regards,

Jordan
Christian Weisgerber
2021-04-15 14:29:17 UTC
Permalink
--- /tmp/bad.txt  Wed Apr 14 21:06:51 2021
+++ /tmp/good.txt  Wed Apr 14 21:06:41 2021
I'll note that no characters have been lost between the two files.
Only the order is different.
The only thing that changed between these runs was me using either xargs -P 1 or -P 2.
What do you expect? You run two processes in parallel that write
to the same file. Obviously their output will be interspersed in
unpredictable order.

You seem to imagine that awk's output is line-buffered. But when
it writes to a pipe or file, its output is block-buffered. This
is default stdio behavior. Output is written in block-size increments
(16 kB in practice) without regard to lines. So, yes, you can end
up with a fragment from a line written by process #1, followed by
lines from process #2, followed by the remainder of the line from
#1, etc.
--
Christian "naddy" Weisgerber ***@mips.inka.de
Otto Moerbeek
2021-04-15 14:49:14 UTC
Permalink
Post by Christian Weisgerber
--- /tmp/bad.txt  Wed Apr 14 21:06:51 2021
+++ /tmp/good.txt  Wed Apr 14 21:06:41 2021
I'll note that no characters have been lost between the two files.
Only the order is different.
The only thing that changed between these runs was me using either xargs -P 1 or -P 2.
What do you expect? You run two processes in parallel that write
to the same file. Obviously their output will be interspersed in
unpredictable order.
You seem to imagine that awk's output is line-buffered. But when
it writes to a pipe or file, its output is block-buffered. This
is default stdio behavior. Output is written in block-size increments
(16 kB in practice) without regard to lines. So, yes, you can end
up with a fragment from a line written by process #1, followed by
lines from process #2, followed by the remainder of the line from
#1, etc.
--
Right, a fflush() call after the printf makes the issue go away, but
only since awk is being nice and issues a single write call for that
single printf. Since awk afaik does not give such a guarantee, it is
better to have each parallel invocation write to a separate file and
then cat them together after all the awk runs are done.

-Otto
Jordan Geoghegan
2021-04-16 09:26:27 UTC
Permalink
Post by Otto Moerbeek
Post by Christian Weisgerber
--- /tmp/bad.txt  Wed Apr 14 21:06:51 2021
+++ /tmp/good.txt  Wed Apr 14 21:06:41 2021
I'll note that no characters have been lost between the two files.
Only the order is different.
The only thing that changed between these runs was me using either xargs -P 1 or -P 2.
What do you expect? You run two processes in parallel that write
to the same file. Obviously their output will be interspersed in
unpredictable order.
You seem to imagine that awk's output is line-buffered. But when
it writes to a pipe or file, its output is block-buffered. This
is default stdio behavior. Output is written in block-size increments
(16 kB in practice) without regard to lines. So, yes, you can end
up with a fragment from a line written by process #1, followed by
lines from process #2, followed by the remainder of the line from
#1, etc.
--
Right, a fflush() call after the printf makes the issue go away, but
only since awk is being nice and issues a single write call for that
single printf. Since awk afaik does not give such a guarantee, it is
better to have each parallel invocation write to a separate file and
then cat them together after all the awk runs are done.
-Otto
Hello Christian and Otto,

Thank you for setting me straight. The block vs line buffering issue should have been obvious to me. What got me confused was that this solution worked well, for a long time - until it didn't. One would assume that it would consistently mangle output...

While fflush does seem to fix the issue, I wanted to explore your suggestion Otto of writing to a temporary file from within awk.

Is something like the following a sane approach to safely generating temporary files from within awk?:

BEGIN{ cmd = "mktemp -q /tmp/workdir/tmp.XXXXXXX" ; if( ( cmd | getline result ) > 0 ) TMPFILE = result ; else exit 1 }

Unless I'm missing something obvious, It seems there is no way to capture both the stdout and return code of an external command from within awk. My workaround solution to error check the call to mktemp here is to abort if mktemp returns no data. Is this sane?

Regards,

Jordan
Otto Moerbeek
2021-04-16 09:55:56 UTC
Permalink
Post by Jordan Geoghegan
Post by Otto Moerbeek
Post by Christian Weisgerber
--- /tmp/bad.txt  Wed Apr 14 21:06:51 2021
+++ /tmp/good.txt  Wed Apr 14 21:06:41 2021
I'll note that no characters have been lost between the two files.
Only the order is different.
The only thing that changed between these runs was me using either xargs -P 1 or -P 2.
What do you expect? You run two processes in parallel that write
to the same file. Obviously their output will be interspersed in
unpredictable order.
You seem to imagine that awk's output is line-buffered. But when
it writes to a pipe or file, its output is block-buffered. This
is default stdio behavior. Output is written in block-size increments
(16 kB in practice) without regard to lines. So, yes, you can end
up with a fragment from a line written by process #1, followed by
lines from process #2, followed by the remainder of the line from
#1, etc.
--
Right, a fflush() call after the printf makes the issue go away, but
only since awk is being nice and issues a single write call for that
single printf. Since awk afaik does not give such a guarantee, it is
better to have each parallel invocation write to a separate file and
then cat them together after all the awk runs are done.
-Otto
Hello Christian and Otto,
Thank you for setting me straight. The block vs line buffering issue should have been obvious to me. What got me confused was that this solution worked well, for a long time - until it didn't. One would assume that it would consistently mangle output...
Buffering issues depend on the (size of) the data being written. I
think it is pretty consistent: if the bugs appears it always does in
the same way.
Post by Jordan Geoghegan
While fflush does seem to fix the issue, I wanted to explore your suggestion Otto of writing to a temporary file from within awk.
BEGIN{ cmd = "mktemp -q /tmp/workdir/tmp.XXXXXXX" ; if( ( cmd | getline result ) > 0 ) TMPFILE = result ; else exit 1 }
Unless I'm missing something obvious, It seems there is no way to capture both the stdout and return code of an external command from within awk. My workaround solution to error check the call to mktemp here is to abort if mktemp returns no data. Is this sane?
Regards,
Jordan
I think that would work, but maybe it is nicer to wrap the code in a
shell script that generates tmp file names, passes the names to awk
and then do the catting of the result files in the shell script? To
run the cat command you need to know the names of the files anayway.

-Otto

Loading...