Another potential awk or xargs bug?

Discussion:

Jordan Geoghegan

2021-04-15 05:19:10 UTC

Hello,

I've found some very interesting behaviour when subjecting various awk implementations to some very specific circumstances.

I'm basically looking for a sanity check here to confirm if I'm just wildly flailing, or if I am indeed onto something here.

Here's my situation:

When parsing some RIR data in parallel using awk with xargs, I seem to have found a way to reliable lose and/or mangle output with parallel xargs. My google-fu seems to be failing me. I understand that xargs does not buffer output and that lines may arrive out of order, but in this case I am reliably and reproducibly losing data and receiving mangled output. But wait, it gets stranger.

I don't want to lose you guys here with a long winded explanation, so I'm going to show you a diff that shows reproducibly mangled output when using xargs in parallel mode:

--- /tmp/bad.txtÂ Wed Apr 14 21:06:51 2021
+++ /tmp/good.txtÂ Wed Apr 14 21:06:41 2021
@@ -1,5 +1,3 @@
-267386
-A264890
Â AS262399
Â AS262400
Â AS262401
@@ -1774,6 +1772,7 @@
Â AS264887
Â AS264888
Â AS264889
+AS264890
Â AS264891
Â AS264892
Â AS264893
@@ -3552,6 +3551,7 @@
Â AS267383
Â AS267384
Â AS267385
+AS267386
Â AS267387
Â AS267388
Â AS267389
@@ -4220,6 +4220,7 @@
Â AS268318
Â AS268319
Â AS268320
+AS268320
Â AS268321
Â AS268321
Â AS268323
@@ -7785,6 +7786,7 @@
Â AS270633
Â AS270633
Â AS270634
+AS270634
Â AS270635
Â AS270635
Â AS270636
@@ -10277,5 +10279,3 @@
Â AS46210
Â AS46280
Â AS46280
-ASAS268320
-ASS270634

The only thing that changed between these runs was me using either xargs -P 1 or -P 2.

To allow folks to follow along with me at home, I've included the two files (gzipped for politeness) I used to trigger this behaviour.

Once you've extracted the attached text files into your working directory, here's a snippet that should reproduce my issue:

$ printf 'BR\nCA\n' > cc.txt

$ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 2 -- awk -F '|' 'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }' cc.txt

What does this 1 liner do, well it's supposed to slurp the country codes specified in cc.txt into an array where we then check the first field of each row of the RIR data against. If the first field matches a country code in the array and the second field indicates that this row is an ASN record, then we print the 3rd field prepended with 'AS'. As you can see, if you grep the output of the above command for the string "ASAS", "ASS" or 'A2' you should see some mangled ASNs. If you change "-P 2" to "-P 1" this mangling will not occur.

Here's where things get very weird. While parsing this data (as part of a larger dataset comprising an aggregation of all the registrar delegation statistics) I've been using this snippet for a while to quickly fetch ASN records. It is not until I have BOTH the BR and CA country codes in the array that I can trigger this bug. I can have any number of country codes in the array, but if Brazil AND Canada happen to be specified in the array, then I get mangled output, but ONLY if executed with parallel xargs. This reproducibly happens when using awk, gawk or mawk. To further melt your brain, this behaviour has NOT been observed when using goawk, a POSIX compliant awk implementation written in go.

Just to prove my point, here's me testing the hash outputs between various awk implementations with my above 1 liner:

$ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 2 -- awk -F '|' 'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }' cc.txt | sort | md5
Â Â Â 2a20f44ce6a23d5c49b05b9f2689ef93

$ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 1 -- awk -F '|' 'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }' cc.txt | sort | md5
Â Â Â 9ab3dbfbff5746f059cdb35221ff73b1
---
$ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 2 -- mawk -F '|' 'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }' cc.txt | sort | md5
Â Â Â 2a20f44ce6a23d5c49b05b9f2689ef93

$ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 1 -- mawk -F '|' 'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }' cc.txt | sort | md5Â Â Â Â Â Â Â Â >
Â Â Â 9ab3dbfbff5746f059cdb35221ff73b1
---
$ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 2 -- ~/go/bin/goawk -F '|' 'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }' cc.txt | sort | md>
Â Â Â 9ab3dbfbff5746f059cdb35221ff73b1

$ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 1 -- ~/go/bin/goawk -F '|' 'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }' cc.txt | sort | md
Â Â Â 9ab3dbfbff5746f059cdb35221ff73b1

I've racked my brain and the internet for hours, I've tested and toiled, and I'm left thoroughly perplexed. I now humbly ask the fine folks here in OpenBSD Land for guidance, insight or suggestions.

As always, is this a bug, or am I holding it wrong?

Regards,

Jordan

Christian Weisgerber

2021-04-15 14:29:17 UTC

Permalink

--- /tmp/bad.txt Wed Apr 14 21:06:51 2021
+++ /tmp/good.txt Wed Apr 14 21:06:41 2021

I'll note that no characters have been lost between the two files.
Only the order is different.

The only thing that changed between these runs was me using either xargs -P 1 or -P 2.

What do you expect? You run two processes in parallel that write
to the same file. Obviously their output will be interspersed in
unpredictable order.

You seem to imagine that awk's output is line-buffered. But when
it writes to a pipe or file, its output is block-buffered. This
is default stdio behavior. Output is written in block-size increments
(16 kB in practice) without regard to lines. So, yes, you can end
up with a fragment from a line written by process #1, followed by
lines from process #2, followed by the remainder of the line from
#1, etc.

--
Christian "naddy" Weisgerber ***@mips.inka.de

Otto Moerbeek

2021-04-15 14:49:14 UTC

Permalink

Post by Christian Weisgerber

--- /tmp/bad.txt Wed Apr 14 21:06:51 2021
+++ /tmp/good.txt Wed Apr 14 21:06:41 2021

I'll note that no characters have been lost between the two files.
Only the order is different.

The only thing that changed between these runs was me using either xargs -P 1 or -P 2.

What do you expect? You run two processes in parallel that write
to the same file. Obviously their output will be interspersed in
unpredictable order.
You seem to imagine that awk's output is line-buffered. But when
it writes to a pipe or file, its output is block-buffered. This
is default stdio behavior. Output is written in block-size increments
(16 kB in practice) without regard to lines. So, yes, you can end
up with a fragment from a line written by process #1, followed by
lines from process #2, followed by the remainder of the line from
#1, etc.
--

Right, a fflush() call after the printf makes the issue go away, but
only since awk is being nice and issues a single write call for that
single printf. Since awk afaik does not give such a guarantee, it is
better to have each parallel invocation write to a separate file and
then cat them together after all the awk runs are done.

-Otto

Jordan Geoghegan

2021-04-16 09:26:27 UTC

Permalink

Post by Otto Moerbeek

Post by Christian Weisgerber

--- /tmp/bad.txt Wed Apr 14 21:06:51 2021
+++ /tmp/good.txt Wed Apr 14 21:06:41 2021

I'll note that no characters have been lost between the two files.
Only the order is different.

The only thing that changed between these runs was me using either xargs -P 1 or -P 2.

What do you expect? You run two processes in parallel that write
to the same file. Obviously their output will be interspersed in
unpredictable order.
You seem to imagine that awk's output is line-buffered. But when
it writes to a pipe or file, its output is block-buffered. This
is default stdio behavior. Output is written in block-size increments
(16 kB in practice) without regard to lines. So, yes, you can end
up with a fragment from a line written by process #1, followed by
lines from process #2, followed by the remainder of the line from
#1, etc.
--

Hello Christian and Otto,

Thank you for setting me straight. The block vs line buffering issue should have been obvious to me. What got me confused was that this solution worked well, for a long time - until it didn't. One would assume that it would consistently mangle output...

While fflush does seem to fix the issue, I wanted to explore your suggestion Otto of writing to a temporary file from within awk.

Is something like the following a sane approach to safely generating temporary files from within awk?:

BEGIN{ cmd = "mktemp -q /tmp/workdir/tmp.XXXXXXX" ; if( ( cmd | getline result ) > 0 ) TMPFILE = result ; else exit 1 }

Unless I'm missing something obvious, It seems there is no way to capture both the stdout and return code of an external command from within awk. My workaround solution to error check the call to mktemp here is to abort if mktemp returns no data. Is this sane?

Regards,

Jordan

Otto Moerbeek

2021-04-16 09:55:56 UTC

Permalink

Post by Jordan Geoghegan

Post by Otto Moerbeek

Post by Christian Weisgerber

--- /tmp/bad.txt Wed Apr 14 21:06:51 2021
+++ /tmp/good.txt Wed Apr 14 21:06:41 2021

I'll note that no characters have been lost between the two files.
Only the order is different.

The only thing that changed between these runs was me using either xargs -P 1 or -P 2.

What do you expect? You run two processes in parallel that write
to the same file. Obviously their output will be interspersed in
unpredictable order.
You seem to imagine that awk's output is line-buffered. But when
it writes to a pipe or file, its output is block-buffered. This
is default stdio behavior. Output is written in block-size increments
(16 kB in practice) without regard to lines. So, yes, you can end
up with a fragment from a line written by process #1, followed by
lines from process #2, followed by the remainder of the line from
#1, etc.
--

Buffering issues depend on the (size of) the data being written. I
think it is pretty consistent: if the bugs appears it always does in
the same way.

Post by Jordan Geoghegan
While fflush does seem to fix the issue, I wanted to explore your suggestion Otto of writing to a temporary file from within awk.
BEGIN{ cmd = "mktemp -q /tmp/workdir/tmp.XXXXXXX" ; if( ( cmd | getline result ) > 0 ) TMPFILE = result ; else exit 1 }
Unless I'm missing something obvious, It seems there is no way to capture both the stdout and return code of an external command from within awk. My workaround solution to error check the call to mktemp here is to abort if mktemp returns no data. Is this sane?
Regards,
Jordan

I think that would work, but maybe it is nicer to wrap the code in a
shell script that generates tmp file names, passes the names to awk
and then do the catting of the result files in the shell script? To
run the cat command you need to know the names of the files anayway.

-Otto