1brc in AWK (< 6 minutes 😁 ) #171
Replies: 11 comments 20 replies
-
I also wrote a slightly slower version that should be more robust against overflow for pathological input. The difference is that it calculates a running mean, rather than keeping a sum of all the measurements. Here it is: time mawk -F';' '
{
if (!c[$1]) {
c[$1] = 1;
min[$1] = mean[$1] = Max[$1] = $2;
} else {
c[$1]++;
mean[$1] += ($2 - mean[$1]) / c[$1];
if ($2 < min[$1]) min[$1] = $2;
else if ($2 > Max[$1]) Max[$1] = $2;
}
}
END {
for (i in c)
printf "%s = %.1f/%.1f/%.1f\n", i, min[i], mean[i], Max[i] | "sort"
}
' measurements.txt This adds about a minute to the run-time:
|
Beta Was this translation helpful? Give feedback.
-
I also like using AWK for data wrangling. I also wrote 1BRC solution in it 2 days ago, but:
On the other hand making people aware of AWK more is always a good thing, so thank you for doing it! It seems your solution is not strictly conforming to format requested in original blog post, seemingly to be able to leverage using external sorting utlity (sort). So it's not truly standalone AWK solution, but it's at still a POSIX-y solution, so that's good. My solution: #!/usr/bin/awk -f
BEGIN { FS=";"; }
{
tot[$1] += $2;
if (!cnt[$1] || min[$1] > $2)
min[$1] = $2;
if (!cnt[$1] || max[$1] < $2)
max[$1] = $2;
cnt[$1]++;
}
END {
n = asorti(cnt, key);
printf("{");
i=1;
printf("%s=%.1f/%.1f/%.1f", key[i], min[key[i]], tot[key[i]]/cnt[key[i]], max[key[i]]);
for (i=2; i<=n; i++) {
printf(", %s=%.1f/%.1f/%.1f", key[i], min[key[i]], tot[key[i]]/cnt[key[i]], max[key[i]]);
}
printf("}\n");
} Tested on MSYS2 in W11.
@stig, can you share HW and SW setup used for testing your script? |
Beta Was this translation helpful? Give feedback.
-
Yeah, I went for an idiomatic AWK solution rather than contorting myself to use only AWK. The latter would not be much closer to confirming to the original challenge, since that was to use Java. 😄 |
Beta Was this translation helpful? Give feedback.
-
frawk, gawk and baseline Java program testing result on my M1 Mac Book Pro 14(Apple M1 Pro, 16G). Source: BEGIN {
FS = ";"
}
{
value = 0.0 + $2
if ($1 in Count) {
Count[$1] += 1;
Total[$1] += value;
if (value < Min[$1]) Min[$1] =value;
else if (value > Max[$1]) Max[$1] = value;
} else {
Count[$1] = 1;
Total[$1] = value;
Min[$1] = value;
Max[$1] = value;
}
}
END {
for (i in Count) {
printf "%s = %.1f/%.1f/%.1f\n", i, Min[i], (Total[i]/Count[i]), Max[i] | "sort"
}
} baseline Java:
frawk:
gawk:
|
Beta Was this translation helpful? Give feedback.
-
It's either 6 minutes with 20 lines of AWK written in five minutes. Or 6 seconds in 300 lines of java written in five days. Classic "XKCD: Is It Worth the Time?". |
Beta Was this translation helpful? Give feedback.
-
My version is ever shorter (still optimized for readability): !min[$1] || $2 < min[$1] { min[$1] = $2 }
!max[$1] || $2 > max[$1] { max[$1] = $2 }
{ sum[$1] += $2; count[$1]++ }
END {
for (city in min)
printf("%30s %5.1f / %5.1f / %5.1f\n", city, min[city], sum[city] / count[city], max[city])
} Sample output:
Result on my intel i7-14700K
|
Beta Was this translation helpful? Give feedback.
-
Awk is somewhat at a disadvantage performance-wise because there's no builtin minimum function. X86 and Aarch64 both have a single instruction that takes the minimum of two floats. |
Beta Was this translation helpful? Give feedback.
-
@stig and @gunnarmorling - Howdy. With respect, the proposed mawk solution is incomplete:
While it's fast, it's incorrect. Here's a proposed alternative in gawk that's only 14% slower but addresses all the challenge requirements. A POSIX awk/mawk implementation with sorting would take a lot longer to write and execute. This implementation took only a few minutes to code (and a lot of minutes testing). Thoughts? #!/usr/local/bin/gawk -f
BEGIN {
FS = ";";
}
{
temp[$1] += $2;
if (!count[$1]++)
min[$1] = max[$1] = $2;
else {
if ($2 < min[$1]) min[$1] = $2;
if ($2 > max[$1]) max[$1] = $2;
}
}
END {
PROCINFO["sorted_in"] = "@ind_str_asc";
printf("{");
for (station in temp)
printf("%s=%3.1f/%3.1f/%3.1f,", station, min[station], temp[station]/count[station], max[station]);
printf("\b}");
} The The main processing block is the only place where meaningful optimizations can happen. I tried various buffering strategies (e.g. Thoughts? Have an awesome weekend. |
Beta Was this translation helpful? Give feedback.
-
This thing is embarassingly parallel, so I looked to see if I could use GNU parallel as well. I'm not really interested in raw performance but more in the delta of performance we can grab with very little change of code. I did a good old map/reduce/rereduce:
#!/usr/bin/awk -f
BEGIN {FS=";"}
!min[$1] || $2 < min[$1] { min[$1] = $2 }
!max[$1] || $2 > max[$1] { max[$1] = $2 }
{ sum[$1] += $2; count[$1]++ }
END {
for (city in min)
printf("%30s;%5.1f;%5.1f;%5d;%5.1f\n", city, min[city], sum[city], count[city], max[city])
}
(Thanks @frapa !)
#!/usr/bin/awk -f
BEGIN {FS=";"}
!min[$1] || $2 < min[$1] { min[$1] = $2 }
!max[$1] || $5 > max[$1] { max[$1] = $5 }
{ sum[$1] += $2; count[$1]++ }
END {
for (city in min)
printf("%30s / %5.1f / %5.1f / %5.1f\n", city, min[city], sum[city]/count[city], max[city])
} The glue: parallel --pipe-part -j 8 --round-robin --arg-file $1 ./map.awk | ./reduce.awk The results on my i5-1145G7 @ 2.60GHz, completely unscientifically, while doing other things at the same time with the same machine: Initial awk:
Parallelized awk:
A x2 speed improvement is really interesting, but to me the more interesting thing is that there is fare more time spent in |
Beta Was this translation helpful? Give feedback.
-
Mine is pretty much the same BEGIN {
FS = ";"
CONVFMT = "%.1f"
}
{
sum[$1] += $2
if (++c[$1] == 1)
min[$1] = max[$1] = $2
else if ($2 < min[$1])
min[$1] = $2
else if ($2 > max[$1])
max[$1] = $2
}
END {
fmt = "awk 'BEGIN{printf \"{\"}{printf \"%s%s\",s,$0;s=\", \"}END{print \"}\"}'"
for (i in c)
print i "=" min[i] "/" sum[i]/c[i] "/" max[i] | "sort | " fmt
} Although it prints the output in this format
|
Beta Was this translation helpful? Give feedback.
-
Hi!
The original script on a quite old quad core i5-6500: This: PS: You don't need the extra || check, whether the array exists or not. If it doesn't, then the first number will always be larger than a non-existent field. |
Beta Was this translation helpful? Give feedback.
-
AWK is my go-to data munging tool, so I was curious to see how a naive solution in AWK would fare. Pretty well I would say!
Source:
Running against the full 1 billion row data set on this machine (2021 M1 Max, 32 GB RAM) took:
For comparison, the baseline Java program on my machine took:
PS: I tried awk, gawk and mawk: the latter was the clear winner. (gawk was number two.)
Beta Was this translation helpful? Give feedback.
All reactions