Performance comparison of a fork-heavy bash script on WSL vs native Linux

I was doing a simple conversion between dataformats in the simplest and fastest way I could: some sed, tr, awk and good ol’ bash scripts.

While doing my conversion I was baffled as the nigh one-liner was taking forever on my local machine running in Windows Subsystem for Linux bash prompt.

I mean the data itself is not that fat…

Data I am working with

$ wc -l rawdata

	2634 rawdata

$ head -n2 rawdata finaldata.csv

	==> rawdata <==
	2017,01,custAAA,white:272,green:36,
	2017,01,custABC,white:61,green:5,yellow:2,

	==> finaldata.csv <==
	2017,01,custAAA,272,36,0,0
	2017,01,custABC,61,5,2,0

Data conversion script: convert.sh

#!/bin/bash

while read line; do
	new_line=$(echo -n $line | cut -f1-3 -d',')
	echo -n $new_line
	for prio in white green yellow red; do
		prio=$(echo $line | grep -Po "$prio:\K(\d+)" || echo 0)
		echo -n ",$prio"
	done
	echo
done

And I knew in my gut that although this might not be the optimal, best or even a good way to do the conversion, I knew it should be sufficient to get the job done given the small size of the data.

So something was amiss. From past experience I knew that doing fork() code in excess - i.e. the $(stuff) here spawning a new subprocess - can hinder the performance of the code significantly, especially if done inside multiple loops.

This is also something I had an inkling of doubt regarding how well WSL bash had been implemented. Previous things like cygwin were not forking emulations, but had to spawn threads and thus were less performant than native forks on a native Linux kernel.

So like any proper geek, instead of focusing on the task at hand - the thing why I was doing the conversion of data - I got derailed and followed this rabbit hole which obviously required me to do some benchmarking of WSL vs native Linux.

Native Linux:

I uploaded my script and data to a nearby Linux server. The server is not an insane number-cruncher with massive amounts of cores or memory, just a regular utility drone with roughly equivalent hardware specs as my laptop.

$ time cat rawdata | bash convert.sh > finaldata.csv

	real    0m19.365s
	user    0m1.300s
	sys     0m16.850s

This is exactly in the ballpark I was picturing given the size of the data and the (lack of) complexity of the conversion. So why did it freeze up on WSL.

WSL:

As it turns out, it didn’t freeze. I was just too trigger-happy with my Ctrl-C interrupt…

$ time cat rawdata | bash convert.sh > finaldata.csv

	real    6m17.960s
	user    0m18.172s
	sys     7m6.125s

Holy Terra be blessed, that is long o_____o

Updated: