Reading files line by line in Bash – performance

I’m processing stored  HL7 message files with a bash script, and have been getting pretty poor performance (around one message per second for simple parsing, processing and outputting to CSV format). I thought I might be able to improve performance by using a different method of loading the text file.

The two methods of reading a file in line by line:

bashreadtest1.sh:
#!/bin/bash
while read line ; do
echo “$line”
done < “$1”

bashreadtest2.sh
#!/bin/bash
IFS=$’\n’
LINES=(`cat $1`)
for i in ${LINES[@]} ; do
echo “$i”
done

The test:

phil@mig1:~/$ time ./bashreadtest1.sh 20120821_prjBsqrPacsIn.dat | wc -l
12706

real 0m2.708s
user 0m2.492s
sys 0m0.096s

phil@mig1:~/$ time ./bashreadtest2.sh 20120821_prjBsqrPacsIn.dat | wc -l
12706

real 0m1.096s
user 0m0.920s
sys 0m0.116s

The verdict – read the file into an array and iterate through the array – somewhere around twice as fast.

(I did actually run the test multiple times and they were fairly consistent. The script was also using the second, faster method, so I’ll need to look elsewhere for performance improvements)

 

 

 

Using find in a bash script

Here’s something that will hopefully save someone the frustration that I experienced getting this working. I am writing a script that needs to traverse a directory structure, then operate on the files it finds (read DICOM tags, modify/add tags). The reading and modifying was simple enough, but I could not get find to work inside the script, so basically resorted to running a command like:

$ find . -type f -exec fileops.sh {} \;

That worked fine, but I really wanted to have it all integrated into a single command. The problem I had was that the output of a find command couldn’t be used to iterate in a for loop – as a scalar (FILES=`find . -type f`) it was a single string, and as an array (FILES=(`find . -type f`)) elements would be split on spaces and newlines, which breaks if your paths have spaces in them.

The trick here is to change the IFS variable, which tells bash what to split strings on (by default it is space, tab or newline).  In my case, the output of find is separated by newline characters. Change it to a newline and now it works as expected. yay! Here’s the code:

#!/bin/bash
# IFS controls how bash splits strings (by default, whitespace.
# change to newline
OLD_IFS=${IFS}
IFS=$'\n'
# dummy function for demo
fileops () {
    echo $1
}
FILES=(`find "$1" -type f`)
    for i in "${FILES[@]}" ; do
fileops "$i"
done
# restore IFS
IFS=${OLD_IFS}