Data science and unix command line

· 4 min read

Note: This article applies only to those who code. I have seen many people struggling with MS Excel when dealing with large CSV files. I don’t blame them, because most people ignore standard Unix command-line tools simply because they are unfamiliar with them.

MS Excel for large dataset

When the data is large (anything above 0.5 GB) and we are trying to do something as simple as reading the column names of a CSV, MS Excel will get stuck with a “Not Responding” dialog. And if we script using Python/R/Perl, we spend time writing the script and more time waiting for it to finish.

Assume that we have a CSV “car.csv” as shown below:

"","mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
"Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
"Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
"Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
"Hornet 4 Drive",21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
"Hornet Sportabout",18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
"Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
"Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
"Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4

Let’s assume that we have a large CSV with many columns and we are interested in the sum of the 12th column (carb).

cat car.csv | awk -F "," '{ sum += $12 } END { printf "%.2f\n", sum}'
# 29.00

Here we use the standard command-line tools cat and awk without loading the entire data into memory.
The above line says:

  1. Using cat, stream the contents into stdout.
  2. Pipe the stdout from cat to the next command awk.
  3. With Awk:
    • -F parameter takes the delimiter used in the CSV.
    • Once you specify the -F delimiter, all fields are available as $1, $2, ....
    • sum += $12 increments the variable sum with the 12th column of each line.
    • Use printf to format the result.

Other useful commands:

Head and tail
To get a sample of a large CSV, use head or tail. Neither loads the entire file into memory; they read line by line.

# To get the first 3 lines of your csv 
head car.csv -n 3
# "","mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
# "Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
# "Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4

# Similarly for the last 5 lines 
tail car.csv -n 3 
# "Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
# "Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
# "Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1

You can even save the sample to a separate file to view in an editor or run a script over it. Using the redirect operator > in bash, you can redirect the output to a file.

# first 100 lines into a new file first100.txt
head -n 100 big.csv > first100.txt

Word count
wc gives the number of lines, words, and bytes in your file:

# number of lines, words and bytes
wc car.csv
#  11  25 602 car.csv

# to only output number of lines 
wc -l car.csv
# 11

Grep
To search for lines containing some text like Datsun, use grep. Grep supports regular expressions and has a lot of options.

# Search for text Datsun 
grep Datsun car.csv
# "Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
# "Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1

# subset data with only lines containing Datsun
grep Datsun car.csv > datsun.csv

Cut
Cut splits a line into fields according to a given delimiter, number of characters, or pattern.

# To cut your csv and show only first and third column
cut -d "," -f 1,3 car.csv
# "","cyl"
# "Mazda RX4",6
# "Mazda RX4 Wag",6
# "Datsun 710",4
# "Hornet 4 Drive",6
# "Hornet Sportabout",8
# "Mazda RX4",6
# "Mazda RX4 Wag",6
# "Mazda RX4",6
# "Mazda RX4 Wag",6
# "Datsun 710",4

# You can cut by specifying start/end/number of characters 
# To print only first 3 characters 
cut -c-3 car.csv
# "","m
# "Mazd
# "Mazd
# "Dats
# "Horn
# "Horn
# "Mazd
# "Mazd
# "Mazd
# "Mazd
# "Dats

Sed and Awk
These two commands are more of a programming language than simple commands. Sed and AWK, combined with regular expressions, let you do most text file manipulations.

# Replace all occurrences of "Mazda" with "Maa" in your data.
sed s/Mazda/Maa/g car.csv
# "","mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
# "Maa RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
# "Maa RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
# "Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
# "Hornet 4 Drive",21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
# "Hornet Sportabout",18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
# "Maa RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
# "Maa RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
# "Maa RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
# "Maa RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
# "Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1

Sed can be used to clean your dataset. Say we want to delete all lines containing Mazda:

# Delete all the lines containing Mazda 
sed /Mazda/d car.csv > noMazda.csv

Read more about Sed and Awk.

Sort and uniq
The sort command sorts the CSV treating each line as a string. You can also sort by column, in numeric or reverse order. The uniq command returns only unique rows, or just the duplicated ones.
Say you want to sort a CSV by column 12 (carb). The -k (key) flag specifies the column to sort by, -nr sorts in reverse numeric order, and -t specifies the column delimiter.

# if you want to sort the csv according to carb(Column 12)
sort -k 12 -nr -t "," car.csv
# "Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
# "Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
# "Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
# "Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
# "Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
# "Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
# "Hornet Sportabout",18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
# "Hornet 4 Drive",21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
# "Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
# "Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
# "","mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"

Use uniq to output only unique rows. The -c flag prepends the occurrence count to each row.

sort car.csv | uniq -c
# 2 "Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
# 1 "Hornet 4 Drive",21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
# 1 "Hornet Sportabout",18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
# 3 "Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
# 3 "Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
# 1 "","mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"

To see only the duplicate lines, pass a -d flag.

sort car.csv | uniq -d
# "Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
# "Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
# "Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4

To get the total number of unique lines, pipe uniq output to wc.

sort car.csv | uniq | wc -l
6

GNU split
Sometimes you just want to split the file into n smaller parts and run your script over each part.

#Split the csv into part files of 100 lines
split big.csv --lines 100

The file will be split into part files (xaa, xab, …) in the same folder.

GNU Plot
GNU Plot is an awesome tool for plotting. Read more.

Working with Unix command-line tools is easier than you might think; all you have to do is try them. There are many custom tools to ease your workflow, like jsonToCsv, GNU Plot, etc. High-level scripting languages can do a lot, and I’m a big fan of Python, but Unix command-line tools are often much easier and more reliable. R is a pretty good tool too, but when the data gets large, use the command line to trim and tidy it for R.
And you don’t have to remember all these parameters - just use the man page!

Links:


unix command linetext manipulationdata sciencedata science large fileslarge dataset