Wednesday, September 03, 2014

Useful Powershell Commands for Data Science

A while back, I ran across an article titled Useful Unix commands for data science.  The article was quite interesting, and as I read it, I began to be curious to try doing the same operations using Windows Powershell.  The idea here was not to prove that Powershell was somehow better.  I just wanted to see if I could use it to achieve similar results and add another tool to my toolbox of options for working with data.  What I found was that Powershell is a very capable language that allows nearly limitless uses, from single-line commands to full scripts that can do just about anything you want (on a Windows machine). 

What follows is a list of “exercises” derived from the commands outlined in the article cited above, and the appropriate Powershell solutions to those exercises (Note: I’m going to assume that the reader has at least some familiarity with Powershell and not go into a huge amount of detail on the ins and outs of every command(let)).  Here they are: 

Exercise 1: Import a delimited file and retrieve the sum of a single column.

In the original article, the author read in a pipe-delimited file and produced the sum of a column within that file.  To do the same thing with Powershell requires using this command:

Import-Csv "C:\Temp2\dataMed.txt" -Delimiter "|" | Measure-Object -Property col4 -Sum

The above command uses the Import-Csv commandlet, identifying the file and delimiter.  It then pipes the results of this import to the Measure-Object commandlet, which sums the identified property (column) in the data file.  In this case, the column name is “col4”.  The results look like this:

Count    : 100000
Average  :
Sum      : 5042083
Maximum  :
Minimum  :
Property : col4

You’ll notice that there are several other columns of aggregation that we could take advantage of beyond sum.  In addition, we could do this same operation on multiple columns at once by using wildcards to identify the desired property to act on.  Here is an example that provides the sum, maximum, minimum, and average for every column in the file:

Import-Csv "C:\Temp2\dataMed.txt" -Delimiter "|" | Measure-Object -Property * -Sum -Maximum -Minimum -Average

The results look like this:

Count    : 100000
Average  : 50.68611
Sum      : 5068611
Maximum  : 100
Minimum  : 1
Property : col1

Count    : 100000
Average  : 50.29766
Sum      : 5029766
Maximum  : 100
Minimum  : 1
Property : col2

Count    : 100000
Average  : 50.61111
Sum      : 5061111
Maximum  : 100
Minimum  : 1
Property : col3

Count    : 100000
Average  : 50.42083
Sum      : 5042083
Maximum  : 100
Minimum  : 1
Property : col4

Count    : 100000
Average  : 50.56574
Sum      : 5056574
Maximum  : 100
Minimum  : 1
Property : col5


Exercise 2: Get the first N lines or the last N lines of a given file.

Suppose you’d just like to look at a few lines of the file you have.  To do that in Powershell, use this command:

Get-Content "C:\Temp2\dataTiny.txt" | Select-Object -First 3

First, we use Get-Content to read in the file contents (rather than importing and parsing the file, we just get the file content as one big chunk), then pipe the result to the Select-Object commandlet.  Here, we select the first 3 rows in the file.  To select the last N (in this case 3) rows, just change the command slightly to:

Get-Content "C:\Temp2\dataTiny.txt" | Select-Object -Last 3

Exercise 3: Count lines, words, and bytes of a given file.

Here is a command that will count the lines, words, and characters in a given file (you can ignore whitespace in the character count using the –IgnoreWhiteSpace flag):

Get-Content "C:\Temp2\dataMed.txt" | Measure-Object -Line -Word -Character

It will give you a result like this:

Lines             Words          Characters Property          
-----             -----          ---------- --------          
100001            100001         12169841    

To get the number of bytes in a file, use this command:

Get-ChildItem "C:\Temp2\dataMed.txt" | Measure-Object -Property Length -Minimum

Note that we’re measuring the ‘minimum’ size (length).  We could also measure average, maximum, or sum, which would all be identical for a single file.  However, if we used a wildcard or selected from a folder rather than a file for the Get-ChildItem command, we would be able to measure multiple files at once using these aggregates. 

Exercise 4: Search for content within a file.

Searching for content within a file is a pretty basic and well-known function.  The Powershell method is equivalent to grep in Unix and findstr in the “old” Windows/DOS command line.  It looks like this (looking for occurrences of “20” within the identified file):

Select-String -Path "C:\Temp2\dataTiny.txt" -Pattern '20'

The result will be each line in the file (identified by number) where the result exists (the syntax is FileName:LineNumber:RowContent):


You can search multiple files by using wildcards or identifying folders for the path in the command.

Exercise 5: Find and replace items within a file.

Find and replace is an incredibly useful function, and it’s even better at the command line without having to open the file:

Get-ChildItem "C:\Temp2\dataTest.txt" | ForEach {
  (Get-Content $_ | ForEach {$_ -replace "20", "Twenty"}) | Set-Content $_

This one is a little more involved, so let’s unpack it:

  1. Get-ChildItem "C:\Temp2\dataTest.txt": Get all files in a given location (so you could find and replace on multiple files).  Pipe this output to a for loop that works through each of these items.
  2. ForEach {  (Get-Content $_: For each file, get the content of the file, then pipe that to a for loop that finds and replaces each instance of your search string (note the $_ variable is like “this” in other programming languages.  It refers to the current working set).
  3. ForEach {$_ -replace "20", "Twenty"}): For each instance of your search string (“20”), replace with your target string (“Twenty”). 
  4.  Set-Content $_: Save your changes to the current file you’re working on (note that the end bracket (}) for the outer for loop is outside this command). 

 As noted, the above command changes the existing file(s) and saves them.  If you’d rather save changes to a different file, you can just change the command slightly to do this:

Get-ChildItem "C:\Temp2\dataTest.txt" | ForEach {
  (Get-Content $_ | ForEach {$_ -replace "20", "Twenty"})
} | Out-File "C:\Temp2\dataNew.txt"

Instead of saving the file, we write the changed file content to a new file using the Out-File command.  Note that if you were operating on multiple files that you wanted to save to new individual files, you would need to re-arrange this and probably introduce some variables and logic to create your new filenames (meaning you’d probably need to write more of a small script to do it right).  The important takeaway here is that you can easily accomplish this.

Exercise 6: Sort data within a file.

To sort the data in a file, just use this command:

Get-Content "C:\Temp2\dataTiny.txt" | Sort-Object -Descending

To write the newly sorted data to a new file, adjust the command like this:

Get-Content "C:\Temp2\dataTiny.txt" | Sort-Object -Descending | Out-File "C:\Temp2\dataDesc.txt"

Exercise 7: Find or remove duplicate records within a file.

For the final exercise, we want to find or remove duplicate records.  First, we’ll remove them:

Get-Content "C:\Temp2\dataDups.txt" | Sort-Object -Unique

That is pretty easy, and you could pipe the results to a new file (Out-File) in the same way as in previous examples.  If you’d like to identify the duplicates instead of removing them, you can use this method:

Get-Content "C:\Temp2\dataDups.txt" | Group-Object | Where-Object { $_.count -gt 1 } | Format-Table Count,Name -AutoSize

In this example, we’re starting again by getting the content of the file.  We then group the file (similar to a SQL group by) by row.  We then pipe these results to the Where-Object command, specifying that we want to show rows that appear more than once (the group count has greater than [gt] 1 item).  Finally, we pipe this result to Format-Table, selecting the Count and Name columns.  This will give you output like this to identify the duplicate rows and how many times they appeared in your dataset:

Count Name                   
----- ----                   
    2 Dup1|1|1|1|1|1|1|1|1|1|1
    2 Dup0|0|0|0|0|0|0|0|0|0|0
    2 Dup8|8|8|8|8|8|8|8|8|8|8
    2 Dup4|4|4|4|4|4|4|4|4|4|4
    2 Dup3|3|3|3|3|3|3|3|3|3|3
    2 Dup2|2|2|2|2|2|2|2|2|2|2
    2 Dup5|5|5|5|5|5|5|5|5|5|5
    2 Dup7|7|7|7|7|7|7|7|7|7|7
    2 Dup6|6|6|6|6|6|6|6|6|6|6
    2 Dup9|9|9|9|9|9|9|9|9|9|9

Again, these are just examples of what Powershell can do.  I’m sure that both Powershell and Unix have many additional functions that you could use to do many other data manipulations.  Perhaps one day I’ll write more about it, but in the meantime, I leave that exploration as a project for you, the reader.