Wednesday, September 03, 2014

Useful Powershell Commands for Data Science

A while back, I ran across an article titled Useful Unix commands for data science.  The article was quite interesting, and as I read it, I began to be curious to try doing the same operations using Windows Powershell.  The idea here was not to prove that Powershell was somehow better.  I just wanted to see if I could use it to achieve similar results and add another tool to my toolbox of options for working with data.  What I found was that Powershell is a very capable language that allows nearly limitless uses, from single-line commands to full scripts that can do just about anything you want (on a Windows machine). 

What follows is a list of “exercises” derived from the commands outlined in the article cited above, and the appropriate Powershell solutions to those exercises (Note: I’m going to assume that the reader has at least some familiarity with Powershell and not go into a huge amount of detail on the ins and outs of every command(let)).  Here they are: 

Exercise 1: Import a delimited file and retrieve the sum of a single column.

In the original article, the author read in a pipe-delimited file and produced the sum of a column within that file.  To do the same thing with Powershell requires using this command:

Import-Csv "C:\Temp2\dataMed.txt" -Delimiter "|" | Measure-Object -Property col4 -Sum

The above command uses the Import-Csv commandlet, identifying the file and delimiter.  It then pipes the results of this import to the Measure-Object commandlet, which sums the identified property (column) in the data file.  In this case, the column name is “col4”.  The results look like this:

Count    : 100000
Average  :
Sum      : 5042083
Maximum  :
Minimum  :
Property : col4

You’ll notice that there are several other columns of aggregation that we could take advantage of beyond sum.  In addition, we could do this same operation on multiple columns at once by using wildcards to identify the desired property to act on.  Here is an example that provides the sum, maximum, minimum, and average for every column in the file:

Import-Csv "C:\Temp2\dataMed.txt" -Delimiter "|" | Measure-Object -Property * -Sum -Maximum -Minimum -Average

The results look like this:

Count    : 100000
Average  : 50.68611
Sum      : 5068611
Maximum  : 100
Minimum  : 1
Property : col1

Count    : 100000
Average  : 50.29766
Sum      : 5029766
Maximum  : 100
Minimum  : 1
Property : col2

Count    : 100000
Average  : 50.61111
Sum      : 5061111
Maximum  : 100
Minimum  : 1
Property : col3

Count    : 100000
Average  : 50.42083
Sum      : 5042083
Maximum  : 100
Minimum  : 1
Property : col4

Count    : 100000
Average  : 50.56574
Sum      : 5056574
Maximum  : 100
Minimum  : 1
Property : col5


Exercise 2: Get the first N lines or the last N lines of a given file.

Suppose you’d just like to look at a few lines of the file you have.  To do that in Powershell, use this command:

Get-Content "C:\Temp2\dataTiny.txt" | Select-Object -First 3

First, we use Get-Content to read in the file contents (rather than importing and parsing the file, we just get the file content as one big chunk), then pipe the result to the Select-Object commandlet.  Here, we select the first 3 rows in the file.  To select the last N (in this case 3) rows, just change the command slightly to:

Get-Content "C:\Temp2\dataTiny.txt" | Select-Object -Last 3

Exercise 3: Count lines, words, and bytes of a given file.

Here is a command that will count the lines, words, and characters in a given file (you can ignore whitespace in the character count using the –IgnoreWhiteSpace flag):

Get-Content "C:\Temp2\dataMed.txt" | Measure-Object -Line -Word -Character

It will give you a result like this:

Lines             Words          Characters Property          
-----             -----          ---------- --------          
100001            100001         12169841    

To get the number of bytes in a file, use this command:

Get-ChildItem "C:\Temp2\dataMed.txt" | Measure-Object -Property Length -Minimum

Note that we’re measuring the ‘minimum’ size (length).  We could also measure average, maximum, or sum, which would all be identical for a single file.  However, if we used a wildcard or selected from a folder rather than a file for the Get-ChildItem command, we would be able to measure multiple files at once using these aggregates. 

Exercise 4: Search for content within a file.

Searching for content within a file is a pretty basic and well-known function.  The Powershell method is equivalent to grep in Unix and findstr in the “old” Windows/DOS command line.  It looks like this (looking for occurrences of “20” within the identified file):

Select-String -Path "C:\Temp2\dataTiny.txt" -Pattern '20'

The result will be each line in the file (identified by number) where the result exists (the syntax is FileName:LineNumber:RowContent):


You can search multiple files by using wildcards or identifying folders for the path in the command.

Exercise 5: Find and replace items within a file.

Find and replace is an incredibly useful function, and it’s even better at the command line without having to open the file:

Get-ChildItem "C:\Temp2\dataTest.txt" | ForEach {
  (Get-Content $_ | ForEach {$_ -replace "20", "Twenty"}) | Set-Content $_

This one is a little more involved, so let’s unpack it:

  1. Get-ChildItem "C:\Temp2\dataTest.txt": Get all files in a given location (so you could find and replace on multiple files).  Pipe this output to a for loop that works through each of these items.
  2. ForEach {  (Get-Content $_: For each file, get the content of the file, then pipe that to a for loop that finds and replaces each instance of your search string (note the $_ variable is like “this” in other programming languages.  It refers to the current working set).
  3. ForEach {$_ -replace "20", "Twenty"}): For each instance of your search string (“20”), replace with your target string (“Twenty”). 
  4.  Set-Content $_: Save your changes to the current file you’re working on (note that the end bracket (}) for the outer for loop is outside this command). 

 As noted, the above command changes the existing file(s) and saves them.  If you’d rather save changes to a different file, you can just change the command slightly to do this:

Get-ChildItem "C:\Temp2\dataTest.txt" | ForEach {
  (Get-Content $_ | ForEach {$_ -replace "20", "Twenty"})
} | Out-File "C:\Temp2\dataNew.txt"

Instead of saving the file, we write the changed file content to a new file using the Out-File command.  Note that if you were operating on multiple files that you wanted to save to new individual files, you would need to re-arrange this and probably introduce some variables and logic to create your new filenames (meaning you’d probably need to write more of a small script to do it right).  The important takeaway here is that you can easily accomplish this.

Exercise 6: Sort data within a file.

To sort the data in a file, just use this command:

Get-Content "C:\Temp2\dataTiny.txt" | Sort-Object -Descending

To write the newly sorted data to a new file, adjust the command like this:

Get-Content "C:\Temp2\dataTiny.txt" | Sort-Object -Descending | Out-File "C:\Temp2\dataDesc.txt"

Exercise 7: Find or remove duplicate records within a file.

For the final exercise, we want to find or remove duplicate records.  First, we’ll remove them:

Get-Content "C:\Temp2\dataDups.txt" | Sort-Object -Unique

That is pretty easy, and you could pipe the results to a new file (Out-File) in the same way as in previous examples.  If you’d like to identify the duplicates instead of removing them, you can use this method:

Get-Content "C:\Temp2\dataDups.txt" | Group-Object | Where-Object { $_.count -gt 1 } | Format-Table Count,Name -AutoSize

In this example, we’re starting again by getting the content of the file.  We then group the file (similar to a SQL group by) by row.  We then pipe these results to the Where-Object command, specifying that we want to show rows that appear more than once (the group count has greater than [gt] 1 item).  Finally, we pipe this result to Format-Table, selecting the Count and Name columns.  This will give you output like this to identify the duplicate rows and how many times they appeared in your dataset:

Count Name                   
----- ----                   
    2 Dup1|1|1|1|1|1|1|1|1|1|1
    2 Dup0|0|0|0|0|0|0|0|0|0|0
    2 Dup8|8|8|8|8|8|8|8|8|8|8
    2 Dup4|4|4|4|4|4|4|4|4|4|4
    2 Dup3|3|3|3|3|3|3|3|3|3|3
    2 Dup2|2|2|2|2|2|2|2|2|2|2
    2 Dup5|5|5|5|5|5|5|5|5|5|5
    2 Dup7|7|7|7|7|7|7|7|7|7|7
    2 Dup6|6|6|6|6|6|6|6|6|6|6
    2 Dup9|9|9|9|9|9|9|9|9|9|9

Again, these are just examples of what Powershell can do.  I’m sure that both Powershell and Unix have many additional functions that you could use to do many other data manipulations.  Perhaps one day I’ll write more about it, but in the meantime, I leave that exploration as a project for you, the reader.

Thursday, May 01, 2014

Firefox 29 and the HTTP Response Timeout

I’ve been using Selenium with Python for some time to automate some work in the browser that I absolutely can’t accomplish any other way (and believe me, I wish I could).  So when Firefox 29 came out, I upgraded, ran a quick test, and everything seemed fine.  However, my script that ran overnight bombed during the browser phase.

Before continuing, I should describe a little bit of what I expect the browser to do during this script.  Here is the basic sequence of events:

1. Open a vendor’s website and log in.
2. Initiate a database backup by either clicking a link or directly running the Javascript the link runs (either way works).
3. Wait (60 minutes).
4. Initiate a download of the backed up database by either clicking a link or directly running the Javascript the link runs (again, either way works).
5. Wait (20 minutes).
6. Close the browser and pass control back to the main script.

The script was throwing an unknown exception at step 4.  On investigation, I found out the reason for this was the browser generating a timeout message a few minutes into the wait period at step 3, which made step 4 impossible.  The message was something like this:

Naturally, I started scouting around for all the timeout settings in Firefox, trying to figure out which one needed to change.  There are several timeout settings in Firefox (see here and here, for example), but none of these seemed to solve the problem.  While searching, I stumbled on this document showing changes between Firefox 28 and 29, and it turned out to be gold.  In particular, notice this change:

So, the network.http.response.timeout went from not being set in version 28 to a default limit of 300 (5 minutes) in version 29.  You probably won’t notice this, unless (like me) you work with a page that has a long-running process that you have to wait more than 5 minutes for a response from.  To solve this, you can update this setting manually in about:config.  Better yet, you can do it programmatically through Selenium with this code:

ffprofile = webdriver.FirefoxProfile()
ffprofile.set_preference("network.http.response.timeout", 30000) #replace 30000 with the number of seconds you want the max delay to be
browser = webdriver.Firefox(firefox_profile=ffprofile)

Problem solved.

Monday, March 10, 2014

Un-zipping / Copying to Network Location

I had a lovely little batch script that, among other things, unzipped an incoming file (using 7-zip at the command line) to a network location using this command: 

"%ProgFiles%\7-Zip\7z.exe" e "C:\Temp\" -o"x: " file.mdb -y -r

In this case, the network location is mapped to the “X” drive.  The command will extract file.mdb wherever it sits within (-r recurses through all directories) and take all actions (namely overwriting existing files) without prompting (-y switch). 

While the script did not fail all the time, it seemed to fail fairly regularly somewhere in the middle of unzipping the file.  When the script failed, the only error message was that “The system cannot open the device or file specified.”  What file the system couldn't find/open wasn't entirely clear here, but I highly suspected that it was the destination file on the network drive.  The network location didn’t appear unstable (in that it never disappeared when I navigated to it or copied files manually), but my guess was that there was some kind of instability in the connection. 

To minimize this, I changed the zip command to unzip to the local directory and then copy the resulting file to the X drive:

"%ProgFiles%\7-Zip\7z.exe" e "C:\Temp\" file.mdb -y –r
COPY file.mdb X:\file.mdb /Y

I thought this would minimize the opportunity for the process to fail.  Indeed, it no longer failed during the zip process, but it then started failing during the copy for the same reasons.  In doing some research, I found two switches associated with the XCOPY command that seem to have resolved the issue.  Here is the copy command I now use after unzipping to a local directory:

XCOPY file.mdb X:\* /J /Y /Z

The J switch removes buffering for large files (my file is nearly 1.5 GB).  The Z switch allows for restarting the copy over a network connection.  The COPY command has an equivalent Z switch, but not the J switch.

As an aside, I copy the file to X:\* to avoid the XCOPY command always asking whether the target is a file or directory ("Does X:\file.mdb specify a file name or directory name on the target (F = file, D = directory)?"). 

Since I switched to this method, the script has run like a champ with no failures.