Wednesday, September 03, 2014

Useful Powershell Commands for Data Science

A while back, I ran across an article titled Useful Unix commands for data science.  The article was quite interesting, and as I read it, I began to be curious to try doing the same operations using Windows Powershell.  The idea here was not to prove that Powershell was somehow better.  I just wanted to see if I could use it to achieve similar results and add another tool to my toolbox of options for working with data.  What I found was that Powershell is a very capable language that allows nearly limitless uses, from single-line commands to full scripts that can do just about anything you want (on a Windows machine). 

What follows is a list of “exercises” derived from the commands outlined in the article cited above, and the appropriate Powershell solutions to those exercises (Note: I’m going to assume that the reader has at least some familiarity with Powershell and not go into a huge amount of detail on the ins and outs of every command(let)).  Here they are: 

Exercise 1: Import a delimited file and retrieve the sum of a single column.

In the original article, the author read in a pipe-delimited file and produced the sum of a column within that file.  To do the same thing with Powershell requires using this command:

Import-Csv "C:\Temp2\dataMed.txt" -Delimiter "|" | Measure-Object -Property col4 -Sum

The above command uses the Import-Csv commandlet, identifying the file and delimiter.  It then pipes the results of this import to the Measure-Object commandlet, which sums the identified property (column) in the data file.  In this case, the column name is “col4”.  The results look like this:

Count    : 100000
Average  :
Sum      : 5042083
Maximum  :
Minimum  :
Property : col4

You’ll notice that there are several other columns of aggregation that we could take advantage of beyond sum.  In addition, we could do this same operation on multiple columns at once by using wildcards to identify the desired property to act on.  Here is an example that provides the sum, maximum, minimum, and average for every column in the file:

Import-Csv "C:\Temp2\dataMed.txt" -Delimiter "|" | Measure-Object -Property * -Sum -Maximum -Minimum -Average

The results look like this:

Count    : 100000
Average  : 50.68611
Sum      : 5068611
Maximum  : 100
Minimum  : 1
Property : col1

Count    : 100000
Average  : 50.29766
Sum      : 5029766
Maximum  : 100
Minimum  : 1
Property : col2

Count    : 100000
Average  : 50.61111
Sum      : 5061111
Maximum  : 100
Minimum  : 1
Property : col3

Count    : 100000
Average  : 50.42083
Sum      : 5042083
Maximum  : 100
Minimum  : 1
Property : col4

Count    : 100000
Average  : 50.56574
Sum      : 5056574
Maximum  : 100
Minimum  : 1
Property : col5

…etc.

Exercise 2: Get the first N lines or the last N lines of a given file.

Suppose you’d just like to look at a few lines of the file you have.  To do that in Powershell, use this command:

Get-Content "C:\Temp2\dataTiny.txt" | Select-Object -First 3

First, we use Get-Content to read in the file contents (rather than importing and parsing the file, we just get the file content as one big chunk), then pipe the result to the Select-Object commandlet.  Here, we select the first 3 rows in the file.  To select the last N (in this case 3) rows, just change the command slightly to:

Get-Content "C:\Temp2\dataTiny.txt" | Select-Object -Last 3

Exercise 3: Count lines, words, and bytes of a given file.

Here is a command that will count the lines, words, and characters in a given file (you can ignore whitespace in the character count using the –IgnoreWhiteSpace flag):

Get-Content "C:\Temp2\dataMed.txt" | Measure-Object -Line -Word -Character

It will give you a result like this:

Lines             Words          Characters Property          
-----             -----          ---------- --------          
100001            100001         12169841    

To get the number of bytes in a file, use this command:

Get-ChildItem "C:\Temp2\dataMed.txt" | Measure-Object -Property Length -Minimum

Note that we’re measuring the ‘minimum’ size (length).  We could also measure average, maximum, or sum, which would all be identical for a single file.  However, if we used a wildcard or selected from a folder rather than a file for the Get-ChildItem command, we would be able to measure multiple files at once using these aggregates. 

Exercise 4: Search for content within a file.

Searching for content within a file is a pretty basic and well-known function.  The Powershell method is equivalent to grep in Unix and findstr in the “old” Windows/DOS command line.  It looks like this (looking for occurrences of “20” within the identified file):

Select-String -Path "C:\Temp2\dataTiny.txt" -Pattern '20'

The result will be each line in the file (identified by number) where the result exists (the syntax is FileName:LineNumber:RowContent):

C:\Temp2\dataTiny.txt:6:5|93|1|46|32|4|7|81|20|25|41
C:\Temp2\dataTiny.txt:21:20|80|44|31|96|32|59|29|51|37|28
C:\Temp2\dataTiny.txt:26:25|32|72|50|100|20|54|45|78|30|7
C:\Temp2\dataTiny.txt:40:39|22|20|95|93|29|55|94|44|70|14
C:\Temp2\dataTiny.txt:42:41|48|9|75|85|22|90|69|63|10|20
C:\Temp2\dataTiny.txt:51:50|89|19|58|97|52|20|94|70|23|85
C:\Temp2\dataTiny.txt:52:51|93|20|89|100|33|93|39|61|69|9

You can search multiple files by using wildcards or identifying folders for the path in the command.

Exercise 5: Find and replace items within a file.

Find and replace is an incredibly useful function, and it’s even better at the command line without having to open the file:

Get-ChildItem "C:\Temp2\dataTest.txt" | ForEach {
  (Get-Content $_ | ForEach {$_ -replace "20", "Twenty"}) | Set-Content $_
}

This one is a little more involved, so let’s unpack it:

  1. Get-ChildItem "C:\Temp2\dataTest.txt": Get all files in a given location (so you could find and replace on multiple files).  Pipe this output to a for loop that works through each of these items.
  2. ForEach {  (Get-Content $_: For each file, get the content of the file, then pipe that to a for loop that finds and replaces each instance of your search string (note the $_ variable is like “this” in other programming languages.  It refers to the current working set).
  3. ForEach {$_ -replace "20", "Twenty"}): For each instance of your search string (“20”), replace with your target string (“Twenty”). 
  4.  Set-Content $_: Save your changes to the current file you’re working on (note that the end bracket (}) for the outer for loop is outside this command). 

 As noted, the above command changes the existing file(s) and saves them.  If you’d rather save changes to a different file, you can just change the command slightly to do this:

Get-ChildItem "C:\Temp2\dataTest.txt" | ForEach {
  (Get-Content $_ | ForEach {$_ -replace "20", "Twenty"})
} | Out-File "C:\Temp2\dataNew.txt"

Instead of saving the file, we write the changed file content to a new file using the Out-File command.  Note that if you were operating on multiple files that you wanted to save to new individual files, you would need to re-arrange this and probably introduce some variables and logic to create your new filenames (meaning you’d probably need to write more of a small script to do it right).  The important takeaway here is that you can easily accomplish this.

Exercise 6: Sort data within a file.

To sort the data in a file, just use this command:

Get-Content "C:\Temp2\dataTiny.txt" | Sort-Object -Descending

To write the newly sorted data to a new file, adjust the command like this:

Get-Content "C:\Temp2\dataTiny.txt" | Sort-Object -Descending | Out-File "C:\Temp2\dataDesc.txt"

Exercise 7: Find or remove duplicate records within a file.

For the final exercise, we want to find or remove duplicate records.  First, we’ll remove them:

Get-Content "C:\Temp2\dataDups.txt" | Sort-Object -Unique

That is pretty easy, and you could pipe the results to a new file (Out-File) in the same way as in previous examples.  If you’d like to identify the duplicates instead of removing them, you can use this method:

Get-Content "C:\Temp2\dataDups.txt" | Group-Object | Where-Object { $_.count -gt 1 } | Format-Table Count,Name -AutoSize

In this example, we’re starting again by getting the content of the file.  We then group the file (similar to a SQL group by) by row.  We then pipe these results to the Where-Object command, specifying that we want to show rows that appear more than once (the group count has greater than [gt] 1 item).  Finally, we pipe this result to Format-Table, selecting the Count and Name columns.  This will give you output like this to identify the duplicate rows and how many times they appeared in your dataset:

Count Name                   
----- ----                   
    2 Dup1|1|1|1|1|1|1|1|1|1|1
    2 Dup0|0|0|0|0|0|0|0|0|0|0
    2 Dup8|8|8|8|8|8|8|8|8|8|8
    2 Dup4|4|4|4|4|4|4|4|4|4|4
    2 Dup3|3|3|3|3|3|3|3|3|3|3
    2 Dup2|2|2|2|2|2|2|2|2|2|2
    2 Dup5|5|5|5|5|5|5|5|5|5|5
    2 Dup7|7|7|7|7|7|7|7|7|7|7
    2 Dup6|6|6|6|6|6|6|6|6|6|6
    2 Dup9|9|9|9|9|9|9|9|9|9|9

Again, these are just examples of what Powershell can do.  I’m sure that both Powershell and Unix have many additional functions that you could use to do many other data manipulations.  Perhaps one day I’ll write more about it, but in the meantime, I leave that exploration as a project for you, the reader.

Thursday, May 01, 2014

Firefox 29 and the HTTP Response Timeout

I’ve been using Selenium with Python for some time to automate some work in the browser that I absolutely can’t accomplish any other way (and believe me, I wish I could).  So when Firefox 29 came out, I upgraded, ran a quick test, and everything seemed fine.  However, my script that ran overnight bombed during the browser phase.

Before continuing, I should describe a little bit of what I expect the browser to do during this script.  Here is the basic sequence of events:

1. Open a vendor’s website and log in.
2. Initiate a database backup by either clicking a link or directly running the Javascript the link runs (either way works).
3. Wait (60 minutes).
4. Initiate a download of the backed up database by either clicking a link or directly running the Javascript the link runs (again, either way works).
5. Wait (20 minutes).
6. Close the browser and pass control back to the main script.

The script was throwing an unknown exception at step 4.  On investigation, I found out the reason for this was the browser generating a timeout message a few minutes into the wait period at step 3, which made step 4 impossible.  The message was something like this:


Naturally, I started scouting around for all the timeout settings in Firefox, trying to figure out which one needed to change.  There are several timeout settings in Firefox (see here and here, for example), but none of these seemed to solve the problem.  While searching, I stumbled on this document showing changes between Firefox 28 and 29, and it turned out to be gold.  In particular, notice this change:


So, the network.http.response.timeout went from not being set in version 28 to a default limit of 300 (5 minutes) in version 29.  You probably won’t notice this, unless (like me) you work with a page that has a long-running process that you have to wait more than 5 minutes for a response from.  To solve this, you can update this setting manually in about:config.  Better yet, you can do it programmatically through Selenium with this code:

ffprofile = webdriver.FirefoxProfile()
ffprofile.set_preference("network.http.response.timeout", 30000) #replace 30000 with the number of seconds you want the max delay to be
browser = webdriver.Firefox(firefox_profile=ffprofile)

Problem solved.

Monday, March 10, 2014

Un-zipping / Copying to Network Location

I had a lovely little batch script that, among other things, unzipped an incoming file (using 7-zip at the command line) to a network location using this command: 

"%ProgFiles%\7-Zip\7z.exe" e "C:\Temp\zipfile.zip" -o"x: " file.mdb -y -r

In this case, the network location is mapped to the “X” drive.  The command will extract file.mdb wherever it sits within zipfile.zip (-r recurses through all directories) and take all actions (namely overwriting existing files) without prompting (-y switch). 

While the script did not fail all the time, it seemed to fail fairly regularly somewhere in the middle of unzipping the file.  When the script failed, the only error message was that “The system cannot open the device or file specified.”  What file the system couldn't find/open wasn't entirely clear here, but I highly suspected that it was the destination file on the network drive.  The network location didn’t appear unstable (in that it never disappeared when I navigated to it or copied files manually), but my guess was that there was some kind of instability in the connection. 

To minimize this, I changed the zip command to unzip to the local directory and then copy the resulting file to the X drive:

"%ProgFiles%\7-Zip\7z.exe" e "C:\Temp\zipfile.zip" file.mdb -y –r
COPY file.mdb X:\file.mdb /Y

I thought this would minimize the opportunity for the process to fail.  Indeed, it no longer failed during the zip process, but it then started failing during the copy for the same reasons.  In doing some research, I found two switches associated with the XCOPY command that seem to have resolved the issue.  Here is the copy command I now use after unzipping to a local directory:

XCOPY file.mdb X:\* /J /Y /Z

The J switch removes buffering for large files (my file is nearly 1.5 GB).  The Z switch allows for restarting the copy over a network connection.  The COPY command has an equivalent Z switch, but not the J switch.

As an aside, I copy the file to X:\* to avoid the XCOPY command always asking whether the target is a file or directory ("Does X:\file.mdb specify a file name or directory name on the target (F = file, D = directory)?"). 

Since I switched to this method, the script has run like a champ with no failures.

Wednesday, October 02, 2013

Filtering Data in Excel

Spreadsheet software has many valuable functions, but one that I find most useful (and use most frequently) is the data filter (aka “auto filter”).  This post will be about how to use the data filter in Microsoft Excel (the function is similar in most other spreadsheet programs).  For each example, I’ll be using a sample data set of medals won by Olympic athletes available here.

First, you have to turn on the filter.  To do this, highlight the row containing column names for your data set, then select Filter from the Data tab in the ribbon:


For Excel 2003 and prior, click the Data menu, then point to Filter, and click AutoFilter.

Once you’ve done this, you’ll see a button to the right of each column heading.  Clicking this button will give you a menu of options for sorting and filtering your list based on data in that column:


As an example, we’ll look for only records of gold medals in archery, ordered by year (ascending).  First, we’ll filter on the Sport column (uncheck Select All, then check Archery to do this quickly):


Next, filter the Gold Medals column for values greater than zero (there are multiple ways to do this, but I’m demonstrating the quickest for this particular data set):


Finally, sort the Year column ascending (smallest to largest):


This gets us to our result.  To remove the filter on any column, select the menu for that column and then select Clear Filter From <Column Name>:


As you can see, this is a simple but very powerful tool.  Enjoy!

Monday, September 16, 2013

Remove All Formatting in Excel

Let’s say you have a spreadsheet that someone built for printing or looks.  It might have gridlines, background colors, special fonts, and the like:


All you want out of this is some nice rows and columns for analysis, and the formatting is not helpful for this purpose.  How do you quickly simplify this data without updating every single column or cell?  First, select the entire worksheet (or whatever area you want to update).  Under the home tab (far right side), select Clear Formats:


Once you’ve done that, your data will be ready for use:


There’s only one drawback to this approach that I’ve found so far.  If you have date fields in your data, you’ll have to re-apply the date format to the affected columns/fields.  Otherwise, the column will display the date serial number because it has no format assigned.  Other than that, this method works great for quickly removing formatting so that you can focus on the data.

Monday, July 01, 2013

Registry Hack - Add "Edit With..." Shortcut to Context Menu (Using Notepad++)

I've had to do this several times now, so it's time to write it down.  Here are the issues:

1. The existing 'Edit with Notepad++' context menu link generates an error on execution
2. Worse, it doesn't display at all through Multi-Commander

No like!

So here's how to fix it.  First, get rid of the existing context menu entry, which you'll find under HKEY_CLASSES_ROOT/*/shellex/ANotepad++64.  Now you're ready to add the new entry:

1. Go to HKEY_CLASSES_ROOT\*\shell.
2. Right-click shell and add a new key called "Edit with Notepad++" (or whatever you want to call it).
3. Now Right-click on the key you just created, and add a new key beneath it called "command".  The end result should look like this:



4. Within the command key, modify the default string to the full path of the Notepad++ executable in quotes followed by %1.  For example:



5. Go back to your original key (Edit with Notepad++ or similar) and add a new string value called "Icon".  Modify the value to the full path of the Notepad++ executable in quotes followed by ,0.  For example:



The changes will take effect immediately, and you should now have the item in your right-click context menu:





SQL Server 2012 Linked Server for MS Access

Here's the scoop: I've installed SQL Server, but I want to use Management Studio (SSMS) to query an Access Database.  I followed the directions, and also found out that the Jet OLEDB 4.0 provider is only available for 32-bit.  But now I'm getting this error:


Text:
An exception occurred while executing a Transact-SQL statement or batch. (Microsoft.SqlServer.ConnectionInfo)
Cannot initialize the data source object of OLE DB provider "Microsoft.Jet.OLEDB.4.0" for linked server "[server name]".
OLE DB provider "Microsoft.Jet.OLEDB.4.0" for linked server "[server name]" returned message "Unspecified error". (Microsoft SQL Server, Error: 7303)
There are some solutions to similar errors, but none of them are working.  What now?  Try running SSMS as an administrator.  In fact make sure it always runs with administrative rights by right-clicking the shortcut/executable, selecting properties, and checking the "Run this program as an administrator" box:


Once you do this, SSMS will have full rights to connect to your MS Access files (and any others you'd like to query).


Remove All Images From Excel Worksheet

So I have a spreadsheet I've extracted, and it has two columns of images in it (which I don't want):


Ok, so I'll just delete the columns like I would with any other data, right?


Hm.  That didn't work.  Are you really telling me that I have to select and delete these one-by-one? NO WAY!

Thankfully, there's a quick macro you can run to get rid of all images in a worksheet like this:

Dim s As Shape
For Each s In ActiveSheet.Shapes
    s.Delete
Next

Running this will remove all these pesky images in one shot.  No fuss, no muss.

Happy deleting!

Quickly Line Up Rows/Columns in Excel

Have you ever pasted a table into Excel and gotten results like this?


No like!

Wouldn't it be nice to line up and autofit all the rows and columns in one click?  Here's a quick macro that will allow you to do just that:

    Cells.Select
  Selection.ColumnWidth = 255 
  Cells.EntireRow.AutoFit 
  Selection.ColumnWidth = 9 
  Cells.EntireColumn.AutoFit 'fit the columns

Basically, it selects all the cells, makes the column extremely wide, then autofits the rows and columns.  The result is more like this:


Much better.  Enjoy!

Friday, August 10, 2012

Rob Peter to Pay Paul

The idiom "Rob Peter to pay Paul" is fairly well known and most people understand it to mean solving one problem by causing another.  In many cases, people use it to mean paying a debt by incurring another debt.  Here are some examples:

...to discharge one debt by incurring another (http://www.phrases.org.uk/meanings/rob-peter-to-pay-paul.html)

Why borrow money to pay your bills? That's just robbing Peter to pay Paul. (http://idioms.thefreedictionary.com/rob+Peter+to+pay+Paul)


Jacob: I think I'm going to apply for another credit card so I can pay off some of my bills. 
David: Robbing Peter to pay Paul, eh?! Just be carful not to get into debt. (http://www.urbandictionary.com/define.php?term=Robbing%20Peter%20to%20pay%20Paul)

For example, They took out a second mortgage on their house so they could buy a condo in Florida--they're robbing Peter to pay Paul. (http://www.answers.com/topic/rob-peter-to-pay-paul)

We all get what these examples are trying to say, but they are not a completely accurate use of the idiom.  Imagine robbing a bank to get money to pay your bills.  If you get away with it, you're out of debt and you've solved your financial problem.  If you don't get away with it, you go to jail.  You could say that you've incurred a social debt, or you could imagine monetary fines being assessed, but none of these are what the above examples were talking about.

If we were to use a completely accurate metaphor, we would say in these cases that we are "Borrowing from Peter to pay Paul".

How important is this?  Not very.  As I said, everyone understands what the idiom represents, even if the language is not precise.  However, I think it is an interesting mental exercise to challenge assumptions, including habitual use of words that may or may not be exactly correct.