PowerShell Performance Part 2, Reading Text Files

This is part 2 of my informal blog series on PowerShell performance.  In part 1 I discussed some strategies for measuring performance.  In part 2 I’ll be covering file read performance and related techniques and use cases.  Because of the volume of information, I’ll cover writing file data in part 3.

Working with text files is fundamental.  Tasks like reading and parsing log files are exceedingly common in both interactive and programmatic scenarios.  It’s no surprise a lot has already been written about PowerShell performance in this area.  My goal here is to conduct a comprehensive study of file read techniques to determine the best options in different situations.  As the title implies I’m particularly interested in performance but code readability and memory utilization will also be considered.

PowerShell’s primary tool for reading text files is the Get-Content (GC) cmdlet.  Like many native cmdlets, GC offers broad capabilities. For example, it can easily read different encodings including non-text data.  No surprise, GC’s flexibility comes with a performance penalty; it’s earned a reputation for being quite slow.  As such, a number of alternate techniques have gained popularity, especially those that directly leverage .Net classes.

Study Methodology:

As described in Part 1, I don’t want to rely on a single measurement.  So, I ran each technique through a 10 iteration loop.  Those techniques that generate a single string were re-run through another 2 loops.  The first, using the -split operator and the other using the .Split() method.  Get-Content can return both types, but defaults to an array, so I wanted to ensure comparison of like return types while including the typical expectation.  The data should be sufficient to pick the fastest approach for the desired output type.

Note/Warning: .Split() will split on every character in its argument.  Therefore, splitting on the default Windows line ending results in unintended empty elements.  To compare fairly with the -split I included the [System.StringSplitOptions]::RemoveEmptyEntries argument in the tests.  However, that will also remove naturally occurring blank lines; a potential problem if you are expecting and/or need them. I included the .Split() variations because it still works well where blanks aren’t an issue, which is often the case with text logs. 

Test files were created by copying data from an IIS log file into 100KB, 2.5MB, 25MB, 50MB, 100MB and 1GB files. I maintained ASCII encoding throughout.

I ran each test in a fresh PowerShell console window.  Seeing as there’s overlap between command permutations and/or .Net classes I didn’t want any of the caching functionality mentioned in Part 1 to skew the results.

To evaluate the impact on memory, I monitored the \Process\Private Bytes counter for each run.

Note: All tests were performed with PowerShell 5.1.


Here’s are the techniques I tested and their respective test code:

  • Get-Content
1..10 | ForEach{ (Measure-Command { Get-Content $file }).TotalMilliseconds }
  • Get-Content -Raw

    Returns a single string including line ending characters.  As mentioned the -Raw parameter will be retested with the additional splits.
1..10 | ForEach{ (Measure-Command { Get-Content  $file -Raw }).TotalMilliseconds }
1..10 | ForEach{ (Measure-Command { (Get-Content  $file -Raw) -split "`r`n" } ).TotalMilliseconds }
1..10 | ForEach{ (Measure-Command { (Get-Content  $file -raw).split("`r`n", [StringSplitOptions]::RemoveEmptyEntries)}).TotalMilliseconds}
  • Get-Content -ReadCount 0

The -ReadCount parameter determines how many lines are passed down the pipe at a time.  -ReadLine 0 will pass all lines down the pipe at once.  This generally precludes cleanly placing | ForEach-Object{} directly after the Get-Content cmdlet, because $_ will actually be an array consisting of whatever number of objects were specified with -ReadCount. This method is fine if you need to store the data in a variable.

1..10 | foreach{ (Measure-Command { Get-Content 'C:\temp\TestFiles\Test100MB.txt' -ReadCount 0}).TotalMilliseconds }
  • [System.IO.File]::ReadAllLines()

Reference: MS Documentation

The System.IO.File class offers functionality for working with files.  The ReadAllLines static method is particularly useful and has been my go-to alternative for quite a while.  It returns a string array ([String[]]) which operationally equivalent to Get-Content‘s [Object[]] return. So, withstanding the break from verb-noun syntax it’s an easy drop-in alternative.

1..10 | ForEach{ (Measure-Command { Get-Content $file -ReadCount 0}).TotalMilliseconds }

Note: Shorthand below may refer to this as [IO.File]::ReadAllLines() or just ::ReadAllLines()

  • [System.IO.File]::ReadAllText()
    like GC -Raw this will read the entire file into memory as a single string, including the line break characters.  So, it too will be tested with the additional splits.
1..10 | ForEach{ (Measure-Command { [System.IO.File]::ReadAllText( $file ) }).TotalMilliseconds }
1..10 | ForEach{ (Measure-Command { [System.IO.File]::ReadAllText( $file ) -split "`r`n" }).TotalMilliseconds }
1..10 | ForEach{ (Measure-Command { [System.IO.File]::ReadAllText( $file ).Split("`r`n",[StringSplitOptions]::RemoveEmptyEntries) }).TotalMilliseconds }

Note: Shorthand below may refer to this as [IO.File]::ReadAllText() or just ::ReadAllText()

  • System.IO.StreamReader object using the .ReadLine() method

Reference: MS Documentation

StreamReader reads a stream of bytes as text. Usually, it’s more verbose than other techniques.  It’s not as neat as ::ReadAllLines() but it’s a common and well-advertised alternative to Get-Content.  Using StreamReader generally follows a loop pattern common to many languages.  Once the file is open, read and processing commands are placed in a loop stepping through each line until the EndOfStream value evaluates to true and executing the .Close() method immediately after.

1..10 | ForEach{ (Measure-Command {
$Stream = [System.IO.StreamReader]::new( $file )
While( !$Stream.EndOfStream ) { 
	$Stream.ReadLine()
	# Do some other stuff with the data…
}
$Stream.Close() } ).TotalMilliSeconds }

This pattern doesn’t return an array and cannot be piped. Of course that makes it a little more difficult to work with incoming lines. In practice, you’d probably assign the incoming line to a variable to work with it further. You can easily store the output in a variable to facilitate piping, but I’d only do so if it was already a requirement. It’s slower and more memory intense so if it’s just for piping you’re better off doing the work in the existing loop.

Note: Shorthand below may refer to this as $Stream.ReadToEnd() or just .ReadLine()

  • System.IO.StreamReader object using the .ReadToEnd() method
1..10 | ForEach{ (Measure-Command {
$Stream = [System.IO.StreamReader]::new( $file )
$Stream.ReadToEnd()
$Stream.Close() } ).TotalMilliSeconds }

1..10 | ForEach{ (Measure-Command {
$Stream = [System.IO.StreamReader]::new( $file )
$Stream.ReadToEnd() -split "`r`n"
$Stream.Close() } ).TotalMilliSeconds }

1..10 | ForEach{ (Measure-Command {
$Stream = [System.IO.StreamReader]::new( $file )
$Stream.ReadToEnd().Split("`r`n", [StringSplitOptions]::RemoveEmptyEntries )
$Stream.Close() } ).TotalMilliSeconds }

Note: Shorthand below may refer to this as $Stream.ReadToEnd() or just .ReadToEnd()


Observations:

The study confirms Get-Content quite a bit slower than other methods but there are some other very interesting observations. Below, I graphed some data from the 100MB file tests:

Note: I choose to display 100MB results because the graph seems a better representation. With the smaller files, relatively small differences were over-represented.

Note: Above, green are techniques that return an array, blue are single string returns and red are single string returns split after the fact.

Of those techniques that return an array, Get-Content is by far the slowest, taking 1271ms. .ReadLine() & ::ReadAllLines() averaged 585 & 675ms. That’s a significant difference that could really add up when processing many files. Get-Content -ReadCount 0 performed better but was still way behind both the .Net approaches which were respectively ~200/100ms faster.

I was surprised by the difference between the 2 .Net approaches above. I’ve always favored ::ReadAllLines() because it’s so easy to use in typical PowerShell code.  Whenever I’ve read about StreamReader I’d do a quick test and ::ReadAllLines() was always faster.  Now, looking at my results across file sizes it seems [IO.File]::ReadAllLines() is faster for smaller files, but $Stream.ReadLine() method is faster for “larger” files. Take a look at the below table.

FileSize[System.IO.File]::ReadAllLines()StreamReader’s .ReadLine() method
100KB1.332.05
2.5MB14.1416.60
25MB169.74153.60
50MB329.51290.87
100MB675585

This is an interesting find because it offers some logic on which technique to use when. If you’re processing many small files ::ReadAllLines() may perform better. If you’re dealing with larger files you may want to accept slightly more complex code to implement the StreamReader. Either way, both approaches are valid and perform far better than Get-Content.

Of course, I don’t know how these observations would play out in a larger program. $Stream.ReadLine() requires a loop. Assuming you pack further operations into the same loop the only additional overhead is from those operations. Any additional overhead needed to loop with[IO.File]::ReadAllLine() is not accounted for in these tests.

Given the admittedly arbitrary file sizes, more testing is necessary to determine where the performance advantage flips. Moreover, I’d like to see how this plays out in more realistic scripts. I’ll post a follow-up with that information as soon as I can pull it together.

The .Net methods that return a single string are the fastest overall. They perform similarly to one another. ::ReadAllText() outperformed .ReadToEnd() by a mere 15ms (407 Vs. 422ms) . Both .Net methods are very good alternatives to Get-Content -Raw which clocked in at 1049ms – ~2.5x slower!

Not surprising, but splitting the string after the fact added significant overhead. If you need an array ::ReadAllText() & .ReadToEnd() aren’t the best options. Unbelievably, and despite the extra overhead, when using the .Split() method both .Net methods were still faster than Get-Content alone.

Another revelation from these tests; .Split() consistently outperformed the -split Operator. This was true across all tested sizes but the differences were modest on smaller files and exaggerated larger ones. This seems to indicate splitting larger strings is faster using .Split(), but this too calls for a follow-up post. I’d like to re-test the 2 split techniques independent of file read operations. Some use cases may allow splitting on a single newline character so I also want to see how .Split() performs without removing the empties.

Memory Considerations:

Memory is a concern, particularly when processing many large files. obviously, the techniques that return a single string used the most memory, but there were still some surprises.

Note: These are peak measurements taken from perfmon during each test.

All the sessions started out using ~72MB. Get-Content & $Stream.ReadLine() had no detectable impact on memory! I was surprised to see that [IO.File]::ReadAllLines() used about 525MB.

I expected the techniques that return single string to use the most memory. Indeed Get-Content -Raw consumed 1.3GB even before splitting. However, $Stream.ReadToEnd() & [IO.File]::ReadAllText() were more modest at ~525MB. Get-Content -ReadCount 0 used ~600MB most likely because it has to pass all the file’s lines down the pipeline.

Memory is generally not a concern. PowerShell relies on .Net to manage memory through background garbage collection which frees unused memory either when needed or on a schedule. Different underlying collection behaviors may explain some of these disparities, particularly between .ReadLine() & ::ReadAllLines(). However, the larger the file the greater the risk of memory exhaustion.

All the methods that return a single string ran out of memory trying to read a 1GB file. This was true even when >3GB was available. Secondary testing showed storing the output in RAM required 3-4x the file size. Thankfully, even if you had a use case for single strings, you could certainly adapt one of the more memory friendly methods.


Conclusion:

The most glaring and unfortunate conclusion is that Get-Content is still unacceptable slow. Comparatively, Get-Content under performed in all use cases and permutations. However, PowerShell’s ability to utilize .Net classes offers a rich set of alternatives that cover pretty much any file read scenario.

It’s healthy to revisit old assumptions once in a while. Obviously I knew a bit about this topic beforehand, but going through a formal experiment uncovered some new information and questions. I’ll be writing an addendum soon to address the following points:

  1. [IO.File]::ReadAllLines() & $Stream.ReadLine(). Is the former faster for smaller files and the latter faster for larger ones. And if so, at what point does it flip? In other words, define large & small in this context.
  2. Determine if garbage collection impacting the performance differentials between [IO.File]::ReadAllLines() & $Stream.ReadLine() .
  3. Additional StreamReader examples & code patterns, merits & demerits of different approaches.
  4. Separate experiment to determine the performance difference between .Split() than -Split. Evaluate the additional impact of [System.StringSplitOptions]::RemoveEmptyEntries .

As always, I’d love to get some feedback.  Comment, click follow or grab the RSS feed to get notifications of future posts.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s