Uncovering a Memory Leak using WinDbg

The Crash…

Recently, we’ve been having a Windows service hosting a WCF endpoint crashing increasingly frequently due to out of memory errors. Although the usage of the service had increased, it didn’t seem likely that the service would actually consume its entire address space in the course of legitimate operation, so I naturally assumed a memory leak.

First, I perused all static collections and looked for any possible leaks. Only one small leak was found through inspection, which I strongly doubted could be the actual cause. Another staff recommended that the SQL server might be consuming all available memory, and that an upper limit should be set. We did indeed set the limit. However, these two changes did not resolve the issue. One of my coworkers suggested investigating the crash memory dump using WinDbg.

Using WinDbg

I hadn’t used WinDbg before and getting started was, frankly, daunting. After several hours, however, I was able to get an effective flow. The first mistake I made was loading the wrong bitness of WinDbg. The application in question is 32-bit (due to its dependency on several legacy DLLs), but I was trying to use the 64-bit WinDbg. No dice (but some strange and confusing errors that slowed me down). Finally I got the right WinDbg and the right memory dump and I was ready to go.

After opening the dump, the first step is to load the reference symbols for Microsoft DLLs:

.sympath srv*http://msdl.microsoft.com/download/symbols
.reload

Next, in order to enable debugging of .NET memory dumps, a specific .NET plugin, psscor, needs to be loaded. Since this is a .NET 4 app, version 4 of psscor is needed.
I heavily leaned on the article Intro to WinDbg for .NET Developers. Note in the subsequent examples references to “k:\windbg” represent the working directory where I accumulated the dump and helper files.

.load k:\windbg\psscor4.dll
.symfix c:mylocalsymcache 
.cordll -ve -u -l

These commands will be sufficient for basic .NET debugging. But as I thrashed around for several hours exploring the dump, I found another plugin that extends psscor with additional capability. The SOSEX extension specifically includes reference tracking, which I found very helpful.

.load k:\windbg\sosex.dll
!bhi

A helpful cheat sheet clearly describes all available commands at this point.

Another short article on debugging memory leaks got me on the right track. I also referenced a source describing how to access object values.

An overall view of memory usage, sorted from least total usage to most, is the first point of analysis:

!dumpheap -stat
Statistics:
      MT    Count    TotalSize Class Name
...[many small entries omitted]
71148b08    10520      1346560 System.Data.DataColumn
735ceb9c    17895      1431600 System.Runtime.Remoting.ServerIdentity
735f0764    13218      1506528 System.Int32[]
735f2388    10785      4252704 System.Collections.Hashtable+bucket[]
735d6a0c     3313      7084876 System.Decimal[]
711493a0       52     17129584 System.Data.DataRow[]
735ef46c    32056     21923408 System.String[]
735eed0c    12115     36322560 System.Object[]
735f35fc    59441     54614724 System.Byte[]
735ee918  3079311     64983264 System.String
71149898      932     90479536 System.Data.RBTree`1+Node[[System.Data.DataRow, System.Data]][]
735eba00  5610857    134660568 System.Guid
71148a30  2821656    191872608 System.Data.DataRow

I immediately noticed that the top entry was DataRow, and several other of the top entries seemed to be related (RBTree of DataRow, DataRow[], and DataColumn). Guid is also related, since all the rows in the database (and, thus, all DataRow in memory) are keyed on Guids. The application in fact uses DataTable heavily to bridge SQL and WCF. In no case, however, should the application be holding several million rows in active memory. According to this dump, more than 2 million DataRows are accumulated in process memory.

I suspected a leak where DataRows were being retained. However, no instance of DataTable, DataColumn, nor DataRow are being retained in any obvious or explicit cache or static table.

Next step is to investigate the details. Since DataRow is the largest, I selected it to investigate further. Note the “71148a30” refers to the MT (method table) from the previous output.

!dumpheap -mt 71148a30

This run outputs thousands of entries, but visually as they flow by, you can see that they are “clustered”, suggesting a relatively small number of collections with many many rows each. Supporting this visual analysis is the “52” instances of DataRow[] in the first listing.

Address       MT     Size
015540a8 71148a30       68     
0155412c 71148a30       68     
015541b0 71148a30       68     
01554234 71148a30       68     
015542b8 71148a30       68     
...[thousands of rows omitted]
3ad0c110 71148a30       68     
3ad0c17c 71148a30       68     
3ad0fd78 71148a30       68 

So we’ve confirmed that we have a variety (probably 52) DataTables in memory each with many, many rows. This latter part is unusual, since there are few places in the application which expect many, many rows. Most DataTables have 1 to 100 rows, not thousands+.

It might be useful, instead, to follow the chain of the 52 DataRow[] entries. Note the MT reference for them from the first table: 711493a0

!dumpheap -mt 711493a0
 Address       MT     Size
015ee00c 711493a0       12     
015f16e4 711493a0      524     
01a6506c 711493a0       12     
01c60bb4 711493a0      524     
...[couple dozen rows omitted, all size = 524]   
2c276c28 711493a0      524     
2c6a5278 711493a0      524     
3996c578 711493a0    65548     
40a98ed8 711493a0   262156     
53071010 711493a0 16777228     

Whoa what’s this. A single DataRow[] (that is, a single table) is taking a HUGE amount of memory. Naturally that’s a target of interest. However, the many smaller entries could be interesting as well. We can investigate both.

The first leak – many smaller objects

First, we’ll investigate the smaller, many instance.

We can use the gcroot command to find the path (or paths) from the object to the active, live object at the root of the application.

!gcroot 2c276c28 
HandleTable:
    00c613ec (pinned handle)
    -> 024c34c8 System.Object[]
    -> 01bff620 System.Collections.Generic.Dictionary`2[[System.Object, mscorlib],[System.Collections.Generic.List`1[[Microsoft.Win32.SystemEvents+SystemEventInvokeInfo, System]], mscorlib]]
    -> 01bff968 System.Collections.Generic.Dictionary`2+Entry[[System.Object, mscorlib],[System.Collections.Generic.List`1[[Microsoft.Win32.SystemEvents+SystemEventInvokeInfo, System]], mscorlib]][]
    -> 01c0d9f0 System.Collections.Generic.List`1[[Microsoft.Win32.SystemEvents+SystemEventInvokeInfo, System]]
    -> 0dee8adc Microsoft.Win32.SystemEvents+SystemEventInvokeInfo[]
    -> 2c2214ac Microsoft.Win32.SystemEvents+SystemEventInvokeInfo
    -> 2c22148c Microsoft.Win32.UserPreferenceChangedEventHandler
    -> 2c27b3ac System.Windows.Forms.ContextMenuStrip
    -> 2c27b650 System.ComponentModel.EventHandlerList
    -> 2c27c650 System.ComponentModel.EventHandlerList+ListEntry
    -> 2c27c388 System.ComponentModel.EventHandlerList+ListEntry
    -> 2c27c354 System.ComponentModel.EventHandlerList+ListEntry
    -> 2c27c320 System.ComponentModel.EventHandlerList+ListEntry
    -> 2c27c300 System.EventHandler
    -> 2c27b178 Microsoft.Reporting.WinForms.ReportPanel
    -> 2c27acc8 Microsoft.Reporting.WinForms.WinRSviewer
    -> 2c279e64 System.Windows.Forms.Panel
    -> 2c279af0 System.Windows.Forms.TableLayoutPanel
    -> 2c2798a0 Microsoft.Reporting.WinForms.RVSplitContainer
    -> 2c277a3c System.Windows.Forms.Panel
    -> 2c2776d4 System.Windows.Forms.TableLayoutPanel
    -> 2c277484 Microsoft.Reporting.WinForms.RVSplitContainer
    -> 2c276e84 Microsoft.Reporting.WinForms.ReportViewer
    -> 2c277114 Microsoft.Reporting.WinForms.ReportHierarchy
    -> 2c277124 System.Collections.Generic.Stack`1[[Microsoft.Reporting.WinForms.ReportInfo, Microsoft.ReportViewer.WinForms]]
    -> 2c21f0ac Microsoft.Reporting.WinForms.ReportInfo[]
    -> 2c21efec Microsoft.Reporting.WinForms.ReportInfo
    -> 2c220fc4 Microsoft.Reporting.WinForms.LocalReport
    -> 2c28de40 Microsoft.Reporting.WinForms.SubreportProcessingEventHandler
    -> 2c218b40 <error>
    -> 2c22ce00 System.Data.DataSet
    -> 2c22ce6c System.Data.DataTableCollection
    -> 2c22ce90 System.Collections.ArrayList
    -> 2c2398f8 System.Object[]
    -> 2c22f884 System.Data.DataTable
    -> 2c22fa0c System.Data.RecordManager
    -> 2c276c28 System.Data.DataRow[]

That’s a lot to take in. The list goes from the bottom (the instance in question) to the top. On the way, it tells us this table is owned by a WinForm SSRS reporting component which in turn is held alive by an event handler.

You might recall this application is not a Desktop app; it is a headless Windows service. Nevertheless, we embed the Winforms SSRS component to be able to convert RDL into PDF as part of an automated process. This happens occasionally, and once the PDF is done no further work is done on the data. There is absolutely NO REASON multiple alive references should exist to this component. It’s a leak!

But why?

Looking into the code exposes the reason:

var sia = new WsbisReports.BridgeInventory.SI.SIA();
sia.PrepareReport(result, p);
// ... find destFilePath
sia.CreatePDF(destFilePath);  

Here SIA inherits from a class which encapsulates WinForms LocalReport. What’s missing? Dispose. LocalReport is a disposable class, but my consumers never call Dispose on it. Corrective action? Make the SIA (more correctly, the base RDLReport class in the application) implement IDisposable, have it Dispose the LocalReport object, and wrap all instances in a using statement to ensure disposal.

A discussion online indicates that, in additional to disposal, several other actions might be helpful:

  • report.ReleaseSandboxAppDomain();
  • Unsubscribe to any SubreportProcessing events
  • Clear Data Sources
  • Dispose the report

The first three actions can be completed in the new Dispose method once the encapsulating class implements IDisposable.

A few large objects

There are three larger DataTable objects in this dump. We’ll look at each one. We’ll start with the smallest of the three.

!gcroot 3996c578 
Thread 9e78:
    0b97ef3c 7072cf80 System.ServiceModel.Channels.ServerReliableChannelBinder`1+DuplexServerReliableChannelBinder`1[[System.__Canon, mscorlib],[System.__Canon, mscorlib]].OnSend(System.__Canon, System.ServiceModel.Channels.Message, System.TimeSpan)
        ebp-10: 0b97ef4c
            ->  39c11654 System.ServiceModel.Security.SecurityAppliedMessage
            ->  39c11158 System.ServiceModel.Dispatcher.OperationFormatter+OperationFormatterMessage
            ->  39c11178 System.ServiceModel.Dispatcher.OperationFormatter+OperationFormatterMessage+OperationFormatterBodyWriter
            ->  390f6220 System.Data.DataSet
            ->  390f628c System.Data.DataTableCollection
            ->  390f62b0 System.Collections.ArrayList
            ->  3922fa2c System.Object[]
            ->  3922e600 System.Data.DataTable
            ->  3922e788 System.Data.RecordManager
            ->  3996c578 System.Data.DataRow[]

Conclusion: This is an “in-flow” object being processed by WCF. Not a leak.

How about the medium-sized object?

!do 40a98ed8 
Name:        System.Data.DataRow[]
MethodTable: 711493a0
EEClass:     7322f4e4
Size:        262156(0x4000c) bytes
Array:       Rank 1, Number of elements 65536, Type CLASS

First glance, it appears that the table has allocated an array for up to 65,536 rows. That’s a lot of rows compared to the in-memory tables normally processed. We can trace the objects life through the garbage collector.

!gcroot 40a98ed8 
Thread 2668:
    1531daf0 07887608 WsbisWcfService.dll!Unknown
        ebp+e8: 1531dbdc
            ->  3a09b9d0 System.Data.DataTable
            ->  3a09bb58 System.Data.RecordManager
            ->  40a98ed8 System.Data.DataRow[]

This object is alive, with a reference being held by our service.

We can first use the owning DataTable to find out the actual number of rows (65,536 is just the allocated array size).

!mdt 3a09b9d0 
3a09b9d0 (System.Data.DataTable)
...[omitted rows]
    nextRowID:0xad77 (System.Int64)
...[omitted rows]

0xad77 = 44,407 in decimal. So the table has about 44 thousand rows. Still a lot of rows for what the app normally does, but not inconceivably huge.

What’s the location of the hold? Since I’m new to WinDbg, I wasn’t able to figure out how to get the method name from the address, but I was able to decompile the method:

!u 07887608
Normal JIT generated code
WsbisWcfService.dll!Unknown
Begin 07886ef8, size 9bf
07886ef8 55              push    ebp
07886ef9 8bec            mov     ebp,esp
...[omitted rows]
07886f95 ff35b4dd6002    push    dword ptr ds:[260DDB4h] (" request received from ")
...[omitted rows]
07887603 e8f89149f9      call    00d20800 (WsbisClient.dll!Unknown, mdToken: 0600004d)
>>> 07887608 e8704ec86c      call    clr!IL_Rethrow (7450c47d)
0788760d c745e400000000  mov     dword ptr [ebp-1Ch],0
...[omitted rows]
0788769d 8b15dcdd6002    mov     edx,dword ptr ds:[260DDDCh] ("SYS_CHANGE_OPERATION")
...[omitted rows]

So the key thing I picked up in the giant code dump is the string constants, which let me identify the method in question. Further, I note that the instruction is a “rethrow”, so I tracked down the try-catch with a rethrow between the two string constants in the method. This lets me precisely identify where the execution pointer was in method was when the dump was created.

The method is called “GetChangeTable” and is used by the client to request a table of rows that have changed for local database synchronization purposes. However, 44,000 rows in a single request is unusually high for synchronization, which happens rather continually. Even when the client is disconnected for several days, a single batch of 44,000 is more than expected.

Let’s find out more before moving on.

Finding the data in a DataTable

In normal .NET code, we can access the data in a DataTable via an indexer on the Rows property, e.g. t.Rows[0][“column”]

However, looking at the memory dump for the DataTable, it is more difficult to find the actual data. I ended up referencing the actual .NET source code for DataRow to find out where the data is stored:

/// <devdoc>
///    <para>Gets
///       or sets all of the values for this row through an array.</para>
/// </devdoc>
public object[] ItemArray {
  get {
    int record = GetDefaultRecord();
    _table.recordManager.VerifyRecord(record, this);
    object[] values = new object[_columns.Count];
    for (int i = 0; i < values.Length; i++) {
      DataColumn column = _columns[i];
      VerifyValueFromStorage(column, DataRowVersion.Default, column[record]);
      values[i] = column[record];
    }
    return values;
  }

So the row's item array isn't real at all! The data is actually stored on the column.

!mdt 3a09b9d0 
3a09b9d0 (System.Data.DataTable)
...[omitted rows]
    columnCollection:3a09bb90 (System.Data.DataColumnCollection)
...[omitted rows]

!mdt 3a09bb90
3a09bb90 (System.Data.DataColumnCollection)
...[omitted rows]
_list:3a09bbc8 (System.Collections.ArrayList)
...[omitted rows]

!mdt -e:2 3a09bbc8
3a09bbc8 (System.Collections.ArrayList)
[0] 3a0dc77c (System.Data.DataColumn)
...[omitted rows]
    _columnName:3a0dc6ec (System.String) Length=20, String="SYS_CHANGE_OPERATION"
...[omitted rows]
    _storage:3a0dc908 (System.Data.Common.StringStorage)
...[omitted rows]

!mdt 3a0dc908
3a0dc908 (System.Data.Common.StringStorage)
    Column:3a0dc77c (System.Data.DataColumn)
...[omitted rows]
    values:40a18e98 (System.String[], Elements: 65536)

Let's stop here for a minute. We've found that the first column is named "SYS_CHANGE_OPERATION", which confirms that this table is a GetChangeRows synchronizer table (because that's the hard-coded column name in the SQL procedure that generates the change table result). We've further found the backing array of up to 65,536 values for the column. This column indicates a command code which describes what kind of operation will be done (update, insert, delete, or re-sync).

!mdt -e:2 -count:5 40a18e98
40a18e98 (System.String[], Elements: 65536)
[0] 3a0dc8e0 "R"
[1] 3a0dc9a8 "R"
[2] 3a0dca14 "R"
[3] 3a0dca80 "R"
[4] 3a0dcaec "R"

The elements are command code "R", which is used when the client needs to resync the entire table. That explains why there are so many rows! The server already has a mechanism that allows it to tell the client to request a subset of rows. Maybe to save memory, the server should ALWAYS require the client to request a subset of rows when a resync is performed? That will mean smaller DataTable allocations.

Looking at the source code reveals a rather silly "bug": The server has a maximum row limit. If the row limit is exceeded, it will refuse to send the table to the client and send a message telling the client to request fewer rows. However, the server still loads the ENTIRE table from the SQL procedure, and then counts the rows using Rows.Count. This means, at least temporarily, an unnecessary very large table will exist in memory! The SQL procedure should be modified to take the max rows as a parameter, and simply return a message to the service if there are too many rows (rather than returning the whole table, having the service instantiate it in memory, and then count the rows).

The "final" twist

We've uncovered a lot so far, but there's still that monster giant DataTable to investigate.

!do 53071010 
Name:        System.Data.DataRow[]
MethodTable: 711493a0
EEClass:     7322f4e4
Size:        16777228(0x100000c) bytes
Array:       Rank 1, Number of elements 4194304, Type CLASS

Wow! That's a huge table! Note that the number of rows FAR exceeds the maximum row limit (described previously) that should ever be sent! We can trace the objects life through the garbage collector.

!gcroot 53071010 
Finalizer Queue:
    19d71aac
    -> 19d71aac System.Data.DataColumn
    -> 19d6c49c System.Data.DataTable
    -> 19d6c624 System.Data.RecordManager
    -> 53071010 System.Data.DataRow[]

Warning: These roots are from finalizable objects that are not yet ready for finalization.

Wait, what? I had always assumed objects had two states: alive (active reference) or ready for garbage collection. It turns out there is a third state: in the finalizer queue. Objects which have a finalizer are placed on the finalizer queue. The object must be finalized before it is eligible for garbage collection, even if no references exist. The finalizer, in turn, runs through the queue on a single thread in the background. Discussion on SO.

Generally, finalization queue is avoided by having the Dispose method call SuppressFinalize, exempting the object from the finalizer queue.

But disposing is only used when an object incurs unmanaged resources. Why is it happening here, for a DataTable?

According to one discussion:

The Dispose method in DataSet exists ONLY because of side effect of inheritance-- in other words, it doesn't actually do anything useful in the finalization. The class DataSet inherits from "System.ComponentModel.MarshalByValueCompenent" which implements the IDisposable interface because it is a component.

Great. So DataTable's have a finalizer, which means they get stuck in the finalization queue for a while before the garbage collector can dispose of them. But wait! When we return a DataTable via WCF, WCF will call dispose on both parameters that are passed in and returned parameters. So how can this be explained?

Looking at the service code, we already learned that when a too-large table is created, it is instantiated in memory and the rows counted. Then, if there are too many rows, the table is NOT returned via WCF to the client; instead, a message telling the client to "request fewer rows" is sent. What happens to the DataTable (the one with TONS of rows)? It simply goes out of scope. No Dispose is ever called. That's why the large table is waiting on the finalizer queue, blocking lots of memory.

This problem could be avoided by implementing the SQL-side row counting (as described above), and also by manually calling Dispose() if the service decides it will not be returning the table to the client.

Conclusions

Garbage collection isn't magic. Some actions should be taken to ensure long-running applications and services don't leak memory. WinDbg is an awesome tool for uncovering memory leaks.

Planned changes for this particular application. Some of these might seem obvious in hindsight, but at the time, I just assumed the overhead wouldn't be a problem.

  • In RDL handling
    • report.ReleaseSandboxAppDomain();
    • Unsubscribe to any SubreportProcessing events
    • Clear Data Sources
    • Dispose the report
  • In GetChangeTable
    • The server should ALWAYS require the client to request a subset of rows when a resync is performed
    • The SQL procedure should be modified to take the max rows as a parameter, and simply return a message to the service if there are too many rows (rather than returning the whole table, having the service instantiate it in memory, and then count the rows)
    • If the service generates a DataTable that it decides not to return to the client, Dispose() should be called
    • When DataTable is placed into a larger object for WCF serialization, that object should be marked with IDisposable and should, on Dispose, dispose the DataTable

.NET Decimal Precision

I was confronted by a very unusual claim by a coworker: two .NET Decimals, containing the same value, running through the same block of code, were rendering differently (one as, say, “120” and one as “120.00”). One of the big advantages of Decimal is that, unlike Float or Double, a Decimal represents an exact base-10 value whereas Float and Double are base-2, and so can only approximate many common fractional base-10 values. Therefore, I don’t normally expect problems of this kind with Decimal.

I initially assumed that the coworker was confused; some other input must be coming in, or a different section of code executing, or a different string formatting code used. However, the problem clearly reproduced. Further, it occurred even when no fractional component existed. But stripping away the extraneous, the essence of the code was fairly straightforward:

decimal v = ...;
return v.ToString();

Detailed inspection revealed that in one case, the decimal value was loaded from the user input, and in another case, it was loaded from a database table. In order to test the significance of this, I made a test case which performed both side-by-side. At first, everything seems routine:
decimals

What happened next was anything but routine:
decimals2

What would cause one decimal to render with 2 decimal places and another, the exact same value, to render without them? It turns out that the .NET Decimal implementation intentionally includes the concept of significant trailing zeros, which have no impact on calculations but can carry information about the value’s precision.

Although there is no direct method to expose the precision, the details of the significant trailing zeros arrangement are discussed in the Decimal.GetBits method documentation. In this case, it is clear that the same logical value can be represented with different exponents. In the case above, we can have a value of 120 with an exponent of 0, and a value of 12000 with an exponent of 2 (the exponent “indicates the power of 10 to divide the integer number”), so 12000 * 10-2 = 120.00.

This is indeed confirmed by analysis. The first three bytes contain the value, while the exponent is defined as “Bits 16 to 23” of the fourth byte.

Decimal.GetBits(vFromUser) = [120, 0, 0, 0]
Decimal.GetBits(vFromDb) = [12000, 0, 0, 131072]
131072 >> 16 = 2

This confirms the vFromDb value is represented as 12000 * 10-2 while vFromUser is represented as 120 * 100. Although these values are logically equal, the default implementation of Decimal.ToString() outputs the value with all significant zeros, including trailing zeros.

Although the Decimal class does not expose properties for the precision nor the scale, it is possible to take advantage of the helper SqlDecimal class to access these values. In this context, precision means the total number of digits and scale means the number of digits to the right of the decimal place.

var sFromDb = new System.Data.SqlTypes.SqlDecimal(vFromDb);
Console.WriteLine(sFromDb.Precision);
Console.WriteLine(sFromDb.Scale);

For vFromDb, this outputs a precision of 5 and a scale of 2; while for vFromUser this outputs a precision of 3 and a scale of 0.

Arbitrary Clustering in SQL: Tumbling down the rabbit hole

Subtitle: “Things are not as they appear, nor are they otherwise.”

For a demo (in other words, nothing matters), I wanted to take a set of rows of sample data and cluster them into a few groups approximately the same size. I figured “I’ll use a random number”. My first flaw, of course, was to decided on a solution before I fully considered the meat of the problem. But anyways…

So I fool around with RAND() in SQL and I’m not getting good results. I look online and find this Microsoft article: Selecting Rows Randomly from a Large Table. Sounds like it might have something good. They advise a syntax like this for random [see aside, at bottom]:

SELECT * FROM Table1
  WHERE (ABS(CAST(
  (BINARY_CHECKSUM(NEWID()) *
  RAND()) as int)) % 100)

Of course replacing 100 with whatever upper bound is desired.

I’ll try it out with my table:

SELECT CustomerID, 
  (ABS(CAST((BINARY_CHECKSUM(NEWID())) as int)) % 3) 
  FROM dbo.Customers

Looks good! Eyeballing indicates an approximately equal distribute of 0, 1, and 2. No other values.

Now I want to substitute a named value for each possible random value, so I put this expression into a case statement, like:

SELECT CustomerID, 
  CASE (ABS(CAST((BINARY_CHECKSUM(NEWID())) as int)) % 3) 
    WHEN 0 THEN 'Alice'
    WHEN 1 THEN 'Bob'
    WHEN 2 THEN 'Charlie'	
   END Value,
 FROM dbo.Customers

Now things get interesting. Some of the values returned are NULL! How is that possible. Furthermore, if I investigate the distribution, e.g. with:

SELECT TBL.Value, COUNT(*) FROM
(
  SELECT CustomerID, 
    CASE (ABS(CAST((BINARY_CHECKSUM(NEWID())) as int)) % 3) 
      WHEN 0 THEN 'Alice'
      WHEN 1 THEN 'Bob'
      WHEN 2 THEN 'Charlie'	
    END Value	
  FROM dbo.Customers
) TBL
GROUP BY TBL.Value

I get a very non-uniform distribution. I tried it again on a table with more rows, to get a better feel for the distribution, and found an experimental result of:

Value Count Approx % of Rows
Alice 696 32%
NULL 621 29%
Bob 527 24%
Charlie 311 14%

Repeated runs returned similar results. Removing the case statement and just having the inner expression resulting in an approximately uniform distribution of 0, 1, and 2.

How can this be?

I thought about it for a while and concluded that the expression must be being re-evaluated for each “when” of the case statement. This is the only thing I could come up with that would allow for the NULL possibility. But would it generate the distribution being observed? If so, that would confirm the effect.

Since we observed the generator alone was approximately normal, we’ll call it f() and say that the range of f() is {0, 1, 2} with uniform probability.

If f() is being re-evaluated at each WHEN, we end up with a probability distribution as follow:
Probability Distribution

Grouping and summing the probabilities gives us:

Value Theoretical Probability Observed Approximate Probability
Alice (0) 1/3 = 33.3% 32%
NULL 8/27 = 29.6% 29%
Bob (1) 2/9 = 22.2% 24%
Charlie (2) 4/27 = 14.8% 14%

Theory confirms observation: f() is being re-evaluated at each WHEN.

But you already know what I’m going to tell you

Remember the whole intention of this, originally, was to arbitrarily (not necessarily randomly!) partition a set of records into several roughly equal sets. There is a much more straightforward approach if randomness is not required: simply use ROW_NUMBER().

SELECT CustomerID,
  CASE (CAST(ROW_NUMBER() OVER(ORDER BY CustomerID) AS int) % 3) 
    WHEN 0 THEN 'Alice'
    WHEN 1 THEN 'Bob'
    WHEN 2 THEN 'Charlie'	
  END Value
FROM dbo.Customers

Slight Aside

I went back and reviewed the original article and found that NEWID() is actually only advised in conjunction with other named columns. Otherwise, BINARY_CHECKSUM(*) is advised. If BINARY_CHECKSUM(*) is used, everything seems to work out fine! So, to be more specific, NEWID() (and thus, BINARY_CHECKSUM) is being re-evaluated at each row, but RAND() is not. Further, the article notes that RAND() is not needed when NEWID() or a column is specified. So, this whole problem originated in my skimming the article too quickly, and conflating two examples into one!

Posted in SQL

Slow parameterized queries

I was working on a stored procedure when suddenly test runs started to timeout (>30 seconds). I extracted the procedure, replaced the parameters with the same constants, and ran it. Finished instantly. I spent a lot of time focusing on the “stored procedure” aspect of it (trying things such as WITH RECOMPILE and shadow variables), but the problem turned out to revolve around the “parameterized” aspect of it.

We ended up reducing the query to this small query, where each table/view has at most 10,000 rows:

SELECT *
  FROM tbl6ControlEntity CE 
  LEFT JOIN dbo.tbl7Accounts AS PM 
  ON PM.record_gid = CE.program_manager_gid
WHERE 																					
  (@Insp = 0 OR CE.control_entity_gid IN
    (SELECT XIP.control_entity_gid FROM 
    viewWSBIS_GetControlEntities_Inspections XIP WHERE 
    XIP.master_cd_flag = 1 AND XIP.cert_no = 
      (SELECT AC.inspector_cert_no FROM tbl7Accounts AC 
       WHERE AC.record_gid = @InspGid)))
ORDER BY CE.sort_br_no	

When we put in constants for the parameters @Insp and @InspGid, the query ran in less than 1 second. When we declared and set them in the query window, it took indefinite amounts of time. Using a temp table instead of the view helped, but wasn’t a complete solution. We tried replacing the inner-most uncorrelated subquery with a join. No impact.

I tried most of the common suggestions, such as OPTIMIZE FOR @Insp = 1 and OPTIMIZE FOR UNKNOWN. No impact.

Finally, I added OPTION (RECOMPILE) to the query, which fixed the problem. I’m surprised the WITH RECOMPILE on the stored procedure did not, however.

In general, the need for OPTION (RECOMPILE) indicates a need to rebuild statistics, however, in the cases where flags are used to control execution of certain subconditions, I’m not sure the statistics can cope.

See also: OPTION (RECOMPILE) is Always Faster; Why?

Posted in SQL

Safer SQL

When writing adhoc queries, I find myself writing UPDATEs and DELETEs where I know how many rows should be affected, or, in a stored procedure, maybe I have a variable that I can reference that tells me that. With LINQ, we can use .Single() to throw an exception if there is more than one result, for example. It would be helpful if SQL had an extension that let us specify exactly how many rows should be affected, and if not, the whole command is canceled.

For example, imagine we could

UPDATE EXACTLY 1 table SET field = value WHERE key = whatever

or

DELETE EXACTLY 16 FROM table WHERE field = whatever

And those commands would fail and do nothing if they would affect more or less than 1 and 16 rows respectively.

Of course, SQL doesn’t support this syntax. A workaround is to use the output clause and a transaction. Alternatively, @@ROWCOUNT and a transaction can also be used. However, writing these transaction wrappers around a statement is a lot of text and space for a small benefit. Could these statements be automated somehow?

We can take advantage of implicit transactions in triggers to control success of the overall command. By executing a rollback within the trigger, the update or delete command itself will be rolled back, without an explicit begin transaction.

So, in use, this looks like:

EXEC RowCountExactly 'tblBridges', 2
UPDATE tblBridges SET fips_code = 12345;

In this case, more than 2 rows would have been updated, so the command fails:

Unexpected number of rows affected; 156507 vs 2 expected
Msg 3609, Level 16, State 1, Line 6
The transaction ended in the trigger. The batch has been aborted.

The implementation is a stored proc “RowCountExactly” that creates a trigger on the specified table. The trigger checks if the rowcount differs from the specified value. If so, it rolls back the batch. Either way (succeed or fail), the trigger also removes itself, so that it only runs a single time.

IF OBJECT_ID ( 'dbo.RowCountExactly', 'P' ) IS NOT NULL 
    DROP PROCEDURE dbo.RowCountExactly;
GO
CREATE PROCEDURE dbo.RowCountExactly
(
	@TableName nvarchar(MAX),
	@RowCount int
)
AS
BEGIN
	IF OBJECT_ID ('dbo.RowCountCheck','TR') IS NOT NULL
		EXEC('DROP TRIGGER dbo.RowCountCheck');
	
	DECLARE @trigger varchar(max);

	SET @trigger = 'CREATE TRIGGER dbo.RowCountCheck ON ' + 
	  CONVERT(varchar(max), @TableName) + 
	' AFTER INSERT, UPDATE, DELETE
	AS	
	BEGIN
		DECLARE @actual int;
		SET @actual = @@ROWCOUNT;		

		IF @actual <> ' + CONVERT(varchar(max), @RowCount) + '
		BEGIN
			ROLLBACK;
			IF OBJECT_ID (''dbo.RowCountCheck'',''TR'') IS NOT NULL
				EXEC(''DROP TRIGGER dbo.RowCountCheck'');

			RAISERROR (N''Unexpected number of rows affected; %d vs %d expected'',
				10, 0, 
				@actual, ' + CONVERT(varchar(max), @RowCount) + ');
		END
		ELSE
			IF OBJECT_ID (''dbo.RowCountCheck'',''TR'') IS NOT NULL
				EXEC(''DROP TRIGGER dbo.RowCountCheck'');
	END		
	';

	EXEC(@trigger);
	

END;
GO
Posted in SQL

Using TRY_CONVERT in SQL Server 2012+ to migrate from varchar fields to strongly-typed fields

We have a SQL 2008 database which has some varchar fields used to store times. I wanted to migrate them to use actual time fields, but the format entered into the varchar field had been inconsistent (as previously, there was no validation on the field). In addition to properly formatted times, some entries used . instead : (for example, 12.38), some entries had am/pm (or just a/p) on the end, some didn’t. A simple convert(time, …) would not suffice.

One way to approach is to TRY_CONVERT (new in SQL 2012: converts if able, returns null otherwise) and chain them with COALESCE in order to try multiple formats for a single field.

Consider the following example, assuming field is the source varchar field and convertedField is the new time-typed field.

UPDATE tableName SET
  convertedField = COALESCE(
   TRY_CONVERT(time, field),
   TRY_CONVERT(time, REPLACE(field, '.', ':')),
   TRY_CONVERT(time, REPLACE(field, 'a', ''),
   ...
)	  

Two other things to consider with a conversion of this type:

  1. Users may mix 12 and 24 hour time standards. In this case, we had entries like 13:30 (clear) and 1:30 that were used to reflect the same time. Some additional logic and testing is needed to attempt to “smart migrate” afternoon times that are not marked with pm and are in 12-hour format.
  2. If the database is being imported from SQL 2008, you need to upgrade the compatibility level to gain access to 2012 features:
    ALTER DATABASE DatabaseName
    SET COMPATIBILITY_LEVEL = 110;
    

Quickly Tell When You’re Connected to Production in SQL Management Studio

This is one of those things that “everybody knows” but I don’t find out about until years later (like pushing Windows-L to lock the workstation).

If you have access to various environments in SQL (dev, multiple test/QA machines, and production) sometimes there can be fear: am I connected to the correct server? I often stop and re-read the connection information to be sure I’m on the right server before I execute a command that could be particularly troublesome. I intentionally minimize my permissions in production to help reduce this, but there are still some permissions that exist; and some folks have more production permissions because they change it frequently.

So, here’s an easy way to help reduce anxiety. In SQL Management Studio 2008 and above, you can set custom status bar colors depending on which server you are connected to. Very easy.

When establishing the connection, simply press the “Options” button on the bottom right:
sql1

Then, you can check the box to show a custom color. SQL Management studio will save this value, and for all new query windows that are opened, will highlight the bottom status bar if that server is connected. An easy way to “warn” yourself if connected to a shared environment, and a quick confidence to know when you’re connected to a local dev environment only.

sql2

Posted in SQL