Diagnosing SQL Server: Identifying Queries Causing TempDB Is Full Errors with Extended Events

dba challenges newsletter sql server Feb 10, 2024

Edition: Saturday, February 10th, 2024

Sign up for the DBA Challenges Newsletter here --> DBA Challenges - Newsletter (automatesql.com).

It's hard to believe we're already in the 2nd week of February 2024.  

Today, I will focus on how to leverage Extended Events to identify and troubleshoot queries that contribute to tempdb filling up.  I just recently encountered this problem and attempted to use sp_whoIsActive and sp_BlitzFirst.  Unfortunately, both of these have a dependency on tempdb and encounter the dreaded "Could not allocate a new page for database 'tempdb' because the 'PRIMARY' filegroup is full....." error.

Tempdb is often referred to as the "workhorse" of SQL Server, given its role in temporary object storage, query sorting, and much more.  However, this also makes it susceptible to rapid space consumption, which can lead to system slowdowns or even crashes.  Understanding how to pinpoint the culprits behind tempdb usage spikes can save hours of troubleshooting and prevent performance degradation.

Extended Events are underutilized by many DBAs (including me) due to their perceived complexity and the initial learning curve involved.  There's a common misconception that XEvents are cumbersome to set up and interpret, leading many to rely on older, less efficient methods like SQL Profiler for diagnostics.

Learning Extended Events is not just about keeping TempDB in check; it's about embracing the full diagnostic capabilities of SQL Server to become a more effective and proactive DBA.

There are a lot of great monitoring tools available that help track tempdb issues.  However, all come with a cost.  SQL Server has a fantastic set of built-in capabilities that we, as DBAs, can use.  Here's what I'll cover today:

  • Create an Extended Event Session - This session will capture queries that encounter a 1101 or 1105 error.  We'll cover these errors below.
  • Review results - We'll look at two methods you can use to review events captured by this session.

 



Error descriptions:


The session we'll create is designed to monitor and capture specific error events related to page allocation issues within the tempdb database.  These errors can be encountered due to the disk filling up or tempdb growing to the maximum allowed size.  We're specifically filtering on event IDs of 1101 and 1105 within tempdb (database_id 2).

  • Error 1101 - Could not allocate a new page for database 'tempdb' because the 'PRIMARY' filegroup is full due to lack of storage space or database files reaching the maximum allowed size. Note that UNLIMITED files are still limited to 16TB. Create the necessary space by dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.
  • Error 1105 - Could not allocate space for object '%.*ls'%.*ls in database 'tempdb' because the 'PRIMARY' filegroup is full due to lack of storage space or database files reaching the maximum allowed size. Note that UNLIMITED files are still limited to 16TB. Create the necessary space by dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.

These errors are very similar in meaning and require the same solution: either add disk space to allow tempdb to grow, increase the max size limit, or modify the query(s) causing excessive tempdb growth.  

Create the XEvent Session:


The code below can be used to create the Extended Event session.  This will capture all queries that encounter either the 1101 or 1105 error. 

 

This session is designed to monitor and capture specific error events related to page allocation issues within the TempDB database (which has a database_id of 2). Here's a breakdown of what this script does:

  • CREATE EVENT SESSION: This initiates the creation of a new Extended Events session named page_allocation_error on the server.

  • ADD EVENT sqlserver.error_reported: Specifies the type of event to capture. In this case, it's capturing error events reported by SQL Server. The sqlserver.error_reported event occurs when an error is reported by the SQL Server database engine.

  • ACTION(...): This part of the event definition specifies additional data to capture when the event occurs, including the database_id, session_id, sql_text (the SQL query text that caused the error), tsql_stack (a representation of the T-SQL call stack at the time of the error), session_nt_username (the Windows username of the connected session), and plan_handle (which can be used to retrieve the query execution plan).

  • WHERE: This clause filters the events to be captured based on specific conditions:

    • The error must occur in the TempDB database (database_id = 2).

    • The session_id must be greater than 50, excluding system sessions that typically have lower session IDs.

    • The error number must be 1101 or 1105, specific to space allocation issues within SQL Server databases.

  • ADD TARGET package0.event_file and ADD TARGET package0.ring_buffer: These lines define two targets for the event session data. The event_file target writes the captured data to a file named page_allocation_error.xel, with a maximum file size of 100 MB and up to 10 rollover files. The ring_buffer target stores the event data in memory with a limit of 5000 events and 4096 KB of memory.

  • WITH (MAX_MEMORY=4096 KB,...): This part specifies various session options, such as the maximum memory the session can use, event retention mode, maximum dispatch latency for processing events, and whether to track causality among events.  The startup_sate of ON ensures the session is started whenever SQL Server is restarted.

After creating the session, you can find it in SSMS under Management, Sessions.

Reviewing Results:


The data contained within the *.xel files is XML.  I'll show you two methods for reviewing the results once either error is encountered.

The first method is to use SSMS.  By right-clicking on "page_allocation_error" and then selecting "Watch Live Data," you can see the events as they happen.

 

The second method, my preferred way, is to use a query to parse through the XML.

Using this query, you can review the events which have happened.  Here, I select the top 20 and order by the timeStamp descending.  This will return all queries that have encountered the error, not just the culprit.  However, I've found by looking at the query that encountered the error first is typically the culprit, and I start my analysis there.

Conclusion:


Troubleshooting tempdb space issues can be a real pain if you don't have a method for tracking what query caused the issue.  Hopefully, you can use XEvents to highlight that and make the analysis a bit easier.  I'd recommend trying this out in your sandbox and generating the 1101 and 1105 errors.  Nothing beats hands-on experience when learning new concepts.

By the way, I'm curious about any challenges you're currently struggling with in the SQL Server or automation space.  Drop a comment below.  I'd love to hear about these, and maybe we'll highlight them in an upcoming issue or live event.  If you haven't done so, I invite you to join the DBA Challenges Newsletter.  You can find it at -->DBA Challenges - Newsletter (automatesql.com)

Thanks for subscribing, and I look forward to hearing from you! 

Whenever you're ready, there is one way I can help you gain hands-on experience:Ā 

Automated Sandbox Fundamentals: I teach how to build a virtual lab using automation in this course. Learn how toĀ create golden images, using both Windows and Linux, to easily spin up and add additional machines to your sandbox.Ā  It's packed with 8 modules and the scripts you'll need to build your environment.Ā  Start small, andĀ scale as needed by easily changing the configuration file included with the course.

Give Me the Details