How to troubleshoot high CPU utilization issue in SQL Server?

So you have received a high CPU utilization alert from your monitoring system and wondering what to do next. If yes, then this blog post might be helpful for you.
You can follow the below steps to isolate the problem and mitigate it.

  1. Open the task manager and identify if the SQL Server process is causing the CPU spike or some other process. If any process other than the SQL server is consuming a high CPU, get in touch with the respective team. Now a day, with advanced monitoring tools /custom-built scripts, we can easily detect which process is consuming a high amount of CPU and direct the alert to the respective teams. ( I will discuss this in another article)

  2. If the SQL Server process is taking high CPU bandwidth, we can further segregate them using the below-mentioned perfmon counters.

    Process (sqlservr):

    % Processor Time

    % Privileged Time

    % User Time

    Processor:

    % Processor Time

    % Privileged Time

    % User Time

    SQL Server: SQL Statistics:

    Batch Requests/sec

    SQL Compilations/sec

    SQL Recompilations/sec

    Note: CPU is consumed in two different modes:

    a) User Mode

    b) Kernel Mode

  3. If you see % Privileged Time under the processor node is greater than 25 -30 %, which indicates that the processor is spending more time executing kernel mode operations. There’s the possibility that a faulty system driver or any antivirus software that is running currently is causing this. You should get in touch with the server admin/ windows server support folks at this point to investigate this.

  4. Now, let’s check if SQL Server processes are consuming most of the CPU by taking a look at the % Privileged Time counter value at the Process (sqlservr) node. It measures the percentage of elapsed time that the SQL process threads spent executing code in privileged Kernel mode. Verify if (Process (sqlservr)% Privileged time/No of logical Processor) is >30%. If yes, chances are there that some large I/O operations are being performed at that time (or other kernel mode operations executing). Again check with the server admin/ windows server support folks if the underlying server infra(storage performance, IOPS etc..) is performing fine.

  5. Finally, roll up your sleeves if you see a high % User Time under Process (sqlservr) node. It measures the percentage of elapsed time that the process threads spent executing code in user mode. You have to perform some checks as mentioned below to isolate the problem.

a) Identify the top CPU-consuming processes using the below query. If you are experiencing a severe performance issue that is paralyzing the system, you should contact the person/ team running the query ASAP and terminate it after taking their consent.

SELECT s.session_id, 
r.status, 
r.blocking_session_id,
r.wait_type, 
wait_resource,
r.wait_time / (1000 * 60) 'wait_time(Min)',
r.cpu_time,
r.logical_reads,
r.reads,
r.writes,
r.total_elapsed_time / (1000 * 60) 'total_elapsed_time(Min)',
Substring(st.TEXT,(r.statement_start_offset / 2) + 1,
((CASE r.statement_end_offset
WHEN -1
THEN Datalength(st.TEXT)
ELSE r.statement_end_offset
END - r.statement_start_offset) / 2) + 1) AS statement_text,
Coalesce(Quotename(Db_name(st.dbid)) + N'.' + Quotename(Object_schema_name(st.objectid, st.dbid)) + N'.' +
Quotename(Object_name(st.objectid, st.dbid)), '') AS command_text,
r.command,
qp.query_plan,
s.login_name,
s.host_name,
s.program_name,
s.last_request_end_time,
s.login_time,
r.open_transaction_count
FROM sys.dm_exec_sessions AS s
JOIN sys.dm_exec_requests AS r
ON r.session_id = s.session_id
CROSS APPLY sys.Dm_exec_sql_text(r.sql_handle) AS st
CROSS APPLY sys.dm_exec_query_plan(r.plan_handle) AS qp
WHERE r.session_id != @@SPID
ORDER BY r.cpu_time desc

Now that the CPU-intensive queries are terminated and the CPU is stable now, you need to take a deep dive into the query plan (look at the query_plan column from the output of the above query) of the problematic query and make necessary changes to fine-tune the queries to restrict the queries from creating a mess again.

I will discuss what to look for in a query plan and how to tune the problematic query in a separate blog post.

b) Along with it, check if there are any SQL background threads like Ghost Cleanup or Lazy Writer threads running which generally consumes a significant amount of CPU resource.

select * from sys.sysprocesses where cmd like 'LAZY WRITER' or cmd like '%Ghost%'

c) You should also take a look at the Compilations & Re-Compilations in SQL Server through perfmon counter values.

If the SQL Compilation/sec perfmon counter value is greater than 15 -20% of Batch Requests/sec or SQL Recompilation/sec value is greater than 10% of SQL Compilation/sec value, then consider checking the below points:

  • If the statistics are up to date? If not, then go ahead and update the stats.

  • Check with the dev/application team about any recent Schema changes in the databases.

  • Try to see if there is any stored procedure created with the OPTION (RECOMPILE) query hint.

  • Parameterize your stored procedures and queries which will help in query plan reuse that is essential for optimal performance in high OLTP systems.

  • Enable Optimize for ad-hoc workloads Server Configuration Option which is used to improve the efficiency of the plan cache for workloads that contain many single-use ad hoc batches.

d) Finally, go the extra mile like a Rock Star DBA, and take a look at the SQL Server configurations. Though I don’t think in the age of Desired State Configuration (DSC), there would be any anomaly in SQL Server configuration settings.

Look at the below settings to check if they are set as per the Microsoft recommendations:

  • Max degree of parallelism (MAXDOP)

  • Cost Threshold for Parallelism

  • Max Worker Thread

These are the first few things that I generally look for while troubleshooting CPU issues. There are tons of additional checks that you may have to perform based on the complicacy and intensity of the problems. I will update this space with the additional details as and when I encounter anything new.

Note: If your system is already unresponsive 😰 and not accepting any local connections, try connecting through DAC and run the diagnostic queries.