Jephe Wu - http://linuxtechres.blogspot.com
Environment: MS SQL server 2008 R2, Windows server 2008
Objective: SQL server performance monitoring and tuning, database backup and restore procedures
Steps:
Configuration Changes:
According to - How to determine proper SQL Server configuration settings http://support.microsoft.com/kb/319942
SQL Server can obtain a very high level of performance with relatively little configuration tuning. You can obtain high levels of performance by using good application and database design, and not by extensive configuration tuning. See the "References" section of this article for information about how to troubleshoot various SQL Server performance issues.
When you address a performance problem, the degree of improvement that is available from configuration tuning is typically modest unless you do not currently have the system properly configured. In SQL Server version 7.0 and later, SQL Server uses automatic configuration tuning and it is extremely rare that configuration settings (especially advanced settings) need any changes. Generally, do not make a SQL Server configuration change without overwhelming reason and not without careful methodical testing to verify the need for the configuration change. You must establish a baseline before the configuration change so that you can measure the benefit after the change.
If you do not have SQL Server properly configured, some settings might de-stabilize the server or might make SQL Server behave erratically. Years of support experience with many different environments indicate that non-default configuration settings might have results that range from neutral to highly negative.
If you do make a configuration change, you must perform rigorous methodical performance testing both before and after the change to assess the degree of improvement.
Based on actual support scenarios, SQL Server version 7.0 and later can achieve an extremely high level of performance without any manual configuration tuning.
In SQL Server version 7.0 and later, do not make any configuration changes to user connections, locks, and open objects because, by default, SQL Server dynamically tunes these settings.
Part I: Performance monitoring:
Performance Counter threshold value:
http://www.grumpyolddba.co.uk/monitoring/Performance%20Counter%20Guidance%20-%20SQL%20Server.htm
http://www.grumpyolddba.co.uk/monitoring/Performance%20Counter%20Guidance%20-%20Windows%20Server.htm
http://www.grumpyolddba.co.uk/sql2000/KB702_SQL%20Performance%20Tuning%20using%20Waits%20and%20Queues.mht
http://msdn.microsoft.com/en-us/library/ms189628.aspx - performance objects explanation
1. overview
check CPU and memory first, then disk I/O, lastly network.
because memory problem can trigger disk I/O problem.
then check database internal performance counters, server minimum/maximum memory configuration, data/log files, tempdb data/log files, paging files.
2. CPU bottleneck check
a. firstly check task manager to see if any other application is using most of cpu, rather than sql server. If it's sqlserver, then check further
b. run 'perfmon', add performance counter
- processor information - % Processor time the percentage of elapsed time that all of process threads used the processor to execution instructions should <80%. Process is the object created when you run a program; thread is object that spawned by progress to be used to execute instruction.
- Processor - % user time should be > 70%
- Processor - % privileges time -
The operating system switches application threads to privileged mode to access operating system services. should be < 20%
SQL Server runs in User mode. Privileged
mode, is designed for operating system components and allows direct access to
hardware and all memory.
- System - Processor Queue Length - Number of threads waiting to be scheduled for CPU
time
System:Processor Queue Length is the processor queue for all CPUs on a Windows system. If System: Processor Queue Length is greater than two for each CPU, it may indicates a CPU bottleneck
Some caused for high processor queue length:
1. Unnecessary compilation and recompilation.
2. memory
pressure
3. lack of proper
indexing
- System - Context Switches/sec - if it's greater than 20,000 per second, that's a lot.
sys.dm_os_wait_stats for Signal waits and Total waits, if
Signal waits > 25% of total waits indicates cpu
bottleneck
Note: to counter the number of cpu in sql server:
select cpu_count from sys.dm_os_sys_info
·
SQL Server: SQL Statistics: Batch Requests/sec the number of batch requests that SQL Server receives per second, indicates how busy you sql server are. This is relative number, depending your CPU power and how busy your database is.
a 100M ethernet can roughly handle 1000 batch request/sec
·
SQL Server: SQL Statistics: SQL Compilations/sec (should be as lower as possible < 100 maybe)
·
SQL Server: SQL Statistics: SQL Recompilations/sec (better to be nearly 0)
Measurement:
sql recompilation/sec : sql compilation/sec should be < 10%.
sql compilation/sec : batch request/sec should be < 10%
3. Memory bottleneck check
The default setting for min server memory is 0, and the default setting for max server memory is 2147483647
MB in MS SQL 2008
run 'perfmon', check the following:
- SQL server: Buffer Manager - Average Page Life Expectancy Counter - should not less than 300 (5mins)
How long can a page stay in the memory before being flushed out to disk. too lower indicates lack of memory or missing index.(full table scan)
- SQL server: Memory Manager - memory grant pending - task waiting for RAM to execute.
- Memory: - paging file % usage - should be as lower as possible, < 70% based on KB 889654
- Memory: - paging file % usage peak - should be as lower as possible, < 70% based on KB 889654
- Memory: - Available MBytes - should not be less than 200
- Memory: - Page fault /s
Page fault /s includes both
hard faults (those that require disk access) and soft faults (where the faulted page is found elsewhere in physical memory.) Most processors can handle large numbers of soft faults without significant consequence. However, hard faults, which require disk access, can cause significant delays
- Memory: - pages/sec - indicator of the amount of paging that Windows is performing
- Memory: - Page Reads/sec (should <=5)
This counter is probably the best indicator of a memory shortage because it indicates how often the system is reading from disk because of hard page faults. The system is always using the pagefile even if there is enough RAM to support all of the applications. Thus, some number of page reads will always be encountered. However, a sustained value over 5 Page Reads/sec is often a strong indicator of a memory shortage
If Memory: Pages/sec
is greater than zero or Memory: Page Reads/sec is
greater than five, Windows is using
disk to resolve memory references (hard page fault). This costs disk I/O + CPU resources.
Note:
a. If you have less memory, the disk I/O will increase, they are related, you should consider both sides if you suspect your system has performance issue.
4. Disk I/O bottleneck
We should concentrate on disk queuing instead of on the actual I/O for each disk as actual disk I/O speed
cannot be adjusted.
In perfmon,
- PhysicalDisk - avg. Disk Queue Length (if exceeds twice the number of spindles, then you are likely developing a bottleneck, you have to know how many disk behind the system, if it's SAN system, it might be difficult to get that information, even SAN administrator might not know exactly)
Note: Disk queuing might be caused by short of memory, check pages/sec value also.
- Physical disk - avg. disk seconds/read -
indicates the average time, in seconds, of a read of data from the disk. (4-8ms is ideal, if > 15ms might indicate disk bottlenecks
refer to http://technet.microsoft.com/en-US/library/cc917690 for IO bottleneck part
Less than 10 ms - very good
Between 10 - 20 ms - okay
Between 20 - 50 ms - slow, needs attention
Greater than 50 ms – Serious I/O bottleneck
According to
http://download.microsoft.com/download/F/1/0/F10BC023-9396-4D67-BAC9-7C43AE954BF4/SQLServer2000_SQL_Performance_Tuning_using_Waits_and_Queues.pdf
If disk sec/read > normal read
time (ask vendor for normal read time) you can consider the following
options:
1. Resolve IO bottleneck by adding more
drives; spreading IO across new drives if possible e.g. move files such as
database, transaction log, other application files that are being written to or
read from.
2. Check for memory pressure – see memory
component.
3. Check for proper indexing of SQL
tables. Proper indexing can save
IO. Check SQL query plans looking for
scans and sorts, etc. Showplan
identifies sorting steps.
4. Run SQL Profiler to identify TSQL
statements doing scans. In Profiler,
select the scans event class &
scan stopped event. Go to the data column tab and add object Id. Run the
trace. Save the profiler trace to a
trace table, and then search for the scans event. Alternately, you can search for high
duration, reads, and writes.
- Physical disk - avg. disk seconds/write (< 8ms (non cached) and <=1ms for highend SAN) The throughput for high volume OLTP applications is dependent on fast sequential transaction log writes.
Non cached Writes
Excellent < 08 Msec ( .008 seconds
)
Good < 12 Msec ( .012 seconds )
Fair < 20 Msec ( .020
seconds )
Poor > 20 Msec ( .020 seconds )
Cached Writes
Only
Excellent < 01 Msec ( .001 seconds )
Good < 02 Msec (
.002 seconds )
Fair < 04 Msec ( .004 seconds )
Poor > 04
Msec ( .004 seconds
- Physical disk: % disk time - should be < 50% - percentage
of elapsed time that the selected disk drive was busy servicing read or write
requests.
- Physical disk - avg. disk reads/s and Physical disk - avg. disk write/s
Note: this 2 values indicate how busy is your database server, just for measuring purpose to know how busy your system is , compare to 3 month or a year ago
Should maintain above 2 values < 85% disk capacity
Example:
When you use these counters, you may need to
adjust the values for RAID configurations using the following formulas:
·
Raid 0 -- I/Os per disk =
(reads + writes) / number of disks
·
Raid 1 -- I/Os per disk =
[reads + (2 * writes)] / 2
·
Raid 5 -- I/Os per disk =
[reads + (4 * writes)] / number of disks
·
Raid 10 -- I/Os per disk =
[reads + (2 * writes)] / number of disks
For example, you might have a RAID-1 system with
two physical disks with the following values of the counters.
Disk Reads/sec
80
Disk Writes/sec 70
Avg. Disk Queue Length 5
Then to maintain a raid configuration below, you need:
Raid 0 - need 75 IO per disk
Raid 1 - need 110 IO per disk
Raid 5 - need 180 IO per disk
Raid 10 - need 110 IO per disk
The method of watching out for disk queuing associated with SQL Server log files is different from SQL
Server database files. For log file, we use
SQL Server: database: log flush wait time
SQL server: database: log flush waits/sec
c. raid level - use raid 10 if possible
Each SQL Server write to the mirrorset results in two disk I/O operations
When dynamiclly adding new disk into raid array, The RAID controller will move some existing SQL Server data
to these new drives so data is evenly distributed across all drives in the RAID array
d. sequential I/O
sequentional I/O is faster than non-sequential one, most of hard disk perform 2 times better for sequential
than non-sequential I/O operations
Note: Logs generally are not a major concern because transaction log data is always written sequentially to the log file in sizes ranging up to 32 KB.
5. SQL Server
See SQL Server, Buffer Manager Object -
http://msdn.microsoft.com/en-us/library/ms189628.aspx for various objects explanation
- SQL Server: Buffer Manager: Buffer Cache Hit Ratio, should be at least 90%
Note: a major part of any database server environment is the management of memory buffer cache. less than 90% may indicate memory pressure or missing index.
See SQL Profiler: Stored Procedure: CacheHit,
CacheMiss, and CacheInsert to see what stored procedure query plans are already
in cache (Hit), vs. those not in cache (Miss,Insert)
- SQL server: cache manager: Cache Hit Ratio - Percentage of time the procedure
plan pages are already in cache e.g. procedure
cache hits. I.e. how often a compiled procedure is found in the procedure cache
(thus avoiding the need to recompile).
- SQL server: Buffer Manager: Page Life expectancy -
Sudden big drop in page life expectancy - DW applications (e.g. big transactions) could experience big drops
in page life expectancy. This is due to a cache flush from a big read
or
average page life expectancy
less than 300 which is 5 minutes - OLTP database - indicate memory pressure,
missing indexes, or a cache flush.
- SQL server: access method : Forwarded
Records/sec
Number of records fetched through
forwarded record pointers.
Tables with NO clustered
index. If you start out with a short
row, and update the row creating a wider row, the row may no longer fit on the
data page. A pointer will be put in its
place and the row will be forwarded to another page.
Look at code to determine where
the short row is inserted followed by an update.
Can be avoided
by:
1. Using Default values (so that
an update will not result in a longer row that is the root cause of forwarded
records).
2. Using Char instead of varchar
(fixes length so that an update will not result in a longer row
- SQL Server: access method: Full
Scan/sec -
The number of unrestricted full scans. These can either be base table or full index scans. -
should be as lower as possible > 1 might indicate problem
SQL Profiler can be used to
identify which TSQL statements do scan.
Select the scans event class
& events scan:started and scan:completed. Include the object Id data column. Save the profiler trace to a trace table, and
then search for the scans
event.
The scan:completed event will provide
associated IO so you can also search for high reads, writes, and
duration.
- MS SQL Server: Access method: index search/s (ideally index search/s : full scan /s should > 1000)
Index searches are used to start range scans, single index record fetches, and
to reposition within an index.
Compare to Full Scan/sec. You want to see high values for index
searches.
- MS SQL Server: Access method: Page
Splits/sec - Number of page splits occurring as
the result of index pages overflowing.
Normally associated with leaf pages of clustered indexes and
non-clustered indexes.
Page splits are extra IO overhead
that results from random inserts.
When there is no room on a data
page, and the row must be inserted on the page (due to index order), SQL will
split the page moving half the rows to a new page, and then insert the new
row.
Correlate to Disk: page
sec/write. If this is very high, you may
reorg the index(es) on the table(s) causing the page splits, to reduce page
splits temporarily.
Fillfactor will
leave a certain amount of space available for inserts. see
http://msdn.microsoft.com/en-us/library/aa933139(v=sql.80).aspx for what's fillfactor.
- MS SQL Server: Buffer Manager: Checkpoint pages/sec - Pages written to disk during the
checkpoint process, freeing up SQL cache
Indicates the number of pages flushed to disk per second by a checkpoint or other operation that require all dirty pages to be flushed.
Memory pressure is indicated if this counters is high along with
high lazy writes/sec and
low page life expectancy (<300 seconds)
- MS SQL Server: Buffer Manager: Lazy
writes/sec - Pages written to disk by the
lazywriter, freeing up SQL cache
Indicates the number of buffers written per second by the buffer manager's lazy writer. The
lazy writer is a system process that flushes out batches of dirty, aged buffers (buffers that contain changes that must be written back to disk before the buffer can be reused for a different page) and makes them available to user processes. The lazy writer eliminates the need to perform frequent checkpoints in order to create available buffers.
Memory pressure is indicated if above this counters is high along with high lazy writes/sec and low page life expectancy (<300
seconds)
- MS SQL Server: Buffer Manager: Readahead pages/sec - indicates the number of pages read per second in anticipation of use.
If memory shortages, cold cache,
or low hit rates, SQL may use worker threads to readahead (bring in pages ahead
of time) to raise hit rates. By itself
readahead is not a problem unless users are flushing each other’s pages
consistently.
Check for proper indexing and bad
query plans (scans in profiler)
- SQL Server: database: log flush wait time (waiting for transaction log write request to complete <ms>) and SQL server: database: log flush waits/sec (Tranlog writes per
second) and
and S
QL server: database: log growths (
Windows will automatically grow
transaction log to accommodate insert, update, and delete
activity.)
In general, growths of the
transaction log will temporarily freeze writes to the transaction log while
Windows grows the transaction log file.
Check to see that the growth increment is large enough. If not, performance will suffer as log
growths will occur more often.
- SQL Server: log file size and used log file size
- SQL Server: database file size
- SQL server: tempdb file size
- SQL server: database: Transactions
/sec - SQL Server transactions per
second
- MS SQL server: Transactions - Longest Transaction Running time.
check long transaction records view
sys.dm_tran_database_transactions - columns of particular interest include the time of the first log record
(database_transaction_begin_time), the current state of the transaction (database_transaction_state), and the log sequence number (LSN) of the begin record in the transaction log (database_transaction_begin_lsn).
In perfmon, check
- General statistics: login/s (number of logins per second), logout/s
- General statistics: User Connections : not very accurate since some application uses connection pool, there are many users behind connection pool
- Latches : Average Latch Wait
Time(ms)
Latches are
short term light weight synchronization object.
Latches are not held for the duration of a transaction. Typical latching
operations during row transfers to memory, controlling modifications to row
offset table, etc
- Latches: Latch
Waits/sec
- Lock - Average Wait
Time(ms) Transactions should be
as short as possible to limit the blocking of other users.
- Lock - Lock
Waits/sec
- Lock - number of deadlocks/sec
sys.dm_os_wait_stats - check top wait statistics
sys.dm_db_index_operational_stats - check lock and latch wait
8. Network bottlenecks
in perfmon, choose the following:
a. Network Interface - Current Bandwidth.
b. packets/sec
Note: actual bandwidth is computed as packets/sec * 1500 * 8 /1000000 Mbps.
9. tempdb
Also check paging file, separate data file and log file drive, tempdb data/log should be at separate drive.
Monitor the following Performance Monitor
counters for any unusual increase in the temporary objects allocation/deal
location activity:
·
SQL Server:Access Methods\Workfiles Created /Sec
·
SQL Server:Access Methods\Worktables Created /Sec
·
SQL Server:Access Methods\Mixed Page Allocations /Sec
·
SQL Server:General Statistics\Temp Tables Created /Sec
·
SQL Server:General Statistics\Temp Tables for destruction
Solution: Increase the tempdb data files by an equal amount to
distribute the workload across all of the disks and files. Ideally, you want to
have as many files as there are CPUs (taking into account the affinity).
Move tempdb data and logfile to new location:
alter database tempdb modify file
(name='tempdev',filename= 'e:\mssql7\tempnew_location.mDF')
alter database tempdb modify file
(name='templog',filename= 'c:\temp\tempnew_loglocation.mDF')
Part II - performance tuning
1. CPU/Memory bottlenecks - upgrade server or adding more RAM
Note:
AWE: check it on 32-bit servers with >4GB of memory, and unchecked the rest of the time.
2. reduce disk I/O
- buy faster disk (15k rpm)
- using database partition to reduce I/O
- separate data and log file disk
- increase temp tablespace initial size and put to separate disk other than data/log files
- configuring suitable maximum server memory in sql server to leave some for OS
- suitable raid level:raid 0 or raid 0+1(raid 1/0 or raid10) , mirrored stripes (ms) , stripe first, mirror later, raid 0 first. Raid 10 is the bestfiles.
3. MS SQL server
- create and maintain good index
recently used index is in sys.dm_db_index_usage_stats, any defined index not included in this DMV has not
been used since the last re-start of SQL Server.
- Too many indexes need frequent updates or unused indexes that need frequently maintenance
(insert/select/update) will create overhead, but without benefiting any users.
- create appropriate indexing to avoid excessive sorting operations
- Big IOs such as table and range scans due to missing indexes
Reduce table join for frequent queries. join overuse will result in long run query and waste resources.
- Partition large data sets and indexes
- Tune applications and queries.
It is very important for application developers to understand SQL Server architectural basics and how to
take full advantage of SQL Server indexes to minimize I/O.
- Lack of useful statistics:
http://msdn.microsoft.com/en-us/library/ms187348(v=sql.105).aspx
http://msdn.microsoft.com/en-us/library/ms190397(v=sql.105).aspx#UpdateStatistics
the following command calls sp_updatestats to update all statistics for the database.
EXEC sp_updatestats
other usage:
UPDATE STATISTICS Sales.SalesOrderDetail;
DBCC SHOW_STATISTICS
AUTO_CREATE_STATISTICS
AUTO_UPDATE_STATISTICS
AUTO_UPDATE_STATISTICS_ASYNC
Master database files:
The master database, msdb, and model databases are not used much during production compared to user
databases, so it is typically not necessary to consider them in I/O performance tuning considerations. The master database is usually used only for adding new logins, databases, devices, and other system objects.
Part III - References:
1.
http://sqlcat.com/sqlcat/b/top10lists/archive/2007/11/21/top-sql-server-2005-performance-issues-for-oltp-applications.aspx
2. SQL server performance tuning guide for dataware house -
http://technet.microsoft.com/library/Cc966420
3.
http://msdn.microsoft.com/en-us/library/ms178067.aspx
4.
http://sqlcat.com/sqlcat/b/top10lists/archive/2007/11/21/top-10-sql-server-2005-performance-issues-for-data-warehouse-and-reporting-applications.aspx
5. sql long transaction view -
http://msdn.microsoft.com/en-us/library/ms186957(v=sql.105).aspx
6. update statistics -
http://msdn.microsoft.com/en-us/library/ms187348(v=sql.105).aspx
7. Performance Monitor Counters -
http://technet.microsoft.com/en-us/library/cc768048.aspx
8. performance and tuning howto -
http://msdn.microsoft.com/en-us/library/ms191511(v=sql.105).aspx
9. http://technet.microsoft.com/en-us/library/ms190619(v=sql.105)
10. establishing a performance baseline -
http://technet.microsoft.com/en-us/library/cc781394(v=WS.10).aspx
11. storage best practise -
http://technet.microsoft.com/library/Cc966534
12. Troubleshooting performance problem in sql server 2008 - http://msdn.microsoft.com/en-us/library/dd672789(v=SQL.100).aspx#phrss - excellent doc
13. best practice for sql sever maintenance - http://technet.microsoft.com/en-US/library/cc966447
14. SAN storage best practise for sql server -
http://www.brentozar.com/sql/sql-server-san-best-practices/
15. sql server 2008 I/O performance -
http://blogs.technet.com/b/josebda/archive/2009/03/31/sql-server-2008-i-o-performance.aspx
16. diagnosing and resolving Latch Content for SQL server 2008 R2 -
http://www.microsoft.com/en-us/download/details.aspx?id=26665
17. How to: Configure SQL Server to Use Soft-NUMA-
http://msdn.microsoft.com/en-us/library/ms178144(v=sql.105).aspx
Part IV - useful DMVs
1. DMV list:
http://msdn.microsoft.com/en-us/library/ms179984.aspx
2. commands:
select * from sys.dm_os_wait_stats order by wait_time_ms
select * from sys.dm_os_latch_stats;
Select wait_type,
waiting_tasks_count,
wait_time_ms
from sys.dm_os_wait_stats
where wait_type like
'PAGEIOLATCH%'
order by wait_type
I/O wait at this moment
select
database_id,
file_id,
io_stall,
io_pending_ms_ticks,
scheduler_address
from sys.dm_io_virtual_file_stats(NULL, NULL)t1,
sys.dm_io_pending_io_requests
as t2
where t1.file_handle
= t2.io_handle
top 10 wait type
select top 10 *
from sys.dm_os_wait_stats
order by wait_time_ms desc
Part V - best practise
1.
Isolate log from data at the physical disk level -
- separate database data and log file disk, Pre-size data and log files.
According to http://technet.microsoft.com/library/Cc966534,
It is recommended to have .25 to 1 data files (per filegroup) for each CPU on the host server.
- tempdb data/logfile should on different disk spindle
create one data file for tempdb per cpu (Dual core counts as 2 CPUs; logical procs (hyperthreading) do not)
depending on tempdb usage, use raid10 for storage if possible
- backup should go to separate disk
- If you have real time anti-virus monitoring, it is recommended that you exclude the SQL Server database files (including data files, transaction log files, tempdb and other system database files) from real time monitoring
- Update statistics on the largest tables weekly or monthly
- Rebuild or defrag the most important indexes
2. storage disk for data file should not be shared with other application etc, dedicated for data file with enough spindle
3. vcpu and memory amount should be same as physical machine baseline if running in virtual environment.
4. If raid 1/0 or raid 1 for logfile disk if possible.
RAID 1+0 provides better write performance than any other RAID level providing data protection, including RAID 5.
Part VI - testing tools
1. sql nexus
2. sql diag
3. sqlIO , alternatively , use IOMETER
download sqlio at
http://www.microsoft.com/en-us/download/details.aspx?id=20163
http://www.brentozar.com/archive/2008/09/finding-your-san-bottlenecks-with-sqlio/
SAN performance tuning with sqlio -
http://sqlserverpedia.com/wiki/SAN_Performance_Tuning_with_SQLIO
4. perfmon and sql profiler
http://www.simple-talk.com/sql/database-administration/correlating-sql-server-profiler-with-performance-monitor/ good article talking about how to corelate profiler and perfmon data
5. latch wait
For a non-production environment only, clear the sys.dm_os_wait_stats DMV with
the following command:
dbcc SQLPERF ('sys.dm_os_wait_stats', 'CLEAR')
A similar command can be run to clear the sys.dm_os_latch_stats DMV:
dbcc SQLPERF ('sys.dm_os_latch_stats', 'CLEAR')
Part VII - backup and restore
1. restore master datbase in single user mode -
http://technet.microsoft.com/en-us/library/ms190679.aspx
2. Performing a Complete Database Restore (Full Recovery Model)
http://msdn.microsoft.com/en-us/library/ms187495(v=sql.105).aspx
http://msdn.microsoft.com/en-us/library/ms175199(v=sql.105).aspx
Part VIII - FAQ
1. if performance counter doesn't load or corrupted, you can run this to fix:
cd
\windows\system32
lodctr /R