Use Oracle Rman to backup and restore database


Jephe Wu - http://linuxtechres.blogspot.com 

Objective: understanding Oracle rman backup and best practice for doing it
Environment: Oracle database 11.2 64bit

Configuration:

Firstly, check control_file_record_keep_time paramteter, make sure it's greater than your retention policy, otherwise, the rman backup information will be removed from controlfile. it's 7 days by default.

to check parameter value:
sql> show parameter control_file_record_keep_time;

use 'show all' in rman to list current configuration
and use something like 'CONFIGURE CHANNEL DEVICE TYPE DISK clear' to reset to default settings.

Rman backup script best practise:

$ cat rmanbackup.sh
# export variables
ORACLE_BASE=/u01/app/oracle
export ORACLE_BASE
ORACLE_HOME=/u01/app/oracle/product/10.2.0/db_1
export ORACLE_HOME
ORACLE_SID=LIVEDB1
export ORACLE_SID
PATH=$ORACLE_HOME/bin:$PATH
export PATH

#or

export PATH=/usr/bin:/usr/local/bin:/usr/ccs/bin
export NLS_DATE_FORMAT='DD-MON-YYYY HH24:MI:SS'
export ORACLE_SID=PROD

ORAENV_ASK=NO
. oraenv
unset ORAENV_ASK

# other variables
week=`date +%w`
LOGFILE=/u01/app/oracle/admin/LIVEDB/backup/backup.log
export TIMESTAMP=`date +%Y-%m-%d-%T`
export WEEK=`date +%w`
export TRACE=/u01/app/backup/controlfile.backup.${week}.sql
export PFILE=/u01/app/backup/spfile.backup
export EMAIL="jephe@domain.com"


# create lock file

PRG=`basename $0`

LOCKFILE=/tmp/$PRG.lock
if [ -f $LOCKFILE ]; then
echo "lock file exists, exiting..."
exit 1
else
echo "DO NOT REMOVE, $LOCKFILE" > $LOCKFILE
 fi 


rman target / <<ERMAN > $LOGFILE

set echo on;
# controlfile autobackup
CONFIGURE CONTROLFILE AUTOBACKUP ON;
or
CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE DISK TO '/dir/cf%F';

# The %F element of the format string combines the DBID, day, month, year, and sequence number to # generate a unique filename. %F must be included in any control file autobackup format.

#retention policy
CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF 7 DAYS;
or
CONFIGURE RETENTION POLICY TO REDUNDANCY 2;


show all;
report unrecoverable;

# actual backup
backup database plus archivelog;

refer to http://docs.oracle.com/cd/B19306_01/backup.102/b14192/bkup003.htm below

4.3.7.2 Backing Up Logs with BACKUP ... PLUS ARCHIVELOG

You can add archived redo logs to a backup of other files by using the BACKUP ... PLUS ARCHIVELOG clause. Adding BACKUP ... PLUS ARCHIVELOG causes RMAN to do the following:
  1. Runs the ALTER SYSTEM ARCHIVE LOG CURRENT command.
  2. Runs BACKUP ARCHIVELOG ALL. Note that if backup optimization is enabled, then RMAN skips logs that it has already backed up to the specified device.
  3. Backs up the rest of the files specified in BACKUP command.
  4. Runs the ALTER SYSTEM ARCHIVE LOG CURRENT command.
  5. Backs up any remaining archived logs generated during the backup.

This guarantees that datafile backups taken during the command are recoverable to a consistent state.





# delete obsolete;
# report obsolete;
delete force noprompt obsolete;

# controlfile/pfile backup
sql 'ALTER DATABASE BACKUP CONTROLFILE TO TRACE';
sql "ALTER DATABASE BACKUP CONTROLFILE TO TRACE AS ''/u01/app/backup/controlfile.backup.${week}.sql'' reuse";
sql "CREATE PFILE=''/u01/app/backup/spfile.backup'' FROM SPFILE";

# crosscheck
crosscheck archivelog all;
delete noprompt expired archivelog all;
crosscheck backup;
delete noprompt expired backup;
crosscheck copy;
delete noprompt expired copy;

# list
# list backup by file;
# list backup of database by backup;
# list expired backup;
# list expired copy;
list backup;
list backup summary;
list backup recoverable;

# report
# report need backup database;
# report obsolete;
# report schema;
report unrecoverable;

#  test restore
restore database validate;
restore spfile validate;
restore archivelog from time 'sysdate-1' validate [check logical]
restore controlfile to '/path/to/' validate;

# restore tablespace users validate;

# restore archivelog from time ‘sysdate-1′ validate;

Note: The RESTORE DATABASE VALIDATE command will check for the last level 0 or FULL tape or disk based backup, but the RESTORE ARCHIVELOG ALL command will check for all the archivelog files catalogued based on the retention policy



exit
ERMAN

mailx -s "controlfile trace backup for PROD"  $EMAIL <  $TRACE
mailx -s "pfile backup for PROD"  $EMAIL <  $PFILE


use RMAN to check unrecoverable 
# check which datafile has unrecoverable objects:

SELECT df.name data_file_name, df.unrecoverable_time FROM v$datafile df, v$backup bk WHERE (df.file# = bk.file#) and df.unrecoverable_change# != 0 and df.unrecoverable_time >  (select max(end_time) from v$rman_backup_job_details where INPUT_TYPE in ('DB FULL' ,'DB INCR'));

# check objects inside datafile

select distinct dbo.owner,dbo.object_name, dbo.object_type, dfs.tablespace_name, dbt.logging table_level_logging, ts.logging 


tablespace_level_logging from v$segstat ss, dba_tablespaces ts, dba_objects dbo, dba_tables dbt, v$datafile df, dba_data_files dfs, v$tablespace 
vts where ss.statistic_name ='physical writes direct' and dbo.object_id = ss.obj# and vts.ts# = ss.ts# and ts.tablespace_name = vts.name and 
ss.value != 0 and df.unrecoverable_change# != 0 and dfs.file_name = df.name and ts.tablespace_name = dfs.tablespace_name and dbt.owner = dbo.owner 
and dbt.table_name = dbo.object_name;


use RMAN to check block corruption if required.


RMAN> configure device type disk parallelism 4;
RMAN> backup validate check logical database;
or backup validate check logical datafile 3;

SQL> select * from v$database_block_corruption;

V$DATABASE_BLOCK_CORRUPTION displays information about database blocks that were corrupted after the last backup.
RMAN> blockrecover corruption list;


Starting recover at 05-AUG-12
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=34 device type=DISK

starting media recovery
media recovery complete, elapsed time: 00:00:01

Finished recover at 05-AUG-12


RMAN>run {
      allocate channel d1 type disk;
      blockrecover corruption list;
      release channel d1;
     }

V$BACKUP_CORRUPTION displays information about corrupt block ranges in data file backups
                                                  from the control file.
V$DATABASE_BLOCK_CORRUPTION displays information about database blocks
                                                                     that were corrupted after the last backup.



SQL>  select * from v$backup_corruption;
no rows selected

The V$BACKUP_CORRUPTION view shows corrupted blocks discovered during an RMAN backup. But once the blocks have been fixed this view is not updated.



SQL> select * from v$copy_corruption;
no rows selected


References:


How to identify all the Corrupted Objects in the Database with RMAN [ID 472231.1]
New Rman Blockrecover command in 11g (Recover corruption list) [ID 1390759.1]




Rman database restore: - using current control file

rman target /
Rman> startup force mount;
Rman> restore database;
Rman> recover database;
Rman> alter database open;


Rman database restore: - using backup control file




rman target /
RMAN> startup nomount;
RMAN> restore controlfile from autobackup;
RMAN> alter database mount;
RMAN> restore database;
RMAN> recover database;
RMAN> alter database open resetlogs;


Rman backup in non-archivelog mode:

rman target /
RMAN> SHUTDOWN IMMEDIATE;
RMAN> STARTUP FORCE DBA;
RMAN> SHUTDOWN IMMEDIATE;
RMAN> STARTUP MOUNT;

RMAN> BACKUP DATABASE;
or
RMAN> BACKUP AS COPY DATABASE;

RMAN> ALTER DATABASE OPEN;



Other rman backup: (individual file or tablespace)

RMAN> BACKUP ARCHIVELOG from time 'sysdate-2';
RMAN> BACKUP TABLESPACE system, users, tools;
RMAN> BACKUP AS BACKUPSET DATAFILE
'ORACLE_HOME/oradata/trgt/users01.dbf',
'ORACLE_HOME/oradata/trg/tools01.dbf';

RMAN> BACKUP DATAFILE 1,3,5;
RMAN> BACKUP CURRENT CONTROLFILE TO '/backup/curr_cf.copy';
RMAN> BACKUP SPFILE;
RMAN> BACKUP BACKUPSET ALL;

rman> Backup archivelog;
rman> Backup archivelog all;

You can also specify a range of archived redo logs by time, SCN, or log sequence number, as in the following example:
rman> BACKUP ARCHIVELOG
      FROM TIME 'SYSDATE-30' UNTIL TIME 'SYSDATE-7';

Rman recover tablespace and data block:

Tablespace:
   RMAN> SQL 'ALTER TABLESPACE users OFFLINE';
    rman>  RESTORE TABLESPACE users;
     rman> RECOVER TABLESPACE users;
      rman> SQL 'ALTER TABLESPACE users ONLINE;

Individual data block:
rman> BLOCKRECOVER CORRUPTION LIST;
rman> BLOCKRECOVER DATAFILE 7 BLOCK 233, 235 DATAFILE 4 BLOCK 101;


Managing rman repository:

RMAN repository metadata is always recorded in the control file of the target database.

CONTROL_FILE_RECORD_KEEP_TIME = 7 is the default setting

You can configure a retention policy to be used by RMAN to determine which backups are considered obsolete. This policy can be based on a recovery window (the maximum number of days into the past for which you can recover) or redundancy (how many copies of each backed-up file to keep).

e.g.
CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF 7 DAYS
CONFIGURE RETENTION POLICY TO REDUNDANCY 2;

RMAN does not automatically delete backups rendered obsolete by the retention policy. RMAN deletes obsolete files if you run the DELETE OBSOLETE command.

report obsolete;
delete obsolete;
or
delete [force] [noprompt] obsolete;

if encounter problem like this:
ORA-27027: sbtremove2 returned error
ORA-19511: Error received from media manager layer, error text:


Rman crosscheck and delete backups:
Crosscheck is needed when an archivelog file or backup is manually removed, i.e., not deleted by RMAN. This command ensures that data about backups in the recovery catalog or control file is synchronized with corresponding data on disk or in the media management catalog.  

The CROSSCHECK command operates only on files that are recorded in the recovery catalog or the control file. 

The CROSSCHECK command does not delete any files that it is unable to find, but updates their repository records to EXPIRED. Then, you can run DELETE EXPIRED to remove the repository records for all expired files as well as any existing physical files whose records show the status EXPIRED. 

If some backup pieces or copies were erroneously marked as EXPIRED, for example, because the media manager was misconfigured, then after ensuring that the files really do exist in the media manager, run the CROSSCHECK BACKUP command again to restore those files to AVAILABLE status. 

crosscheck archivelog all;
delete noprompt expired archivelog all;
crosscheck backup;
delete noprompt expired backup;


Rman FAQ:
1. Move backup to new location:
You can move backup to their original location, then catalog them again
rman> catalog archivelog '/path/to/xxxz.arc';
or
rman> catalog star with '/path/to/';

MOS: How to backup archivelog files moved to different location [ID 293495.1]

2. remove archivelog backed up once:
Rman target /
Rman> DELETE ARCHIVELOG ALL BACKED UP 1 TIMES TO Disk;
or to free up disk space quickly

rm -f *.arc
rman> crosscheck archivelog all;

note: if not crosschecked, you will encounter backup issue below:

RMAN-06059  (RMAN-6059)
Text:  expected archived log not found, lost of archived log compromises
 recoverability

MOS: RMAN-6059 expected archived log not found, lost of archived log compromi [ID 291415.1]
MOS: Rman backup retention policy [ID 462978.1]

3. check tablespace usage
Tablespace usage
select
   tablespace_name,
   used_percent
from
   dba_tablespace_usage_metrics
where
   used_percent > 50;

4. you cannot delete a obsolete file in rman, uncatalog it

rman> report obsolete;
rman> change copy of controlfile uncatalog;
rman> report obsolete;
now file is not listed

rman> delete obsolete

References:

1. http://docs.oracle.com/cd/E11882_01/backup.112/e10642.pdf  # rman 11.2 doc
2. Oracle10g / 11g - Getting Started with Recovery Manager (RMAN) [ID 360416.1]
3. http://gavinsoorma.com/2009/11/rman-restore-validate-examples/




How to install Toad DBA Suite for Oracle on Windows 7 64bit

Jephe Wu - http://linuxtechres.blogspot.com

Objective: install Toad DBA Suite 11.5.1.2 trial version on Windows 8 64bit
Environment: Windows 7 enterprise edition 64bit, 8G RAM


Steps:

1. Install Toad DBA Suite for Oracle with default options, without Oracle instant client.

2. download Oracle instant client for Windows 32bit
install 32bit client only, even you are using Windows 7 64bit according to

http://www.quest.com/Quest_Site_Assets/Documents/ToadforOracle_115_DBA_Suite_Installation_Guide
pdf at page 6 as follows:

Windows 7 (32-bit and 64-bit)
Note: When running on a 64-bit operating system, you must use the 32-bit Oracle Client.

or
you can install Oracle client for Windows instead of using instant client at
http://www.oracle.com/technetwork/database/enterprise-edition/downloads/112010-win32soft-098987.html


3. install instant client and configure Windows PATH and variables
download instantclient-basic-nt-11.2.0.zip then unzip to c:\instantclient_11_2, the most important file is oci.dll

right client 'my computer', to add the following in the front of the PATH environment variables:

c:\instantclient_11_2

Also add an additional variable as follows:
variable name: NLS_LANG
vriable value: AMERICAN.AMERICA.IW8ISO8859P8


3. Configuring Toad DBA Suite for Oracle for database connection.

refer to http://www.quest.com/tv/All-Videos/1044396072001/Toad-for-Oracle---Using-Oracle-Instant-Client/Video/

choose the newly installed instant client as connect using:, use 'direct' tab to configure database host, port and service name, type in user and password, connect.

4. References:
http://yahawana.wordpress.com/2012/02/12/step-by-step-install-toad-for-oracle-on-a-windows-7-64-bits-oci-error/



How to install/backup Mysql 5.5 on CentOS 6 for zabbix database


Jephe Wu - http://linuxtechres.blogspot.com


Objective: upgrade Mysql to version 5.5 on CentOS 6 for zabbix installation
Environment: CentOS 6.2 64bit


Steps:

1. install epel and remi repository from http://blog.famillecollet.com/pages/Config-en

# wget http://dl.fedoraproject.org/pub/epel/6/i386/epel-release-6-7.noarch.rpm
# wget http://rpms.famillecollet.com/enterprise/remi-release-6.rpm
# rpm -Uvh remi-release-6*.rpm epel-release-6*.rpm

2. install mysql and mysql-server
# yum --enablerepo=remi,remi-test install mysql mysql-server



3. make it automatic start on reboot
# service mysqld start
# chkconfig --list mysqld
# chkconfig mysqld on

4. Secure mysql 
# mysql_secure_installation
answer yes to disable root login remotely, drop test database, remove anonymous users etc

5. change mysql admin password
# mysqladmin -u root password password
note: assume your use 'password' as your root password.

6. create zabbix user and database
# mysql [-h localhost] -u root -ppassword
# mysql> create database zabbix;
# mysql> create user 'zabbix'[@'1.2.3.4'] identified by 'zabbix';
# mysql> grant all on zabbix.* to zabbix[@'1.2.3.4'];
# mysql> flush privileges;
# mysql> exit;

Note: [..] is optional

7. update firewall rules for iptables
update iptables firewall rules if necessary.

8. other options
use Percona mysql 5.5 at http://www.percona.com/software/

9. backup mysql - backup database 'jephe'

#!/bin/sh

WEEK=`date +%w`
mysqldump jephe --add-drop-table --add-locks --extended-insert --single-transaction --quick -u jephe -ppassword | bzip2 > /data/backup/jephe_mysql.sql.$WEEK.bz2


How to monitor and tune Microsoft SQL server 2008 performance


Jephe Wu - http://linuxtechres.blogspot.com 

Environment: MS SQL server 2008 R2, Windows server 2008
Objective: SQL server performance monitoring and tuning, database backup and restore procedures


Steps:


Configuration Changes:
According to - How to determine proper SQL Server configuration settings http://support.microsoft.com/kb/319942

SQL Server can obtain a very high level of performance with relatively little configuration tuning. You can obtain high levels of performance by using good application and database design, and not by extensive configuration tuning. See the "References" section of this article for information about how to troubleshoot various SQL Server performance issues. 

When you address a performance problem, the degree of improvement that is available from configuration tuning is typically modest unless you do not currently have the system properly configured. In SQL Server version 7.0 and later, SQL Server uses automatic configuration tuning and it is extremely rare that configuration settings (especially advanced settings) need any changes. Generally, do not make a SQL Server configuration change without overwhelming reason and not without careful methodical testing to verify the need for the configuration change. You must establish a baseline before the configuration change so that you can measure the benefit after the change. 

If you do not have SQL Server properly configured, some settings might de-stabilize the server or might make SQL Server behave erratically. Years of support experience with many different environments indicate that non-default configuration settings might have results that range from neutral to highly negative. 

If you do make a configuration change, you must perform rigorous methodical performance testing both before and after the change to assess the degree of improvement.

Based on actual support scenarios, SQL Server version 7.0 and later can achieve an extremely high level of performance without any manual configuration tuning.

In SQL Server version 7.0 and later, do not make any configuration changes to user connectionslocks, and open objects because, by default, SQL Server dynamically tunes these settings.


Part I: Performance monitoring:

Performance Counter threshold value:
http://www.grumpyolddba.co.uk/monitoring/Performance%20Counter%20Guidance%20-%20SQL%20Server.htm

http://www.grumpyolddba.co.uk/monitoring/Performance%20Counter%20Guidance%20-%20Windows%20Server.htm

http://www.grumpyolddba.co.uk/sql2000/KB702_SQL%20Performance%20Tuning%20using%20Waits%20and%20Queues.mht

http://msdn.microsoft.com/en-us/library/ms189628.aspx - performance objects explanation

1. overview
check CPU and memory first, then disk I/O, lastly network.
because memory problem can trigger disk I/O problem.

then check database internal performance counters, server minimum/maximum memory configuration, data/log files, tempdb data/log files, paging files.

2. CPU bottleneck check
a. firstly check task manager to see if any other application is using most of cpu, rather than sql server. If it's sqlserver, then check further
b. run 'perfmon', add performance counter

  • processor information - % Processor time the percentage of elapsed time that all of process threads used the processor to execution instructions  should <80%. Process is the object created when you run a program; thread is object that spawned by progress to be used to execute instruction.
  • Processor - % user time  should be > 70%
  • Processor - % privileges time -  The operating system switches application threads to privileged mode to access operating system services. should be < 20% 
SQL Server runs in User mode. Privileged mode, is designed for operating system components and allows direct access to hardware and all memory.
  • System - Processor Queue Length Number of threads waiting to be scheduled for CPU time
System:Processor Queue Length is the processor queue for all CPUs on a Windows system. If System: Processor Queue Length is greater than two for each CPU, it may indicates a CPU bottleneck


Some caused for high processor queue length:
1. Unnecessary compilation and recompilation.

2. memory pressure
3. lack of proper indexing

  • System - Context Switches/sec - if it's greater than 20,000 per second, that's a lot.

sys.dm_os_wait_stats for Signal waits and Total waits, if Signal waits > 25% of total waits indicates cpu bottleneck

Note: to counter the number of cpu in sql server:
select cpu_count from sys.dm_os_sys_info

·         SQL Server: SQL Statistics: Batch Requests/sec  the number of batch requests that SQL Server receives per second, indicates how busy you sql server are. This is relative number, depending your CPU power and how busy your database is.
   a 100M ethernet can roughly handle 1000 batch request/sec
·         SQL Server: SQL Statistics: SQL Compilations/sec  (should be as lower as possible < 100 maybe)
·         SQL Server: SQL Statistics: SQL Recompilations/sec  (better to be nearly 0)
Measurement:
 sql recompilation/sec : sql compilation/sec should be < 10%.
 sql compilation/sec : batch request/sec  should be < 10%



3. Memory bottleneck check
The default setting for min server memory is 0, and the default setting for max server memory is 2147483647 MB in MS SQL 2008 


run 'perfmon', check the following:

  • SQL server: Buffer Manager - Average Page Life Expectancy Counter - should not less than 300 (5mins) 

How long can  a page stay in the memory before being flushed out to disk. too lower indicates lack of memory or missing index.(full table scan)

  • SQL server: Memory Manager - memory grant pending  - task waiting for RAM to execute.
  • Memory: - paging file % usage - should be as lower as possible, < 70% based on KB 889654
  • Memory: - paging file % usage peak - should be as lower as possible, < 70% based on KB 889654
  • Memory: - Available MBytes - should not be less than 200
  • Memory: - Page fault /s 
Page fault /s includes both hard faults (those that require disk access) and soft faults (where the faulted page is found elsewhere in physical memory.) Most processors can handle large numbers of soft faults without significant consequence. However, hard faults, which require disk access, can cause significant delays 

  • Memory: - pages/sec - indicator of the amount of paging that Windows is performing
  • Memory: - Page Reads/sec (should <=5)
 This counter is probably the best indicator of a memory shortage because it indicates how often the system is reading from disk because of hard page faults. The system is always using the pagefile even if there is enough RAM to support all of the applications. Thus, some number of page reads will always be encountered. However, a sustained value over 5 Page Reads/sec is often a strong indicator of a memory shortage

If Memory: Pages/sec is greater than zero or Memory: Page Reads/sec is greater than five, Windows is using disk to resolve memory references (hard page fault). This costs disk I/O + CPU resources. 

Note: 
a. If you have less memory, the disk I/O will increase, they are related, you should consider both sides if you suspect your system has performance issue.  

4. Disk I/O bottleneck
We should concentrate on disk queuing instead of on the actual I/O for each disk as actual disk I/O speed cannot be adjusted.
In perfmon, 

  •  PhysicalDisk - avg. Disk Queue Length  (if exceeds twice the number of spindles, then you are likely developing a bottleneck, you have to know how many disk behind the system, if it's SAN system, it might be difficult to get that information, even SAN administrator might not know exactly)

Note: Disk queuing might be caused by short of memory, check pages/sec value also.

  • Physical disk - avg. disk seconds/read - indicates the average time, in seconds, of a read of data from the disk.  (4-8ms is ideal, if > 15ms might indicate disk  bottlenecks

refer to http://technet.microsoft.com/en-US/library/cc917690 for IO bottleneck part



Less than 10 ms - very good
Between 10 - 20 ms - okay
Between 20 - 50 ms - slow, needs attention
Greater than 50 ms – Serious I/O bottleneck

According to http://download.microsoft.com/download/F/1/0/F10BC023-9396-4D67-BAC9-7C43AE954BF4/SQLServer2000_SQL_Performance_Tuning_using_Waits_and_Queues.pdf

If disk sec/read > normal read time (ask vendor for normal read time) you can consider the following options:
1. Resolve IO bottleneck by adding more drives; spreading IO across new drives if possible e.g. move files such as database, transaction log, other application files that are being written to or read from.
2. Check for memory pressure – see memory component.
3. Check for proper indexing of SQL tables. Proper indexing can save IO. Check SQL query plans looking for scans and sorts, etc. Showplan identifies sorting steps.
4. Run SQL Profiler to identify TSQL statements doing scans. In Profiler, select the scans event class & scan stopped event. Go to the data column tab and add object Id. Run the trace. Save the profiler trace to a trace table, and then search for the scans event. Alternately, you can search for high duration, reads, and writes.




  • Physical disk - avg. disk seconds/write (< 8ms (non cached)  and <=1ms for highend SAN) The throughput for high volume OLTP applications is dependent on fast sequential transaction log writes. 


Non cached Writes
Excellent < 08 Msec ( .008 seconds )
Good < 12 Msec ( .012 seconds )
Fair < 20 Msec ( .020 seconds )
Poor > 20 Msec ( .020 seconds )

Cached Writes Only
Excellent < 01 Msec ( .001 seconds )
Good < 02 Msec ( .002 seconds )
Fair < 04 Msec ( .004 seconds )
Poor > 04 Msec ( .004 seconds





  • Physical disk: % disk time - should be < 50%percentage of elapsed time that the selected disk drive was busy servicing read or write requests.




  • Physical disk - avg. disk reads/s  and Physical disk - avg. disk write/s 

Note: this 2 values indicate how busy is your database server, just for measuring purpose to know how busy your system is , compare to 3 month or a year ago


Should maintain above 2 values < 85% disk capacity


Example:
When you use these counters, you may need to adjust the values for RAID configurations using the following formulas:
·         Raid 0 -- I/Os per disk = (reads + writes) / number of disks
·         Raid 1 -- I/Os per disk = [reads + (2 * writes)] / 2
·         Raid 5 -- I/Os per disk = [reads + (4 * writes)] / number of disks
·         Raid 10 -- I/Os per disk = [reads + (2 * writes)] / number of disks
For example, you might have a RAID-1 system with two physical disks with the following values of the counters.

Disk Reads/sec            80
Disk Writes/sec           70
Avg. Disk Queue Length    5



Then to maintain a raid configuration below, you need:
Raid 0 - need 75 IO per disk
Raid 1 - need 110 IO per disk
Raid 5 - need 180 IO per disk
Raid 10 - need 110 IO per disk

The method of watching out for disk queuing associated with SQL Server log files is different from SQL Server database files. For log file, we use 


SQL Server: database: log flush wait time 
SQL server: database: log flush waits/sec

c. raid level - use raid 10 if possible

Each SQL Server write to the mirrorset results in two disk I/O operations

When dynamiclly adding new disk into raid array, The RAID controller will move some existing SQL Server data to these new drives so data is evenly distributed across all drives in the RAID array

d. sequential I/O
sequentional I/O is faster than non-sequential one, most of hard disk perform 2 times better for sequential than non-sequential I/O operations

Note: Logs generally are not a major concern because transaction log data is always written sequentially to the log file in sizes ranging up to 32 KB.

5. SQL Server

See SQL Server, Buffer Manager Object -  http://msdn.microsoft.com/en-us/library/ms189628.aspx for various objects explanation

  • SQL Server: Buffer Manager: Buffer Cache Hit Ratio, should be at least 90%

Note: a major part of any database server environment is the management of memory buffer cache. less than 90% may indicate memory pressure or missing index.

See SQL Profiler: Stored Procedure: CacheHit, CacheMiss, and CacheInsert to see what stored procedure query plans are already in cache (Hit), vs. those not in cache (Miss,Insert)

  • SQL server: cache manager: Cache Hit Ratio - Percentage of time the procedure plan pages are already in cache e.g. procedure cache hits. I.e. how often a compiled procedure is found in the procedure cache (thus avoiding the need to recompile).


  • SQL server: Buffer Manager: Page Life expectancy -

Sudden big drop in page life expectancy - DW applications (e.g. big transactions) could experience big drops in page life expectancy. This is due to a cache flush from a big read
or
average page life expectancy less than 300 which is 5 minutes - OLTP database - indicate memory pressure, missing indexes, or a cache flush.



  • SQL server: access method : Forwarded Records/sec


Number of records fetched through forwarded record pointers.
Tables with NO clustered index. If you start out with a short row, and update the row creating a wider row, the row may no longer fit on the data page. A pointer will be put in its place and the row will be forwarded to another page.

Look at code to determine where the short row is inserted followed by an update.
Can be avoided by:
1. Using Default values (so that an update will not result in a longer row that is the root cause of forwarded records).
2. Using Char instead of varchar (fixes length so that an update will not result in a longer row

  • SQL Server: access method: Full Scan/sec The number of unrestricted full scans. These can either be base table or full index scans.  -  should be as lower as possible > 1 might indicate problem

  SQL Profiler can be used to identify which TSQL statements do scan. Select the scans event class & events scan:started and scan:completed. Include the object Id data column. Save the profiler trace to a trace table, and then search for the scans event.
The scan:completed event will provide associated IO so you can also search for high reads, writes, and duration.

  • MS SQL Server: Access method: index search/s  (ideally index search/s : full scan /s should > 1000)
Index searches are used to start range scans, single index record fetches, and to reposition within an index. Compare to Full Scan/sec. You want to see high values for index searches.

  • MS SQL Server: Access method: Page Splits/secNumber of page splits occurring as the result of index pages overflowing. Normally associated with leaf pages of clustered indexes and non-clustered indexes.

Page splits are extra IO overhead that results from random inserts.
When there is no room on a data page, and the row must be inserted on the page (due to index order), SQL will split the page moving half the rows to a new page, and then insert the new row.
Correlate to Disk: page sec/write. If this is very high, you may reorg the index(es) on the table(s) causing the page splits, to reduce page splits temporarily. Fillfactor will leave a certain amount of space available for inserts. see http://msdn.microsoft.com/en-us/library/aa933139(v=sql.80).aspx for what's fillfactor.

  • MS SQL Server: Buffer Manager: Checkpoint pages/secPages written to disk during the checkpoint process, freeing up SQL cache
Indicates the number of pages flushed to disk per second by a checkpoint or other operation that require all dirty pages to be flushed.

Memory pressure is indicated if this counters is high along with high lazy writes/sec and low page life expectancy (<300 seconds)

  • MS SQL Server: Buffer Manager: Lazy writes/sec Pages written to disk by the lazywriter, freeing up SQL cache
Indicates the number of buffers written per second by the buffer manager's lazy writer. The lazy writer is a system process that flushes out batches of dirty, aged buffers (buffers that contain changes that must be written back to disk before the buffer can be reused for a different page) and makes them available to user processes. The lazy writer eliminates the need to perform frequent checkpoints in order to create available buffers.


Memory pressure is indicated if above this counters is high along with high lazy writes/sec and low page life expectancy (<300 seconds)

  • MS SQL Server: Buffer Manager: Readahead pages/sec -  indicates the number of pages read per second in anticipation of use.

If memory shortages, cold cache, or low hit rates, SQL may use worker threads to readahead (bring in pages ahead of time) to raise hit rates. By itself readahead is not a problem unless users are flushing each other’s pages consistently.
Check for proper indexing and bad query plans (scans in profiler)


  • SQL Server: database: log flush wait time (waiting for transaction log write request to complete <ms>)  and SQL server: database: log flush waits/sec (Tranlog writes per second) and
and  SQL server: database: log growths (Windows will automatically grow transaction log to accommodate insert, update, and delete activity.)

In general, growths of the transaction log will temporarily freeze writes to the transaction log while Windows grows the transaction log file. Check to see that the growth increment is large enough. If not, performance will suffer as log growths will occur more often.
  • SQL Server: log file size and used log file size
  • SQL Server: database file size
  • SQL server: tempdb file size
  • SQL server: database: Transactions /sec SQL Server transactions per second

  • MS SQL server: Transactions - Longest Transaction Running time.


check long transaction records view
sys.dm_tran_database_transactions - columns of particular interest include the time of the first log record (database_transaction_begin_time), the current state of the transaction (database_transaction_state), and the log sequence number (LSN) of the begin record in the transaction log (database_transaction_begin_lsn).
In perfmon, check

  • General statistics: login/s (number of logins per second), logout/s 
  • General statistics:  User Connections : not very accurate since some application uses connection pool, there are many users behind connection pool
  • Latches : Average Latch Wait Time(ms)
Latches are short term light weight synchronization object. Latches are not held for the duration of a transaction. Typical latching operations during row transfers to memory, controlling modifications to row offset table, etc

  • Latches: Latch Waits/sec
  • Lock - Average Wait Time(ms)  Transactions should be as short as possible to limit the blocking of other users.
  • Lock - Lock Waits/sec
  • Lock - number of deadlocks/sec 



sys.dm_os_wait_stats - check top wait statistics
sys.dm_db_index_operational_stats - check lock and latch wait

8. Network bottlenecks
in perfmon, choose the following:
a. Network Interface - Current Bandwidth.
b. packets/sec 

Note: actual bandwidth is computed as packets/sec * 1500 * 8 /1000000 Mbps.

9. tempdb

Also check paging file, separate data file and log file drive, tempdb data/log should be at separate drive. 


Monitor the following Performance Monitor counters for any unusual increase in the temporary objects allocation/deal location activity:
·         SQL Server:Access Methods\Workfiles Created /Sec
·         SQL Server:Access Methods\Worktables Created /Sec
·         SQL Server:Access Methods\Mixed Page Allocations /Sec
·         SQL Server:General Statistics\Temp Tables Created /Sec
·         SQL Server:General Statistics\Temp Tables for destruction

Solution:  Increase the tempdb data files by an equal amount to distribute the workload across all of the disks and files. Ideally, you want to have as many files as there are CPUs (taking into account the affinity).

Move tempdb data and logfile to new location:

alter database tempdb modify file 
     (name='tempdev',filename= 'e:\mssql7\tempnew_location.mDF')
alter database tempdb modify file 
     (name='templog',filename= 'c:\temp\tempnew_loglocation.mDF')


Part II - performance tuning
1. CPU/Memory bottlenecks - upgrade server or adding more RAM

Note:
AWE: check it on 32-bit servers with >4GB of memory, and unchecked the rest of the time.

2. reduce disk I/O 

  • buy faster disk (15k rpm)
  • using database partition to reduce I/O
  • separate data and log file disk
  • increase temp tablespace initial size and put to separate disk other than data/log files
  • configuring suitable maximum server memory in sql server to leave some for OS
  • suitable raid level:raid 0 or raid 0+1(raid 1/0 or raid10) , mirrored stripes (ms) , stripe first, mirror later, raid 0 first. Raid 10 is the bestfiles.


3. MS SQL server

  • create and maintain good index

  recently used index is in sys.dm_db_index_usage_stats, any defined index not included in this DMV has not been used since the last re-start of SQL Server.

  •   Too many indexes need frequent updates or unused indexes that need frequently maintenance

(insert/select/update) will create overhead, but without benefiting any users.

  •   create appropriate indexing to avoid excessive sorting operations
  •   Big IOs such as table and range scans due to missing indexes


Reduce table join for frequent queries. join overuse will result in long run query and waste resources.


  • Partition large data sets and indexes



  • Tune applications and queries.

  It is very important for application developers to understand SQL Server architectural basics and how to take full advantage of SQL Server indexes to minimize I/O.


  • Lack of useful statistics:

  http://msdn.microsoft.com/en-us/library/ms187348(v=sql.105).aspx
  http://msdn.microsoft.com/en-us/library/ms190397(v=sql.105).aspx#UpdateStatistics
the following command calls sp_updatestats to update all statistics for the database.

EXEC sp_updatestats

other usage:
UPDATE STATISTICS Sales.SalesOrderDetail;
DBCC SHOW_STATISTICS

AUTO_CREATE_STATISTICS
AUTO_UPDATE_STATISTICS
AUTO_UPDATE_STATISTICS_ASYNC


Master database files:

The master database, msdb, and model databases are not used much during production compared to user databases, so it is typically not necessary to consider them in I/O performance tuning considerations. The master database is usually used only for adding new logins, databases, devices, and other system objects.


Part III - References:
1. http://sqlcat.com/sqlcat/b/top10lists/archive/2007/11/21/top-sql-server-2005-performance-issues-for-oltp-applications.aspx
2. SQL server performance tuning guide for dataware house -  http://technet.microsoft.com/library/Cc966420
3. http://msdn.microsoft.com/en-us/library/ms178067.aspx
4. http://sqlcat.com/sqlcat/b/top10lists/archive/2007/11/21/top-10-sql-server-2005-performance-issues-for-data-warehouse-and-reporting-applications.aspx
5. sql long transaction view - http://msdn.microsoft.com/en-us/library/ms186957(v=sql.105).aspx
6. update statistics - http://msdn.microsoft.com/en-us/library/ms187348(v=sql.105).aspx
7. Performance Monitor Counters -  http://technet.microsoft.com/en-us/library/cc768048.aspx
8. performance and tuning howto - http://msdn.microsoft.com/en-us/library/ms191511(v=sql.105).aspx
9. http://technet.microsoft.com/en-us/library/ms190619(v=sql.105)
10. establishing a performance baseline - http://technet.microsoft.com/en-us/library/cc781394(v=WS.10).aspx
11. storage best practise - http://technet.microsoft.com/library/Cc966534

12. Troubleshooting performance problem in sql server 2008 - http://msdn.microsoft.com/en-us/library/dd672789(v=SQL.100).aspx#phrss   - excellent doc


13. best practice for sql sever maintenance - http://technet.microsoft.com/en-US/library/cc966447
14. SAN storage best practise for sql server - http://www.brentozar.com/sql/sql-server-san-best-practices/
15. sql server 2008 I/O performance - http://blogs.technet.com/b/josebda/archive/2009/03/31/sql-server-2008-i-o-performance.aspx
16. diagnosing and resolving Latch Content for SQL server 2008 R2 - http://www.microsoft.com/en-us/download/details.aspx?id=26665

17. How to: Configure SQL Server to Use Soft-NUMA- http://msdn.microsoft.com/en-us/library/ms178144(v=sql.105).aspx

Part IV - useful DMVs

1. DMV list: http://msdn.microsoft.com/en-us/library/ms179984.aspx
2. commands:
select * from sys.dm_os_wait_stats order by wait_time_ms
select * from sys.dm_os_latch_stats;


Select  wait_type,
        waiting_tasks_count,
        wait_time_ms
from  sys.dm_os_wait_stats 
where wait_type like 'PAGEIOLATCH%' 
order by wait_type

I/O wait at this moment

select
    database_id,
    file_id,
    io_stall,
    io_pending_ms_ticks,
    scheduler_address
from  sys.dm_io_virtual_file_stats(NULL, NULL)t1,
        sys.dm_io_pending_io_requests as t2
where t1.file_handle = t2.io_handle



top 10 wait type

select top 10 *
from sys.dm_os_wait_stats
order by wait_time_ms desc



Part V - best practise
1. Isolate log from data at the physical disk level - 

  •     separate database data and log file disk,  Pre-size data and log files.

          According to http://technet.microsoft.com/library/Cc966534It is recommended to have .25 to 1 data files (per filegroup) for each CPU on the host server. 

  •     tempdb data/logfile should on different disk spindle

       create one data file for tempdb per cpu  (Dual core counts as 2 CPUs; logical procs (hyperthreading) do not)
       depending on tempdb usage, use raid10 for storage if possible

  • backup should go to separate disk
  • If you have real time anti-virus monitoring, it is recommended that you exclude the SQL Server database files (including data files, transaction log files, tempdb and other system database files) from real time monitoring
  •  Update statistics on the largest tables weekly or monthly
  • Rebuild or defrag the most important indexes

2.  storage disk for data file should not be shared with other application etc, dedicated for data file with enough spindle
3.  vcpu and memory amount should be same as physical machine baseline if running in virtual environment.
4. If raid 1/0 or raid 1 for logfile disk if possible.  RAID 1+0 provides better write performance than any other RAID level providing data protection, including RAID 5.


Part VI - testing tools
1. sql nexus
2. sql diag
3. sqlIO , alternatively , use IOMETER
download sqlio at http://www.microsoft.com/en-us/download/details.aspx?id=20163
http://www.brentozar.com/archive/2008/09/finding-your-san-bottlenecks-with-sqlio/

SAN performance tuning with sqlio - http://sqlserverpedia.com/wiki/SAN_Performance_Tuning_with_SQLIO

4. perfmon and sql profiler
http://www.simple-talk.com/sql/database-administration/correlating-sql-server-profiler-with-performance-monitor/  good article talking about how to corelate profiler and perfmon data

5. latch wait


For a non-production environment only, clear the sys.dm_os_wait_stats DMV with
the following command:
dbcc SQLPERF ('sys.dm_os_wait_stats', 'CLEAR')
A similar command can be run to clear the sys.dm_os_latch_stats DMV:
dbcc SQLPERF ('sys.dm_os_latch_stats', 'CLEAR')


Part VII - backup and restore


1. restore master datbase in single user mode -  http://technet.microsoft.com/en-us/library/ms190679.aspx
2. Performing a Complete Database Restore (Full Recovery Model)
http://msdn.microsoft.com/en-us/library/ms187495(v=sql.105).aspx

http://msdn.microsoft.com/en-us/library/ms175199(v=sql.105).aspx

Part VIII - FAQ
1. if performance counter doesn't load or corrupted, you can run this to fix:


cd \windows\system32
lodctr /R