HOWTO Setup S.M.A.R.T. hard-drive Monitoring: Difference between revisions

From Research
Jump to navigation Jump to search
No edit summary
No edit summary
 
(One intermediate revision by the same user not shown)
Line 11: Line 11:
Representative configuration file:
Representative configuration file:
  <font color=red>hostname</font> <font color=blue># </font>'''emacs -nw /etc/smartd.conf'''
  <font color=red>hostname</font> <font color=blue># </font>'''emacs -nw /etc/smartd.conf'''
  # This file:  /etc/smartd.conf
  # This file:  /etc/smartd.conf
  # created April 14, 2010, Gordon Pritchard
  # created April 14, 2010, Gordon Pritchard
Line 17: Line 16:
  # Monitor 16 SATA disks connected to a 3ware 9650 controller which
  # Monitor 16 SATA disks connected to a 3ware 9650 controller which
  # uses the (built-into-kernel) 3w-9xxx driver.
  # uses the (built-into-kernel) 3w-9xxx driver.
# The first two drives are WD 150GB Raptor, configured as RAID-1
#  long selftest takes 72min
# The remaining 14 drives are Seagate 500GB Barracuda, configured as RAID-5
#  long selftest takes 120min
   
   
   
   
  /dev/twa0 \
  /dev/twa0 \
       -s L/../../0/01 \    # Long Self-tests Sundays between 1-2am
       -s L/../../0/01 \    # Long Self-tests Sundays between 1-2:30am
       -d 3ware,0 \        # First physical drive, behind 3Ware controller
       -d 3ware,0 \        # First physical drive, behind 3Ware controller
       -I 194 \            # Ignore temperature
       -I 194 \            # Ignore temperature
       -I 9 \              # Ignore power-on hours
       -I 9 \              # Ignore power-on hours
       -a                  # Report on:
       -a                  # Report on:
                           #  SMART health status
                           #  SMART health status (-H)
                           #  usage failures (-f)
                           #  usage failures (-f)
                           #  changes in Prefailure, Usage (-t)
                           #  changes in Prefailure, Usage (-t)
Line 32: Line 36:
   
   
  /dev/twa0 \
  /dev/twa0 \
       -s L/../../0/03 \    # Long Self-tests Sunday between 3-4am
       -s L/../../0/03 \    # Long Self-tests Sunday between 3-4:30am
       -d 3ware,1 \        # Second physical drive, behind 3Ware controller
       -d 3ware,1 \        # Second physical drive, behind 3Ware controller
       -I 194 \            # Ignore temperature
       -I 194 \            # Ignore temperature
Line 44: Line 48:
   
   
  /dev/twa0 \
  /dev/twa0 \
       -s L/../../0/23 \    # Long Self-tests Sundays between 11pm-midnight
       -s L/../../0/22 \    # Long Self-tests Sundays between 10pm-midnight
       -d 3ware,2 \        # Third physical drive, behind 3Ware controller
       -d 3ware,2 \        # Third physical drive, behind 3Ware controller
       -I 194 \            # Ignore temperature
       -I 194 \            # Ignore temperature
      -I 190 \            # Ignore Air Temperature 
       -I 9 \              # Ignore power-on hours
       -I 9 \              # Ignore power-on hours
       -a                  # Report on:
       -a                  # Report on:
Line 56: Line 61:
   
   
  /dev/twa0 \
  /dev/twa0 \
       -s L/../../1/01 \    # Long Self-tests Mondays between 1-2am
       -s L/../../1/01 \    # Long Self-tests Mondays between 1-3am
       -d 3ware,3 \        # Fourth physical drive, behind 3Ware controller
       -d 3ware,3 \        # Fourth physical drive, behind 3Ware controller
       -I 194 \            # Ignore temperature
       -I 194 \            # Ignore temperature
      -I 190 \            # Ignore Air Temperature 
       -I 9 \              # Ignore power-on hours
       -I 9 \              # Ignore power-on hours
       -a                  # Report on:
       -a                  # Report on:
Line 68: Line 74:
   
   
  /dev/twa0 \
  /dev/twa0 \
       -s L/../../1/23 \    # Long Self-tests Mondays between 11pm-midnight
       -s L/../../1/22 \    # Long Self-tests Mondays between 10pm-midnight
       -d 3ware,4 \        # Fifth physical drive, behind 3Ware controller
       -d 3ware,4 \        # Fifth physical drive, behind 3Ware controller
       -I 194 \            # Ignore temperature
       -I 194 \            # Ignore temperature
      -I 190 \            # Ignore Air Temperature 
       -I 9 \              # Ignore power-on hours
       -I 9 \              # Ignore power-on hours
       -a                  # Report on:
       -a                  # Report on:
Line 80: Line 87:
   
   
  /dev/twa0 \
  /dev/twa0 \
       -s L/../../2/01 \    # Long Self-tests Tuesdays between 1-2am
       -s L/../../2/01 \    # Long Self-tests Tuesdays between 1-3am
       -d 3ware,5 \        # Sixth physical drive, behind 3Ware controller
       -d 3ware,5 \        # Sixth physical drive, behind 3Ware controller
       -I 194 \            # Ignore temperature
       -I 194 \            # Ignore temperature
      -I 190 \            # Ignore Air Temperature 
       -I 9 \              # Ignore power-on hours
       -I 9 \              # Ignore power-on hours
       -a                  # Report on:
       -a                  # Report on:
Line 92: Line 100:
   
   
  /dev/twa0 \
  /dev/twa0 \
       -s L/../../2/23 \    # Long Self-tests Tuesdays between 11pm-midnight
       -s L/../../2/22 \    # Long Self-tests Tuesdays between 10pm-midnight
       -d 3ware,6 \        # Seventh physical drive, behind 3Ware controller
       -d 3ware,6 \        # Seventh physical drive, behind 3Ware controller
       -I 194 \            # Ignore temperature
       -I 194 \            # Ignore temperature
      -I 190 \            # Ignore Air Temperature 
       -I 9 \              # Ignore power-on hours
       -I 9 \              # Ignore power-on hours
       -a                  # Report on:
       -a                  # Report on:
Line 104: Line 113:
   
   
  /dev/twa0 \
  /dev/twa0 \
       -s L/../../3/01 \    # Long Self-tests Wednesdays between 1-2am
       -s L/../../3/01 \    # Long Self-tests Wednesdays between 1-3am
       -d 3ware,7 \        # Eighth physical drive, behind 3Ware controller
       -d 3ware,7 \        # Eighth physical drive, behind 3Ware controller
       -I 194 \            # Ignore temperature
       -I 194 \            # Ignore temperature
      -I 190 \            # Ignore Air Temperature 
       -I 9 \              # Ignore power-on hours
       -I 9 \              # Ignore power-on hours
       -a                  # Report on:
       -a                  # Report on:
Line 114: Line 124:
   
   
  /dev/twa0 \
  /dev/twa0 \
       -s L/../../3/23 \    # Long Self-tests Wednesdays between 11pm-midnight
       -s L/../../3/22 \    # Long Self-tests Wednesdays between 10pm-midnight
       -d 3ware,8 \        # Ninth physical drive, behind 3Ware controller
       -d 3ware,8 \        # Ninth physical drive, behind 3Ware controller
       -I 194 \            # Ignore temperature
       -I 194 \            # Ignore temperature
      -I 190 \            # Ignore Air Temperature 
       -I 9 \              # Ignore power-on hours
       -I 9 \              # Ignore power-on hours
       -a                  # Report on:
       -a                  # Report on:
Line 126: Line 137:
   
   
  /dev/twa0 \
  /dev/twa0 \
       -s L/../../4/01 \    # Long Self-tests Thursdays between 1-2am
       -s L/../../4/01 \    # Long Self-tests Thursdays between 1-3am
       -d 3ware,9 \        # Tenth physical drive, behind 3Ware controller
       -d 3ware,9 \        # Tenth physical drive, behind 3Ware controller
       -I 194 \            # Ignore temperature
       -I 194 \            # Ignore temperature
      -I 190 \            # Ignore Air Temperature 
       -I 9 \              # Ignore power-on hours
       -I 9 \              # Ignore power-on hours
       -a                  # Report on:
       -a                  # Report on:
Line 138: Line 150:
   
   
  /dev/twa0 \
  /dev/twa0 \
       -s L/../../4/23 \    # Long Self-tests Thursdays between 11pm-midnight
       -s L/../../4/22 \    # Long Self-tests Thursdays between 10pm-midnight
       -d 3ware,10 \        # Eleventh physical drive, behind 3Ware controller
       -d 3ware,10 \        # Eleventh physical drive, behind 3Ware controller
       -I 194 \            # Ignore temperature
       -I 194 \            # Ignore temperature
      -I 190 \            # Ignore Air Temperature 
       -I 9 \              # Ignore power-on hours
       -I 9 \              # Ignore power-on hours
       -a                  # Report on:
       -a                  # Report on:
Line 150: Line 163:
   
   
  /dev/twa0 \
  /dev/twa0 \
       -s L/../../5/01 \    # Long Self-tests Fridays between 1-2am
       -s L/../../5/01 \    # Long Self-tests Fridays between 1-3am
       -d 3ware,11 \        # Twelfth physical drive, behind 3Ware controller
       -d 3ware,11 \        # Twelfth physical drive, behind 3Ware controller
       -I 194 \            # Ignore temperature
       -I 194 \            # Ignore temperature
      -I 190 \            # Ignore Air Temperature 
       -I 9 \              # Ignore power-on hours
       -I 9 \              # Ignore power-on hours
       -a                  # Report on:
       -a                  # Report on:
Line 162: Line 176:
   
   
  /dev/twa0 \
  /dev/twa0 \
       -s L/../../5/23 \    # Long Self-tests Fridays between 11pm-midnight
       -s L/../../5/22 \    # Long Self-tests Fridays between 10pm-midnight
       -d 3ware,12 \        # Thirteenth physical drive, behind 3Ware controller
       -d 3ware,12 \        # Thirteenth physical drive, behind 3Ware controller
       -I 194 \            # Ignore temperature
       -I 194 \            # Ignore temperature
      -I 190 \            # Ignore Air Temperature 
       -I 9 \              # Ignore power-on hours
       -I 9 \              # Ignore power-on hours
       -a                  # Report on:
       -a                  # Report on:
Line 174: Line 189:
   
   
  /dev/twa0 \
  /dev/twa0 \
       -s L/../../6/01 \    # Long Self-tests Saturdays between 1-2am
       -s L/../../6/01 \    # Long Self-tests Saturdays between 1-3am
       -d 3ware,13 \        # Fourteenth physical drive, behind 3Ware controller
       -d 3ware,13 \        # Fourteenth physical drive, behind 3Ware controller
       -I 194 \            # Ignore temperature
       -I 194 \            # Ignore temperature
      -I 190 \            # Ignore Air Temperature 
       -I 9 \              # Ignore power-on hours
       -I 9 \              # Ignore power-on hours
       -a                  # Report on:
       -a                  # Report on:
Line 186: Line 202:
   
   
  /dev/twa0 \
  /dev/twa0 \
       -s L/../../6/03 \    # Long Self-tests Saturdays between 3-4am
       -s L/../../6/03 \    # Long Self-tests Saturdays between 3-5am
       -d 3ware,14 \        # Fifteenth physical drive, behind 3Ware controller
       -d 3ware,14 \        # Fifteenth physical drive, behind 3Ware controller
       -I 194 \            # Ignore temperature
       -I 194 \            # Ignore temperature
      -I 190 \            # Ignore Air Temperature 
       -I 9 \              # Ignore power-on hours
       -I 9 \              # Ignore power-on hours
       -a                  # Report on:
       -a                  # Report on:
                           #  SMART health status
                           #  SMART health status (-H)
                           #  usage failures (-f)
                           #  usage failures (-f)
                           #  changes in Prefailure, Usage (-t)
                           #  changes in Prefailure, Usage (-t)
Line 198: Line 215:
   
   
  /dev/twa0 \
  /dev/twa0 \
       -s L/../../6/23 \    # Long Self-tests Saturdays between 11pm-midnight
       -s L/../../6/22 \    # Long Self-tests Saturdays between 10pm-midnight
       -d 3ware,15 \        # Sixteenth physical drive, behind 3Ware controller
       -d 3ware,15 \        # Sixteenth physical drive, behind 3Ware controller
       -I 194 \            # Ignore temperature
       -I 194 \            # Ignore temperature
      -I 190 \            # Ignore Air Temperature 
       -I 9 \              # Ignore power-on hours
       -I 9 \              # Ignore power-on hours
       -a                  # Report on:
       -a                  # Report on:
                           #  SMART health status
                           #  SMART health status (-H)
                           #  usage failures (-f)
                           #  usage failures (-f)
                           #  changes in Prefailure, Usage (-t)
                           #  changes in Prefailure, Usage (-t)
                           #  increases in error (-l error)
                           #  increases in error (-l error)
                           #  increases in self-test errors (-l selftest)
                           #  increases in self-test errors (-l selftest)


=Available SMART parameters=
In order to find what parameters are available for monitoring or ignoring, issue the '''smartctl''' command:
In order to find what parameters are available for monitoring or ignoring, issue the '''smartctl''' command:
  <font color=red>hostname</font> <font color=blue># </font>'''smartctl --all /dev/twa0 -d 3ware,0'''  ''(of course, change the ''',0''' to any other 3Ware-connected drive you're interested in)''
  <font color=red>hostname</font> <font color=blue># </font>'''smartctl --all /dev/twa0 -d 3ware,0'''  ''(of course, change the ''',0''' to any other 3Ware-connected drive you're interested in)''
=How Long Does a (Long) Selftest Take=
To learn how long you should allow for a Long SelfTest (typical values are 72min for a 150GB WD Raptor, and 150min for a 500GB WD RE2):
<font color=red>hostname</font> <font color=blue># </font>'''smartctl -c /dev/twa0 -d 3ware,0'''  ''(of course, change the ''',0''' to any other 3Ware-connected drive you're interested in)''

Latest revision as of 16:39, 15 April 2010

Typically, we use a 3Ware controller, with anywhere from 2 - 16 individual hard-drives attached. Although we use 3Ware's tw_cli tool in both a daily cron-job and also in a Nagios monitor, it's also a good idea to get a daily logwatch-email stanza with drive-statistics and alerts. Belts and suspenders, you know.

In addition to (passive) monitoring, we actively invoke Long Self-Tests on each drive. These tests are scheduled to try to hit low-usage times, and also avoid our tape-backup times.

Install the (Gentoo) package:

hostname # emerge -av smartmontools

Start the smartd monitoring daemon automatically in the default runlevel:

hostname # rc-update add smartd default

Representative configuration file:

hostname # emacs -nw /etc/smartd.conf
# This file:  /etc/smartd.conf
# created April 14, 2010, Gordon Pritchard

# Monitor 16 SATA disks connected to a 3ware 9650 controller which
# uses the (built-into-kernel) 3w-9xxx driver.

# The first two drives are WD 150GB Raptor, configured as RAID-1
#  long selftest takes 72min
# The remaining 14 drives are Seagate 500GB Barracuda, configured as RAID-5
#  long selftest takes 120min


/dev/twa0 \
     -s L/../../0/01 \    # Long Self-tests Sundays between 1-2:30am
     -d 3ware,0 \         # First physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status (-H)
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)

/dev/twa0 \
     -s L/../../0/03 \    # Long Self-tests Sunday between 3-4:30am
     -d 3ware,1 \         # Second physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status (-H)
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)

/dev/twa0 \
     -s L/../../0/22 \    # Long Self-tests Sundays between 10pm-midnight
     -d 3ware,2 \         # Third physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 190 \             # Ignore Air Temperature  
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status (-H)
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)

/dev/twa0 \
     -s L/../../1/01 \    # Long Self-tests Mondays between 1-3am
     -d 3ware,3 \         # Fourth physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 190 \             # Ignore Air Temperature  
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status (-H)
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)

/dev/twa0 \
     -s L/../../1/22 \    # Long Self-tests Mondays between 10pm-midnight
     -d 3ware,4 \         # Fifth physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 190 \             # Ignore Air Temperature  
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status (-H)
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)

/dev/twa0 \
     -s L/../../2/01 \    # Long Self-tests Tuesdays between 1-3am
     -d 3ware,5 \         # Sixth physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 190 \             # Ignore Air Temperature  
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status (-H)
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)

/dev/twa0 \
     -s L/../../2/22 \    # Long Self-tests Tuesdays between 10pm-midnight
     -d 3ware,6 \         # Seventh physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 190 \             # Ignore Air Temperature  
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status (-H)
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)

/dev/twa0 \
     -s L/../../3/01 \    # Long Self-tests Wednesdays between 1-3am
     -d 3ware,7 \         # Eighth physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 190 \             # Ignore Air Temperature  
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)

/dev/twa0 \
     -s L/../../3/22 \    # Long Self-tests Wednesdays between 10pm-midnight
     -d 3ware,8 \         # Ninth physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 190 \             # Ignore Air Temperature  
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status (-H)
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)

/dev/twa0 \
     -s L/../../4/01 \    # Long Self-tests Thursdays between 1-3am
     -d 3ware,9 \         # Tenth physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 190 \             # Ignore Air Temperature  
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status (-H)
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)

/dev/twa0 \
     -s L/../../4/22 \    # Long Self-tests Thursdays between 10pm-midnight
     -d 3ware,10 \        # Eleventh physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 190 \             # Ignore Air Temperature  
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status (-H)
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)

/dev/twa0 \
     -s L/../../5/01 \    # Long Self-tests Fridays between 1-3am
     -d 3ware,11 \        # Twelfth physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 190 \             # Ignore Air Temperature  
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status (-H)
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)

/dev/twa0 \
     -s L/../../5/22 \    # Long Self-tests Fridays between 10pm-midnight
     -d 3ware,12 \        # Thirteenth physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 190 \             # Ignore Air Temperature  
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status (-H)
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)

/dev/twa0 \
     -s L/../../6/01 \    # Long Self-tests Saturdays between 1-3am
     -d 3ware,13 \        # Fourteenth physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 190 \             # Ignore Air Temperature  
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status (-H)
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)

/dev/twa0 \
     -s L/../../6/03 \    # Long Self-tests Saturdays between 3-5am
     -d 3ware,14 \        # Fifteenth physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 190 \             # Ignore Air Temperature  
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status (-H)
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)

/dev/twa0 \
     -s L/../../6/22 \    # Long Self-tests Saturdays between 10pm-midnight
     -d 3ware,15 \        # Sixteenth physical drive, behind 3Ware controller
     -I 194 \             # Ignore temperature
     -I 190 \             # Ignore Air Temperature  
     -I 9 \               # Ignore power-on hours
     -a                   # Report on:
                          #  SMART health status (-H)
                          #  usage failures (-f)
                          #  changes in Prefailure, Usage (-t)
                          #  increases in error (-l error)
                          #  increases in self-test errors (-l selftest)

Available SMART parameters

In order to find what parameters are available for monitoring or ignoring, issue the smartctl command:

hostname # smartctl --all /dev/twa0 -d 3ware,0   (of course, change the ,0 to any other 3Ware-connected drive you're interested in)

How Long Does a (Long) Selftest Take

To learn how long you should allow for a Long SelfTest (typical values are 72min for a 150GB WD Raptor, and 150min for a 500GB WD RE2):

hostname # smartctl -c /dev/twa0 -d 3ware,0   (of course, change the ,0 to any other 3Ware-connected drive you're interested in)