Dakusan's Domain Forum

Main Site Discussion => Posts => Topic started by: Dakusan on September 28, 2009, 05:31:02 AM

Title: Always Confirm Potentially Hazardous Actions
Post by: Dakusan on September 28, 2009, 05:31:02 AM
Original post for Always Confirm Potentially Hazardous Actions can be found at https://www.castledragmire.com/Posts/Always_Confirm_Potentially_Hazardous_Actions.
Originally posted on: 04/23/08

So I have been having major speed issues with one of our servers. After countless hours of diagnoses, I determined the bottle neck was always I/O (input/output, accessing the hard drive).  For example, when running an MD5 hash on a 600MB file load would jump up to 31 with 4 logical CPUs and it would take 5-10 minutes to complete. When performing the same test on the same machine on a second drive it finished within seconds.

Replacing the hard drive itself is a last resort for a live production server, and a friend suggested the drive controller could be the problem, so I confirmed that the drive controller for our server was not on-board (on its own card), and I attempted to convince the company hosting our server of the problem so they would replace the drive controller. I ran my own tests first with an iostat check while doing a read of the main hard drive (cat /etc/sda > /dev/null). This produced steadily worsening results the longer the test went on, and always much worse than our secondary drive. I passed these results on to the hosting company, and they replied that a “badblocks –vv” produced results that showed things looked fine.

So I was about to go run his test to confirm his findings, but decided to check parameters first, as I always like to do before running new Linux commands.  Thank Thor I did. The admin had meant to write “badblocks –v” (verbose) and typoed with a double key stroke. The two v’s looked like a w due to the font, and had I ran a “badblocks –w” (write-mode test), I would have wiped out the entire hard drive.

Anyways, the test outputted the same basic results as my iostat test with throughput results very quickly decreasing from a remotely acceptable level to almost nil.  Of course, the admin only took the best results of the test, ignoring the rest.

I had them swap out the drive controller anyways, and it hasn’t fixed things, so a hard drive replace will probably be needed soon.  This kind of problem would be trivial if I had access to the server and could just test the hardware myself, but that is a price to pay for proper security at a server farm.