Home Page
  • March 28, 2024, 08:18:52 pm *
  • Welcome, Guest
Please login or register.

Login with username, password and session length
Advanced search  

News:

Official site launch very soon, hurrah!


Author Topic: Always Confirm Potentially Hazardous Actions  (Read 10072 times)

Dakusan

  • Programmer Person
  • Administrator
  • Hero Member
  • *****
  • Posts: 535
    • View Profile
    • Dakusan's Domain
Always Confirm Potentially Hazardous Actions
« on: September 28, 2009, 05:31:02 am »


So I have been having major speed issues with one of our servers. After countless hours of diagnoses, I determined the bottle neck was always I/O (input/output, accessing the hard drive).  For example, when running an MD5 hash on a 600MB file load would jump up to 31 with 4 logical CPUs and it would take 5-10 minutes to complete. When performing the same test on the same machine on a second drive it finished within seconds.

Replacing the hard drive itself is a last resort for a live production server, and a friend suggested the drive controller could be the problem, so I confirmed that the drive controller for our server was not on-board (on its own card), and I attempted to convince the company hosting our server of the problem so they would replace the drive controller. I ran my own tests first with an iostat check while doing a read of the main hard drive (cat /etc/sda > /dev/null). This produced steadily worsening results the longer the test went on, and always much worse than our secondary drive. I passed these results on to the hosting company, and they replied that a “badblocks –vv” produced results that showed things looked fine.

So I was about to go run his test to confirm his findings, but decided to check parameters first, as I always like to do before running new Linux commands.  Thank Thor I did. The admin had meant to write “badblocks –v” (verbose) and typoed with a double key stroke. The two v’s looked like a w due to the font, and had I ran a “badblocks –w” (write-mode test), I would have wiped out the entire hard drive.

Anyways, the test outputted the same basic results as my iostat test with throughput results very quickly decreasing from a remotely acceptable level to almost nil.  Of course, the admin only took the best results of the test, ignoring the rest.

I had them swap out the drive controller anyways, and it hasn’t fixed things, so a hard drive replace will probably be needed soon.  This kind of problem would be trivial if I had access to the server and could just test the hardware myself, but that is a price to pay for proper security at a server farm.

Logged