About the author.

Welcome to The blog of whall

Come on in and stay a while… laugh a little. Maybe even think. Read more...

Hi, This is Wayne. This is my site, my stuff, my blog, blahblahblah. The site itself is powered by WordPress and the Scary Little theme. I thought it was cool, and I still do.

September
11
2008
2:41 pm
Categories:
Tags:
Post Meta :

ALERT! ALERT! ALERT! ALERT! ALERT! ALERT!

ALERT! ALERT! ALERT! ALERT! ALERT! ALERT!

This is another one of those semi-technical posts about my blog. Proceed at your own peril.

Question: Would you complain?

I mean, let’s say you have a blog. It’s not a business, you don’t rake in donations for the homeless, and people on breathing machines don’t depend on your site for their next life-giving gasp of air. Its. Just. A. Blog.

Despite the monotonous blogtitude of it all (it’s a blog every day, day in, day out, it’s never anything but a blog) of your blog’s existence, it still has a yearning to, well, exist. It (the blog) really doesn’t like it when it, um isn’t. Isn’ting isn’t a very cool thing to be. IS is where it’s at.

The Blog Is. I Am. We Are (the world). That’s what she said.

Anyway, where was I? Oh yeah, Is.

For a long long time, my blog was. It was up. It was around. It was on. Then, a couple of times recently, it was broken. So what do you do? Do you call your ISP and complain to them “hey, like, hey man, like my blog. Something’s wrong with it” and then they say “it looks fine to me!” and then I’m all like “whoa, man, like, you’re RIGHT. It is cool. That’s a trip.”

And so on it goes.

However, I have a little thing in my back pocket called MONITORING. I work for a nationwide ISP who prides itself on uptime, customer service, and actually proving if things are up or not. One of the products we used to have is called NetCool ISM, and we still have it around, although it’s a very old version and probably not supported any more.

The rest of my troubleshooting, replete with pretty eye candy, graphs, technical jargon, and unix stuff, is in the extended entry below

Using this tool, I would check out the “uptime” of my website(s). I have it checking every 5 minutes for http://whall.org/blog and then letting me know different things like how long it took to load, whether it loaded or not, if any errors happened or what.

Here’s what a typical day used to look like (this is from a week ago, 9/3/2008)

Looks good, eh? 100% uptime is pretty good in my book!

A couple of days before that, there was a minor outage near midnight, but I’m fine with a little maintenance now and again.

99.33% still isn’t that bad, and since the issues happened in the 10pm hour, I figure it’s no biggie.

But take a look at this week so far:

Each day, the uptime has gotten worse and worse. If I click into Wednesday to see the detailed hourly breakdown, we see a wildly different set of statistics than the typical day from last week:

That is abysmal! 50% downtime in two “prime time” hours (1pm, 2pm)  (side note: thanx to Avi for letting me know my site was down – this is what prompted this post in the first place).

So the next thing, from a purely testing point of view, is to determine if the monitoring system itself has issues. I checked other sites we’re monitoring, and they all show nothing but green. So that’s not it.

Next to test is apache, the web server, vs PHP, the web parser. So I create a test html file and a test php script that just does phpinfo() to see if they have any differences. I’ll be watching that in the next day or so. If I see that the html file test is green and the php script isn’t, then at least we’ve narrowed down something. If both of them are green, but the blog isn’t, then the next test will be the MySql server – maybe that’s what’s been having the problems.

HEH – so I go in to add the two new monitoring profiles and wait the obligatory 5-10 minutes for stats to gather. And it shows DOWN. Wha??? So I check the other monitoring and they also indicate down. So I check my blog, and guess what – it’s down also.

SHEESH

I login via ssh and check the status of the server and its load average is way off the charts

top – 14:22:41 up 42 days, 15:51, 8 users, load average: 28.89, 22.23, 16.05
Tasks: 2 total, 1 running, 1 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 4.8%sy, 0.0%ni, 0.0%id, 95.0%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 4148308k total, 1718856k used, 2429452k free, 472k buffers
Swap: 265064k total, 1636k used, 263428k free, 862504k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3531 u3530304 22 2 5516 1984 1352 S 0 0.0 0:00.34 bash
14632 u3530304 22 2 2112 996 812 R 0 0.0 0:00.26 top

So I call in, describe the problem and they say they’re aware of the problem. I try telling them to inform the admin of a load average of 28+ and a 95% wait state, but I realize it’s a lost cause – the folks on the phone really don’t understand what’s going on.

For those of you who know not of which I speak, here’s the quick breakdown:

Load Average:
(what I’ve heard and lived for 10+ years): You want this at 1x #cpu or less. Any higher than 1x #cpu and your system is throttled or bottlenecked by something. I doubt the server I’m on has 28 CPUs. If the load average was high and CPU utilization was high with “us” (user) or “sy” (system), then it needs a faster CPU.  But that’s not the case here.

Wait State:
This is the percentage of time the CPU was busy waiting. That’s right, waiting. Not doing anything, just waiting. What would the CPU wait on, you ask? Well, it might wait on a slow disk (NFS mounted, for example), or a network card. Poorly written applications might not free up the CPU while it waits on a response from something, so it ties up the CPU just taking up time. If Wait State is high, then something is fixable – and it’s usually I/O (input/output) from a disk array or network card.

So I’ve done my job at creating a support case #, and escalated it to the admins. Should I ask for more, given how much I can prove the downtime?

What would you do? Demand a refund? Would you just switch to another host? Would you complain on your own blog but nowhere else?

UPDATE: I just checked again, and the load average is close to 70

top – 15:05:50 up 42 days, 16:34, 8 users, load average: 68.60, 49.07, 35.94
Tasks: 2 total, 1 running, 1 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.0%us, 6.9%sy, 0.0%ni, 0.0%id, 90.9%wa, 0.0%hi, 0.2%si, 0.0%st
Mem: 4148308k total, 1395496k used, 2752812k free, 596k buffers
Swap: 265064k total, 1560k used, 263504k free, 534780k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3531 u3530304 22 2 5516 1984 1352 S 0 0.0 0:00.44 bash
28091 u3530304 22 2 2112 1000 812 R 0 0.0 0:00.88 top

 
UPDATE #2: Well, my blog has been down quite a bit today. I wrote this and wanted to post it, but the blog was, um, DOWN. So I just wrote a note to support and not 5 minutes after I click Send, what do I see?

top – 15:37:08 up 42 days, 17:06, 8 users, load average: 4.95, 20.94, 34.66
Tasks: 2 total, 1 running, 1 sleeping, 0 stopped, 0 zombie
Cpu(s): 27.4%us, 4.2%sy, 0.0%ni, 61.8%id, 5.6%wa, 0.0%hi, 1.0%si, 0.0%st
Mem: 4148308k total, 1665568k used, 2482740k free, 2836k buffers
Swap: 265064k total, 1464k used, 263600k free, 741928k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3531 u3530304 22 2 5516 1984 1352 S 0 0.0 0:00.90 bash
8851 u3530304 22 2 2112 1000 812 R 0 0.0 0:00.04 top

Woohoo! The load average is going down! My blog is back up!  Now I can blog about how it was down.

Hmmm. I wonder if my email to support had anything to do with the “miraculous rise” of the server’s functionality.

And lo, the people did comment thus:

12 Comments

  1. Shiny says:

    This should be an easy one: Did you ask your ISP to turn off your server and then turn it on again? That usually works. 🙂

    Every few weeks or so people come to me when their SparcStations seem to get very sluggish — and I find the same iowait percentage way up. More often than not (at least in these cases) I have to kill every single java process to get things working at an acceptable pace again. Fun…

  2. Shiny says:

    Oh — and… you can SSH into your ISP’s hosting box? Are you hosted on a dedicated server?

    (Don’t you already feel the non-nerds slowly backing away from their computers?)

  3. Sybil Law says:

    *slowly backing away from you and shiny and my computer*
    Um, so I am glad the problem is fixed? Hopefully fixed?!
    I caught some of that stuff but overall, it just confused me.
    Either way, it’s up for me! (Now THAT’S what she said!)
    Haha
    😀

  4. Robin says:

    Ok, I’m just going to go back and hide under the covers and just wait for someone to fix my laptop…

  5. metalmom says:

    How can you be having problems? I thought you were the ‘god of the blogosphere’ ? Are you dashing my image of you?

  6. Ren says:

    One thing to check for when you are seeing high %wa is the mysqld process. There could be a DB corruption or something causing it to get into a weird condition where it keeps accessing the DB over and over.

  7. Bucky says:

    At least it’s fixed for now.

    Maybe they went and bought a couple more hamsters to help turn the wheel.

  8. whall says:

    Shiny, heh, I’ve been tempted to tell them “just reboot it for me wouldja?” because sometimes they’re so clueless. And yes, I can ssh in – I’m on a shared server, which is part of the problem. If it was a dedicated server, I wouldn’t have much of a problem because then I could troubleshoot mysql, php, iostat, vmstat, truss-like utilities, etc.

    Sybil Law, I hope it’s fixed too. Even right now, the load average is > 20 which isn’t good. But the blog is still up. I just wish they knew what they were doing or would let me do more.

    Robin, laptop got the sniffles?

    metalmom, I post this just to look helpless so people won’t be so threatened by my existance.

    Ren, wish I could. I’m not allowed to see the other processes on the system due to privileges.
    (uiserver):u35303046:~ > ps -elf
    Error, do this: mount -t proc none /proc
    (uiserver):u35303046:~ > ps
    PID TTY TIME CMD
    16066 pts/2 00:00:00 bash
    29362 pts/2 00:00:00 ps

    not helpful.

    Bucky, I’d gladly donate a hamster or two.

  9. martymankins says:

    Since I am with 1and1 as well, I’ve experienced my own downtimes with them recently. They must be doing some sort of upgrades on all of the servers. The last tech person I spoke to mentioned something about this.

    I tend to not call in each and every time, but when the outage is noticeable for extended periods of time, I do call in and tell the person on the phone my site is down. They seem like they want to help, but they really don’t know much.

    As I’ve mentioned before, I am looking at switching hosting companies. I’ll be moving chillywilly.org first, then banalleakage.com sometime next year (I’m on a one year with 1and1 for that site).

    I’d stay with them longer, but they have no interest in supporting external access to the MySQL databases (using a tool like Navicat) and their recent issues. I know other hosting companies have their own issues, but for as long as I’ve been with 1and1, it would be nice for them to have a better track record.

  10. kapgar says:

    Definitely complain.

    kapgar’s last blog post..Don’t forget to remember me…

  11. Poppy says:

    That load average is BANANAS!!!!!

    Poppy’s last blog post..Beauty Queens, Queens, Subways, and Tower Power

  12. marilyn says:

    I think they are jealous because you know more than they do and so they’re messing with you. But I may be wrong since they know more than I do. It’s hard to tell from here, but the first part of this post was a lot of fun to read anyway.

    I’d switch if I were you but I don’t actually pay for anything to do with my blog so I can’t switch. On the other hand my blog is almost never down. I am stuck with that pesky blogspot domain thingy but it’s just a blog. Eventually I’ll need a real domain for my doll business though and I wouldn’t want it to be down if I was paying for hosting.

    marilyn’s last blog post..Ruby Tuesday: Crab apples

Want to Reply to Poppy?

Hey, we all want to share our voice. And I particularly love comments, especially if you took the time to read my blog entry. I'll take the time to read your comment, I swear! But due to spammers, robots, and the fact that I want my blog to be PG rated, I need to approve the comments. This should be same day, but please don't get mad if it takes me a while to approve the comment.







Comment:


PLEASE help keep this blog family-friendly by refraining from profanity and vulgarity.


CommentLuv badge


Admin
tsk tsk

Ajax CommentLuv Enabled 336ad6ab990e8080f1c0ad1f892428a0