MSNBot Behaves Poorly
As I was going through my old emails at work today (I’m still at TechMission, and will be there for at least another month), I came across a write-up that I composed and sent to the other three members of our tech team (we manage the technical aspects of TechMission’s websites, and we maintain the web server). I wrote this last fall and had meant to post it onto my blog, but forgot about it.
This is some research that I conducted, and my recommendations into addressing a high server load problem that we were having at the time. Note that my entire time at TechMission has been in the role of an Americorps intern, and everything that I have done in this role, including my work indicated in this blog post has been completely self taught in the recent past.
There was a problem…
Last fall (2009), TechMission’s servers were fairly unstable in terms of performance. Our websites were slow, server load would routinely be above 5.0 on a 5 and 15 minute average, and we constantly had to restart Apache.
After I did some research into why we were having so many problems, I found that our website was being hammered by Robot crawlers that were not respecting all of our robots.txt directives. One of these robots, surprisingly, was the crawler used by MSN.
MSNbot caused TechMission’s server load to rise to very high levels. In the 3rd week of September, we had two days where top reported our average server load during business hours at hovering between 10 and 20. For a normal server, the ideal server load would be under 1.0. In effect, we were experiencing DDoS symptoms.
When our load first increased to high levels, we did not know what the cause was. And so in our search for this information, one of my coworkers checked our WHM Apache logs, and suggested that I do the same. As I scanned the document, I noticed that several IP addresses in the same range were showing up multiple times throughout the status log. I immediately became suspicious, because this log is a snapshot of the current activity on the server – processes that are literally happening at the time the log is loaded.
I went to www.projecthoneypot.org and searched for several of these specific IP addresses. All of these addresses were associated with the same “user”: msnbot. I then went into one of my open putty sessions and issued
netstat | grep msn and found several current connections to the server.
We found a solution…
I decided to try my theory out that MSNbot was the cause of our high server load. After getting approval from my coworkers (I was only an intern at TechMission) I went into WHM and added these IP addresses to our blacklist. Server load dropped like a rock, from 20 down to under 10 within a 2-3 minute time period.
Other people have experienced similar issues
According to several sources, msnbot is widely known to behave poorly. On April 16th, 2009, a blog posting was published which gave proof that msnbot used the wrong robots.txt file when indexing a website. Instead of using the right robots.txt, it has been known to follow the instructions of a completely different (unknown) website. In February, other people complained of this same problem.
The phenomenon of the msnbot slowing servers down is not new. In 2006, an article was published with a detailed report on how several webmasters and server administrators have experienced denial of service (DDoS) symptoms as a result of the bot.
Approximately 76% of our traffic for www.urbanministry.org comes from search engines. Out of this, 68% comes from Google, and 4% comes from Yahoo. From July 1st to August 31st of this year, Bing provided 2,695 visitors to our site, and ranked as the 3rd contributing search engine (behind Google and Yahoo). From October 6, 2008 through today, October 6, 2009, Bing ranks 5th among search engines, and provided 5,435 visitors to our site. Out of these visitors, we had a 59% bounce rate.
Based on the research cited above, I have a couple of ideas. First of all, we need to do more research to find out if by blocking msnbot, our traffic from Yahoo will eventually be affected, since Microsoft and Yahoo have begun partnering together. On July 29th, 2009, this announcement was made public.
Since we have more aggressive robots.txt instructions, perhaps we could begin to unblock a few of the MSN IP addresses (not all of them) and see what happens. I think it would be interesting to create a log of all MSN connections on our server, and find out what it does. We do know for a fact that currently, not all IP addresses are blocked, as I have occasionally seen the bot show up under different IP addresses than the ones we have blocked.
Based on the data that we obtain by unblocking a few more MSN IP addresses and log all of the MSN connections, I think that we could come back in approximately another month or two and determine whether our robots.txt instructions are being followed.
 TechMission’s Google Analytics