Category Archives: Articles

MSN Bot Behaving Poorly

MSNBot Behaves Poorly

As I was going through my old emails at work today (I’m still at TechMission, and will be there for at least another month), I came across a write-up that I composed and sent to the other three members of our tech team (we manage the technical aspects of TechMission’s websites, and we maintain the web server). I wrote this last fall and had meant to post it onto my blog, but forgot about it.

This is some research that I conducted, and my recommendations into addressing a high server load problem that we were having at the time. Note that my entire time at TechMission has been in the role of an Americorps intern, and everything that I have done in this role, including my work indicated in this blog post has been completely self taught in the recent past.

There was a problem…
Last fall (2009), TechMission’s servers were fairly unstable in terms of performance. Our websites were slow, server load would routinely be above 5.0 on a 5 and 15 minute average, and we constantly had to restart Apache.

After I did some research into why we were having so many problems, I found that our website was being hammered by Robot crawlers that were not respecting all of our robots.txt directives. One of these robots, surprisingly, was the crawler used by MSN.

MSNbot caused TechMission’s server load to rise to very high levels. In the 3rd week of September, we had two days where top reported our average server load during business hours at hovering between 10 and 20. For a normal server, the ideal server load would be under 1.0. In effect, we were experiencing DDoS symptoms.

When our load first increased to high levels, we did not know what the cause was. And so in our search for this information, one of my coworkers checked our WHM Apache logs, and suggested that I do the same. As I scanned the document, I noticed that several IP addresses in the same range were showing up multiple times throughout the status log. I immediately became suspicious, because this log is a snapshot of the current activity on the server – processes that are literally happening at the time the log is loaded.

I went to www.projecthoneypot.org and searched for several of these specific IP addresses. All of these addresses were associated with the same “user”: msnbot. I then went into one of my open putty sessions and issued netstat | grep msn and found several current connections to the server.

We found a solution…
I decided to try my theory out that MSNbot was the cause of our high server load. After getting approval from my coworkers (I was only an intern at TechMission) I went into WHM and added these IP addresses to our blacklist. Server load dropped like a rock, from 20 down to under 10 within a 2-3 minute time period.

Other people have experienced similar issues
According to several sources, msnbot is widely known to behave poorly. On April 16th, 2009, a blog posting was published which gave proof that msnbot used the wrong robots.txt file when indexing a website. Instead of using the right robots.txt, it has been known to follow the instructions of a completely different (unknown) website.[1] In February, other people complained of this same problem.[2]

The phenomenon of the msnbot slowing servers down is not new. In 2006, an article was published with a detailed report on how several webmasters and server administrators have experienced denial of service (DDoS) symptoms as a result of the bot.[3]

Traffic Sources
Approximately 76% of our traffic for www.urbanministry.org comes from search engines. Out of this, 68% comes from Google, and 4% comes from Yahoo.[4] From July 1st to August 31st of this year, Bing provided 2,695 visitors to our site, and ranked as the 3rd contributing search engine (behind Google and Yahoo). From October 6, 2008 through today, October 6, 2009, Bing ranks 5th among search engines, and provided 5,435 visitors to our site. Out of these visitors, we had a 59% bounce rate.

Recommendations
Based on the research cited above, I have a couple of ideas. First of all, we need to do more research to find out if by blocking msnbot, our traffic from Yahoo will eventually be affected, since Microsoft and Yahoo have begun partnering together. On July 29th, 2009, this announcement was made public.[5]

Since we have more aggressive robots.txt instructions, perhaps we could begin to unblock a few of the MSN IP addresses (not all of them) and see what happens. I think it would be interesting to create a log of all MSN connections on our server, and find out what it does. We do know for a fact that currently, not all IP addresses are blocked, as I have occasionally seen the bot show up under different IP addresses than the ones we have blocked.

Based on the data that we obtain by unblocking a few more MSN IP addresses and log all of the MSN connections, I think that we could come back in approximately another month or two and determine whether our robots.txt instructions are being followed.

Sources
[1] http://www.chewie.co.uk/seosem/msnbot-20b-is-ignoring-robotstxt-and-no-index-meta-tags/.

[2] http://www.webmasterworld.com/search_engine_spiders/3839742.htm

[3] http://www.masternewmedia.org/news/2006/07/05/server_slowdown_problems_possible_causes.htm

[4] TechMission’s Google Analytics

[5] http://www.searchmarketing.com/searchmarketing/2009/07/microsoft-yahoo-partnership-part-1-of-3.html

Share

Apple’s HTTP Live Streaming: A Nightmare?

This blog post is hopefully #1 of 2 posts on the same topic: Getting HTTP Live Streaming configured on a Linux server running CentOS 5.4. If you are curious about what I have been up to for over a week, or if you are looking for ways to stream your audio or video to the iPhone, then this is for you.

A week ago, I had high hopes that I would be able to author a write-up on how I got Apple’s HTTP Live Streaming Protocol to work on our CentOS server at work (TechMission). Recently, I was asked to implement this new technology in order that we might be able to “stream” MP3s from our website to iPhones using the 3G network. I figured it would be a 1 to 2 day job. Little did I know….

It has now been over a week since the request was made to put this project as my highest priority. Besides working on this project, the only thing that I have done during my time at work the last week is database backups, performing our weekly rsync of the home directories, and upgrading our Moodle site.

If you are operating an Apple server, then from what I can tell, your job is going to be a whole lot easier. For the rest of us… well, it’s trial, error and research. Let’s get started.

  1. Research
    The first thing that I did was research what exactly the HTTP Live Streaming is. HTTP Live Streaming is a concept for an entirely new internet protocol (IP). The Internet Draft of the Internet Engineering Task Force (IETF) states “describes a protocol for transmitting unbounded streams of multimedia data. It specifies the data format of the files and the actions to be taken by the server (sender) and the clients (receivers) of the streams.”

    More specifically, according to Apple, “HTTP Live Streaming allows you to send live or prerecorded audio and video to iPhone or other devices, such as iPod touch or desktop computers, using an ordinary Web server.”

    The concept is simple: When using the HTTP Live Streaming IP, media files (either provided by a live video camera or equivalent feed, or pre-recorded content such as .MP3 files) are “encoded” by the server to an acceptable format before the new file is broken up into small chunks that are then sent to the client. So if you look closely, you will realize that this is not a true stream, nor does it have to be live. Why it is called the HTTP Live Streaming IP, therefore, is beyond me. The name is a little misleading, in my opinion.

  2. Server Implementation
    So how does one actually implement this? Good question. I’m still trying to solve the issue on our server! However, here is what I have figured out and tried. Hopefully this will be useful to someone.

    The are two parts to a server’s configuration for HTTP Live Streaming:

    1. Media Encoder
    2. File Segmenter

    Media Encoder
    The media encoder is what takes the signal from a live broadcast feed or from some other incompatible format, and turns it into an acceptable format for the iPhone (or iPod touch or even Quick Time for that matter) to understand.

    Before reaching the segmenter, video files in an MPEG-2 transport stream, and audio-only files can either be in an MPE2-2 transport stream, or in AAC (with appropriate headers) or MP3 format. Obviously, if one is trying to stream pre-recorded audio content, the media encoder is NOT always a necessary step, assuming the audio was already saved in the correct format.

    File Segmenter
    The second component of the server before the file chunks are sent to the client is the file segmenter. Unless you have a lot of server space, are not streaming anything live, and do not have many files to stream, I highly recommend that this step be performed when the client requests it. The other option would be to segment the files and save the segments in a permanent location on the server.

    This is the most critical part of HTTP Live Streaming and is required to make it work. Before implementing into a production environment, I recommend testing in your test environment (if you don’t have a dedicated testing server, you could install your server’s operating system into a virtual machine).

    My first attempt was to use FFMpeg with a segmenter written in C by Chase Douglas. As one who is not very familiar with C and also not familiar with Mac OS X, it took me a while to realize that the segmenter was written to run on Mac servers, and not on Linux. I thought about trying to port the code, but decided to try some other things first.

    In my further research, I found that somebody (Carson McDonald) HAD ported Chase Douglas’ segmenter to Linux. But the only catch is, it has a Ruby wrapper script (I have never worked with Ruby, gems, Ruby on Rails, or the Ruby server before). Nevertheless, I decided to give it a try.

    After spending 2 weeks on the project now, I have been unsuccessful getting Ruby and Ruby-Gems to work properly. I first installed Ruby and Ruby-Gems through yum. When that didn’t really work, I uninstalled it all through yum and the installed the latest versions of both manually (putting the source, of course, in /usr/local/src).

    In his instructions, Carson says that the gems net-scp and right_aws are required to work with the Ruby script he wrote. When I run gem install net-scp or gem install right_aws, I keep getting the following error message:

    ERROR: could not find net-scp locally or in a repository

    I am working on solving this issue, and may decide to try either of the following 2 options:
    a) Porting the original C script written by Chase Douglas into a stand-alone program that works on Linux
    b) Porting Carson McDonald’s Ruby scripts to shell scripts, which can be called with PHP.

    Right now, my boss has put me onto a new project, and I’m not quite sure when I’ll come back to this project. But once I find a solution, I will post about it here.

Hopefully this has been useful for some of you to see what I have done. Or, if you are reading this, and you think you have a solution for me, feel free to respond.


More Resources
Still feeling a little confused? Check out these great resources, which I have found very useful (in addition to the ones I linked to earlier):

By far the most popular blog and set of instructions for getting this to work is Carson McDonald’s articles on iPhone HTTP Streaming with FFMpeg and an Open Source Segmenter and HTTP Live Video Stream Segmenter and Distributor.

However, the best documentation provided by Apple that I could find is their HTTP Live Streaming Overview.

Share