Wednesday, September 17, 2014

Bots in Web

Bots stand for robots and in the web these are applications which traverse the web. Search engine do employ bots to crawl the web pages and build the search index. Google spiders are a kind of bots, which crawls the pages and build the index against keywords. If you want to see how spider bots crawls a page  do the following:
  • Go to Google webmaster tool.
  • In the Crawl menu on left hand side and click on 'Fetch as Google'
  • Now if you click on the link displayed in the table under column 'Path', you can see the page as seen by Google bot.
What you would see is basically a HTML page and that's the Google spider sees. It then parses that page and tries to make a judgement about the content of the page. Based on the content it associates
keywords to the web page which is later used to display the links against a search.

Different kinds of Bots

Bots are used to do various kinds of things in web, both  good and bad. Some of the usage of bots in internet are:
  • Many times you see spammed messages on your blog posts or web pages whose sole purpose is to create back links. The content of these comments have no relation to your post. These are many times the work of bots and done by link spamming agencies.
  • Bots are used by businesses to fetch the prices of competitors so that they can put their prices in the competitive range. There are many sites who run comparative services. They use bots extensively to do this. For example price comparisons of mobile phones at different e-commerce sites.
  • Bots are also used to scrape content from a web page and create almost similar web pages at different web addresses. This is done to create content for search engines to crawl and index those web pages.
  • Bots are also used to collect personal informations including emails so that this data can be sold to marketing agencies. It's usually considered a good practise to not to put your mail in web pages as they can than become the target of spam mails. If you have no choice but to display your mail, than one way that can be done is to replace some of the characters of the mail and instruct the human users to rectify it before sending mails. 
  • Bots are employed to do attacks like Denial of service attack or overwhelm servers. This can impact businesses if genuine users are not able to access applications.
  • Bots are used by spammers to artificially boost the page views. The increase in page view may result in a perception that the web site is very popular. It can also impact the ad payouts which depends on number of page views. Search engines like Google and Bing have devised a number of techniques to differentiate between real users and bots. This directly impacts their business model. This is also knows as click fraud.
  • Bots are used in share markets to do algorithmic trading. These bots automatically buy and sell stocks based on certain rules. This is also known as high frequency trading as bots can sell and purchase with very high speed compared to humans. 
  • Many web sites provide API's (web services) to fetch the content. For example like Amazon, twitter, Facebook, Google and so on. Airlines and hotels also publish their inventory via web services which is consumed by travel portals to make comprehensive travel plans. The programs consuming these services are also a kind of bots. The difference is that they consume the content at a semantically higher level.
A study report by Incapsula  says that the bot traffic constitutes around 61.5 % of total traffic in 2013 which is more than half of the Internet traffic.

How to save yourself from malicious bots

Malicious bots are bad. One place where bots can be stopped from doing bad thing is to stop them from writing spam on your comment area. This can be done by either incorporating CAPTCHA or only allowing comments from users who had identified themselves by authenticating themselves in some way. Some bots are smart enough to even do character recognition of CAPTCHA. This might required using more complex CAPTCHA. It's a constant game of catching up. 

However you cannot stop bots from crawling your page and scraping your content for malicious purpose. You can only hope that the search engines will consider you the source of information. 

Writing a Bot

Writing a bot is not very difficult. You basically have to write a program which can create a HTTP connection and download the content from the targeted web page. Once the content is downloaded, you can parse the content and do your own analysis or can build your own search index. Please do this for good purpose and not to create spam. This is a powerful mechanism which can be employed to do various useful things like analysis of trends.

Command utilities like wget can help to retrieve the content of any web page.

More posts on Search Engine Optimization

No comments:

Post a Comment

Popular Posts