Search engines and user-agents list (2006)
Search engines, web crawlers and user-agents
I've created a list with active web crawlers, web query/crawl tools, captured on a web site of mine. I've started the logging a few months ago, and the results are pretty amazing, seems like a lot of companies and institutes are making web research through distributed web crawlers. Please note that:
- i might be wrong about some of the signatures below for being crawlers. They might be only automated downloading programs (such as DAP, NetAnt), or spoofed browser identities.
- the IP's of the crawlers a just for extra information, the distributed crawlers can have thousands of IPs
- i will update and correct this list with a lot of features, news and informations about the crawlers
| Crawler Signature: | IPs: | Obs: |
A | ||
| Anonymous/0.0 (Anonymous; http://www.anonymous.com; noreply@anonymous.com) | 63.133.162.98 | |
| aipbot/1.0 (aipbot; http://www.aipbot.com; aipbot@aipbot.com) | 24.177.134.6 | |
| Mozilla/2.0 (compatible; Ask Jeeves/Teoma; +http://sp.ask.com/docs/about/tech_crawling.html) | 65.214.44.39 | |
| asked/Nutch-0.8 (web crawler; http://asked.jp; epicurus at gmail dot com) | 131.112.125.105 | |
B | ||
| Blogslive (info@blogslive.com) | 64.158.138.84 | |
| Baiduspider+(+http://www.baidu.com/search/spider.htm) | 202.108.11.234 | |
C | ||
| Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com) | 220.130.191.235 | Nutch based |
| ccubee/4.0 | 194.213.194.207 | |
| ccubee/3.5 | 194.213.194.201 | |
| ConveraCrawler/0.9d (+http://www.authoritativeweb.com/crawl) | 63.241.61.7 | About Covera Crawler |
| Crawler/1.0+http://elibron.com | 83.149.215.35 | |
| CPG RSS Module File Reader | 82.208.190.46 | |
| csci_b659/0.13 | 156.56.103.14 | |
D | ||
| Data Searcher/0.1 libwww-perl/5.65 | 80.97.105.98 | |
| Mozilla/5.0 (compatible; DNS-Digger/1.0; +http://www.dnsdigger.com <a href='http://www.dnsdigger.com/'>DNS-Digger</a>) | 83.227.119.189 | (HEAD) |
| Mozilla/5.0 (compatible; DNS-Digger/1.0; +http://www.dnsdigger.com) | 212.214.165.218 | |
E | ||
| envolk/1.7 (+http://www.envolk.com/envolkspiderinfo.php) | 70.169.191.4 | |
| Exabot/2.0 | 193.47.80.39 | About Exabot |
F | ||
| Feedster Crawler/1.0; Feedster, Inc. | 64.95.116.1 | |
| findlinks/1.1.1-a1 (+http://wortschatz.uni-leipzig.de/findlinks/) | 139.18.2.216 | About FindLinks |
| findlinks/1.0.9 (+http://wortschatz.uni-leipzig.de/findlinks/) | 139.18.2.209 | |
| findlinks/1.1.1-a5 (+http://wortschatz.uni-leipzig.de/findlinks/) | 139.18.2.81 | |
| findlinks/1.1.3-beta2 (+http://wortschatz.uni-leipzig.de/findlinks/) | 139.18.13.204 | |
| findlinks/1.1.3-beta2 (+http://wortschatz.uni-leipzig.de/findlinks/) | 139.18.13.204 | |
| findlinks/1.1.3-beta6 (+http://wortschatz.uni-leipzig.de/findlinks/) | 139.18.13.201 | |
| findlinks/1.1.3-beta8 (+http://wortschatz.uni-leipzig.de/findlinks/) | 139.18.13.203 | |
G | ||
| Gigabot/2.0; http://www.gigablast.com/spider.html | 66.154.103.99 | About Gigabot |
| Gigabot/2.0/gigablast.com/spider.html | 66.154.103.158 | |
| Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0; Girafabot; girafabot at girafa dot com; http://www.girafa.com) | 64.210.196.197 | About Girafabot |
| Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) | 66.249.72.37 | About Googlebot |
| Googlebot/2.1 (+http://www.google.com/bot.html) | 66.249.64.54 | |
| Feedfetcher-Google; (+http://www.google.com/feedfetcher.html) | 72.14.199.69 | |
| Googlebot-Image/1.0 | 66.249.65.80 | |
| Generic Mobile Phone (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html) | 66.249.65.139 | |
| Gregarius/0.5.4 (+http://devlog.gregarius.net/docs/ua) | 86.35.0.162 | |
| GOFORITBOT ( http://www.goforit.com/about/ ) | 216.69.177.55 | |
H | ||
| HooWWWer/2.1.3 (debugging run) (+http://cosco.hiit.fi/search/hoowwwer/ | mailto:crawler-info<at>hiit.fi) | 128.214.112.85 | |
| HouxouCrawler/Nutch-0.9-dev (houxou.com's nutch-based crawler which serves special interest on-line communities; http://www.houxou.com/crawler; crawler at houxou dot com | 193.203.240.135 | Nutch based |
I | ||
| ia_archiver | 209.237.238.235 | |
| 209.237.238.226 | ||
| ichiro/2.0 (http://help.goo.ne.jp/door/crawler.html) | 210.173.180.151 | |
| IRLbot/2.0 (+http://irl.cs.tamu.edu/crawler) | 35.9.45.19 | |
| 128.194.135.81 | ||
| imds_monitor/0.1 | 211.37.79.20 | (HEAD) |
| ilial/Nutch-0.9-dev | 164.67.195.67 | |
| IlTrovatore/1.2 (IlTrovatore; http://www.iltrovatore.it/bot.html; bot@iltrovatore.it) | 213.215.201.223 | |
J | ||
| Jakarta Commons-HttpClient/3.0-rc2 | 206.188.0.22 | |
| Jyxobot/1 | 195.113.214.206 | |
L | ||
| Linkie Winkie Crawler (http://www.linkiewinkie.com/) | 212.227.22.5 | |
| LWP::Simple/5.803 | 64.34.164.36 | About LWP |
| Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) | 68.142.251.86 | About Y! Slurp |
| Misterbot-Nutch/0.7.1 (Misterbot-Nutch; http://www.misterbot.fr; admin@misterbot.fr) | 213.251.133.12 | |
| Mozilla/5.0 (Windows;) NimbleCrawler 1.13 obeys UserAgent NimbleCrawler For problems contact: crawler@healthline.com | 72.5.115.44 | |
| sproose/0.1-alpha (sproose crawler; http://www.sproose.com/bot.html; crawler@sproose.com) | 38.100.225.6 | |
| sproose/0.1 (sproose bot; http://www.sproose.com/bot.html; crawler@sproose.com) | 38.100.225.7 | |
| 38.100.225.12 | ||
| My WinHTTP Connection | 81.18.79.174 | |
| TurnitinBot/2.0 http://www.turnitin.com/robot/crawlerinfo.html | 64.140.49.69 | |
M | ||
| MJ12bot/v1.0.7 (http://majestic12.co.uk/bot.php?+) | 81.178.102.15 | |
| Microsoft URL Control - 6.00.8862 | 194.102.182.71 | |
| Microsoft-WebDAV-MiniRedir/5.1.2600 | 86.34.227.121 | (OPTIONS) |
| 85.166.207.145 | ||
| msnbot/1.0 (+http://search.msn.com/msnbot.htm) | 65.54.188.146 | |
| msnbot-media/1.0 (+http://search.msn.com/msnbot.htm) | 65.55.213.86 | |
| MOT-RAZRV3xv/85.83.E1P MIB/BER2.2 Profile/MIDP-2.0 Configuration/CLDC-1.1 | 193.230.161.122 | |
| Microsoft Data Access Internet Publishing Provider Protocol Discovery | 86.105.61.38 | (OPTIONS) |
N | ||
| noyona_0_1 | 72.9.228.79 | |
| NetResearchServer/4.0(loopimprovements.com/robot.html) | 67.180.149.252 | |
| NetSprint -- 2.0 | 212.77.102.121 | (pl, lt) |
| NutchEC2Test/Nutch-0.9-dev (Testing Nutch on Amazon EC2.; http://lucene.apache.org/nutch/bot.html; ec2test at lucene.com) | 216.182.237.22 | |
O | ||
| OmniExplorer_Bot/6.68 (+http://www.omni-explorer.com) WorldIndexer | 65.19.150.213 | |
| Oracle Ultra Search | 141.146.4.11 | |
| Mozilla/5.0 (compatible; OnetSzukaj/5.0; +http://szukaj.onet.pl) | 213.180.128.154 | Language: pl, en;q=0.5, *;q=0.2 |
P | ||
| PeerFactor Crawler | 88.191.11.81 | |
| Python-urllib/1.16 | 208.223.208.181 | |
| POE-Component-Client-HTTP/0.65 (perl; N; POE; en; rv:0.650000) | 64.239.7.216 | |
| ping.blo.gs/2.0 | 66.218.65.40 | Referer: http://blo.gs/ping.php |
| psycheclone | 208.66.195.11 | |
| psbot/0.1 (+http://www.picsearch.com/bot.html) | 217.212.224.159 | |
| 217.212.224.165 | ||
R | ||
| Robozilla/1.0 | 207.200.81.166 | (Referer: http://directory.mozilla.org) |
| RAMPyBot - www.giveRAMP.com/1.0 (RAMPyBot - www.giveRAMP.com; http://www.giveramp.com/bot.html; support@giveRAMP.com) | 64.27.2.18 | |
| REBOL View 1.2.48.3.1 | 84.163.170.119 | |
S | ||
| Syntryx ANT Scout Chassis Pheromone | 64.92.202.124 | This crawler has no signature, but sends it's name through referer. |
| Syntryx ANT Scout Chassis Pheromone; Mozilla/4.0 compatible crawler | 216.7.179.20 | This time this crawler has signature |
| Snapbot/1.0 | 66.234.139.198 | |
| 38.98.19.83 | ||
| Snappy/1.1 ( http://www.urltrends.com/ ) | 205.138.199.126 | |
| Sphere Scout&v4.0 (beta) - scout at sphere dot com | 64.40.115.54 | |
| 64.40.115.55 | ||
| SuperBot/4.6.0.69 (Windows XP) | 194.105.24.56 | (has referer information) |
| Shim-Crawler(Mozilla-compatible; http://www.logos.ic.i.u-tokyo.ac.jp/crawler/; crawl@logos.ic.i.u-tokyo.ac.jp) | 157.82.254.2 | |
| Szukacz/1.5 (robot; www.szukacz.pl/html/jak_dziala_robot.html; info@szukacz.pl) | 193.218.115.7 | (uses the same ip?not distributed crawl?) |
| NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu) | 128.208.6.227 | |
T | ||
| TulipChain/6.03 (http://ostermiller.org/tulipchain/) Java/1.5.0 (http://java.sun.com/) Linux/2.6.15.5-x1 RPT-HTTPClient/0.3-3 | 85.186.168.8 | Used in DMOZ.org ; you can also log the referer |
| TurnitinBot/2.1 (http://www.turnitin.com/robot/crawlerinfo.html) | 64.140.49.69 | |
U | ||
| UP.Browser/6.1.0.1.140 (Google CHTML Proxy/1.0) | 64.233.178.136 | |
V | ||
| VSE/1.0 (testcrawler@hotmail.com) | 24.3.56.88 | |
Y | ||
| Mozilla/5.0 (compatible; Yahoo! Slurp China; http://misc.yahoo.com.cn/help.html) | 202.160.180.124 | |
| Yahoo-Blogs/v3.9 (compatible; Mozilla 4.0; MSIE 5.5; http://help.yahoo.com/help/us/ysearch/crawling/crawling-02.html ) | 209.191.83.2 | |
| YahooFeedSeeker/2.0 (compatible; Mozilla 4.0; MSIE 5.5; http://publisher.yahoo.com/rssguide; users 0; views 0) | 66.218.65.25 | |
W | ||
| WebRankSpider/1.37 (+http://ulm191.server4you.de/crawler/) | 62.75.202.126 | |
Z | ||
| Zeusbot/0.8.1 (Ulysseek's web-crawling robot; http://www.zeusbot.com; agent@zeusbot.com) | 217.113.244.119 | |