全文预览

毕业论文外文翻译-网络爬虫

上传者:梦溪 |  格式:docx  |  页数:27 |  大小:50KB

文档介绍
y imposed by disk latency. Adding a cache to the Host Splitter makes it possible to discard ing duplicate URLs instead of sending them to the peer node, thereby reducing the amount work traffic. This reduction is particularly important ina scenario where the individual crawling machines are not connected via a high-speed LAN (as they were inour experiment), but are instead globally distributed. In such a setting, each crawler would be responsible for web servers “ close to it”. Mercator performs an approximation ofa breadth-first search traversal of the web graph. Each of the (typically 500) threads in each process operates in parallel, which introduces a certain amount of non-determinism to the traversal. More importantly, the scheduling of downloads is moderated by Mercator ’s politeness

收藏

分享

举报
下载此文档