全文预览

山东建筑大学计算机网络课程设计报告《基于.Python的网络爬虫设计》

上传者:qnrdwb |  格式:doc  |  页数:12 |  大小:201KB

文档介绍
_dataР def _get_new_urls(self, page_url, soup):Р new_urls = set()Р # http://baike./item/%E8%87%AA%E7%94%B1%E8%BD%AF%E4%BB%B6Р # http://baike./item/%E6%BA%90%E4%BB%A3%E7%A0%81/3969Р links = soup.find_all('a', href=pile(r"/item/\%"))Р for link in links:Р new_url = link['href']Р # http://baike./item/Python?sefr=psР new_full_url = urlparse.urljoin("http://baike./",new_url)Р new_urls.add(new_full_url)Р return new_urlsР def _get_new_date(self, page_url, soup):Р res_data = {}Р # urlР res_data['url'] = page_urlР # <dd class="lemmaWgt-lemmaTitle-title"><h1>Python</h1>Р title_node = soup.find('dd',class_="lemmaWgt-lemmaTitle-title").find('h1')Р res_data['title'] = title_node.get_text()Р # <div class="lemma-summary" label-module="lemmaSummary">Р summary_node = soup.find('div', class_="lemma-summary")

收藏

分享

举报
下载此文档