python爬蟲爬取

python 爬取博客園接螞蟻學pythonP5生產者消費者爬蟲數據重復問題

先看訪問地址
成都創新互聯公司自2013年創立以來，是專業互聯網技術服務公司，擁有項目成都做網站、成都網站建設、成都外貿網站建設網站策劃，項目實施與項目整合能力。我們以讓每一個夢想脫穎而出為使命，1280元遵義做網站,已為上家服務,為遵義各地企業和個人服務,聯系電話:13518219792
- 訪問地址是https://www.cnblogs.com/#p2 但是實際訪問地址是https://www.cnblogs.com 說明其中存在貓膩；像這種我們給定指定頁碼，按理應該是 post 請求才對；于是乎往下看了幾個連接
- 然后再看一下payload 發現這個post 請求才是我們想要的鏈接其中PageIndex 就是我們要設置的頁數

代碼擼起來

# Author: Lovyya
# File : blog_spider
import requests
import json
from bs4 import BeautifulSoup
import re
# 這個是為和老師的urls一致性 匹配urls里面的數字
rule = re.compile("\d+")

urls = [f'https://www.cnblogs.com/#p{page}' for page in range(1, 31)]

# pos請求網址
url = "https://www.cnblogs.com/AggSite/AggSitePostList"
headers = {
	"content-type": "application/json",
	"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36 Edg/95.0.1020.30"
}

def craw(urls):
	#idx 是'xxx.xxxx.xxx/#p{num}' 里面的num 這樣寫可以不用改 后面生產者消費者的代碼 
	idx = rule.findall(urls)[0]
	# payload參數 只需要更改 idx 就行
	payload = {
		"CategoryType": "SiteHome", 
		"ParentCategoryId": 0, 
		"CategoryId": 808, 
		"PageIndex": idx,
		"TotalPostCount": 4000, 
		"ItemListActionName": "AggSitePostList"
	}
	r = requests.post(url, data=json.dumps(payload), headers=headers)
	return r.text

def parse(html):
	# post-item-title
	soup = BeautifulSoup(html, "html.parser")
	links = soup.find_all("a", class_="post-item-title")
	return [(link["href"], link.get_text()) for link in links]

if __name__ == '__main__':
	for res in parse(craw(urls[2])):
		print(res)

分享標題：python爬蟲爬取
網站URL：http://m.kartarina.com/article16/dsogigg.html

成都網站建設公司_創新互聯，為您提供網站建設、Google、軟件開發、企業網站制作、網站收錄、網站維護

聲明：本網站發布的內容（圖片、視頻和文字）以用戶投稿、用戶轉載內容為主，如果涉及侵權請盡快告知，我們將會在第一時間刪除。文章觀點不代表本網站立場，如需處理請聯系客服。電話：028-86922220；郵箱：631063699@qq.com。內容未經允許不得轉載，或轉載時需注明來源：創新互聯

猜你還喜歡下面的內容

python爬蟲爬取

python 爬取 博客園 接 螞蟻學pythonP5生產者消費者爬蟲數據重復問題

python 爬取博客園接螞蟻學pythonP5生產者消費者爬蟲數據重復問題