potato papa: [Hack] 網頁爬蟲，網頁看的到要爬的東西卻無法解出API。[python][selenium][webdriver][chrome]

筆記 (CentOS 7 為例)：

網頁看的到要爬的東西卻無法~~Hack~~解出 API，有可能因為 API 要 key 才能 access，或是 API 資料編碼過，回傳到 browser 端才由 javascript 解析塞進 HTML 中。這種情形無法直接用 python request, wget, curl去抓網頁，可透過程式模擬 user 操作 browser 去抓。

1. 安裝 chrome
Install google-chrome-stable

a. 於/etc/yum.repos.d/ 建立 google-chrome.repo，內容如下：
[google-chrome]
name=google-chrome
baseurl=http://dl.google.com/linux/chrome/rpm/stable/$basearch
enabled=1
gpgcheck=1
gpgkey=https://dl-ssl.google.com/linux/linux_signing_key.pub

b. install google-chrome-stable
yum -y install google-chrome-stable

2. 安裝 chromedriver

a. 需要先安裝 libX11 套件
yum -y install libX11

b. 官網下載 chromedriver，解壓縮至於 PATH 可用的路徑即可。
載點：
https://chromedriver.storage.googleapis.com/index.html?path=79.0.3945.36/

3. 安裝 selenium python 套件
a. pip install selenium

4. selenium webdriver API 的使用

# Python 3.7.6
from selenium import webdriver

proxy = '{intranet_proxy}:{port}'

# chrome start arguments
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=http://' + proxy)
chrome_options.add_argument('--ignore-certificate-errors')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--headless')

chrome = webdriver.Chrome(chrome_options=chrome_options)

# open URL
chrome.get(my_url)

# 先點某個tab，真強，居然可以用程式模擬使用者在網頁操作的行為！
B = chrome.find_element_by_id('tab_element')
B.click()

# 再點某個按鈕
btn = chrome.find_element_by_id('ui-id-2')
btn.click()

# 把現在的browser畫面存下來，debug 看看有沒有真的觸發到上面的 click 事件
chrome.save_screenshot("screenshot.png")

# 透過 XPath 抓網頁上的資料
ele_list = chrome.find_elements_by_xpath("//div[1]/div[1]/table/tbody[2]/tr[1]/td[1]")

for ele in ele_list:

print(ele.text)

原始參考： http://keejo.coding.me/CentOS%E4%B8%8B%E9%83%A8%E7%BD%B2selenium%E7%8E%AF%E5%A2%83.html
https://sites.google.com/a/chromium.org/chromedriver/home

potato papa

Header Bar

2020年1月22日星期三

[Hack] 網頁爬蟲，網頁看的到要爬的東西卻無法解出API。[python][selenium][webdriver][chrome]

沒有留言:

Header Bar

2020年1月22日 星期三

[Hack] 網頁爬蟲，網頁看的到要爬的東西卻無法解出API。[python][selenium][webdriver][chrome]

沒有留言:

2020年1月22日星期三