-
Notifications
You must be signed in to change notification settings - Fork 22
/
Copy pathR06_2p_crawl_104.Rmd
195 lines (108 loc) · 4.93 KB
/
R06_2p_crawl_104.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
---
title: "Crawling 104"
author: "Jilung Hsieh"
date: "`r Sys.Date()`"
output: html_document
---
```{r setup-load-pkgs, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
# options(stringsAsFactors = F) # by default in R > 4.0
```
# Scraper Overview
## JSON的概念
1. 呈現在我們面前的網頁都是HTML,但一個HTML如新聞網站的版面,通常會分為資料和功能區塊。對爬蟲學習者來說,新聞的標題與其超鏈結才是資料,其他如分類、廣告等都是屬於功能區塊。
2. 當發送第一個request時,通常會獲取整個HTML頁面,包括所有的內容。但是,當需要僅更新部分內容時,例如獲取第二頁的新聞,此時如果又要更新整個頁面版面,非常龐大,這時往往會使用AJAX的方式。這個request會向伺服器請求JSON格式的資料。當瀏覽器收到第二個request的回應後,使用JavaScript處理回應的資料,並將其動態地更新網頁上的特定部分,而不需要重新加載整個頁面。
## 爬蟲判斷流程
1. 先確認可以爬取JSON或者必須要剖析HTML:觀察**Network**的**Fetch/XHR**中所依次載入的內容,是否有JSON格式的內容出現。
1. 點選**Network**的**Fetch/XHR**中逐一載入的檔案會看到Header、Payload、Preview、Response等分頁。要觀察是否有JSON格式,要看的是Preview,要複製該頁面的鏈結,要找Header。
2. 觀察的小訣竅:想辦法載入第二頁(點選、或往下捲動產生),找到是否有載入新的JSON。如果在Fetch/XHR都沒有出現跟頁面相關的JSON格式數據,那應該就會是要剖析HTML格式的網頁。
2. JSON類爬蟲:確認是自己想抓的JSON檔後,複製該頁面連結,貼至瀏覽器網址列上測試。如果網頁上可以直接載入該頁面內的JSON內容,代表該JSON容易取得,可以進入寫程式的環節。如果會產生存取錯誤,可能就需要觀察Referer或Cookie以取得該頁面內容。
3. 找到最後一頁要怎麼取得(停止條件)
4. 開始逐一爬取頁面
# Get the first page
## (try to) Load the 2nd page
```{r}
# Copy and assign the 2nd page url to url2
url2 <- "https://www.104.com.tw/jobs/search/list?ro=0&kwop=7&keyword=%E8%B3%87%E6%96%99%E7%A7%91%E5%AD%B8&expansionType=area%2Cspec%2Ccom%2Cjob%2Cwf%2Cwktm&order=14&asc=0&page=2&mode=s&jobsource=2018indexpoc&langFlag=0&langStatus=0&recommendJob=1&hotJob=1"
# Copy and assign the 3rd page url to url3
url3 <-
# Send GET() request to get the page of url2,
# parse it,
# and assign it to res2
res2 <- GET(url2) %>%
content("text") %>%
fromJSON()
# Tracing variable result2 and finding the data frame,
# assign to df2
df2
```
## Add "Referer" argument to request page data correctly
```{r}
response <- GET(url2, config = add_headers("Referer" = "https://www.104.com.tw/"))
res <- response %>% content("text") %>%
fromJSON()
res$data$list %>% View
```
## Get the first page by modifying url
```{r}
# Guessing the 1st page data url to url1
# Getting back the 1st page data
```
# Combine data frames by row
## (try to) Combine two pieces of data (having exactly the same variables)
```{r}
# all.df <- bind_rows(df1, df2) # will raise error
# Error in bind_rows_(x, .id) :
# Argument 31 can't be a list containing data frames
```
## Drop out hierarchical variables
- Preserving numeric or character, dropping list of data.frame by assigning NULL to the variable
```{r}
# Drop list and data.frame inside the data.frame
# Re-binding two data.frames df1 and df2 by rows
```
## Dropping hierarchical variables by dplyr way
```{r}
# Getting the 1st page data and dropping variable tags and link
# Assigning to df1
# Getting the 2nd page data and dropping variable tags and link
# Assigning to df2
# binding df1 and df2
```
# Finding out the last page number
```{r}
# Tracing the number of pages in result1
# Checking the page number of the last page
# Examining if the last page data available by re-composing URL with paste0()
# Getting back and parsing the last page data
```
# Use for-loop to get all pages
```{r}
```
# Combine all data.frame
```{r}
# The 1st url of the query
# Getting back the 1st page data
# Tracing and getting total number of page
# Truncating hierarchical variables: link and tags
# for-loop to getting back data and joining them
```
# Complete Code
```{r}
all.df <- tibble()
refer_url <- "https://www.104.com.tw"
for(p in 1:10){
url <- str_c('https://www.104.com.tw/jobs/search/list?ro=0&kwop=7&keyword=%E8%B3%87%E6%96%99%E7%A7%91%E5%AD%B8&order=12&asc=0&page=',
p,
"&mode=s&jobsource=2018indexpoc")
print(p)
res <- GET(url, add_headers("referer"=refer_url)) %>%
content("text") %>%
fromJSON()
res$data$list$tags <- NULL
res$data$list$link <- NULL
all.df <- bind_rows(all.df, res$data$list)
}
all.df$jobNo %>% unique %>% length
```