-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathTODO.txt
41 lines (28 loc) · 1.2 KB
/
TODO.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
TODO:
- add parsing of article publish dates to timestamps
- refactor article hashes to concat publish date (or scrape date month) and title
- add email report to cron (email a report once a day)
- add 'active' column to headline_words table
- fix headline processing to only strip beggining and end of words
-
Sample query for
- select * from headline_words where scrape_ts between 'ts_int' and 'ts_int';
- decide what we want to store and how
::Each News Item::
- available props: url, author, title, description, publish time
- shared props: url, title, publish time, (description, maybe)
::Each Headline::
- give timestamp
- split into individual words
- remove undesired words
- format as dictionary word count
- database and schema
article table schema:
- id, url, author, title, description, scrape_ts, publish_ts, md5hash
headline_word schema:
- id, word (not unique), article_id, scrape_ts
notes about scraping/programmatic web requests:
X throttle rate of requests
- rotate IPs (look into VPNs, shared proxies and TOR)
- get multiple newsapi tokens and rotate
X user-agent string spoofing