- Your contribution here
- #13: Add csv to the gemspec / remove unused
create_browser
method - n-at-han-k
- Allow passing
data:
tocrawl!
- glaucocustodio
- #4: Fix keyword args on
crawl!
- milk1000cc
- Add support to Ruby 3 - glaucocustodio
- Add
response_type
toin_parallel
- glaucocustodio
- First release as Tanakai - glaucocustodio
- Add support to Apparition - glaucocustodio
- Add support to Cuprite - glaucocustodio
- Add
encoding
config option (see All available config options) - Validate url before processing a request (Base#request_to)
- Fix console command bug (see issue 21)
- In the project template, set Ruby version as >= 2.5 (before was hard-coded to 2.5.1)
- Remove .ruby-version file (was hard-coded to 2.5.1) from the project template
- Fixed bug in Base#save_to
- Remove persistence database feature (because it's slow and makes things complicated)
- Add
--include
and--exclude
options to CLI#runner - Add Base
#create_browser
method to easily create additional browser instances - Add Capybara::Session
#scroll_to_bottom
- Add skip_on_failure feature to
retry_request_errors
config option - Add info about
add_event
method to the README
- Improve Runner
- Fix time helper in schedule.rb
- Add proxy validation to browser builders
- Allow to pass different arguments to the
Base.parse
method
- Add possibility to add array of values to the storage (
Base::Storage#add
) - Add
exception_on_fail
option toBase.crawl!
- Add possibility to pass request hash to the
start_urls
(You can use array of hashes as well, like:@start_urls = [{ url: "https://example.com/cat?id=1", data: { category: "First Category" } }]
) - Implement
skip_request_errors
config feature. Added Handle request errors chapter to the README. - Add option to choose response type for
Session#current_response
(:html
default, or:json
) - Add option to provide custom chrome and chromedriver paths
- Refactor
Runner
- Fix
Base#Saver
(automatically create file if it doesn't exists in case of persistence database) - Do not deep merge config's
headers:
option
browser
config option depricated. Now all sub-options inside browser
should be placed right into @config
hash, without browser
parent key. Example:
# Was:
@config = {
browser: {
retry_request_errors: [Net::ReadTimeout],
restart_if: {
memory_limit: 350_000,
requests_limit: 100
},
before_request: {
change_proxy: true,
change_user_agent: true,
clear_cookies: true,
clear_and_set_cookies: true,
delay: 1..3
}
}
}
# Now:
@config = {
retry_request_errors: [Net::ReadTimeout],
restart_if: {
memory_limit: 350_000,
requests_limit: 100
},
before_request: {
change_proxy: true,
change_user_agent: true,
clear_cookies: true,
clear_and_set_cookies: true,
delay: 1..3
}
}
- Add
storage
object with additional methods and persistence database feature - Add events feature to
run_info
- Add
skip_duplicate_requests
config option to automatically skip already visited urls when using requrst_to - Add
extensions
config option to allow inject JS code into browser (supported only by poltergeist_phantomjs engine) - Add Capybara::Session#within_new_window_by method
- Add the last backtrace line to pipeline output when item was dropped
- Do not destroy driver if it's not exists (for Base.parse! method)
- Handle possible Net::ReadTimeout error while trying to #quit driver
- Fix Mechanize::Driver#proxy (there was a bug while using proxy for mechanize engine without authorization)
- Fix requests retries logic
- Add missing
logger
method to pipeline - Fix
set_proxy
in Mechanize and Poltergeist builders