After getting a ping from a friend of mine, I updated my web scrapper and modified it a bit to fit his needs. He wanted to catch 404’s… so I sent him my old web scrapper. The problem was though, his test machine had limited memory. Due to that it was crashing out and not completing the test. My old code would leverage Anemone to crawl, and then put each URL into an array. Then iterate over the array via some browser automation framework (i.e. Watir.)
The reason for the browser iteration, is that you can leverage something like Watir to validate specific elements on each page. In my case I had errors Anemone wasn’t catching. So Watir was a great help. But because of his machine crashing on run, I modified the code a bit:
require 'anemone'
require 'watir-webdriver'
@browser = Watir::Browser.new :firefox
base_url = "http://yoursite"
puts 'Crawling site'
Anemone.crawl(base_url) do |a|
URLS = []
a.on_every_page do |p|
if p.html?
URLS << p.url.to_s
URLS.each do |u|
@browser.goto(u)
if @browser.text.include? "404"
p u + " FAILED: page is broken"
else
p "This page is good: " + u
end
end
end
end
end
That’s what fixed it. So what we’re doing here, is while Anemone is finding a link it’s crawl, it pulls it up in a browser. I put in a simple browser check. But more could be added. Failures get output to the screen (but could be saved to a csv file.) Passes are put to the screen to get an idea of the full breadth of testing.
Comments are closed