Web Scraper with Browser Support

I had a need at my job to test micro sites that are being spun up.  These sites have hundreds of pages with hundreds of links between them.

I’ve written a few different styles of scrapers.  Some grab links to images, to verify each image is loading, other scripts load browsers and hit each url to verify it loads… and uses watir to make custom validation on fields (i.e. did a specific div load, is there spec text on each page… etc.)

The goal here isn’t to simulate click through’s on links. Rather, I want to ensure each link on the microsite is captured and loaded. Sort of like a unit test for the page. The pages are templates, which would lend to a variety of assertions that could be made. In this case, I’m just checking for 404’s… I could do more, like verify a specific DIV loaded, or that images loaded fine…

The script itself makes use of some nifty packages out there (like Anemone and Watir Webdriver.)

My script looks like this:

require 'anemone'
require 'watir-webdriver'

@profile=Selenium::WebDriver::Firefox::Profile.from_name "default"
@profile.assume_untrusted_certificate_issuer=false  #Specific to me, I needed to pass SSL on my environment
@profile.secure_ssl = true
@browser = Watir::Browser.new :firefox, :profile => @profile  #Telling Watir to use a firefox profile with SSL cert exception

base_url = "http://hard_coded_microsite.org" #I'm hard coding the site, could be passed in, instead.
puts 'Crawling site'
Anemone.crawl(base_url) do |a|
  URLS = []  #Setting up array
  a.on_every_page do |p|
    if p.html?
      URLS => p.url.to_s              
      p URLS   #Dumping the links found to the screen. Could be written or saved to db.      
    end     
  end  
end      
   randomNum = rand(1000..6000)      
   filenameOut = "scraper_fails_#{randomNum}_.txt"      
   File.new("#{filenameOut}", "w")  #Creating my log file 

URLS.each do |u|   #Those URL's Anemone found, we'll now open each one to check they load in browser   
  @browser.goto(u)      
  if @browser.text.include? "404"  #In the test site, data is loaded, if it fails, a 404 is written in html body. I'm checking that.
      p u + " FAILED: page is broken"          
      begin       #Writing out failure url's to flat file         
          file = File.open("#{filenameOut}", "a")                
          file.write("#{u} FAILED and is a broken page... \n")          
      rescue IOError => e
        p "FAILED TO WRITE"
    ensure
        file.close unless file == nil
    end
  else
  end
end
@browser.close

So it does it’s job. It’s simple and certainly it’s not the core of testing.  This certainly isn’t clicking on links, it’s scraping links.  If I were to create web automation scripts, to go through a flow, I might miss some unknown link, or not know that there’s some hidden link on the page.  This insures everything in a “a tag” on every page is accounted for, loaded and checked for some sort of validation.

To see how this looks in operation, I did a screen cast of it running on my local machine:

Leave a Reply

Your email address will not be published. Required fields are marked *