in ,

Automating SEO, Link and Page Validation

Note #1: make sure you have the right to test or crawl the site you are obtaining links from or testing the title and meta data of.  To fire off a test like this against a random site could result in your IP being banned by the admin.  It could also be deemed an attack by some.

Note #2: If you have permission and the site you are testing is hosted by GoDaddy, you will have to work with Tier 2 support.  By default they block the IP issuing a crawl script like this after about the first 10 seconds.

SEO sites can often have many pages, have depth and require extensive testing to ensure that all the cross linking is working.  There is also a need to validate that each page has a title, meta description and meta keywords.

Below are two different scripts.  

The FIRST will export a list of all your URL’s in your site and then walk through setting up JMeter to iterate over each URL’s, verifying no 404’s return.

The SECOND script will run and export a CSV file that has your URL’s, title tag value and Meta information.  That can be useful for cross comparison with other versions of of the code, or spot checking in Excel for any Marketing SEO issues.

Checking for 404s on Each Build

We can make a little Anemone script simply finds all the URLs on the site.  We can output that to a CSV and then use something like Jmeter to run the list of URL’s on each build of the code.  This would hit every page linked in our site and return any pages that doesn’t respond with a 200 OK.


Be sure that you have installed the gems for Anemone and Nokogiri.

To do this, let’s write the crawler:

require 'anemone'
require 'nokogiri'
require 'open-uri'
require 'csv'
num = rand(1000..4000)
fname = "#{num}_meta_results.csv""#{fname}",'w'){|file| file.write("URL\n")}
base_url = ""
puts "Crawling #{base_url}"
Anemone.crawl(base_url) do |a|
  a.on_every_page do |page|
    puts page.url.to_s #Console output to show what URL we are at
      f.puts "#{page.url.to_s}"}
p "Finished."

In the above script I’m first creating a filename with a random 5 digit code (fname) – after that I open/create the file and set the first line (column header) to be “URL”.  You can of course omit that, but it might be useful if you intend to import this csv into Excel.

The website we are going to start our crawl is defined in the base_url variable.

Next, we create an Anemone instance, passing in the crawl method and the (base_url) value. In that block we iterate over every page (page.)  I like to see what’s going on in the console, so I’m putting the value of page.url.to_s (each URL found per page.)

Finally we open the csv file we created outside the block, in append mode (a) and write each line to it with

f.puts “#{page.url.to_s}”}

In Excel you can create a new workbook, then click File | Import and choose to import a CSV file. Point to this output file and then choose:

  • Import it as a delimited file
  • Follow the remaining prompts (defaults)
  • Click Finish

If you want to use and reuse this list to test if any pages are in error, you can have JMeter run the list of URL’s at any time.

JMeter Integration

Edit the CSV file in Excel or in an editor and remove the first line (URL.)

Open JMEter and make a new Thread Group

Right click the Thread Group and choose Add > Config Element > CSV Data Set Config

Screen Shot 2015-04-08 at 5.38.39 PM


Once you add it, you’ll see a screen like below:

Screen Shot 2015-04-08 at 5.38.59 PM

In the Filename field, put the path to your csv file.

In the Variable Names field, add a variable (this is used to pass each line of the CSV into an HTTP Request.  I used the name URL for this variable.

Right click the Thread group again and choose Add > Sampler > HTTP Request (see menus below):

Screen Shot 2015-04-08 at 5.43.38 PM







Next you will have a screen (like below) that will parse each URL in your csv file into a HTTP call.  Just pass the variable into the Path field with the ${URL} syntax.
Screen Shot 2015-04-08 at 5.39.11 PM

We want to know if a URL has an error or not, so let’s add an assertion for each request.  To do this, right click the HTTP Request in your JMeter Thread Group.  Then pick Add > Assertions > Response Assertion

Screen Shot 2015-04-08 at 5.49.56 PM



In the response assertion screen (pictured below) you could make a rule to assert each URL returns a 200 response code.  So pick the Resposne Code radial, and the in the Patterns to Test click Add and add 200.  Now, each time it makes a HTTP request it will hit a URL and any response that is NOT a 200 will be counted as an error in JMeter (404, 500, etc.)Screen Shot 2015-04-08 at 5.51.47 PM

We can add a report of our results by Right clicking the Thread Group and choosing Add > Listener > View Results TreeScreen Shot 2015-04-08 at 5.39.41 PM


You might not want JMeter to run this as fast as it wants. It might be too much throughput for your internal website. You can throttle it down to whatever throughput you want (requests per min) by right clicking Thread Group and Add > Timer > Constant Throughput Timer (see image below.)

Screen Shot 2015-04-08 at 5.47.33 PM

Screen Shot 2015-04-08 at 5.48.21 PM

Change the “Target Throughout” value to whatever you wish.  Remember these are in requests per min.  So if the desired result is 1 request per second, you would use the value 60.0.

One way to run this list of URLs is to tell JMeter to run 1 Thread (user) at a time.  This will repeat the amount of rows you have in your CSV file.  So if your CSV had 1856 rows (or unique url’s) you would Right Click your Thread Group again and change the values like the image below:

Screen Shot 2015-04-08 at 5.56.34 PM

The above image tells JMeter to run your CSV list in one thread and iterate over all 1856 lines in the CSV.

Click Run… and then you can click your View Results Tree report to see all the passes and/or failures.  You can run this test now each time there’s a new build, etc. You can also run it from the command line programmatically using the command line version of JMeter.

Scraping Links, Titles and Meta Information

Marketing teams may have their own tools to accomplish this… or they may not.  You can modify the Anemone script above and run through your site (pre production, production, etc.) and output a csv file with the URL, HTML Title of that URL, META description on that URL and META keywords on that URL page.

This can be useful to see if there are any pages on your site that are missing title tags, META tags, etc.

The resulting output might look like:, Homepage, ‘This is a site about my stuff.’,’my stuff, great site, photography, books, movies.’

You can collect whatever data you want from each page.  We’ll be using Nokogiri to capture/parse the HTML on each link found by Anemone.  Nokogiri will let us grab any element that is common to the pages.

Here’s an example of doing this in code:

require 'anemone'
require 'nokogiri'
require 'open-uri'
require 'csv'
num = rand(1000..4000)
fname = "#{num}_meta_results.csv""#{fname}",'w'){|file| file.write("URL,TITLE,META-KEYWORDS,META-DESCRIPTION\n")}
base_url = ""
puts "Crawling #{base_url}"
Anemone.crawl(base_url) do |a|
  a.on_every_page do |p|
    puts p.url.to_s
    page_source = Nokogiri::HTML(open(p.url.to_s))
    description ='meta[name="description"]')['content'] rescue nil
    keywords ='meta[name="keywords"]')['content'] rescue nil
    title ='title').inner_html rescue nil
    puts "title: #{title} keywords: #{keywords} description: #{description}"
      f.puts "'#{p.url.to_s}','#{title}','#{keywords}','#{description}'"}

p "Finished."

Very similar to the first script, the main change is to add a new variable page_source that instantiates Nokogiri as Nokogiri::HTML(open(p.url.to_s))  where p here is not “puts” but rather the iterated page from Anemone.

Next I created some variables that would reference specific items I want to collect on each page… description as a variable set to‘meta[name=”description”]’)[‘content’]  That is literally Nokogiri pulling out the meta tag named description and returning it’s content.

I do that for the other things I want… for Title I used Anemone’s built in doc method… perhaps I could have used that for the others… not sure. Try and see.

Finally I append the CSV file with

 f.puts "'#{p.url.to_s}','#{title}','#{keywords}','#{description}'"

The reason for the single quotes around the values, is that some websites make use of commas in the title, keywords or description.  That is problematic if you are making a CSV.  You don’t want a value like keywords coming in as: movies, media, content, streaming.  🙂

This requires a little change to view in Excel once you have your CSV data.  To import this into Excel you have to choose “Import CSV data” then choose that the data is delimited.  It should default to comma.  You must now also choose the text qualifier and set it to the single quote character (see image below.)

Screen Shot 2015-04-08 at 6.12.49 PM


It should import all your data, and the data with commas will still remain in one field.

What do you think?

0 points
Upvote Downvote

Total votes: 0

Upvotes: 0

Upvotes percentage: 0.000000%

Downvotes: 0

Downvotes percentage: 0.000000%

Written by Admin

I work for a Telecom company writing and testing software. My passion for writing code is expressed through this blog. It's my hope that it gives hope to any and all who are self-taught.


Leave a Reply


Selenium Server Setup

Web Automation with CSV Imports