Crawling all Links with Selenium and NUnit

The simplest test to perform on a web application is to simply follow all links presented by the application.

To start up the web application test, we have to login first, and we should be able to detect .Net server errors.

Our crawling test needs two collections: the URLs we already processed, and a queue of URLs to be processed.

private System.Collections.Hashtable ht;
private System.Collections.Generic.Queue<string> quUrls;

The first entry in the URL Queue is the page following the login. The hashtable can be filled with URLs not to be followed (such as a link to logout):

public void OpenAllLinks()
{
    Configuration.Login(selenium, uMandIndex);
    quUrls.Enqueue(selenium.GetLocation());

    int uDone = 0;
    string sUrl;
    sUrl = selenium.GetEval("window.document.getElementById('ctl00_hlLogout').href");
    ht.Add(sUrl, "dont follow");

    while (quUrls.Count > 0)
    {
        sUrl = quUrls.Dequeue();
        ht.Add(sUrl, "done");
        uDone++;
        NUnitLog.Trace("processing " + sUrl + " scanned " + ht.Count.ToString() +
            " todo " + quUrls.Count.ToString() + " done " + uDone.ToString());

        selenium.Open(sUrl);
        selenium.WaitForPageToLoad("30000");

        string sHtml = selenium.GetHtmlSource();
        if (DetectServerError(sHtml))
        {
            HandleServerError(sUrl, sHtml);
            continue;
        }

        string sCount = selenium.GetEval(
            "window.document.getElementsByTagName('a').length");
        NUnitLog.Trace(sCount + " links");

After retrieving the page, we collect all href attributes. As selenium.GetAllLinks only returns named elements, we need to set the id attribute of unnamed links. For performance reasons, this is done with a single Javascript call:

        NUnitLog.Trace(selenium.GetEval(@"
var i = 0, ii = 0;
for(i = 0; i < window.document.getElementsByTagName('a').length; i++) {
if (window.document.getElementsByTagName('a')[i].id == '') {
window.document.getElementsByTagName('a')[i].id = 'hl_' + i; ii++;
}
}
ii;") + " links updated");

        string[] rgsLinks = selenium.GetAllLinks();

        foreach (string sLink in rgsLinks)
        {
            if (!string.IsNullOrEmpty(sLink) &&
                sLink != "ctl00_hlHelp")
            {
                string sUrlLink = selenium.GetEval(
                    "window.document.getElementById('" + sLink + "').href");
                if (!string.IsNullOrEmpty(sUrlLink) &&
                    sUrlLink.StartsWith(Configuration.Host) &&
                    !sUrlLink.Contains(".ashx"))
                {
                    string sUrlLinkBase = sUrlLink;
                    if (sUrlLinkBase.Contains("?"))
                    {
                        sUrlLinkBase = Regex.Replace(sUrlLinkBase, "=.+?&", "&");
                        sUrlLinkBase = Regex.Replace(sUrlLinkBase, "=.+", "");
                    }

                    if (!ht.ContainsKey(sUrlLinkBase) &&
                        !quUrls.Contains(sUrlLinkBase))
                    {
                        NUnitLog.Trace("queuing " + sUrlLink);
                        quUrls.Enqueue(sUrlLink);

                        if (sUrlLinkBase != sUrlLink)
                            ht.Add(sUrlLinkBase, "pseudo");
                    }
                }
            }
        }
    }
}

The sUrlLinkBase variable is calculated to avoid calling the same .aspx page with different parameters. Therefore we extract all parameter values with two regular expressions, just leaving the parameter names in the URL. If this modified URL is not in the hashtable of processed urls, we queue it. This calculation is optional; disable it if you want to crawl each and every generated page.

About these ads

3 Responses to Crawling all Links with Selenium and NUnit

  1. [...] NUnit crawler speed-up I improved the speed of my Selenium link crawling algorithm by directly extracting the href URLs of all hyperlinks, instead of retrieving the hyperlinks by ID [...]

  2. [...] crawler can optionally reduce a page’s URL to a kind of signature consisting of the address and the names of its parameters. For example, the [...]

  3. henkemike says:

    Ok, this is cool. How would I use this with the selnium ide?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 65 other followers

%d bloggers like this: