Crawling all Links with Selenium and NUnit

The simplest test to perform on a web application is to simply follow all links presented by the application.

To start up the web application test, we have to login first, and we should be able to detect .Net server errors.

Our crawling test needs two collections: the URLs we already processed, and a queue of URLs to be processed.

private System.Collections.Hashtable ht;
private System.Collections.Generic.Queue<string> quUrls;

The first entry in the URL Queue is the page following the login. The hashtable can be filled with URLs not to be followed (such as a link to logout):

public void OpenAllLinks()
{
    Configuration.Login(selenium, uMandIndex);
    quUrls.Enqueue(selenium.GetLocation());

    int uDone = 0;
    string sUrl;
    sUrl = selenium.GetEval("window.document.getElementById('ctl00_hlLogout').href");
    ht.Add(sUrl, "dont follow");

    while (quUrls.Count > 0)
    {
        sUrl = quUrls.Dequeue();
        ht.Add(sUrl, "done");
        uDone++;
        NUnitLog.Trace("processing " + sUrl + " scanned " + ht.Count.ToString() +
            " todo " + quUrls.Count.ToString() + " done " + uDone.ToString());

        selenium.Open(sUrl);
        selenium.WaitForPageToLoad("30000");

        string sHtml = selenium.GetHtmlSource();
        if (DetectServerError(sHtml))
        {
            HandleServerError(sUrl, sHtml);
            continue;
        }

        string sCount = selenium.GetEval(
            "window.document.getElementsByTagName('a').length");
        NUnitLog.Trace(sCount + " links");

After retrieving the page, we collect all href attributes. As selenium.GetAllLinks only returns named elements, we need to set the id attribute of unnamed links. For performance reasons, this is done with a single Javascript call:

        NUnitLog.Trace(selenium.GetEval(@"
var i = 0, ii = 0;
for(i = 0; i < window.document.getElementsByTagName('a').length; i++) {
if (window.document.getElementsByTagName('a')[i].id == '') {
window.document.getElementsByTagName('a')[i].id = 'hl_' + i; ii++;
}
}
ii;") + " links updated");

        string[] rgsLinks = selenium.GetAllLinks();

        foreach (string sLink in rgsLinks)
        {
            if (!string.IsNullOrEmpty(sLink) &&
                sLink != "ctl00_hlHelp")
            {
                string sUrlLink = selenium.GetEval(
                    "window.document.getElementById('" + sLink + "').href");
                if (!string.IsNullOrEmpty(sUrlLink) &&
                    sUrlLink.StartsWith(Configuration.Host) &&
                    !sUrlLink.Contains(".ashx"))
                {
                    string sUrlLinkBase = sUrlLink;
                    if (sUrlLinkBase.Contains("?"))
                    {
                        sUrlLinkBase = Regex.Replace(sUrlLinkBase, "=.+?&", "&");
                        sUrlLinkBase = Regex.Replace(sUrlLinkBase, "=.+", "");
                    }

                    if (!ht.ContainsKey(sUrlLinkBase) &&
                        !quUrls.Contains(sUrlLinkBase))
                    {
                        NUnitLog.Trace("queuing " + sUrlLink);
                        quUrls.Enqueue(sUrlLink);

                        if (sUrlLinkBase != sUrlLink)
                            ht.Add(sUrlLinkBase, "pseudo");
                    }
                }
            }
        }
    }
}

The sUrlLinkBase variable is calculated to avoid calling the same .aspx page with different parameters. Therefore we extract all parameter values with two regular expressions, just leaving the parameter names in the URL. If this modified URL is not in the hashtable of processed urls, we queue it. This calculation is optional; disable it if you want to crawl each and every generated page.

3 thoughts on “Crawling all Links with Selenium and NUnit

  1. Pingback: Selenium NUnit crawler speed-up « devioblog

  2. Pingback: Page-specific actions in Selenium NUnit crawler « devioblog

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.