Friday, July 27, 2012

How to set HtmlAgilityPack Timeout

HtmlAgilityPack is a great HTML parser library that I often use for scraping.  It does web requests on your behalf via the HtmlWeb().Load methods, but doesn’t expose the HttpWebRequest.Timeout property.  I see a lot of people recommending using HttpWebRequest or WebClient to get the request and then HtmlAgilityPack to query the DOM, but there’s an easier way.

You can view the source for HtmlWeb here and see that they expose a PreRequest delegate:

/// <summary>
/// Represents the method that will handle the PreRequest event.
/// </summary>
public delegate bool PreRequestHandler(HttpWebRequest request);


And they call that delegate right before making the GetResponse call:

if (PreRequest != null)
{
// allow our user to change the request at will
if (!PreRequest(req))
{
return HttpStatusCode.ResetContent;
}
}

HttpWebResponse resp;

try
{
resp = req.GetResponse() as HttpWebResponse;
}


So all you have to do is assign a delegate to PreRequest and set your timeout within that delegate:

var web = new HtmlWeb();
web.PreRequest = delegate(HttpWebRequest webRequest)
{
webRequest.Timeout = 4;
return true;
};
var doc = web.Load("http://www.msn.com/");

Yep, it’s that easy.


 


Jon