Forcing Google Search Appliance to re-crawl EPiServer published pages
Oct 25, 2011
Google Mini and Google Search Appliance both offer a powerful search for your EPiServer site. In the most simple implementation you simply point to the box to your URL and get some Google quality results back.
However the Google boxes will crawl the site at their own pace (its possible to configure). But this isn’t always ideal as site editors like to see their content indexed on the site pretty quickly rather than wait for Google to get around to re-crawling.
Using the Google Enterprise gdata Api its possible to tell the Google box that a page in EPiServer has changed and add it to the re-crawl list (API documentation). Note: The re-crawl still isn’t immediate, it simply tells Google to add the URL to the list of URLs to be re-crawled meaning that the results should appear a sooner than simply waiting for the Google boxes to get round to re-indexing.
The example code uses an initialisation module to hook into the published event and tell Google to re-crawl a URL:
using System;
using System.Text;
using System.Web;
using EPiServer;
using EPiServer.Core;
using EPiServer.Framework;
using EPiServer.Framework.Initialization;
using EPiServer.Web;
using Google.GData.Gsa;
using log4net;
/// <summary>
/// Initialisation module to wire up EpiServer dataFactory handlers to functionality that will trigger the Google Search Appliance
/// to crawl a newly published page
/// </summary>
[InitializableModule]
[ModuleDependency((typeof(EPiServer.Web.InitializationModule)))]
public class GoogleSearchCrawlerInit : IInitializableModule
{
private static readonly ILog Logger = LogManager.GetLogger(typeof(GoogleSearchCrawlerInit));
#region Implementation of IInitializableModule
public void Initialize(InitializationEngine context)
{
// Hook up handler to page published event
DataFactory.Instance.PublishedPage += PagePublished;
}
public void Uninitialize(InitializationEngine context)
{
DataFactory.Instance.PublishedPage -= PagePublished;
}
public void Preload(string[] parameters)
{
}
#endregion
/// <summary>
/// Event handler that tells Google Search Appliance to crawl the published page that raised this event
/// </summary>
/// <param name="sender">sender</param>
/// <param name="e">event args</param>
private static void PagePublished(object sender, PageEventArgs e)
{
// if page is publicly visible (exists below the site start page and is not a container page)
// tell Google Search Appliance to immediately crawl this page (it's url)
if (e.Page.IsVisibleOnSite)
{
var config = GoogleSearchConfiguration.GetConfig();
try
{
// instantiate an object representing the GSA
var myService = new GsaService(config.AdminUrl, config.Port, config.UserId, config.Password);
var updateEntry = new GsaEntry();
// add command to crawl url
var pageUrl = GetPageUrl(e.Page);
updateEntry.AddGsaContent("recrawlURLs", pageUrl);
// Send the request to the GSA
myService.UpdateEntry("command", "recrawlNow", updateEntry);
}
catch (Exception ex)
{
// log error and silently continue
Logger.Error(ex.Message, ex);
}
}
}
/// <summary>
/// Gets the friendly url to crawl
/// </summary>
/// <param name="page">The PageData to extract the url from</param>
/// <returns></returns>
private static string GetPageUrl(PageData page)
{
var url = new UrlBuilder(page.LinkURL);
if (UrlRewriteProvider.IsFurlEnabled)
{
Global.UrlRewriteProvider.ConvertToExternal(url, page.PageLink, Encoding.UTF8);
}
if (HttpContext.Current != null)
{
url.Host = HttpContext.Current.Request.Url.Host;
url.Scheme = HttpContext.Current.Request.Url.Scheme;
}
return url.Uri.ToString();
}
}
Hope this is useful!
Feedback
I'd be happy to hear any feedback on the comments below or @davidknipe