To the Cold Eyes of a Machine, All Single Page Applications Look the Same

I was recently working on a project and bumped into trouble getting Facebook's bot to parse my site. It is single page application, and the content is rendered client side using KnockoutJS. As might be expected, the bot saw identical markup for every URL, and as a result I wasn't able to test the Open Graph stories I was creating. My team and I knew this would be an issue for SEO and originally planned to address it closer to launch; but now that it was hindering feature development the priority just shot up. A little Googling found some helpful posts pointing in the right direction:

Great ideas, however, none of these felt quite right for my situation. First, they all rely on the use of hashbangs, which are converted into querystrings by the bot.

http://mixtmeta.com#!myPage

when requested by a crawler, will look like:

http://mixtmeta.com?escaped_fragment=myPage

The web server is supposed to detect the 'escaped_fragment' and infer that the request is from a bot to serve up some special content. Clever, except my site is a .NET MVC application and does not use hashbangs, opting instead to use routing in the format of:

http://mixtmeta.com/{controller}/{action}/{id}

Second, they all use headless browsing by a PhantomJS webserver, usually running on Node, to serve content to bots on-the-fly. This does not seem practical in my situation and is not an option. The site is on Azure, so theoretically I could put up a Node app. But once they start indexing your site, bots can slam your it with 1000s of requests a second. Since this is a shared environment, I do not want to risk overrunning the CPU allotment by rendering at runtime, which could result in the site going down.

My solution to these problems was to piggyback on the previous suggestions and create a PhantomJS script, but have it run on demand offline (hourly, daily, weekly, whatever you want). The script parses the sitemap.xml for my website and saves a rendered HTML page for each link. I also added an ActionFilter to detect bot requests (in this case, the Facebook crawler) and serve these pre-cached pages directly, instead of rendering them as the requests come in. The PhantomJS script and ActionFilter both MD5 hash the requested URL to name/serve the HTML files. MD5() in the code below is just a string extension to hide away that logic.

The PhantomJS script is on GitHub: mixtmeta/phantomjs-pagecacher, and the ActionFilter is below.

CrawlerActionFilter.cs

using System.Web.Mvc;
using System.Web;
using System.Text.RegularExpressions;

public class CrawlerActionFilter : ActionFilterAttribute
{
    public override void OnActionExecuting(ActionExecutingContext filterContext)
    {
        Regex reg = new Regex("^(facebookexternalhit.*)");
        if (HttpContext.Current.Request.Browser.Crawler
            || reg.IsMatch(HttpContext.Current.Request.UserAgent))
        {
            filterContext.Result = new FilePathResult("/pageCache/"
                        + HttpContext.Current.Request.Url.AbsoluteUri.ToLower().MD5()
                        + ".html", "text/html");
        }
    }

    public override void OnActionExecuted(ActionExecutedContext filterContext) {}

    public override void OnResultExecuting(ResultExecutingContext filterContext) {}

    public override void OnResultExecuted(ResultExecutedContext filterContext) {}
}

Hopefully, consolidating this information into one place will help some people down the road. If you run into this problem while working on a Knockout, Angular, or Backbone single page application, let me know what you think of my solution!



comments powered by Disqus