jazeee:spiderable-longer-timeout

Extended spiderable package: SSL, caching, longer timeout, no stdin issues, publish flag


Keywords
crawlable, meteor-package, meteor-spiderable, meteorjs, phantomjs, seo
Install
meteor add jazeee:spiderable-longer-timeout@=1.2.13

Documentation

spiderable-longer-timeout

About

This is a branch of the standard meteor spiderable package, with some merged code from ongoworks:spiderable package. Primarily, this lengthens the timeout to 30 seconds and size limit to 10MB. All results will be cached to Mongo collection, by default for 3 hours (180 minutes).

This package will ignore all SSL error in favor of page fetching.

This package supports "real response-code" and "real headers", this means if your route returns 301 response code with some headers the package will return the same headers. This package also has support for JavaScript redirects.

This package tested with iron-router and flow-router, with and without next packages:

This package has build-in caching mechanism, by default it stores results for 3 hours, to change storing period set Spiderable.cacheLifetimeInMinutes to other value in minutes.

Installation

meteor add jazeee:spiderable-longer-timeout

Setup:

isReadyForSpiderable {Boolean}

On server and client this tells Spiderable that everything is ready. Spiderable will wait for Meteor.isReadyForSpiderable to be true, which allows for finer control about when content is ready to be published.

Router.onAfterAction ->
  if @ready()
    Meteor.isReadyForSpiderable = true

Options

userAgentRegExps {[RegExp]}

Array of Regular Expressions, of bot's user agents that we want to serve statically, but do not obey the _escaped_fragment_ protocol. Optionally set or extend Spiderable.userAgentRegExps list.

Spiderable.userAgentRegExps.push /^vkShare/i

Default Bots:

  • /^facebookExternalHit/i
  • /^linkedinBot/i
  • /^twitterBot/i
  • /^googleBot/i
  • /^bingBot/i
  • /^yandex/i
  • /^google-structured-data-testing-tool/i
  • /^yahoo/i
  • /^MJ12Bot/i
  • /^tweetmemeBot/i
  • /^baiduSpider/i
  • /^Mail\.RU_Bot/i
  • /^ahrefsBot/i
  • /^SiteLockSpider/
cacheLifetimeInMinutes (Cache TTL) {Number}

How long cached Spiderable results should be stored (in minutes). Note:

  • Should be set before Meteor.startup
  • Value should be {Number} in minutes
  • To set a new cache lifetime you need to drop index on createdAt_1.
  • Default value: 180 (3 hours)
Spiderable.cacheLifetimeInMinutes = 60 # 1 hour in minutes

If you want to change your cache lifetime, first - drop the cache index. To drop the cache index, run in Mongo console:

db.SpiderableCacheCollection.dropIndex('createdAt_1');
/* or */
db.SpiderableCacheCollection.dropIndexes();
ignoredRoutes {[String]}

Spiderable.ignoredRoutes - is array of strings, routes that we want to serve statically, but do not obey the _escaped_fragment_ protocol. For more info see this thread.

Spiderable.ignoredRoutes.push '/cdn/storage/Files/'
customQuery {Boolean|String}

Spiderable.customQuery - additional get query will be appended to http request. This option may help to build different client's logic for requests from phantomjs and normal users

  • If true - Spiderable will append ___isRunningPhantomJS___=true to the query
  • If String - Spiderable will append String=true to the query
Spiderable.customQuery = true
# or
Spiderable.customQuery = '_fromPhantom_'

# Usage:
Router.onAfterAction ->
  if Meteor.isClient and _.has @params.query, '___isRunningPhantomJS___'
    Session.set '___isRunningPhantomJS___', true
debug {Boolean}

Show/hide server's console messages, set Spiderable.debug to true to show server's console messages

  • Default value: false
Spiderable.debug = true
Response statuses

You able to send any response status from phantomjs, this behavior may be easily controlled via special HTML/JADE comment:

  • 201 - <!-- response:status-code=201 -->
  • 401 - <!-- response:status-code=401 -->
  • 403 - <!-- response:status-code=403 -->
  • 500 - <!-- response:status-code=500 -->

This directive accepts any 3-digit value, so you may return any standard or custom response code.

Enable default 404 response if you're using Iron-Router
  • Create template which you prefer to return, when page is not found
  • Set iron router's notFoundTemplate
  • Include a comment <!-- response:status-code=404 --> on your template. This way, we can ensure spiderable sends a 404 status code in the response headers
  • Enable iron router's dataNotFound plugin. See below or read more about iron-router plugins
Router.configure
  notFoundTemplate: '_404'

Router.plugin 'dataNotFound', 
  notFoundTemplate: Router.options.notFoundTemplate
template(name="_404")
  // response:status-code=404
  h1 404
  h3 Oops, page not found
  p Sorry, page you're requested is not exists or was deleted
<template name="_404">
  <!--response:status-code=404-->
  <h1>404</h1>
  <h3>Oops, page not found</h3>
  <p>Sorry, page you're requested is not exists or was deleted</p>
</template>
Enable default 404 response if you're using Flow-Router
  • Create template which you prefer to return, when page is not found
  • Include a comment <!-- response:status-code=404 --> on your template. This way, we can ensure spiderable sends a 404 status code in the response headers
  • Set flow router's notFound property. See below or read more about flow-router not found routes
# With layout
FlowRouter.notFound = action: -> BlazeLayout.render '_layout', content: '_404'

# Without layout
FlowRouter.notFound = action: -> BlazeLayout.render '_404'
template(name="_404")
  // response:status-code=404
  h1 404
  h3 Oops, page not found
  p Sorry, page you're requested is not exists or was deleted
<template name="_404">
  <!--response:status-code=404-->
  <h1>404</h1>
  <h3>Oops, page not found</h3>
  <p>Sorry, page you're requested is not exists or was deleted</p>
</template>
Supported redirects
window.location.href = 'http://example.com/another/page'
window.location.replace 'http://example.com/another/page'

Router.go '/another/page'
Router.current().redirect '/another/page'
Router.route '/one', ->
  @redirect '/another/page'

Important

Set Meteor.isReadyForSpiderable to true when your route is finished, in order to publish. Deprecated Meteor.isRouteComplete=true, but it will work until at least 2015-12-31 after which I'll remove it... See code for details

Install PhantomJS on your server

If you deploy your application with meteor bundle, you must install phantomjs (http://phantomjs.org) somewhere in your $PATH. If you use Meteor Up, then meteor deploy can do this for you.

Spiderable.originalRequest is also set to the http request. See issue 1.

Testing

Test your site by appending a query to your URLs: URL?_escaped_fragment_= as in http://your.site.com/path_escaped_fragment_=

curl

curl your localhost or host name, if you on production, like:

curl http://localhost:3000/?_escaped_fragment_=
curl http://localhost:3000/ -A googlebot
Google Tools: Fetch as Google

Use Fetch as Google tools to scan your site. Tips:

  • Observe your server logs using tail -f or mup logs -f
  • Fetch as Google and observe that it takes 3-5 minutes before displaying results.
    • Use an uncommon URL to help you identify your request in the logs. Consider adding an extra URL query parameter. For example:
# Simple test with test=1 query
curl "http://localhost:3002/blogs?_escaped_fragment_=&test=1"
# Set the date in the query, which will show up in Meteor logs, with a unique date. (Turn on `Spiderable.debug=true`)
TEST=`date "+%Y%m%d-%H%M%S"`; echo $TEST; curl "http://localhost:3000/blogs?_escaped_fragment_=&test=${TEST}"

Interpreting Fetch as Google results:

  • The tool will not actually hit your server right away.
  • It appears to provide a simple scan result without the extra ?_escaped_fragment_= component.
  • Wait several minutes more. Google appears to request the page, which will show up in your logs as Spiderable successfully completed.
  • Search on Google using site:your.site.com
  • Make sure Google lists all relevant pages.
  • Look at Google's cached version of the pages, to make sure it is fully rendered.
  • Make sure that Google sees the pages with all data subscriptions complete.
Testing PhantomJS

PhantomJS can be temperamental, and can be a challenge to work with.

If PhantomJS is failing on your server, you can try running it directly to help debug what is broken.

On the server console, try running phantomjs --version

Also, you can run this package's PhantomJS script. In order to do so, you'd need to find the phantom_script.js file.

# Find phantom_script.js
PHANTOM_SCRIPT=$(find /opt/YOUR_WEB_APP/app/ -name phantom_script.js)
# Verify that you found just one
echo ${PHANTOM_SCRIPT}
# Try running phantomjs with that script
phantomjs --load-images=no --ssl-protocol=TLSv1 --ignore-ssl-errors=true --web-security=false ${PHANTOM_SCRIPT}    http://localhost
# Verify that it succeeded (should return 0)
echo $?

From Meteor's original Spiderable documentation. See notes specific to this branch (above).

spiderable is part of Webapp. It's one possible way to allow web search engines to index a Meteor application. It uses the AJAX Crawling specification published by Google to serve HTML to compatible spiders (Google, Bing, Yandex, and more).

When a spider requests an HTML snapshot of a page the Meteor server runs the client half of the application inside phantomjs, a headless browser, and returns the full HTML generated by the client code.

In order to have links between multiple pages on a site visible to spiders, apps must use real links (eg <a href="/about">) rather than simply re-rendering portions of the page when an element is clicked. Apps should render their content based on the URL of the page and can use HTML5 pushState to alter the URL on the client without triggering a page reload. See the Todos example for a demonstration.

When running your page, spiderable will wait for all publications to be ready. Make sure that all of your publish functions either return a cursor (or an array of cursors), or eventually call this.ready(). Otherwise, the phantomjs executions will fail.