FEATURE: Send a 'noindex' header in non-canonical responses (PR #15026)

While the Discourse SPA has URLs for every single post in a topic, we don’t want crawlers to waste time indexing those. In order to generalize the header application, we are replying with it every time a canonical is generated an is different from the requested URL.

GitHub

I think we should put this behind a hidden site setting to start with. Then we can try it on a few sites before rolling out by default

Setting a response header as a side-effect of the canonical_link_tag helper seems pretty surprising. I think it might be better to make this a separate method within CanonicalUrl::Helpers.set_x_robots_tag, and call it from a before/after-action in the ApplicationController. We already have an add_noindex_header before_action which could be adapted.

Having it as a before_action also means that Routes are still free to customize the header if they want to (e.g. some actions in the Users controller do this)

Actually, maybe it needs to be an after_action, so that routes can set the canonical URL first? In which case, we could use ||= to make sure any explicit values are given priority.

Setting a response header as a side-effect of the canonical_link_tag helper seems pretty surprising.

Yeah, it felt weird.

However I only want to set this when we calculate the canonical, which isn’t something we do on every response right?

The alternative would be setting the canonical value in a session variable and then checking for it on the after_action and setting the header. How that sounds?

However I only want to set this when we calculate the canonical, which isn’t something we do on every response right?

I think we do it on every response. canonical_link_tag is called from _head.html.erb, which is included in every application.html.erb render.

The alternative would be setting the canonical value in a session variable and then checking for it on the after_action and setting the header. How that sounds?

We could cache the result of default_canonical in an instance variable, yeah. We wouldn’t want to put it in session (which is stored in a cookie)

I think we do it on every response

Every response that hits the HTML render right? Most ajax requests won’t.

We could memoize the result of default_canonical in an instance variable, yeah. That way, we don’t have to calculate it twice. We wouldn’t want to put it in session (which is stored in a cookie)

Yes, I want to minimize re-calculating (and possible re-calculating it wrongly). Maybe I can inject the calculated value in the response or request object?

Every response that hits the HTML render right? Most ajax requests won’t.

Ah yes, I see what you mean now :+1:. I guess we just add something like

return if !request.get? || (request.format && request.format.json?) || request.xhr?

to an after_action, so it doesn’t run for XHR/JSON requests?

Maybe I can inject the calculated value in the response or request object?

If we move the default_canonical method into CanonicalURL::ControllerExtensions then we should simply be able to do @default_canonical ||= begin ... end, and it will be cached for the duration of the request inside the controller instance.

I had a go at sketching out what I was thinking of locally. Mostly untested, but the caching thing does seem to work:

diff --git a/app/controllers/application_controller.rb b/app/controllers/application_controller.rb
index be5fd3b01f..71867b8eee 100644
--- a/app/controllers/application_controller.rb
+++ b/app/controllers/application_controller.rb
@@ -41,13 +41,13 @@ class ApplicationController < ActionController::Base
   before_action :redirect_to_login_if_required
   before_action :block_if_requires_login
   before_action :preload_json
-  before_action :add_noindex_header, if: -> { is_feed_request? || !SiteSetting.allow_index_in_robots_txt }
   before_action :check_xhr
   after_action  :add_readonly_header
   after_action  :perform_refresh_session
   after_action  :dont_cache_page
   after_action  :conditionally_allow_site_embedding
   after_action  :ensure_vary_header
+  after_action  :add_noindex_header
 
   HONEYPOT_KEY ||= 'HONEYPOT_KEY'
   CHALLENGE_KEY ||= 'CHALLENGE_KEY'
@@ -895,13 +895,17 @@ class ApplicationController < ActionController::Base
   end
 
   def add_noindex_header
-    if request.get?
-      if SiteSetting.allow_index_in_robots_txt
-        response.headers['X-Robots-Tag'] = 'noindex'
-      else
-        response.headers['X-Robots-Tag'] = 'noindex, nofollow'
-      end
+    return if !request.get? || (request.format && request.format.json?) || request.xhr?
+
+    value = if !SiteSetting.allow_index_in_robots_txt
+      'noindex'
+    elsif is_feed_request?
+      'noindex, nofollow'
+    elsif (@canonical_url || default_canonical) != request.url # and hidden site setting is enabled
+      'noindex'
     end
+
+    response.headers["X-Robots-Tag"] ||= value
   end
 
   protected
diff --git a/lib/canonical_url.rb b/lib/canonical_url.rb
index d318bd56de..df66c37a6c 100644
--- a/lib/canonical_url.rb
+++ b/lib/canonical_url.rb
@@ -2,6 +2,8 @@
 
 module CanonicalURL
   module ControllerExtensions
+    ALLOWED_CANONICAL_PARAMS = %w(page)
+
     def canonical_url(url_for_options = {})
       case url_for_options
       when Hash
@@ -10,22 +12,27 @@ module CanonicalURL
         @canonical_url = url_for_options
       end
     end
+
+    def default_canonical
+      @default_canonical ||= begin
+        canonical = +"#{Discourse.base_url_no_prefix}#{request.path}"
+        allowed_params = params.select { |key| ALLOWED_CANONICAL_PARAMS.include?(key) }
+        if allowed_params.present?
+          canonical << "?#{allowed_params.keys.zip(allowed_params.values).map { |key, value| "#{key}=#{value}" }.join("&")}"
+        end
+        canonical
+      end
+    end
+
+    def self.included(base)
+      base.helper_method :default_canonical
+    end
   end
 
   module Helpers
-    ALLOWED_CANONICAL_PARAMS = %w(page)
     def canonical_link_tag(url = nil)
       tag('link', rel: 'canonical', href: url || @canonical_url || default_canonical)
     end
-
-    def default_canonical
-      canonical = +"#{Discourse.base_url_no_prefix}#{request.path}"
-      allowed_params = params.select { |key| ALLOWED_CANONICAL_PARAMS.include?(key) }
-      if allowed_params.present?
-        canonical << "?#{allowed_params.keys.zip(allowed_params.values).map { |key, value| "#{key}=#{value}" }.join("&")}"
-      end
-      canonical
-    end
   end
 end

What do you think?

Looks great.

Merged it in with a few fixes

This should be removed

This should be removed

      response.headers['X-Robots-Tag'] ||= 'noindex'

(or add !response.headers['X-Robots-Tag'] to the if statement)