FEATURE: Allow customization of robots.txt (PR #7884)

This allows admins to customize/override the content of the robots.txt file at /admin/customize/robots. That page is not linked to anywhere in the UI – admins have to manually type the URL to access that page.

Meta topic: https://meta.discourse.org/t/needing-to-edit-robots-txt-file-where-is-it/93879?u=osama

Screenshots:

image

image

@eviltrout does it make sense to prepend a comment to robots.txt that says something along the lines of “this robots.txt file has been customized at /admin/customize/robots” if the file is customized? It might help with figuring out why certain things are in the file and how to remove/change them?

GitHub

You’ve signed the CLA, OsamaSayegh. Thank you! This pull request is ready for review.

This pull request has been mentioned on Discourse Meta. There might be relevant details there:

https://meta.discourse.org/t/needing-to-edit-robots-txt-file-where-is-it/93879/40

how does this interact with the handful of existing site settings that modify robots.txt, such as whitelisted crawler user agents, slow down crawler user agents, and allow index in robots txt ?

@coding-horror this allows overriding the whole robots.txt file. So if an admin goes to this page and makes some changes, then that’s what’s going to be served. Changes to the site settings that you mentioned won’t apply if there is an overriding copy in the database (admins can remove the overriding copy and restore the default robots.txt). Is this how you think it should work?

It’s fine, we just need to make that clear somewhere, also is there a “restore” button to revert robots.txt to default?

On Thu, Jul 11, 2019 at 3:10 PM Osama Sayegh notifications@github.com wrote:

@coding-horror https://github.com/coding-horror this allows overriding the whole robots.txt file. So if an admin goes to this page and makes some changes, then that’s what’s going to be served. Changes to the site settings that you mentioned won’t apply if there is an overriding copy in the database (admins can remove the overriding copy and restore the default robots.txt). Is this how you think it should work?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/discourse/discourse/pull/7884?email_source=notifications&email_token=AALTWVLKHC3JEFJLKTKRA63P66VVNA5CNFSM4IBUMU22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZYD7EA#issuecomment-510672784, or mute the thread https://github.com/notifications/unsubscribe-auth/AALTWVLX3I3K2YF5D2RUOJDP66VVNANCNFSM4IBUMU2Q .

Yes, there is a “revert changes” button. I will add a copy to the page to make that clear.

Overall it’s looking good but I’d like you to make some changes.

This looks like a good candidate for using the buffered-property mixin which will give you a buffer with helper methods.

We have a computed property propertyEqual that could save a couple lines here.

I am not a fan of using an API here to update to “” as a way of going back to the default. I would prefer if we used another RESTful action for this. Maybe DELETE to the endpoint?

I’d prefer to use CSS than the rows attribute here. It’s easier to be more responsive.

The contents of this JSON should be the overridden version, because it is used by services to generate robots.txt in the root and those services would have to be updated to know about the new overridden property.

This method now does not return the robots info, it returns the default robots info (other code is supposed to know about this and return an override if it exists.) I would suggest renaming this to fetch_default_robots_info

    it "can't perform #update" do
    it "returns default content if there are no overrides" do

does it make sense to prepend a comment to robots.txt that says something along the lines of “this robots.txt file has been customized at /admin/customize/robots” if the file is customized?

The problem is that information is only really useful to staff. You could put in logic that only includes that message if the request is via a staff account, but some people will use those with API keys to generate the robots.txt for the root of a subfolder site.

I’m sorry I’m not sure I understand this point correctly. Do you mean if there is an overridden version, we should try to parse it and send it in JSON format like we do for the default version like this?

{
   "header":"# See http://www.robotstxt.org/robotstxt.html ...",
   "agents":[
      {
         "name":"*",
         "disallow":[
            "/auth/",
            "/assets/browser-update*.js",
            ...
         ]
      },
      {
         "name":"mauibot",
         "disallow":[
            "/"
         ]
      },
     ...
   ]
}

Oooh I see the format is much more complicated than I thought. You can disregard my comment, sorry. Let’s change the API the way you suggested.

This looks good now thanks, merge when ready!