FIX: First pass to improve efficiency of secure uploads rake task (PR #9284)

Get rid of harmful each loop over uploads to update. Instead we put all the unique access control posts for the uploads into a map for fast access (vs using the slow .find through array) and look up the post when it is needed when looping through the uploads in batches.

On a Discourse instance with ~93k uploads, a simplified version of the old method takes > 1 minute:

unique_access_control_posts = Post.unscoped.select(:id, :topic_id).includes(topic: :category).where(id: uploads_to_update.pluck(:access_control_post_id).uniq)
uploads_to_update.each do |upload|
  upload.access_control_post = unique_access_control_posts.find { |post| post.id == upload.access_control_post_id }
end

A simplified version of the new method takes ~18s and uses a lot less memory:

unique_access_control_posts = {}
Post.unscoped.select(:id, :topic_id).includes(topic: :category)
       .where(id: uploads_to_update.pluck(:access_control_post_id).uniq).find_each do |post|
  unique_access_control_posts[post.id] = post
end

uploads_to_update.find_each do |upload|
  upload.access_control_post = unique_access_control_posts[upload.access_control_post_id]
end

GitHub

Does this not pull a huge number of posts into memory? Why not just keep the ids and then instead just update access_control_post_id = id ?

@SamSaffron because the post’s topic and that topic’s category are used in UploadSecurity to determine whether the post has secure media. See:

https://github.com/discourse/discourse/blob/master/lib/upload_security.rb#L51-L53 https://github.com/discourse/discourse/blob/master/app/models/post.rb#L500-L504

In the forum I have tested on this results in ~3400 posts in memory for ~93k uploads.

I see, then why not just keep a list of ids, then you do a Post.find(id) in the loop. That way you don’t keep 3000 posts in memory.

I thought it would be more efficient to keep 3000 posts in memory than to go to the DB 93,000 times?

Its tricky, depends how big the list is, if you can comfortably hold it in memory then I guess it is fine.

I guess this is already much improved so we go with it, but if you are doing this treatment on a DB with 100k posts that need touching there is a chance you can not hold all of that in memory.

I say merge for now, you need to unblock yourself.

Ok all good, I guess it is a balancing act. I might just be being overly cautious with the DB performance based on past experience, probably because I have never worked anywhere with a full infrastructure team that is great at handling this side of things :slight_smile: