PERF: Speed up `Upload.migrate_to_new_scheme` by limiting remap scope.

PERF: Speed up Upload.migrate_to_new_scheme by limiting remap scope.

Doing a LIKE on Post#raw and Post#cooked takes forever on large sites.

diff --git a/app/jobs/scheduled/migrate_upload_scheme.rb b/app/jobs/scheduled/migrate_upload_scheme.rb
index bcb89b6..5516c71 100644
--- a/app/jobs/scheduled/migrate_upload_scheme.rb
+++ b/app/jobs/scheduled/migrate_upload_scheme.rb
@@ -16,7 +16,7 @@ module Jobs
       # migrate uploads to new scheme
-      problems = Upload.migrate_to_new_scheme(50)
+      problems = Upload.migrate_to_new_scheme(limit: 50)
       problems.each do |hash|
         upload_id = hash[:upload].id
diff --git a/app/models/upload.rb b/app/models/upload.rb
index 8b3088c..b68fa53 100644
--- a/app/models/upload.rb
+++ b/app/models/upload.rb
@@ -202,17 +202,17 @@ class Upload < ActiveRecord::Base
     self.posts.where("cooked LIKE '%/_optimized/%'").find_each(&:rebake!)
-  def self.migrate_to_new_scheme(limit = nil)
+  def self.migrate_to_new_scheme(limit: nil)
     problems = []
     if SiteSetting.migrate_to_new_scheme
       max_file_size_kb = [SiteSetting.max_image_size_kb, SiteSetting.max_attachment_size_kb].max.kilobytes
       local_store =
-      scope = Upload.by_users.where("url NOT LIKE '%/original/_X/%' AND url LIKE '%/uploads/#{RailsMultisite::ConnectionManagement.current_db}%'")
-        .order(id: :desc)
+      scope = Upload.by_users.where("url NOT LIKE '%/original/_X/%' AND url LIKE '%/uploads/#{RailsMultisite::ConnectionManagement.current_db}%'").order(id: :desc)
       scope = scope.limit(limit) if limit
+      remap_scope = nil
       scope.each do |upload|
@@ -255,7 +255,25 @@ class Upload < ActiveRecord::Base
           # remap the URLs
           DbHelper.remap(UrlHelper.absolute(previous_url), upload.url) unless external
-          DbHelper.remap(previous_url, upload.url)
+          DbHelper.remap(
+            previous_url,
+            upload.url,
+            excluded_tables: %w{
+              posts
+              post_search_data
+            }
+          )
+          remap_scope ||= begin
+            Post.with_deleted.where("raw ~ '/uploads/default/\\d+/'").all
+          end
+          remap_scope.each do |post|
+            post.raw.gsub!(previous_url, upload.url)
+            post.cooked.sub!(previous_url, upload.url)
+  ! if post.changed?
+          end

GitHub sha: 149411ec

you know this is one of the reasons AR is slow. if we gave up this pattern and instead did:

post.raw = post.raw.gsub(previous_url, upload.url)

a huge amount of waste would be eliminated

I guess there is a reasonable limit here for the count … but yeah in theory this could blow memory.

I’m not following here? How does gsub! relate to the performance of active record?

I really want my pet fix here.

If this thing ran and did 0 work, it should disable the job by turning off the site setting.

It means that AR needs to dup every string so it can figure out is raw_changed? and so on.

With gsub won’t we end up allocating an extra string anyway?

Yeah… but how often are you doing p = Post.find(121) and following that up with p.raw.gsub!?

We now pay a tax each time you materialize a Post object just so if maybe you decide to mutate strings AR will have an old copy around.

I’m not following at all but I guess the change is trivial so I’m make the changes.

no … zero point making the change unless we redo internals of Active Record in a monkey patch.

Follow up to 149411ec90ad443cc662b7ef307ce76f83a774e1.

@tgxworld I think what @sam is saying is that when you do post.raw.gsub!, you’re mutating the same string. So when you ask post.changed?, AR will internally ask if raw_changed? and will have to duplicate the string to check whether it has changed or not.

But if you do post.raw = post.raw.gsub(), my guess is that AR already nows that raw has changed since you’ve assigned a new value.

Am I correct @sam?

It should be gsub right?

Yup already followed up in Follow up to 149411ec90ad443cc662b7ef307ce76f83a774e1. · discourse/discourse@3094a60 · GitHub

1 Like

Already fixed in 3094a603b7fdc23441d5e101d178a0ee466bd5a8

1 Like

I don’t understand the concern here because post.raw.gsub ends up allocating the string as well so allocations wise we’re still the same. Based on my brief scan of rails/dirty.rb at 5-2-stable · rails/rails · GitHub, the original attributes are cached once an AR object is instantiated and dirty methods are comparing the current value against the cached attributes.

I think my point has been totally missed here, I did a terrible job communicating it.

IF in a hypothetical world AR disallowed mutating of strings THEN AR would be faster cause it would get away with duplicating less strings.

So in code:

post = Post.first
# Support for this by rails comes at an unconditional cost.
post.raw.gsub!('XYZ', 'ABC')!

post = Post.first
# this is slower cause rails needs to support the `gsub!` thing
puts post.raw

# In my mind this code is fine and in many ways easier to reason about cause less magic
post.raw = post.raw.gsub('XYZ', 'ABC')

So just to clarify here 100%,

Rails works the way it works now, but needs to duplicate all these strings, I wish it just returned frozen strings and then it could do change tracking without needing a dup

This was fixed in

1 Like