FIX: empty the missing list on each post loop

FIX: empty the missing list on each post loop

diff --git a/lib/tasks/posts.rake b/lib/tasks/posts.rake
index 09f2242..adbc892 100644
--- a/lib/tasks/posts.rake
+++ b/lib/tasks/posts.rake
@@ -392,10 +392,12 @@ desc 'Finds missing post upload records from cooked HTML content'
 task 'posts:missing_uploads' => :environment do
   name = "missing_uploads"
   PostCustomField.where(name: name).destroy_all
-  posts = Post.where("posts.cooked LIKE '%<a %' OR posts.cooked LIKE '%<img %'").select(:id, :cooked)
-  missing = []
+  posts = Post.where("(posts.cooked LIKE '%<a %' OR posts.cooked LIKE '%<img %') AND posts.cooked LIKE '%/uploads/%'").select(:id, :cooked)
+  count = 0
   posts.find_each do |post|
+    missing = []
     Nokogiri::HTML::fragment(post.cooked).css("a/@href", "img/@src").each do |media|
       src = media.value
       next if src.blank? || (src =~ /\/uploads\//).blank?
@@ -406,9 +408,13 @@ task 'posts:missing_uploads' => :environment do
       missing << src unless Upload.get_from_url(src) || OptimizedImage.get_from_url(src)
-    missing.each { |src| PostCustomField.create!(post_id:, name: name, value: src) }
+    if missing.present?
+      missing.each { |src| PostCustomField.create!(post_id:, name: name, value: src) }
+      count += missing.count
+    end
     putc "."
-  puts "", "#{missing.count} post uploads are missing.", ""
+  puts "", "#{count} post uploads are missing.", ""

GitHub sha: bfdd0fe6

I queried for posts with images via the raw instead since we know the rules for what kind of text gets converted into uploads.

Right now, any other links that contains /uploads/ will match too.

I think there are two issues in the recovery SQL query you mentioned above.

  1. It won’t recover downloaded onebox images and hotlinked image URLs.
  2. It won’t work for image markdowns ![](). So it will affect if a user quoted an image or if manually copied the URL to use in image markdown. While checking missing post uploads in other site I noticed missing images inside quotes. It should be related.

I think we have to change SQL query in upload_recovery.

ah icic. In that case, can you update the query for upload_recovery too?

Is there a way around matching on %/uploads%/? That might match upload content of other sites as well.


No, I’m unable to find a better way to exclude other sites. We can filter it like /uploads/default/ or /uploads/{name}/. But I’m not sure about how much it is reliable.

I think that is better, I used that pattern for migrate to new scheme as well. discourse/upload.rb at master · discourse/discourse · GitHub

1 Like

where query is now changed DEV: optimize sql query to narrow down the filtering of post with upl… · discourse/discourse@4878ee9 · GitHub


FIX: should look through posts for image markdown