DEV: improve missing uploads query and skip checking file size

DEV: improve missing uploads query and skip checking file size

From bcdf5b2f475366c9de2570b9625a2e1d495f5da3 Mon Sep 17 00:00:00 2001
From: Vinoth Kannan <vinothkannan@vinkas.com>
Date: Tue, 27 Nov 2018 02:21:33 +0530
Subject: [PATCH] DEV: improve missing uploads query and skip checking file
 size


diff --git a/lib/file_store/s3_store.rb b/lib/file_store/s3_store.rb
index 4ad4f2a..08d2600 100644
--- a/lib/file_store/s3_store.rb
+++ b/lib/file_store/s3_store.rb
@@ -128,7 +128,7 @@ module FileStore
         verified_ids = []
 
         files.each do |f|
-          id = model.where("url LIKE '%#{f.key}' AND filesize = #{f.size}").pluck(:id).first
+          id = model.where("url LIKE '%#{f.key}'").pluck(:id).first if f.size > 0
           verified_ids << id if id.present?
           marker = f.key
         end
@@ -138,7 +138,7 @@ module FileStore
         files = @s3_helper.list(prefix, marker)
       end
 
-      missing_uploads = model.joins('LEFT JOIN verified_ids ON val = id').where(val: nil)
+      missing_uploads = model.where("id NOT IN (SELECT val FROM verified_ids)")
       missing_count = missing_uploads.count
 
       if missing_count > 0

GitHub

What is the rational for this change? a NOT IN query is going to be slower than a LEFT JOIN on large datasets.

1 Like

The file size check is insufficient for our needs. A corrupted file can also have a size that is greater than 0.

3 Likes

I fixed it in S3 inventory PR using etags.

1 Like