FIX: Make migrations from S3 more robust; fix bare URL migration (PR #10093)

Improve handling of URLs to (more) correctly process URLs present in s3 or a clone when migrating to local, including changing to an upload: psuedo-protocol URL, tagging audio and video files, and in the process not creating non-URLs that point to nothing.

Also, fix batch_migrate_from_s3 to take both a max number of posts to actually modify in a migration (the original but failed intent of limit) and a limit of total posts to consider for migration (for cheaper database queries). Because a primary goal of the max limit is debugging, also use it to cause verbose output.

Finally, in the case of a truly unexpected error, print the error and stop processing, so that error cascades do not become hidden failures stopping a successful migration, but can be resolved.

GitHub

This pull request has been mentioned on Discourse Meta. There might be relevant details there:

https://meta.discourse.org/t/migrate-from-s3-problems/119064/6

The title of this pull request changed from “FIX: Do not break non-upload URLs when migrating from s3” to "FIX: Make migrations from S3 more robust; fix bare URL migration

This pull request has been mentioned on Discourse Meta. There might be relevant details there:

https://meta.discourse.org/t/migrating-s3-spaces-non-image-uploads-to-local/132484/4

The only failures I’ve seen with this code running so far on my live site are:

  • Not migrating videos larger than the current max file size; not a bug.
  • Transient failures from digital ocean spaces that are resolved by the workaround I have at https://github.com/johnsonm/discourse/commit/7dfac12a2ea6ec04ba4e0616b4e0dbd1d806cff7 — I can pull that commit into this PR or not, as you wish. I don’t know the right number of retries in general, and I didn’t implement exponential backoff on the retries yet since it was intended as a quick hack to see if it would work around my problem.

I think this is ready for review, and valuable generally.

This pull request has been mentioned on Discourse Meta. There might be relevant details there:

I’ve run into the need to retry download enough that it’s worth pulling in that workaround. It would be better in theory to do exponential backoff, but as far as I can tell this simple workaround hasn’t failed yet. Good enough, I hope. It can always be improved later to be exponential, but really any S3 clone shouldn’t be that unreliable.

It looks like there might be inconsistent behavior of the upload mocks. They aren’t acting like I would have thought; the hash sometimes changes and sometimes doesn’t. This is the source of the occasional test failures, it seems.

https://meta.discourse.org/t/requesting-help-with-failure-finding-discourseredis-connector-running-tests/155599 addressed and test failures understood and resolved. I believe this is ready to merge.

                raise "#{url} not found for #{post_tag}" unless file

I’d rather have a separate variable for this (verbose?) which could be automatically set to true if max is non-nil or via ENV["VERBOSE"]. You could consider using a helper function for logging, like: https://github.com/discourse/discourse/blob/630ec7b49cd98358736c5445302c509c62c68653/lib/shrink_uploaded_image.rb#L238-L240

The full list (I hope it’s full :sweat_smile:) of non-post upload types is for example in the ShrinkUploadedImage class:

https://github.com/discourse/discourse/blob/630ec7b49cd98358736c5445302c509c62c68653/lib/shrink_uploaded_image.rb#L165-L173

This section of the script would break all those types of upload references (other than the profile uploads handled above)

@davidtaylorhq Can I get your eyes on this? :slightly_smiling_face: (the name of the task and what it does)

This re-creates uploads, preserving only the actual file and user_id, filename, origin attributes?

I’m fine with creating new upload records (there was even an idea to make them immutable, to have a permanent record of changes) but all original attributes should be preserved.

limit is the limit to consider migrating, max is the max to actually migrate

I think I know what both of these do by reading the code, but this comment doesn’t make it clear. max is the number of uploads to be migrated? And limit is used in SQL queries? (does that need to be configurable?)

Naming is hard :sweat_smile: but it would be better to use something that isn’t a model name in Discourse. :smiley:

From meta:

It feels like it makes sense to start with posts, since you want to rebake posts to change the URL in the cooked post after each migration; migrating all the uploads and then starting the rebakes would leave the site very broken for potentially a long time for a large site

You can process Uploads one by one, and rebake posts in-place, like downsize_uploads/ShrinkUploadedImage does.


Ideally we would extract the code from those two files into a class that’d be responsible for updating all objects related to the given upload (posts, user profiles, avatars, etc.)