FIX: Make migrations from S3 more robust; fix bare URL migration (PR #10093)

Improve handling of URLs to (more) correctly process URLs present in s3 or a clone when migrating to local, including changing to an upload: psuedo-protocol URL, tagging audio and video files, and in the process not creating non-URLs that point to nothing.

Also, fix batch_migrate_from_s3 to take both a max number of posts to actually modify in a migration (the original but failed intent of limit) and a limit of total posts to consider for migration (for cheaper database queries). Because a primary goal of the max limit is debugging, also use it to cause verbose output.

Finally, in the case of a truly unexpected error, print the error and stop processing, so that error cascades do not become hidden failures stopping a successful migration, but can be resolved.

GitHub

This pull request has been mentioned on Discourse Meta. There might be relevant details there:

https://meta.discourse.org/t/migrate-from-s3-problems/119064/6

The title of this pull request changed from “FIX: Do not break non-upload URLs when migrating from s3” to "FIX: Make migrations from S3 more robust; fix bare URL migration

This pull request has been mentioned on Discourse Meta. There might be relevant details there:

https://meta.discourse.org/t/migrating-s3-spaces-non-image-uploads-to-local/132484/4

The only failures I’ve seen with this code running so far on my live site are:

  • Not migrating videos larger than the current max file size; not a bug.
  • Transient failures from digital ocean spaces that are resolved by the workaround I have at https://github.com/johnsonm/discourse/commit/7dfac12a2ea6ec04ba4e0616b4e0dbd1d806cff7 — I can pull that commit into this PR or not, as you wish. I don’t know the right number of retries in general, and I didn’t implement exponential backoff on the retries yet since it was intended as a quick hack to see if it would work around my problem.

I think this is ready for review, and valuable generally.

This pull request has been mentioned on Discourse Meta. There might be relevant details there:

I’ve run into the need to retry download enough that it’s worth pulling in that workaround. It would be better in theory to do exponential backoff, but as far as I can tell this simple workaround hasn’t failed yet. Good enough, I hope. It can always be improved later to be exponential, but really any S3 clone shouldn’t be that unreliable.

It looks like there might be inconsistent behavior of the upload mocks. They aren’t acting like I would have thought; the hash sometimes changes and sometimes doesn’t. This is the source of the occasional test failures, it seems.

https://meta.discourse.org/t/requesting-help-with-failure-finding-discourseredis-connector-running-tests/155599 addressed and test failures understood and resolved. I believe this is ready to merge.