FIX: Backfill topic_id for EmailLog (PR #13469)

In the previous commit 5222247746340a2f915b17ae15f156e6324ae79a we added a topic_id column to EmailLog. This simply backfills it and gets rid of the topic method defined on EmailLog in favour of belongs_to

GitHub

We need to do this in a post migration since we need the updated code to be deployed before running this migration to ensure we are backfilling correctly.

I’m alittle uneasy about updating so many columns at once since we may end up locking the table and temporary cause email sending to fail. On mailing list enabled sites, the email logs table is huge. Are we able to update in batches instead?

Dang it I second guessed myself here…will switch to a post migrate

I am going to test this query on a couple of sites first (with ROLLBACK) to see the impact…I have not been able to yet because the deploy keeps failing

I will look into the locking…would really prefer not to overcomplicate this query, I have done big UPDATE FROM queries in the past without issue.

@martin-brennan Btw I think I remember why I dropped the topic_id column previously. We can basically get the topic_id via the post association. Is there another need for the topic_id here?

@tgxworld relying on an association is kind of a pain because you have to join across tables just to get the topic ID. Would rather just have the topic ID on this table like we do for IncomingEmail, it will make things a lot easier. Filtering all outbound emails for a topic becomes a lot easier.

I still haven’t gotten around to testing out this migration, doing that today and will batch if needed.

@tgxworld Moved this update into batches and I tested on Meta with a rollback; this is what I did there:

Benchmark.measure do
  ActiveRecord::Base.transaction do
    BATCH_SIZE = 30000
    offset = 0
    email_log_count = EmailLog.count

    loop do
      DB.exec(<<~SQL, offset: offset, batch_size: BATCH_SIZE)
          WITH cte AS (
            SELECT topic_id, post_id
            FROM email_logs
            ORDER BY id
            LIMIT :batch_size
            OFFSET :offset
          )
          UPDATE email_logs
          SET topic_id = posts.topic_id
          FROM cte
          INNER JOIN posts ON posts.id = cte.post_id
          WHERE email_logs.post_id = cte.post_id
      SQL

      offset += BATCH_SIZE
      break if offset > (email_log_count + BATCH_SIZE * 2)
    end
    raise ActiveRecord::Rollback
  end
end

The results were that it took 34s to run this on Meta. Meta has 506230 email log records right now, so that’s about 2s per batch of 30,000. I am actually a little wary of removing the topic method in this PR, it leaves us with no wiggle room if this migration goes bad, I think I will remove that in another PR.

What are your thoughts on this now?

We need to avoid using ActiveRecord in migrations :slight_smile:

I think it is fine to keep this method but I don’t think we need to change the commit history here.

Oh I must have put it in the wrong place sorry

Argh every time!! Will fix

@tgxworld any more changes here?

It seems like we do not have to select topic_id here although I’m wondering if we’re missing a filter to not update rows WHERE topic_id IS NOT NULL

Good point, we do not need topic_id here and for the update I will do this:

WHERE email_logs.post_id = cte.post_id AND email_logs.topic_id IS NULL

I think we can add the email_logs.topic_id IS NULL query to the CTE instead.

Oh right, done

O no we can’t because the rows may change between queries. haha ignore my comment here.

Okay I will remove the additional WHERE lol