FIX: Heartbeat check per sidekiq process (PR #7873)

GitHub

You’ve signed the CLA, OsamaSayegh. Thank you! This pull request is ready for review.

I think we should rename this method, perhaps trigger_heartbeat. When I think set_xyz it sounds to me like replacing the member xyz with something else.

Overall this method seems sound. I was a bit worried about having so many queues but there is only one worker per custom queue so it should not be very inefficient.

I only have one minor quibble about the naming of a method.

The one tricky thing is testing. There are a couple of unit tests which is good, but before we merge, how comprehensively have you tested the queues/workers running locally? There is a fair amount that is not tested automatically here.

Since you’re only looking for one item in the list, you could do something like this

if demon = Demon::Sidekiq.demons.values.find { |d| d.pid == pid }
  demon.stop
  demon.start
  restarted = true
end

You don’t need the &. or the || 0 when using to_i since nil.to_i == 0 :wink:

    $redis.hget(queues_last_heartbeat_hash_key, name).to_i
      next if queue.name !~ /^\h{32}_\d+$/

Shouldn’t that be?

      queue.clear if queue.size > 0

Or just?

      queue.clear

Maybe this could be refactored like this?

def self.before_start
  # cleans up heartbeat queues from previous boot up
  Sidekiq::Queue.all.each { |queue| queue.clear if queue.name[/^\h{32}_\d+$/] }
end

Yeah I think we can just drop this condition altogether because the queue name already matches the name of the special heartbeat queues.

@eviltrout the way I tested this change is I made sidekiq boot up without any queues locally by passing an empty array to cli.parse here:

that way sidekiq couldn’t process the jobs we enqueue here:

Then I lowered the heartbeat check interval to 5 seconds (so that I don’t have to wait 30 minutes) and made the Heartbeat job run every 2 seconds. After that I ran UNICORN_SIDEKIQS=2 ./bin/unicorn and shortly after that the check_sidekiq_heartbeat method ran and the 2 sidekiq processes that were spawned restarted. Does this sound like correct/enough testing?


There is still one problem that I’d like address before merging this: the problem is that every time the master unicorn boots up, 2 unique keys are created in redis (heartbeat_queues_list_key and queues_last_heartbeat_hash_key). This means that if there are 3 app servers and you hit deploy, 6 redis keys are created per deploy. Over time this would accumulate lots of redis keys that aren’t used at all. I’d like to clean up keys as they become unneeded.

One way to do that would to prefix the 2 keys with the server hostname, and then in the before_start method (I just added it) we can iterate over all redis keys and delete the ones that aren’t needed. The prefix is needed to ensure no server deletes other servers’ keys. That way each time master unicorn boots up, it’d clean up the keys of the previous boot up. Does that sound good? Is there a better approach to this?

Does that sound good? Is there a better approach to this?

Keeping track of what keys you’ve set in the past can be messy, and it wouldn’t account for the situation where we retire a host and its previous keys would exist forever.

Instead I recommend giving the keys an EXPIRY, which you can bump whenever the heartbeat runs. If the heartbeat runs every 3 minutes for example, you could set it as expiring in 60 minutes or something just to be very careful.

After one hour of not being used, the keys will vanish.

hmmm is this checking all heartbeat queues? Should it not just check heartbeats for child sidekiq processes it has?

Did you test with unicorn_sidekiqs set to 2 to see how it properly spawns and monitors? What are the blockers for a merge here, overall this is a fantastic change!

Did you test with unicorn_sidekiqs set to 2 to see how it properly spawns and monitors? What are the blockers for a merge here, overall this is a fantastic change!

Thanks! Yes I did test with UNICORN_SIDEKIQS, I posted details of how I tested this change above https://github.com/discourse/discourse/pull/7873#issuecomment-510970132.

Edit: I’m not aware of any blockers.

Demon::Sidekiq.heartbeat_queues returns all sidekiq processes that are spawned by unicorn master and we check their heartbeat one by one here. Is this not what we are supposed to do?

Oh I see… so each process will give a different list here …

A confusing thing here is that if you call Demon::Sidekiq.heartbeat_queues you will get one list … and if you call this on a forked child process you will get an empty list. Maybe we call this Demon::Sidekiq.child_process_heartbeat_queues ?

I am really liking this change … all I seem to have here is superficial comments, I would like you to merge in the beginning of your day, then watch on meta… then deploy another bigger cluster and watch on it as well. Critical that this gets merged when you have a few hours to double check we are not getting a swamp of “restarting child process cause it died”

@OsamaSayegh can you have a look to see if you can merge it this week?