A fictional encounter containing useful information about WebSphere MQ and affinities

—— Scene 1/1 —————
<Herd of buffalo exit stage right. A few seconds later, Ian enters stage left and Dylbert enters stage top>

Dylbert: Hi Ian. Got a minute? I have a quick MQ clustering question for you.
Ian: Hey Dylbert. Sure I have a minute, fire away.
Dylbert: Thanks. How should I stop my queue manager, QM1? It’s in a cluster and I want to bring it down for maintenance.

Ian: That’s easy and really shows off the power of clustering. You just stop it nicely and clustering routes messages elsewhere.
Dylbert: Great. So what actually happens when I stop it?
Ian: Any inbound channels in RUNNING state will be quiesced and so the channels from other queue managers in the cluster into QM1 will go into RETRYING state. RETRYING is less preferential than RUNNING/INACTIVE, so the cluster workload balancing algorithm routes traffic to the RUNNING/INACTIVE channels, thus routing messages to the available queue managers.
Dylbert: Wow, cool.
Ian: Of course if any inbound channels are in INACTIVE state, stopping QM1 will not affect the channel status. This means that the cluster workload balancing algorithm is still likely to choose the channel to QM1. If the channel is chosen by the cluster workload algorithm before QM1 is back up, the cluster sender channel is triggered and will fail to start because we have stopped the queue manager on the other end of the channel. This results in the channel going into RETRYING state. And if the channel is not chosen by the cluster workload algorithm before QM1 is back up, then the QM1 outage will not even be noticed and all is good.
Dylbert: So whether the inbound channels are in INACTIVE or RUNNING when QM1 is stopped, if the channels is used before QM1 is available again the channels will go into RETRYING state.
Ian: Correct and the great part is that if the channels are in RETRYING state, when you bring QM1 back online, the channels from the other cluster queue managers will start automatically. You don’t have to go round to all the other queue managers starting channels. Good eh?
Dylbert: Awesome. We have about 500 hundred queue managers in this cluster, so that would be a pain if we had to do that.
Ian: A small word of warning with channel retry: Once all short and long retries have been attempted, the channel will go into STOPPED state. And channels in STOPPED state do require a manual start. But the good news is that if you are using the default retry values on your channels, you will have a very long time to wait before all retries have been attempted.
Dylbert: We’re using the defaults, so we’re ok.
Ian: Anyway, I wasn’t quite done with my explanation before… So QM1 comes back up, the channels auto-start, using retry, and so you end up with a healthy channel status again and therefore the cluster workload algorithm will start choosing the queues on QM1 again. So not only haven’t you had to go round to administrate remote queue managers, but also you haven’t had to do anything with applications connected to them to a) make them route their traffic to avoid QM1 while it is down, and b) start using QM1 again when it comes back up.
Dylbert: Well, thanks for the explanation Ian, I’ll see you…
Ian: Well hold on again, I’m still not quite done… So this all works great. Unless, of course, you have affinities.
Dylbert: No, we avoid them, because we know they can cause problems. But, it wouldn’t hurt to hear what you have to say on the subject. Give me some examples where affinities have been created.
Ian: Sure. Well firstly, a typical example is using bind-on-open. This is controlled by the MQOO_BIND_ options and the DEFBIND queue attribute. The problem arises when an application opens a cluster queue with bind-on-open and starts putting messages and the queue manager that hosts the chosen queue goes down. This results in messages building up on the cluster transmit queue.

A second good example is where applications fully qualify the destination with the queue manager name, by specifying a non-blank ObjectQMgrName in the MQOD. Obviously if the specified queue manager goes down, all subsequent puts will build up on the cluster transmit queue.

And thirdly, I’ve seen instances where applications are putting to, say Q1, and there is only one instance of Q1 in the cluster. Again, obviously if the single queue manager which hosts Q1 goes down, all subsequent puts will build up on the cluster transmit queue.

Dylbert: Yeah I think we should be covered on all three points.
Ian: Glad to hear it. I’d recommend building some checking into your admin and apps change control processes too, so that no affinities sneak in.
Dylbert: But hang on, this doesn’t even sound that bad to me. What you’re saying is that even if they have created an affinity, all that happens is that MQ will queue up their messages until they bring the queue manager back online. The apps will actually continue as normal.
Ian: Correct. But, as usual, it all depends on your SLAs. If your service is supposed to be providing a one second response time, this type of affinity issue could cause you some headaches. But as you point out, if you are servicing batch requests that have to be processed within six hours, you have a lot less of a headache.
Dylbert: Sure. Ok then, so if someone had created affinities and this is going to cause them headaches, how should they stop their cluster queue manager?
Ian: Basically in that situation what you’ve got to do is ensure there are alternate destinations, keep the queue manager up, try and persuade any applications in the cluster to stop using the queue manager and start using the alternate destinations and when all the applications have lost their affinity with the queue manager, bring it down.
Dylbert: Sounds reasonable, but what’s all this persuading?
Ian: There’s an MQSC command, SUSPEND QMGR, that will inform other members of the cluster to stop using the queue manager, unless they have an affinity.
Dylbert: “Unless they have an affinity”? But I thought you said?…
Ian: Well, the purpose of this command is to try and stop new affinities being created. We wait until the original set of applications, that had an affinity with the queue manager, has stopped. Using SUSPEND QMGR we’ve ensured that no new apps can create an affinity with the queue manager, so once the old apps have lost their affinity with the queue manager we can stop it.
Dylbert: Ah, I understand. Ok, next question: How do you know when these apps have lost their affinity with the queue manager?
Ian: You can check the numbers of messages coming over channels into the queue manager or use monitoring products to report usage, but I also recommend knowing your apps and keeping a close eye on transmit queue depths. This way you have a better idea how long apps can hold an affinity on a queue and also spot problems, if, for instance, you brought down a queue manager whilst an app still had an affinity to it.
Dylbert: Ok, I think I’ve got it. It sounds like I’m ok just stopping QM1 this time. Nice and easy. That I like!
Ian: Yeah and remember I do recommend checking those change control processes so that you make sure that your cluster stays in such a good state.
Dylbert: Thanks Ian
Ian: No problem. See you soon.
Dylbert: Sure thing. Bye.

<Ian exits stage left and Dylbert exits down a ladder. A few seconds later, a herd of buffalo enter stage right>
—— End of Scene —————

Advertisements