A fictional encounter containing useful information about WebSphere MQ and affinities
—— Scene 1/1 —————
<Herd of buffalo exit stage right. A few seconds later, Ian enters stage left and Dylbert enters stage top>
| Dylbert: | Hi Ian. Got a minute? I have a quick MQ clustering question for you. |
| Ian: | Hey Dylbert. Sure I have a minute, fire away. |
| Dylbert: | Thanks. How should I stop my queue manager, QM1? It’s in a cluster and I want to bring it down for maintenance. |
| Ian: | That’s easy and really shows off the power of clustering. You just stop it nicely and clustering routes messages elsewhere. |
| Dylbert: | Great. So what actually happens when I stop it? |
| Ian: | Any inbound channels in RUNNING state will be quiesced and so the channels from other queue managers in the cluster into QM1 will go into RETRYING state. RETRYING is less preferential than RUNNING/INACTIVE, so the cluster workload balancing algorithm routes traffic to the RUNNING/INACTIVE channels, thus routing messages to the available queue managers. |
| Dylbert: | Wow, cool. |
| Ian: | Of course if any inbound channels are in INACTIVE state, stopping QM1 will not affect the channel status. This means that the cluster workload balancing algorithm is still likely to choose the channel to QM1. If the channel is chosen by the cluster workload algorithm before QM1 is back up, the cluster sender channel is triggered and will fail to start because we have stopped the queue manager on the other end of the channel. This results in the channel going into RETRYING state. And if the channel is not chosen by the cluster workload algorithm before QM1 is back up, then the QM1 outage will not even be noticed and all is good. |
| Dylbert: | So whether the inbound channels are in INACTIVE or RUNNING when QM1 is stopped, if the channels is used before QM1 is available again the channels will go into RETRYING state. |
| Ian: | Correct and the great part is that if the channels are in RETRYING state, when you bring QM1 back online, the channels from the other cluster queue managers will start automatically. You don’t have to go round to all the other queue managers starting channels. Good eh? |
| Dylbert: | Awesome. We have about 500 hundred queue managers in this cluster, so that would be a pain if we had to do that. |
| Ian: | A small word of warning with channel retry: Once all short and long retries have been attempted, the channel will go into STOPPED state. And channels in STOPPED state do require a manual start. But the good news is that if you are using the default retry values on your channels, you will have a very long time to wait before all retries have been attempted. |
| Dylbert: | We’re using the defaults, so we’re ok. |
| Ian: | Anyway, I wasn’t quite done with my explanation before… So QM1 comes back up, the channels auto-start, using retry, and so you end up with a healthy channel status again and therefore the cluster workload algorithm will start choosing the queues on QM1 again. So not only haven’t you had to go round to administrate remote queue managers, but also you haven’t had to do anything with applications connected to them to a) make them route their traffic to avoid QM1 while it is down, and b) start using QM1 again when it comes back up. |
| Dylbert: | Well, thanks for the explanation Ian, I’ll see you… |
| Ian: | Well hold on again, I’m still not quite done… So this all works great. Unless, of course, you have affinities. |
| Dylbert: | No, we avoid them, because we know they can cause problems. But, it wouldn’t hurt to hear what you have to say on the subject. Give me some examples where affinities have been created. |
| Ian: | Sure. Well firstly, a typical example is using bind-on-open. This is controlled by the MQOO_BIND_ options and the DEFBIND queue attribute. The problem arises when an application opens a cluster queue with bind-on-open and starts putting messages and the queue manager that hosts the chosen queue goes down. This results in messages building up on the cluster transmit queue.
A second good example is where applications fully qualify the destination with the queue manager name, by specifying a non-blank ObjectQMgrName in the MQOD. Obviously if the specified queue manager goes down, all subsequent puts will build up on the cluster transmit queue. And thirdly, I’ve seen instances where applications are putting to, say Q1, and there is only one instance of Q1 in the cluster. Again, obviously if the single queue manager which hosts Q1 goes down, all subsequent puts will build up on the cluster transmit queue. |
| Dylbert: | Yeah I think we should be covered on all three points. |
| Ian: | Glad to hear it. I’d recommend building some checking into your admin and apps change control processes too, so that no affinities sneak in. |
| Dylbert: | But hang on, this doesn’t even sound that bad to me. What you’re saying is that even if they have created an affinity, all that happens is that MQ will queue up their messages until they bring the queue manager back online. The apps will actually continue as normal. |
| Ian: | Correct. But, as usual, it all depends on your SLAs. If your service is supposed to be providing a one second response time, this type of affinity issue could cause you some headaches. But as you point out, if you are servicing batch requests that have to be processed within six hours, you have a lot less of a headache. |
| Dylbert: | Sure. Ok then, so if someone had created affinities and this is going to cause them headaches, how should they stop their cluster queue manager? |
| Ian: | Basically in that situation what you’ve got to do is ensure there are alternate destinations, keep the queue manager up, try and persuade any applications in the cluster to stop using the queue manager and start using the alternate destinations and when all the applications have lost their affinity with the queue manager, bring it down. |
| Dylbert: | Sounds reasonable, but what’s all this persuading? |
| Ian: | There’s an MQSC command, SUSPEND QMGR, that will inform other members of the cluster to stop using the queue manager, unless they have an affinity. |
| Dylbert: | “Unless they have an affinity”? But I thought you said?… |
| Ian: | Well, the purpose of this command is to try and stop new affinities being created. We wait until the original set of applications, that had an affinity with the queue manager, has stopped. Using SUSPEND QMGR we’ve ensured that no new apps can create an affinity with the queue manager, so once the old apps have lost their affinity with the queue manager we can stop it. |
| Dylbert: | Ah, I understand. Ok, next question: How do you know when these apps have lost their affinity with the queue manager? |
| Ian: | You can check the numbers of messages coming over channels into the queue manager or use monitoring products to report usage, but I also recommend knowing your apps and keeping a close eye on transmit queue depths. This way you have a better idea how long apps can hold an affinity on a queue and also spot problems, if, for instance, you brought down a queue manager whilst an app still had an affinity to it. |
| Dylbert: | Ok, I think I’ve got it. It sounds like I’m ok just stopping QM1 this time. Nice and easy. That I like! |
| Ian: | Yeah and remember I do recommend checking those change control processes so that you make sure that your cluster stays in such a good state. |
| Dylbert: | Thanks Ian |
| Ian: | No problem. See you soon. |
| Dylbert: | Sure thing. Bye. |
<Ian exits stage left and Dylbert exits down a ladder. A few seconds later, a herd of buffalo enter stage right>
—— End of Scene —————

9 comments
Comments feed for this article
March 16, 2007 at 2:01 pm
Chris Wilton
You missed something here? If you stop a Queue Manager, channels to that QM do not go into retry until either a message is sent or a heartbeat happens. If it is a message that is sent, then that message will be “stuck” until the QM is back. Suspending the Queue Manager before stopping, and resuming when back (and often resuming after the application is restarted) is the way to go.
March 16, 2007 at 3:08 pm
Mårten Gustafson
Good one Ian
March 21, 2007 at 8:52 am
Ian Vanstone
“If you stop a Queue Manager, channels to that QM do not go into retry until either a message is sent or a heartbeat happens.”
Correct.
“If it is a message that is sent, then that message will be ’stuck’ until the QM is back.”
Correct for bind-on-open, incorrect for bind-not-fixed.
When the message is put to the SYSTEM.CLUSTER.TRANSMIT.QUEUE, the channel tries to transmit it and then detects the failure (as the
remote queue manager is not available). If the message was put bind-not-fixed it is then re-workload balanced and is not “stuck”.
If the message was put bind-on-open it cannot be re-workload balanced and is therefore “stuck” on the SYSTEM.CLUSTER.TRANSMIT.QUEUE
until the channel runs again. This is the price you pay for using bind-on-open.
“resuming when back (and often resuming after the application is restarted)”
Great point. You may not want to advertise your service until the service is actually running.
March 21, 2007 at 2:29 pm
Chris Wilton
The first message does get stuck when using bind-not-fixed – at least everytime I have tested it (and under v6). All subsequent messages are OK.
March 21, 2007 at 10:18 pm
marten.gustafson » links for 2007-03-21
[...] Taking cluster queue managers offline for maintenance « a Hursley view on WebSphere MQ (tags: ibm work wmq by:Ian_Vanstone cluster) [...]
March 22, 2007 at 10:20 am
Ian Vanstone
Chris, is the first message “stuck” because the channel is indoubt? If so you can guard against this by using the batch heartbeat channel attribute BATCHHB.
See here for the manual http://publib.boulder.ibm.com/infocenter/wmqv6/v6r0/topic/com.ibm.mq.csqzae.doc/batchhb.htm)
… and here for supportpac MD0C which contains more info on BATCHHB and other channel availability issues http://www.ibm.com/support/docview.wss?uid=swg24006699
Let me know if the channel is not indoubt or you would like further clarification on why indoubts effect re-workload balancing.
May 24, 2007 at 2:20 am
Peter Potkay
Ian,
When messages with an affinity ARE stuck on the S.C.T.Q. waiting to get to QM1, how often does CLUSSNDR MCA (or is it the Cluster Workload Algorithim?) pull the messages off of the S.C.T.Q. to see if it it can put the message(s) somewhere else? Since they can’t go anywhere else, do they get put back to the S.C.T.Q., or rolled back?
Our XMITQ monitors look at q depth AND dequeue rate. Even though the S.C.T.Q> was backed up with over 500 messages, the alert did not fire because the 500 affinity message kept getting pulled off every x seconds for another try. What is x?
-Peter
IBM Global Services
May 24, 2007 at 7:37 pm
Ian Vanstone
If the messages were bound on open they will not be workload balanced, so assume they are bind not fixed. If they are bind not fixed and have an affinity I assume they were put to a destination specifying both queue and qmgr in the MQOD.
When the bind not fixed messages on the SCTQ are re-workload balanced they are simply got off the queue and re-put. This put does a normal workload balance, which in this case, where there is only one valid destination, will mean the messages are put back on the SCTQ. There is no rollback.
The re-workload balance function is called when the channel fails (note, this means every time a retry fails) or when the destination cluster queue manager is removed from the local cache.
Hope this helps, let me know if you have more questions.
July 16, 2007 at 10:33 am
Morten Kjærulff
Hi,
As I understand it, the re-workload balance function destructive MQGET + MQPUT the messages on SCTQ. On the “new” MQPUT, the messages might go to another instance of the queue. How does it avoid doing the MQGET+MQPUT for messages that were put with bind-on-open?
Morten