Posts Tagged ‘failover’

A few tips for successful new TBSM 4.2.1 failover + existing OMNIbus 7.3 failover

Wednesday, August 31st, 2011

I’d like to present a quick check list for successful TBSM 4.2.1 failover implementation in existing TIP/WebGUI and OMNIbus environment. Some of you may have gone through this, and can confirm it’s not that piece of cake unless you know some secrets. True, details always change the final picture, hence I’m presenting just a check-list to follow, at least a guidance especially for newcomers to the topic.

1. First. Update your software to the most recent releases, including Fix packs and interim fix packs. Even better – if releasing soon, wait for the next one. There’s always something in the fixed code that you may want to have more. Or, read at least the APARs record for incoming or latest fix pack, to learn more whether there’s something useful for failover coming.
Second – collect all the manuals and tips/tricks borrowed from your good mates who have done it and stuck couple of times before. Every single script, automating the pain of many files configuration does matter. If you’re lucky and have good mates, You may even not need this check list here 😉
2. I assume the worst scenario – OMNIbus 7.3.0 exists before TBSM installation, so you’ll have to go through schema updates, and dashboard server over WebGUI installation. On another hand, this situation should turn out with a stable ObjectServers failover set already up, running and for free. I’m assuming that your Object servers are two in the multitier architecture, being specific – they’re a virtual aggregation pair. And secondly, the gateway is configured properly with correct sync types and intervals, and just does its job too.
3. Learn on both ObjectServer alerts.status table schemas. They must be the same. When changing from Primary ObjectServer to backup one, TBSM data server does not discover the schema automatically again. TBSM Event broker reads alerts.status columns ordered by ‘OrdinalPosition’, and it means, that ordinal position of each column must be the same on each ObjectServer. It means that you need to run bsm schema updates on existing ObjectServers carefully, with respect to all the previous schema updates. And do not change the previous order, the original, predefined columns in alerts.status must remain on their places.
Secondly, make also sure that alerts.service_deps table and clear_service_deps automation were imported correctly too.
4. Update the gateway mapping and table definition files after all. Do the update for alerts.status RAD_, TEC_, ITM_ and BSM_ fields mapping, and alerts.service_deps fields mapping.
5. TBSM does not support a fail back to the original Primary data server. The last Primary is primary as long, as it goes down itself. Also, TBSM does not automatically fail back from backup ObjectServer if it is Primary to TBSM, after failover situation. Second information is important for your FO_GATE configuration, as events synced from Primary ObjectServer to Backup one must stay updated in Backup alerts.status and alerts.service_deps after the failback too.
6. For TBSM data and dashboard servers configuration, better use the fo_config script. It’s smart and easy. Watch out on DASH_ settings, they will be applied on dashboard server. Script will also secure the previous versions of files it updates.
7. If you apply FO against running production server, plan for a maintenance window, as the FO implementation does require restarting data and dashboard servers, and we assume your dashboard servers will be running on existing production TIPs with WebGUI, with band of NOC guys online, just watching the AEL, right?
8. The key store file being created during TBSM “primary” server installation should be reused during the “backup” TBSM installation. Keep an eye on it, secure it and pass it when the secondary TBSM installer asks for it.
9. Service details portlet on Service Administration and Service Availability pages is part of Webtop, and will get updated with clickedOn event from the Service Tree portlet during failover only if data sources in WebGUI were previously set correctly.
10. Tipadmin may be not the best idea for a user to test the failover. Go for nco users, and VMM object server plugin instead. Install the plugin on all machines, so data and dashboard servers. Unless you have to integrate with LDAP.
11. If you really want to, you may experiment with the newest settings introduced with TBSM 4.2.1 FP2 and IF3 – consumerQueue and eventsInThread. No risk, no fun.
12. If ObjectServer contains a lot of previously raised events, prepare for EventBroker and ConsistencyChecker hard times. Java heap size analysis and values increase may become needed on both Data servers.
13. Switch on finer or finest tracing and logging levels. Don’t get scary though.
14. Run Primary data server first, then wait and observe the primary’s trace.log until it contains exceptions on detecting not operational backup rad facade. It’s a signal for you to start your backup data server. Even before running any TBSM data server, make sure that Object servers and gateway are up. After data servers start the dashboard servers.
15. When testing the failover, give TBSM, TIP and ObjectServers some time to synchronize and resynchronize before sending next test events. Of course in real life, during regular operations, there can be no time for switch to backup server or failback and TBSM may miss an event match to service instances. Keep this in mind.
16. Observe, make notes, and report the failover scenarios results. Do needed corrections, clean old events to have clear picture and restart again, and again, until it works perfectly. Then let it go for real events.

This is it. Simple and nice, straight forward procedure, isn’t it?