Posts Tagged ‘TBSM’

Triggering TBSM Rules

Friday, May 13th, 2016

Introduction

Tivoli Business Service Manager can calculate amazing things for you, if you only need them. This is thanks to the powerful rules engine being the key part of TBSM as well as the Netcool/Impact policies engine running  just under the hood together with every TBSM edition. You can present your calculation results later on on a dashboard or in reports, depending if you think of a real time scorecard or historical KPI reports.

In this article, I’ll show how you can make TBSM to process its inputs by triggering various template rules in various ways. It is something that isn’t really well documented or at least it isn’t well documented in a single place.

 

Status, numerical and text rules triggers

In this chapter  I’ll show three kinds of rules (status, numerical and text) and I’ll show how TBSM triggers them, so processes the input data, runs them and returns the outputs.

In general these three techniques always kick off TBSM rules based on the same two conditions: time and a new value. Here they are:

  1. Omnibus Event Reader service (for incoming status rules or numerical rules or text rules based on Omnibus ObjectServer as the data feed)
  2. TBSM data fetchers (for numerical or text rules with fetcher based data feed)
  3. TBSM/Impact services of type Policy activator (using Impact policy with PassToTBSM function calls to send data to numerical or text rules)
Figure 1. Three techniques of triggering status, numerical and text rules in TBSM

Figure 1. Three techniques of triggering status, numerical and text rules in TBSM

Make note. There are ITM policy fetchers as well as highly undocumented any-policy fetchers configurable in TBSM, I’ll not comment on them in this material, however their basis is just like any fetcher: the time.

Let’s take a look at the first type of rules triggers, the most popular, the OMNIbus Event Reader service in Impact, widely used in TBSM to process events stored in ObjectServer against the service trees.

OMNIbus Event Reader

Omnibus Event Reader is an Impact service running regularly by default every 3 seconds in order to connect Netcool/OMNIbus for getting events stored in the Objectserver memory that might be affecting TBSM service tree elements. It selects the events based on the following default filter:

(Class <> 12000) AND (Type <> 2) AND ((Severity <> RAD_RawInputLastValue) or (RAD_FunctionType = '')) 
AND (RAD_SeenByTBSM = 0) AND (BSM_Identity <> '')

(Severity <> RAD_RawInputLastValue) is the condition ensuring that event will be tested against containing a new value in the Severity field comparing to the previous event.

The Event Reader itself can be found in Impact UI server within Services page in TIP among other services included in the Impact project called TBSM_BASE:

Figure 2. Configuration of TBSMOMNIbusEventReader service

Figure 2. Configuration of TBSMOMNIbusEventReader service

Make note. TBSM allows you configuring other event readers but you can use just one same time.

All incoming status rules use this Event Reader by default. There’s embedded mapping between the name of the Event Reader and the Data source field in Status/Text/Numerical rules, hence just “ObjectServer” caption occurs in the new rule form:

Figure 3. Screenshot of New Incoming status rule form

Figure 3. Screenshot of New Incoming status rule form

Policy activators

Impact services of type Policy activators simply call a policy every defined period of time and run that policy.

Figure 4. Screenshot of a policy activator service configuration for TBSMTreeRuleHeartbeat

Figure 4. Screenshot of a policy activator service configuration for  TBSMTreeRuleHeartbeat

The policy needs to be created earlier. In order to trigger TBSM rules, it is required to call PassToTBSM() function and contain in its argument an object. Let’s say this is my TBSMTreeRulesHeartbeat policy:

 Seconds = GetDate();
 Time = LocalTime(Seconds, "HH:mm:ss");
 randNum = Random(100000);
  
 ev=NewEvent("TBSMTreeRuleHeartbeatService");
 ev.timestamp = String(Time);
 ev.bsm_identity = "AnyChild";
 ev.randNum =  randNum;
 PassToTBSM(ev);
 log(ev);

 

In my example, a new value generates every time when the policy is activated by using GetDate() function. Pay attention to field called ev.bsm_identity. I’ll be referring to this field later on. For simplification this field has always value “AnyChild”.

Make note. Unlike TBSM OMNIbus event reader, the policy activating services, also policies themselves don’t have to be included in TBSM_BASE impact project.

Netcool/Impact policies give you freedom of reaching to any data source, via SQL or SOAP or XML or REST or command like or anyhow you like, and processing any data you can see useful to process in TBSM. The only requirement is passing that data to TBSM via PassToTBSM function.

 

TBSM data fetchers

TBSM datafetchers combine an Impact policy with DirectSQL function call and Impact policy activator service. Additionally, data fetchers have a mini concept of schedule, it means you can set their run time interval to specific hour and minute and run it once a day (i.e. 12:00 am daily). It also allows postponing or rushing next runtime in case of longer taking SQL queries.

Make note. Data fetchers can be SQL fetchers, ITM policy fetchers or any policy fetchers, unfortunately the TBSM GUI was never fully adjusted to reconfigure ITM policy fetcher and was never enabled to allow configuration of any policy fetchers and in case of the last two you’ve got to perform few manual and command-line level steps instead. Fortunately, the PassToTBSM function available in Impact 6.x policies can be used instead and the any-policy fetchers aren’t that useful anymore.

Every fetcher by default runs every 5 minutes.

Figure 5. Screenshot of data fetcher Heartbeat fetcher

Figure 5. Screenshot of data fetcher Heartbeat fetcher

In the example presented on the screenshot above data fetcher connects DB2 database and runs DB2 SQL query. The new value is ensured every time in this example by calling the native DB2 random function and the whole query is the following:

select 100000*rand() as "randNum", 'AnyChild' as "bsm_identity" from sysibm.sysdummy1

Pay attention to field bsm_identity. It always returns the same value “AnyChild”, just like the policy explained before.

 

Triggering status, numerical, text and formula rules

In the previous chapter I presented various rules triggering methods. It’s now time to show how those triggers work in real. I’ll create a template with 5 numerical and text rules (I don’t want to change any instance status, so I won’t create any incoming status rule this time) and additionally with 3 supportive formulas and I’ll present the output values of all those rules on a scorecard in Tivoli Integrated Portal. Below you can see my intended template with rules:

Figure 6. Screenshot of t_triggerrulestest template with rules

Figure 6. Screenshot of t_triggerrulestest template with rules

OMNIbus Event Reader-based rules

Let’s start from 2 rules utilizing the TBSM OMNibus Event Reader service. Like I said, I won’t use a status rule, but I’ll use one text and one numerical rule, in order to return the last event’s severity and the last event’s summary. Before I do it, let me configure my simnet probe that will be sending random events to my ObjectServer. My future service instance implementing the t_triggerrulestest template will be called TestInstance (or at least it will have such a value in its event identifiers), I want one event of each type and various probability that such event will be sent:

#######################################################################
#
# Format:
#
#       Node Type Probability
#
# where Node        => Node name for alarm
#       Type        =>  0 - Link Up/Down
#                      1 - Machine Up/Down
#                      2 - Disk space
#                      3 - Port Error
#       Probability => Percentage (0 - 100%)
#
#######################################################################


TestInstance 0 10
TestInstance 1 15
TestInstance 2 20
TestInstance 3 25

 

Let’s see if it works:

Figure 7. Screenshot of Event Viewer presenting test events sent by simnet probe.

Figure 7. Screenshot of Event Viewer presenting test events sent by simnet probe.

So now my two rules, I marked important settings with green. So my Data feed is ObjectServer, the source is SimNet probe, the field containing service identifiers is Node and the output value taken back is Severity:

Figure 8. Screenshot of the rule numr_event_lastseverity form

Figure 8. Screenshot of the rule numr_event_lastseverity form

Same here, this time the rule is Text rule and I have a fancy output expression, it is:

'AlertGroup: '+AlertGroup+', AlertKey: '+AlertKey+', Summary: '+Summary

The rule itself:

Figure 9. Screenshot of the rule txtr_event_lastsummary

Figure 9. Screenshot of the rule txtr_event_lastsummary

Fetcher-based rule

That was easy. Now something little bit more complicated, the data fetcher. I already have my datafetcher created and show above in this material, let’s check if it works fine, the logs shows the fetcher is fine, i.e. if it fetches data every 5 minutes:

1463089289272[HeartbeatFetcher]Fetching from TBSMComponentRegistry has started on Thu May 12 23:41:29 CEST 2016
1463089289287[HeartbeatFetcher]Fetched successfully on Thu May 12 23:41:29 CEST 2016 with 1 row(s)
1463089289287[HeartbeatFetcher]Fetching duration: 00:00:00s
1463089289412[HeartbeatFetcher]1 row(s) processed successfully on Thu May 12 23:41:29 CEST 2016. Duration: 00:00:00s. The entire process took 00:00:00s
1463089589412[HeartbeatFetcher]Fetching from TBSMComponentRegistry has started on Thu May 12 23:46:29 CEST 2016
1463089589427[HeartbeatFetcher]Fetched successfully on Thu May 12 23:46:29 CEST 2016 with 1 row(s)
1463089589427[HeartbeatFetcher]Fetching duration: 00:00:00s
1463089589558[HeartbeatFetcher]1 row(s) processed successfully on Thu May 12 23:46:29 CEST 2016. Duration: 00:00:00s. The entire process took 00:00:00s

 

And the data preview looks good too:

Figure 10. The Heartbeat fetcher output data preview

Figure 10. The Heartbeat fetcher output data preview

My next rule will be just one and it will be a numerical rule to return the randNum value, I marked important settings in green again, so I select HeartbeatFetcher as the Data Feed, I select bsm_identity as service event identifier and randNum as the output value:

Figure 11. Screenshot of numr_fetcher_randNum rule

Figure 11. Screenshot of numr_fetcher_randNum rule

Policy activated rules

Last but not least I will create two rules getting data from my policy activated by my custom Impact Service. I did show the policy and the service in the previous chapter, let’s just make sure they both work ok. This is how the service works, every 5 minutes I get my policy activated and every time it returns another value in the randNum field:

12 maj 2016 23:41:29,652: [TBSMTreeRulesHeartbeat][pool-7-thread-87]Parser log: (PollerName=TBSMTreeRuleHeartbeatService, randNum=71648, timestamp=23:41:29, bsm_identity=AnyChild)
12 maj 2016 23:46:29,673: [TBSMTreeRulesHeartbeat][pool-7-thread-87]Parser log: (PollerName=TBSMTreeRuleHeartbeatService, randNum=8997, timestamp=23:46:29, bsm_identity=AnyChild)
12 maj 2016 23:51:29,674: [TBSMTreeRulesHeartbeat][pool-7-thread-91]Parser log: (PollerName=TBSMTreeRuleHeartbeatService, randNum=73560, timestamp=23:51:29, bsm_identity=AnyChild)
12 maj 2016 23:56:29,700: [TBSMTreeRulesHeartbeat][pool-7-thread-91]Parser log: (PollerName=TBSMTreeRuleHeartbeatService, randNum=60770, timestamp=23:56:29, bsm_identity=AnyChild)
13 maj 2016 00:01:29,724: [TBSMTreeRulesHeartbeat][pool-7-thread-92]Parser log: (PollerName=TBSMTreeRuleHeartbeatService, randNum=55928, timestamp=00:01:29, bsm_identity=AnyChild)

 

Let’s then create the rules. I will have two rules again, one numerical and one text. The numerical rule will have the TBSMTreeRuleHeartbeatService as the Data Feed, the bsm_identiy field will be selected as the service event identifier field and randNum field will be my output:

Figure 12. Screenshot of numr_heartbeat_randNum rule

Figure 12. Screenshot of numr_heartbeat_randNum rule

Make note. Every time you add another field to your policy activated by your service, make sure that new field is mapped to the right data type in the Customize Fields form. You will need to add that field first:

Figure 13. Screenshot of CustomizedFields form

Figure 13. Screenshot of CustomizedFields form

And the second rule looks the following, this time it is a text rule and I return the timestamp value:

Figure 14. Screenshot of txtr_heartbeat_lasttime rule

Figure 14. Screenshot of txtr_heartbeat_lasttime rule

Formula rules

The last rules I’ll create will be three formula, policy-based, text rules. Each of them will go to another rules create previously and “spy” on their activity. Let’s see the first example:

Figure 15. Screenshot of nfr_triggered_by_events rule

Figure 15. Screenshot of nfr_triggered_by_events rule

This rule will use policy and will be a text rule. It is important to mark those fields before continuing, later the Text Rule field greys out and inactivates. After ticking both fields I click on the Edit policy button. All three rules will look the same at this level; hence I won’t include all 3 screenshots as just their names will differ. I’ll create another policy in IPL for each of them. Here’s the mapping:

Rule name

Policy name

nfr_triggered_by_events

p_triggered_by_events

nfr_triggered_by_fetcher

p_triggered_by_fetcher

nfr_triggered_by_service

p_triggered_by_service

Each policy of those three will look similar, it will just look after different rules created so far. The p_triggered_by_events policy will do this:

// Trigger numr_event_lastseverity
// Trigger txtr_event_lastsummary

Seconds = GetDate();
Time = LocalTime(Seconds, "HH:mm:ss");

Status = Time;
log("TestInstance triggered by SimNet events at "+Time);
log("Output value of numr_event_lastseverity: "+InstanceNode.numr_event_lastseverity.Value);
log("Output value of txtr_event_lastsummary: "+InstanceNode.txtr_event_lastsummary.Value);

Policy p_triggered_by_fetcher will do this:

// Trigger numr_fetcher_randnum
 
 Seconds = GetDate();
 Time = LocalTime(Seconds, "HH:mm:ss");
 
 Status = Time;
 log("TestInstance triggered by HeartbeatFetcher at "+Time);
 log("Output value of numr_fetcher_randnum: "+InstanceNode.numr_fetcher_randnum.Value);

And policy p_triggered_by_service this:

// Trigger numr_heartbeat_randnum
// Trigger txtr_heartbeat_lasttime
 
 Seconds = GetDate();
 Time = LocalTime(Seconds, "HH:mm:ss");
  
 Status = Time;
 log("TestInstance triggered by TBSMHeartbeatService at "+Time);
 log("Output value of numr_heartbeat_randnum: "+InstanceNode.numr_heartbeat_randnum.Value);
 log("Output value of txtr_heartbeat_lasttime: "+InstanceNode.txtr_heartbeat_lasttime.Value);

You can notice that each policy starts from a comment section. This is important. This is how the formula rules get triggered. It is enough to mention another rule by its name in a comment to trigger your formula every time that other referred rule returns another output value. This is why we have the randnum-related rules in every formula. Those rules are designed to return another value every time they run. Just the first rule isn’t the same, but I assume it will trigger every time a combination of Summary, AlertGroup and AlertKey fields value in the source event changes.

The trigger numerical or text rules are also mentioned later when these policies call them and obtain their output values in order i.e. to put those values into log file. But it is not necessary to trigger my formulas. I log those trigger text and numerical rules outputs for troubleshooting purposes only.

The purpose of these 3 policies and 3 formulas is to report on time when the numerical or text values worked for the last time.

Below you can see an example of one of the policies in actual form.

Figure 16. Screenshot of one of the policies text

Figure 16. Screenshot of one of the policies text

Testing the triggers

Now it’s time to test the trigger rules, the triggers and troubleshoot in case.

Triggering rules in normal operations

In order to do that we will need a service instance implementing our newly created template. I call it TestInstance and this is its base configuration:

Figure 17. Screenshot of configuration of service instance TestInstance – templates

Figure 17. Screenshot of configuration of service instance TestInstance – templates

It is important to make sure that the right event identifiers were selected in the Identification Fields tab. I need to remember what bsm_identity I set in all rules, it is mainly AnyChild (the policy and the fetcher) and TestInstance (the SimNet probe).

Figure 18. Screenshot of configuration of service instance TestInstance – identifiers

Figure 18. Screenshot of configuration of service instance TestInstance – identifiers

Make note. In real life your instance will have its individual identifiers like TADDM GUID or ServiceNow sys_id. It is important to find a match between that value and the affecting events or matching KPIs and if this is necessary to define new identifiers, which will ensure such a match.

Let’s see if it works in general. I created a scorecard and a page to present on all values of my new instance. I’ll put on top also fragments of my formula related policy logs to see if the data returned in policies and timestamps match:

Figure 19. Screenshot of the scorecard with policy logs on top

Figure 19. Screenshot of the scorecard with policy logs on top

Let’s take a closer look at the first section. Same event arrived just once but since formula is triggered by two rules it was triggered twice in a row. In general the last event arrived at 20:27:00 and its severity was 4 (major) and the summary was on Link Down. Both rules numr_event_lastseverity and txtr_event_lastsummary triggered m formula correctly.

The next section is about the fetcher, the latest random number is 16589,861 and the rule numr_fetcher_randnum triggered my formula correctly.

The last is the policy activated rule and formula, let’s see. This time I have two rules again and they both triggered the formula correctly. The last run was at 20:26:30. I have two different randnum values in both runs. This is caused by referring to numerical rules twice in my formula policy.

Triggering rules after TBSM server restart

I’ll now show a problem that TBSM has with rules that are not triggered by any trigger. Like I said in the previous chapters, TBSM needs rules to be triggered every now and then but also the value to change between triggers, in order to return the value again.

It causes some issues in TBSM server restart situations. If a value hasn’t changed before server restart and is still the same after the restart, TBSM may be unable to display or return it correctly, if the rule used to return it is not triggered. Server restart situation means clearing TBSM memory so no output values of no rules are preserved for after the server restart.

Here’s an example. I’ll create one new formula rule with this policy in my test template:

Status = ServiceInstance.NUMCHILDREN;
log("Service instance "+ServiceInstance.SERVICEINSTANCENAME+" ("+ServiceInstance.DISPLAYNAME+") has children: "+Status);

Here’s the rule itself:

Figure 20. Screenshot of nfr_numchildren rule configuration

Figure 20. Screenshot of nfr_numchildren rule configuration

As next step, I add one more column to my scorecard to show the output of the newly created rule. I also created 3 service instances and made them a child to TestInstance instance.

TriggeringTBSMRules_21

Figure 21. Screenshot of the scorecard shows 3 children count

Also my formula policy log will return number 3:

13 maj 2016 12:17:56,664: [p_numchildren][pool-7-thread-34 [TransBlockRunner-2]]Parser log: Service instance TestInstance (TestInstance) has children: 3

Now, if I only restart TBSM server, the value shown will be 0 and I will see no new entry in the log:

TriggeringTBSMRules_22

Figure 22. Screenshot of the scorecard after server restart shows 0 children

I can change this situation by taking one of three actions:

  1. Adding new or removing old child instances from Testinstance
  2. Modifying the formula policy
  3. Introducing a trigger to the formula policy

However two first options don’t protect me from another server restart situation occurring.

Let’s say I add another child instance. This is how the scorecard will look like:

Before the restart

After the restart

TriggeringTBSMRules_23 TriggeringTBSMRules_24

Or I may want to modify my rule. After saving my changes, the value will display correctly. However another server restart will reset it back to 0 again anyway.

So let’s say I change my policy to this:

Status = ServiceInstance.NUMCHILDREN;
log("Service instance "+ServiceInstance.SERVICEINSTANCENAME+" ("+ServiceInstance.DISPLAYNAME+") has children: "+Status);
log("Service instance ID: "+ServiceInstance.SERVICEINSTANCEID);

And my policy log now contains two entries per run:

13 maj 2016 12:56:07,023: [p_numchildren][pool-7-thread-4 [TransBlockRunner-1]]Parser log: Service instance TestInstance (TestInstance) has children: 4
13 maj 2016 12:56:07,023: [p_numchildren][pool-7-thread-4 [TransBlockRunner-1]]Parser log: Service instance ID: 163

But the situation before and after the restart is the same:

Before the restart

After the restart

TriggeringTBSMRules_23 TriggeringTBSMRules_24

It’s not a frequent situation though. If your rules are normally event-triggered rules or data fetcher triggered rules you can expect frequent updates to your output values even after your TBSM server restarts.  Just in case you want to present an output value in a rule that normally is not triggered, make sure you include a reference to a trigger in your rule. Let’s use one of the triggers we configured previously in my new formula policy:

 // Trigger by numr_fetcher_randnum     
 // Trigger by numr_heartbeat_randnum 
 
 Status = ServiceInstance.NUMCHILDREN;
 log("Service instance "+ServiceInstance.SERVICEINSTANCENAME+" ("+ServiceInstance.DISPLAYNAME+") has children: "+Status); 
 log("Service instance ID: "+ServiceInstance.SERVICEINSTANCEID);

You can already notice by following the log of the policy that there are many entries per every policy run, precisely as many entries as many times the formula was triggered by one of the trigger rules.

The first pair of entries was added after saving the rule. The next 2 pairs were added in result of the triggers working fine:

13 maj 2016 13:22:36,837: [p_numchildren][pool-7-thread-3 [TransBlockRunner-1]]Parser log: Service instance TestInstance (TestInstance) has children: 4
13 maj 2016 13:22:36,837: [p_numchildren][pool-7-thread-3 [TransBlockRunner-1]]Parser log: Service instance ID: 163
13 maj 2016 13:24:12,833: [p_numchildren][pool-7-thread-3 [TransBlockRunner-1]]Parser log: Service instance TestInstance (TestInstance) has children: 4
13 maj 2016 13:24:12,833: [p_numchildren][pool-7-thread-3 [TransBlockRunner-1]]Parser log: Service instance ID: 163
13 maj 2016 13:24:18,465: [p_numchildren][pool-7-thread-3 [TransBlockRunner-1]]Parser log: Service instance TestInstance (TestInstance) has children: 4
13 maj 2016 13:24:18,465: [p_numchildren][pool-7-thread-3 [TransBlockRunner-1]]Parser log: Service instance ID: 163

Let’s make the final test, so TBSM server restart:

Before the restart

After the restart

TriggeringTBSMRules_23 TriggeringTBSMRules_23

It seems working fine now!

This excercise ends my material for tonight. I’ll continue in another material on triggering the status propagation rules and numeric aggregation rules. See you soon!

mp

 

Unique Grand Children Count in TBSM

Tuesday, May 10th, 2016

Introduction

Tivoli Business Service Manager can calculate amazing things for you, if you only need them. This is thanks to the powerful rules engine being the key part of TBSM as well as the Netcool/Impact policies engine running just under the hood together with every TBSM edition. You can present your calculation results later on on a dashboard or in reports, depending if you think of a real time scorecard or historical KPI reports.
In this article, I’ll show how you can use TBSM rules engine to calculate unique children count for a grand parent level service instance. It is something that isn’t really documented at all and the case isn’t very popular but in case you need it, you can find it here in this material.
In this material I will use the following hierachy of three templates:

  • T_NetworkSite – acting as grandparent template level
  • T_Interface – acting as parent template level
  • T_Router – acting as child template level

Interface a parent to a Router? – You may ask. It is not really what’s being promoted in various documents, definitely not something documented here:

https://www.ibm.com/support/knowledgecenter/SSSPFK_6.1.1.3/com.ibm.tivoli.itbsm.doc/bsma/10/bsma_cust_c_custom_service_model_example.html?lang=en

Well, this depends very much on what and how you want to present in TBSM dashboards. So it depends on what is your busienss service about. The example in the article I mention above is concentrating more on VPN services:

 

uniquegrandchildrencount_pic_01Figure 1. Source: https://www.ibm.com/support/knowledgecenter/api/content/nl/en-us/SSSPFK_6.1.1.3/com.ibm.tivoli.itbsm.doc/bsma/10/images/bsma_cust_sm_network_topology.jpg

In my example, I’m concentraing on Layer 2 connectivity, in other words: I cannot connect to my network site or it is unavailable if all router interfaces are down. All router interfaces can be down and the routers themselves can be up – it doesn’t matter, it means the same thing for the service: an outage. Automatically, if whole routers get switched off, the interfaces will be switched off too so my network site will be unavailable too.

uniquegrandchildrencount_pic_02Figure 2. Templates hierarchy used in this material

The desired effect is the following:

  • There is one grandparent KrakowSite
  • There are 2 routers in total
  • There are 4 interfaces in total, 2 per each of routers
uniquegrandchildrencount_pic_03Figure 3. Access to Krakow network site – business service sample diagram

In other words, KrakowSite should report to run 4 installed interfaces but 2 router devices only. The next scorecard is something we will be building during this exercise.

uniquegrandchildrencount_pic_04Figure 4. Target scorecard to build

Before I continue, I will need to introduce a HeartBeat and PassToTBSM concept.

PassToTBSM and Heartbeat

PassToTBSM is an Impact function that can be used to send any data from Netcool/Impact policy straight to TBSM. It doesn’t have to be same Impact as Impact running jointly with TBSM on the same server, it can be a standalone Impact server too (but I haven’t tried that). It can also be both Impact 6.1 or Impact 7.1 (announced not to have PassToTBSM but I hear it’s still there, not tested by myself though).
A policy that sends data to TBSM with PassToTBSM function can be as follows:

Seconds = GetDate();
Time = LocalTime(Seconds, “HH:mm:ss”);
ev=NewEvent(“TBSMTreeRuleHeartbeatService”);
ev.timestamp = String(Time);
ev.bsm_identity = “AnyChild”;
PassToTBSM(ev);

So we construct an IPL policy in which we take the current time (it is important to have at least one changing value, I’ll explain why in another article on this blog) and specify service instance identifier that affected service instance is expected to have defined for its incoming status rules or numerical rules. Because I’m going to affect two routers: RouterA and RouterB, I specify something generic like “AnyChild”. I could also send two events to TBSM, one with ev.bsm_identity=”RouterA” and the other with ev.bsm_identity=”RouterB”. In a case of large implementations it is easier to specify something generic like AnyChild and add such an identifier to every service instance automatically during an import process via SCR API/XMLtoolkit.
Let me call the policy with TBSMTreeRulesHeartbeat.
Such a policy needs now to be called by an Impact service:

uniquegrandchildrencount_pic_05Figure 5. Impact service to run the heartbeat policy

Make note. Alternatively a data fetcher could be used, which also can be scheduled to run every 30 seconds or even once a day at 12:00 AM or at another time, however I wanted to show PassToTBSM function in action and also in large solution cases you may not want to involve an SQL SELECT statement against any database to simply run such a heartbeat function. Alternatively you could create a policy fetcher, but then you need more skills to do it since there’s no UI for that in TBSM.

Make note. Such a service doesn’t really needs to be added to any of Impact projects.
Now, in order to use such a service and policy in a numerical rule in TBSM, you do two things: you set that service as the data source and set mapping. I have created my HeartbeatRule in TBSM with the following settings:

uniquegrandchildrencount_pic_06Figure 6. Numerical Rule with heartbeat service as data feed

Then in Customize Fields form you should have:

uniquegrandchildrencount_pic_07Figure 7. Custom fields mapping

Save this rule to your LEAF template:

uniquegrandchildrencount_pic_08Figure 8. Heartbeat rule in the LEAF template definition

And the last thing: don’t forget to make sure your service instances have “AnyChild” instance identifier specified:

uniquegrandchildrencount_pic_09Figure 9. Adding new instance ID – AnyChild

Why is it for? You may ask.
The answer is: We will be calculating unique number of grand children in one of TBSM functions. All functions in TBSM need a trigger which is an input value that changes, in order to return a fresh value. If the input value doesn’t change, you’ll not see a new value on the output. It can be the same value, but your rule won’t work if you don’t trigger it from outside somehow.

Example? Sure:
On the next level in templates hierarchy there will be NumberOfRouters rule defined (and the heartbeat rule too):

uniquegrandchildrencount_pic_10Figure 10. T_Interface template’s rules list

Let’s see inside the NumberOfRouters rule:

uniquegrandchildrencount_pic_11Figure 11. NumberOfRouters rule definition

This rule will return the output value from the function NumberOfAllChildren defined in the policy NumericalAttributeFunctions.ipl every time the HeartbeatRule triggers it.
In other words, the number of routers below interfaces won’t change in output of this function, even if it really changes (grows, reduces) unless the rule is kicked again.
So you need that extra rule on the children level like HeartbeatRule running periodically every 30 seconds and returning a random timestamp every time to ensure a different output value every time it runs.

Why so much hassle, you may say?

Why not to use ServiceInstace.NUMCHILDREN inside a policy-based numerical formula?
Well, first of all, Numerical formula is also a rule that also needs a trigger to run. Every rule in TBSM needs a trigger to run. I can dedicate a special post to that topic.
Second of all, I do use ServiceInstance.NUMCHILDREN, check out my policy function:

function NumberOfAllChildren(ChildrenStatusArray, AllChildrenArray, ServiceInstance, Status) {
Status = ServiceInstance.NUMCHILDREN;
}

So this policy, I mean this function, will return the NUMCHILDREN value any time you trigger the rule.

The main reason for that hassle is that unfortunately but you cannot use NUMCHILDREN directly on a scorecard, you only can return it in rules. And rules need a trigger. NUMCHILDREN isn’t also an additional attribute, which could be shown directly in JazzSM dashboard.
Is it clear? I know, it’s bit weird, but just at the first sight.

You may also doubt: why am I using ServiceInstance.NUMCHILDREN? Is there any other attribute to return same value? Why am I using TIP, not JazzSM in my examples at all? The answers are: there’s no additional attribute that you could return in JazzSM straight, without wrapping it with a rule (and you cannot return an additional attribute without packing it in a rule in TIP) to return anything like number of children. So you have two choices:

  1. Use ServiceInstance object’s field NUMCHILDREN – see above
  2. Use a policy that will iterate through an array of children objects of your service instance and return the array’s length.

As you can see, still a policy, so still a numerical aggregation rule or a numerical formula rule must be used. So there’s no other way really. So rules are your way and you need to trigger them.

 

Recalculate correct number of objects after server restart

There’s an alternative to the Heartbeat rule, from TBSM 6.1.1 FP2 you can run this policy and associate it with the server start or run it from time to time manually or schedule it with an Impact service, there are two policies actually, one is for all nodes and the other just for leafs.


All nodes
Leafs
USE_SHARED_SCOPE;
Type=”StateModel”;
Filter = “RECALCSTATENODESLEAF”;
log(“Recalc Leaf Node Only. Policy Start.” );
GetByFilter(Type, Filter, false);
log(“Recalc Leaf Node Only. Policy Finish.” );
USE_SHARED_SCOPE;
Type=”StateModel”;
Filter = “RECALCSTATENODESALL”;
log(“Recalc All Nodes. Policy Start.” );
GetByFilter(Type, Filter, false);
log(“Recalc All Nodes. Policy Finish.” );

This alternative is documented here:
http://www.ibm.com/support/knowledgecenter/SSSPFK_6.1.1.3/com.ibm.tivoli.itbsm.doc/customization/bsmc_pol_num_agg_rule_restrt_recalc_c.html

The difference between my heartbeat solution and the policy documented above is that my heartbeat function is selective, I decide which elements of the service tree will be recalculated (not just leafs but also not the entire service tree) and when (not just during a restart but every now and then). This is important, because change in number of children on some intermediate levels may occur independently on changes in number of children on the leaf level and I still need to trigger that change. Same time it’s an effort for TBSM to recalculate the whole tree, especially in case I have 100k instances in my service tree.

That’s why I prefer to make it selective, so I use Heartbeat concept.

 

Unique grandchildren count rule

Now once we have the children count rule created and triggered, it’s time to get the unique grandchildren count rule.
What’s the difference?
It’s simple, you don’t want to take your children children count, because every Interface will report it has 1 parent, which gives you 4 parents while the true number is just 2.
So you need a smart Impact policy that will calculate that for you.
Since we’re clear on what rules need to be created on the Router level and the Interfaces level, it’s time to present rules on the NetworkSite template level:

uniquegrandchildrencount_pic_12Figure 12. Rules defined inside T_NetworkSite template

The NumberOfInterfaces rule is just to calculate the number of interface below the network site and inside of that rule the same function NumberOfAllChildren is being called from within NumericAttributeFunctions.ipl. The trigger should be the heartbeat rule again since number of interfaces inside the site may change independently. As you could see above, I defined a heartbeat rule inside the T_Interface template and I called it HeartbeatRuleIfc.

The more interesting rule is UniqueGrandChildren, which runs another function from the NumericAttributeFunctions policy, called NumberOfUniqueGrandChildren:

function NumberOfUniqueGrandChildren(ChildrenStatusArray, AllChildrenArray, ServiceInstance, Status) {
i = 0;
uniquegrandchildrenarray = {};
log(“MP: “+ServiceInstance);
while(i<length(ServiceInstance.CHILDINSTANCEBEANS)) {
child = ServiceInstance.CHILDINSTANCEBEANS[i];
log(“Child “+child.DISPLAYNAME+” of grand parent “+ServiceInstance.DISPLAYNAME+” was found.”);
j = 0;
while(j<length(child.CHILDINSTANCEBEANS)) {
grandchild = child.CHILDINSTANCEBEANS[j];
log(“Child “+grandchild.DISPLAYNAME+” of child “+child.DISPLAYNAME+” was found.”);
// Testing if currently analyzed child has already occurred
k = 0;
occurence = 0;
while(k<length(uniquegrandchildrenarray)) {
if(uniquegrandchildrenarray[k].SERVICEINSTANCEID == grandchild.SERVICEINSTANCEID) {
// if yes, mark occurred = 1 (true) and finish analyzing further, so exit this loop
occurence = 1;
// k = length(uniquegrandchildrenarray); //uncomment this line to speed up in case of large child arrays
log(“Duplicate found: “+uniquegrandchildrenarray[k].SERVICEINSTANCEID+” and “+grandchild.SERVICEINSTANCEID+”. Skipping.”);
}
k=k+1;
}
if(occurence == 0) {
uniquegrandchildrenarray = uniquegrandchildrenarray + grandchild;
log(“Unique grand child found: “+grandchild.DISPLAYNAME+”. Added to the list.”);
}
j = j + 1;
}
i = i + 1;
}
Status = length(uniquegrandchildrenarray);
log(“Grand parent “+ServiceInstance.DISPLAYNAME+” has # grand unique children “+Status);
}

So basically the function will traverse the service tree two levels down to the grandchildren level and will start storing their number by tracking their name. For every reoccurring name a counter will be incremented by 1. For every new name, a new item will be added to an array. The size of the array is the returned value.

Is it simple? Not so much, but it’s probably one of those functions you implement once and use all times, so it’s worth to learn about it. Let’s see the rule at the end:

uniquegrandchildrencount_pic_13Figure 13. NumberOfUniqueGrandChildren rule

So this is your desired effect:

uniquegrandchildrencount_pic_14Figure 14. Unique GrandChildrenCount on the scorecard

 

I hope that you like this type of small hints on how to achieve something useful in TBSM, if so, please comment and I’ll try to post as man of this type of posts as I can. Thanks!

TBSM 6.1.1 FP4 released

Friday, April 29th, 2016

Fix pack 4 was released tonight, see the full list of APARs and improvements, also get the downloads here:

http://www-01.ibm.com/support/docview.wss?uid=swg24041505

Please make note that there are few manual actions to take, apart from installing the fix, to fully benefit from few new features included, which are:

RADEVENTSTORE index creation – to prevent TBSM Event reader from hanging
Installation of the new right-click context menu item “Delete Service Instance”
SLAPrunePolicyActivator – Impact policy activator service to prune RADEVENTSTORE and 6 other tables in TBSM DB.
TIP 2.2.0.17 version 3 installation – is now certified to use with TBSM

mp

Total Event Count in TBSM

Tuesday, April 26th, 2016

Introduction

Tivoli Business Service Manager can calculate amazing things for you, if you only need them. This is thanks to the powerful rules engine being the key part of TBSM as well as the Netcool/Impact policies engine running just under the hood together with every TBSM edition. You can present your calculation results later on on a dashboard or in reports, depending if you think of a real time scorecard or historical KPI reports.

In this article, I’ll show how to calculate a total event count throughout multi-level service tree. It is something that TBSM isn’t doing right after a fresh install because it doesn’t provide you with the right rules out of the box, however TBSM doesn’t have also any predefined service tree available to you so in order to see this working you’d need to do both: add the rules to your service templates and import or create by hand your service tree structure to test this.

In this material, I’ll create a simple, multi-level service tree consisting of 3 levels of instances and I’ll use my own defined template T_Regions, but in order to repeat this exercise you can also simply reuse the template SCR_ServiceComponentRawStatusTemplate, which comes with every TBSM installation and is widely used in integrations with Tivoli Application Dependency Discovery Manager (TADDM). They key thing is that your template is to:

  • Have at least 1 incoming status rule
  • Be in use across the whole service tree, so all service instances on all levels in your service tree implement that template.

totaleventcount_pic_01Figure 1. Template settings for this excercise

totaleventcount_pic_02Figure 2. Incoming Status Rule body used in this excercise

totaleventcount_pic_03Figure 3. Simple service tree used in this material

Make note. This document is trying to implement already existing functionality, means calculating the total number of events on every service tree level which result is stored in numRawEventsInt parameter. This parameter can be visible as the last value in the RAD_prototype widget being used typically on Custom Canvases on TBSM dashboards created in Tivoli Integrated Portal. But that parameter value isn’t accessible for numerical rules or policies for further processing.

totaleventcount_pic_04Figure 4. numRawEventsInt value used on RAD_prototype widget

The newest add-on to TBSM, the debug Spy tools, also offer a parameter per every service tree level, called Matching Events. However that value is correct too, it also isn’t accessible from numerical rules or policies.

Make note. There is a BSM Accelerator template, called BSMAccelerator_EventCount which was designed to present the correct number of events for every service instance, however it was tailored to BSM Accelerator needs and service tree structure and isn’t scalable for potentially endlessly high service trees. However, some of the concepts introduced in order to support the BSM Accelerator package, will be covered in this document. If you want to read more, see this document:

http://www.ibm.com/support/knowledgecenter/SSSPFK_6.1.1.3/com.ibm.tivoli.itbsm.doc/bsma/10/bsm_tbsm_BSMAcceleratorEventCount_rules.html

Make note. TBSM 6.1.1 FP3 is a prerequisite for all rules described in this material to work correctly. However it is highly recommended to install Fix Pack 4 or higher for ensuring the latest improvements.

What is the multi-level events count?

TBSM runs an Impact service called TBSMOMNIbusEventReader which comes with the product out of the box and is responsible for reading events in Netcool/OMNIbus on a regular basis (every 3000 miliseconds by default) and finding events to be processed by TBSM by using its special predefined filter.

Here’s the default filter:

(Class <> 12000) AND (Type <> 2) AND ((Severity <> RAD_RawInputLastValue) or (RAD_FunctionType = ”)) AND (RAD_SeenByTBSM = 0) AND (BSM_Identity <> ”)

All events which pass that filter get processed further by TBSM service template rules, actually their special kind called Incoming Status Rules. The most typical incoming status rule, predefined inside SCR_ServiceComponentRawStatusTemplate template, called ComponentRawEventStatusRule has a precondition, called a discriminator, which filters out all events filtered in previously by the event reader, which don’t have one of the following classes:

  • TPC Rules(89200),
  • IBM Tivoli Monitoring Agent(87723),
  • Predictive Events(89300),
  • IBM Tivoli Monitoring(87722),
  • Default Class(0),
  • TME10tecad(6601),
  • Tivoli Application Dependency Discovery Manager(87721),
  • Precision [Start](8000),
  • MTTrapd(300),
  • Precision [End](8049)

Make note. In my example my Incoming Status rule will simply expect just Default Class (0) in all my test events.

This is not the end. There’s one more filter. It is called event identification field and by default TBSM will look for its value in event’s field called BSM_Identity. Value that is expected in that field comes from every service instance event identifier, which by default is the same as service instance name. So the event identifiers for my simple service tree will be the following:

Service instance name Event identifier
Europe Europe
Poland Poland
Malopolska Malopolska

I will not discuss in this material about how to maintain event identifiers, how many event identifiers you can have, how to set up event identifiers in XMLtoolkit configuration files (if you’re interested in those topics, please see my private blog entry: http://www.marcinpaluch.pl/wordpress/?p=231). I will also not discuss here on how the event severity may affect service instance status, I go defaults here in my example, but I will not focus on that area in this material at this time.

To sum it up: there are 3 filters your event has to pass before it affects your service instance:

  1. The TBSMOMNIbusEventReader’s filter
  2. The Incoming Status Rule discriminator / event class filter
  3. The event identifier

If your event made it through all the filters, you can call it a service instance affecting event.

It doesn’t have to mean your event has to change your service instance status, it only means that your event was processed by the Incoming Status Rule implemented in your service instance’s template. If you use TBSM 6.1.1 FP4, you can use Service Model Spy tool to see that your Incoming Status rule updated various attributes like Matching Events (number), Max Event Status (Event’s severity) and a timestamp of time when the rule processed the event.

The Matching Events parameter is what I’ll be calling in this material the EventCount.

Now, why Multilevel event count?

Every service instance can have its own individual EventCount. Every level of the service tree can contain more than one service instance and the best way to sum them up is to calculate their sum on their parent level. Then the parent service instance may also be used to implement a template with Incoming Status rule and therefore it can have its own individual EventCount. And then the parent service instance can be one of many parent service instances so the best way of summing them up would be calculating TotalEventCount on the grandparent service instance level. And so on. So the Multi-level event count is a feature to calculate the total number of events being processed by TBSM in the whole service tree.

Why would you need it? There are several use cases possible:

  • Your service tree consistency check and verification – in a development phase, to see if all levels of your service tree get processed correctly
  • Statistics – to see the current and true load on TBSM by source, class, alert type, any event field in order to perform some further analysis of event storms and their reasons
  • To monitor the operations – for example to compare total events count to total acknowledged events count to total count of events escalated by opening an incident etc.
  • To monitor service component qualities – especially important in case of service components are managed or provided by a 3rd party provider – you can assess how much trouble all of them give your company or your operations team

Once the use case is agreed, you may want to use this material to start collecting your Total event counts in order to present them on a dashboard or in a report. Let me now explain to you how to set it up.

 

Implementation

As the first step let’s make sure I’m collecting the event count for each of my service tree elements. Let me create my new rule: OwnEvents count.

Make note. This step has a prerequisite: I need to have my Incoming Status rule already created.

This is perhaps not well documented, but every Incoming Status Rule can be used in a Numerical Formula rule to get the number of events processed. It is documented in this technote:

http://www-01.ibm.com/support/docview.wss?uid=swg21626616

So let me do exactly what the technote does, this is my numerical formula, my rule called OwnEvents, which will return only non-clear events count via the default (since TBSM 6.1.1 FP1) Incoming Status Rule’s parameter NumEventsSevGE2. Whenever my Incoming Status Rule has processed another event with severity 1 or higher, the output of my numerical formula will refresh and increase by 1.

totaleventcount_pic_05Figure 5. OwnEvents rule settings

And on my scorecard:

totaleventcount_pic_06Figure 6. OwnEvents in a scorecard

Let’s send a test event to the last level now:

totaleventcount_pic_07Figure 7. Sending test event

totaleventcount_pic_08Figure 8. Test event settings

totaleventcount_pic_09Figure 9. OwnEvents after sending test event

As you could see the events severity was passed through the whole service tree up, that is why the icon in the Events column changed color to Purple from bottom level right to the top one.

After sending a critical event to the 2nd level the icons from the 2nd level to the top one changed their color to red.

totaleventcount_pic_10Figure 10. OwnEvents after sending 2nd test event

Make note. In order to perform this exercise, I haven’t created a status propagation rule. And I will not!

Take a look at the OwnEvents column. Even if status was propagated through the service tree from bottom to the top, the OwnEvents rule worked for every level individually. Europe shows bad Events noticed but OwnEvents column shows 0 events affected that level.

Now, let’s try to make every level aware of events happening on the level below it.

Prepare such a policy:

/* trigger_totalevents */
log(“Triggered: “+ServiceInstance.STATEMODELNODE.trigger_totalevents.Value);Status = 0;

si = ServiceInstance.SERVICEINSTANCENAME+” (“+ServiceInstance.DISPLAYNAME+”)”;

if(ServiceInstance.STATEMODELNODE.count_ownevents.Value <> NULL) {
Status =  Int(ServiceInstance.STATEMODELNODE.count_ownevents.Value);
}

log(“Service instance: “+si+” own events count: “+Status);

i = 0;
while (ServiceInstance.CHILDINSTANCEBEANS[i] <> NULL) {
ci = ServiceInstance.CHILDINSTANCEBEANS[i].SERVICEINSTANCENAME+” (“+ServiceInstance.CHILDINSTANCEBEANS[i].DISPLAYNAME+”)”;

if(ServiceInstance.CHILDINSTANCEBEANS[i].NUMCHILDREN > 0) {
grandChildEvents = 0;

if(ServiceInstance.CHILDINSTANCEBEANS[i].STATEMODELNODE.count_totalevents.Value <> NULL) {
grandChildEvents = Int(ServiceInstance.CHILDINSTANCEBEANS[i].STATEMODELNODE.count_totalevents.Value);
}
log(“Service instance: “+si+”, child: “+ci+” children events: “+grandChildEvents);

Status = Status + grandChildEvents;
} else {

childOwnEvents = 0;
if(ServiceInstance.CHILDINSTANCEBEANS[i].STATEMODELNODE.count_ownevents.Value <> NULL) {
childOwnEvents = Int(ServiceInstance.CHILDINSTANCEBEANS[i].STATEMODELNODE.count_ownevents.Value);
}
log(“Service instance: “+si+”, child: “+ci+” own events: “+childOwnEvents);

Status = Status + childOwnEvents;

log(“Service instance: “+si+”, child: “+ci+” children events: “+childOwnEvents);
}

i = i + 1;
}

log(“Service instance: “+si+” total events count: “+Status);

I called this policy count_totalevents_policy_1 and I saved it within numerical formula rule, called count_totalevents.

totaleventcount_pic_11Figure 11. TotalEvents rule settings

Same time, create another, numerical aggregation rule, in which you will point to the just created rule within the same template. Make sure you name your rule exactly same way as indicated in the header of the policy in the numerical formula just created a moment ago.

totaleventcount_pic_12Figure 12. TriggerTotalEvents rule settings

You should have by the end the following list of rules in your template:

totaleventcount_pic_13Figure 13. T_Regions template complete rules set

Make note. After creating a template rule pointing to the same template as a child template, the template will disappear from the templates list in the service navigator portlet. In order to fix it, add that template to any other template by associating via any type of status propagation rule:

totaleventcount_pic_14Figure 14. T_Regions template associated to templateFinder

And this is the result that should occur at the end in your scorecard:

totaleventcount_pic_15Figure 15. TotalEvents column in a scorecard

It looks like the concept works fine. Let’s try it further. Let’s send another event from every level, starting from Malopolska to Poland and to Europe.

totaleventcount_pic_16Figure 16. TotalEvents column after sending more test events

It looks correct, every level OwnEvent count increased by 1 and I have in total 5 events in the entire tree, just 2 on the leaf, another 2 in the middle and just 1 on the root level.

Let’s add a new level below Malopolska and call it Krakow. This will simulate expanding the service tree i.e. in case of a fresh import from TADDM or CMDB.

totaleventcount_pic_17Figure 17. OwnEvents and TotalEvents after adding a new child service

Let’s now send a new event, Severity 3 to Krakow:

totaleventcount_pic_18Figure 18. OwnEvents and TotalEvents after sending a test event to the new child service

The new event affected Krakow and was included in all level calculations of the TotalEvents count correctly. Let’s now create one level above the all, called Earth:

totaleventcount_pic_19Figure 19. OwnEvents and TotalEvents after adding a new root service

Adding Earth didn’t change the TotalEvents count of course, but the current max was reflected on the new top/root level. Let’s send another event to Poland:

totaleventcount_pic_20Figure 20. OwnEvents and TotalEvents after sending test events to the new root service

The total event count increased by 1 again. Only Europe’s OwnEvents column value increased by 1.

Let’s now remove Krakow from the Leaf level to see if the TotalEvents count will decrease by 1 now:

totaleventcount_pic_21Figure 21. OwnEvents and TotalEvents after removing the child service from the tree

So it is correct again, after removing Krakow with its 1 event the overall TotalEvents count dropped by 1 too and equals now 6.

This is is, if you like this post, let me know, also in case of questions.

Take care to the next one!

mp

TBSM multiple event identifiers

Friday, August 14th, 2015

This blog entry will explain on how to set up and use multiple event identifiers in IBM Tivoli Business Service Manager 6.1.1 FP3 and previous releases.

 

Introduction

The official documentation doesn’t say much about multiple event identifiers.

http://www-01.ibm.com/support/knowledgecenter/SSSPFK_6.1.1.3/com.ibm.tivoli.itbsm.doc/ServiceConfigurationGuide/bsmu_srvt_edit_id_fields.html
http://www-01.ibm.com/support/knowledgecenter/SSSPFK_6.1.1.3/com.ibm.tivoli.itbsm.doc/ServiceConfigurationGuide/bsmu_isrt_create_good_marg_bad_rules.html
http://www-01.ibm.com/support/knowledgecenter/SSSPFK_6.1.1.3/com.ibm.tivoli.itbsm.doc/customization/bsms_dlkc_id_rules.html

 

So let’s do a quick summary of what we can:
– we can set multiple event identifiers in incoming status rule
– we can set multiple event identifiers in EventIdentifiersRules.xml or any artifact of category eventidentifiers in XMLtoolkit

 

But how to make sure they would match and do know they do match?

 

Solution

Multiple event identifiers in Incoming status rules are logically associated like there was logical AND operator between them:

pic1

 

 

 

 

 

 

 

pic2

 

 

 

 

 

 

 

 

 

So this definition of CAM_FailedRequestsStatusRule_TDW rule should be understood: all rows returned by CAM_RRT_SubTrans_DataFetcher will affect my service if data returned in the following fields has the following values:

APPLICATION=MyApp AND
SUBTRANSACTION=MySubTrans AND
TRANSACTIONS=MyTrans

 

Same time, if I have multiple values for same label, it means OR.
For incoming status rule like:

pic3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

pic4

 

 

 

 

 

 

So my instance expects that CAM_BSM_Identity_OMNI rule can catch all events with two alternative BSM_Identity values:

– MyApp#t#MyTrans#s#MySubTrans OR
– MySubTrans(5D21DD108FD43941892543AA0872D0EA)-cdm:process.Activity

 

If we’re looking at EventIdentifierRules.xml, there’s a concept of policies and rules, for example:

<Policy name=”ITM6.x”>
<Rule name=”ManagedSystemName”>
<Token keyword=”ATTR” value=”cdm:ManagedSystemName”/>
</Rule>
</Policy>
<Mapping policy=”ITM6.x” class=”%” />

You can have many policies mapped on many classes (which can be mapped on many templates) and you can have many rules within every policy.

 

In our case, for ITCAM Tx subtransactions class we have one policy with many rules:

                         <Policy name=”CAM_SubTransaction_Activity”>
<Rule name=”CAM_GetBSM_Identity”>
<Token keyword=”ATTR” value=”cdm:ActivityName” />
<condition operator=’like’ value=’%#s#%’ />
</Rule>
<Rule name=”CAM_GetApplicationName” field=”APPLICATION”>
<Relationship relationship=’cdm:uses’
relationshipSource=’cdm:process.Activity’>
<Relationship relationship=’cdm:federates’
relationshipSource=’cdm:process.BusinessProcess’>
<Token keyword=”ATTR” value=”cdm:ActivityName” />
</Relationship>
</Relationship>
</Rule>
<Rule name=”CAM_GetTransactionName” field=”TRANSACTIONS”>
<Relationship relationship=’cdm:uses’
relationshipSource=’cdm:process.Activity’>
<Token keyword=”ATTR” value=”cdm:Label” />
</Relationship>
</Rule>
<Rule name=”CAM_GetSubTransactionName” field=”SUBTRANSACTION”>
<Token keyword=”ATTR” value=”cdm:Label” />
<condition operator=’like’ value=’%#s#%’ />
</Rule>
</Policy>

 

and one mapping of that policy on a class:
<Mapping policy=”CAM_SubTransaction_Activity” class=”cdm:process.Activity” />

 

But one class has many policies mapped on them:
<Mapping policy=”CAM_Transaction_Activity” class=”cdm:process.Activity” />
<Mapping policy=”CAM_SubTransaction_Activity” class=”cdm:process.Activity” />
<Mapping policy=”CAM_TT_Object” class=”%” />

 

Means, every mapping of a policy on a class is like element of logical OR operation. And every rule is a logical element of logical AND operation with other rules within same policy.

It is all conditional, because here comes additional aspect of field parameter of <Rule> tag.

 

The field parameter.

The field parameter in rule in policy in EventIdentifier enables that rule will be used only in case of having such field with such a name also in Incoming status rule specified as service instance name field.

 

So there’s no AND operator between those rules in policy in EventIdentifierRules.xml which haven’t been specified in Incoming Status Rule in template.

 

On another hand, there won’t be any value assigned to service instance name fields selected in Incoming Status Rule in Template A if corresponding fields haven’t been configured in rules of policies mapped on class (mapped on Template A in CDM_TO_TBSM4x_MAP_Templates.xml) in EventIdentifierRules.xml/

 

Conclusion.

You need two places to go to and configure your event identifiers:
Templates and incoming status rules / numerical rules / text rules – Service Instance Name Fields

  1. XMLtoolkit artifact EventIdentifierRules.xml (or any custom artifact from category eventidentifiers) – field parameters in rules defined within policies

Additionally, don’t forget: your policies defined in eventidentifiers artifacts must be mapped on CDM or custom classes that have mapping definition stored in CDM_TO_TBSM4x_MAP_Templates.xml and map on the same template that has the incoming status rule (numerical/text rule) you want with the Service Instance Name fields you want.

Otherwise your events or KPIs fetcher in fetchers won’t affect your service tree elements and you will not be showing correct status or availability on dashboards and your outage reports will also miss data and will generate false monthly results!

Drop me a note if you experience any troubles reading this article or trying to apply what’s written here, thanks!

mp

Netcool/Impact and ServiceNow!

Friday, December 6th, 2013

Have you ever tried to integrate Netcool/Impact and ServiceNow! ?

ServiceNow!… It’s an interesting piece of software I must admit and it’s in a public cloud, in Internet. You can create a lot of customizations and your own project and applications like CMDB, incident management, problem management, probably marketing people from SN! would tell you more benefits.

I’ve had a need to integrate my Impact 6.1.1 with SN! via SOAP recently and just wanted to share few general tips.

1. SN! WSDLs seem to be not correctly generated for Axis2 and must be manually adjusted before you approach compiling them in Netcool/Impact, I basically go and open my WSDL, I visit every complextType entry and remove the name parameter and value, so I leave <complextType> only. Only with this your WSDL will compile in nci_compilewsdl. Thank you Yasser from Impact dev team to pointing me to that one! 🙂

2. SN! CA-signed certificate must be imported to your ImpactProfile WAS. It’s Entrust Inc. And here’s the full instruction for Impact: http://www-01.ibm.com/support/docview.wss?uid=swg21592616

3. You’ve got a basic HTTP authentication in SN! and probably won’t be allowed to switch it off, then compilation of WSDL is possible only locally with a downloaded WSDL file (both from CLI and GUI).

4. Policy generated by wizard is good and works well, usually a single parameter being selected helps it working better since the beginning (so you generate WSParams variable correctly).

The documentation of SN! in wiki is pretty well however not everything is documented and beware of your version of ServiceNow!

Everything else depends on your needs, if this is incident management integration (opening tickets) or CMDB (importing service trees to TBSM).

That would be it. I’ll share more after I finish my integration with TBSM.

 

TBSM 6.1.1 database installation gotchas – real disk space requirements

Wednesday, May 1st, 2013

Have you ever tried to install TBSM 6.1.x on non-single partition disk space? It’s easy, you usually go for it on your demo VMware image, and it can be a production case too. It usually looks like this after all the installations (in this case we have logical volume to extend at any time by adding some more disk space):

tbsmdbdisk

Picture 1. Regular disk partitions layout on Unix-alike system, easy and boring.

 

Piece of cake. Let’s try something harder. Typically AIX systems storage is extensively divided on numerous disks in farm or Power System storage. Nevertheless a similar scenario can be easily emulated on your VMware box, assuming you create number of independent disks during your virtual OS installation, like in the example below:

Picture 2. Creating non-single-disk-based disk space in virtual Linux.

Picture 2. Creating non-single-disk-based disk space in virtual Linux.

The goal is to achieve the following setup. I have home directory on one disk (sda5), opt directory (historically for “optional” packages, it’s default directory for most of Tivoli packages on Linux platform) on another disk (sda2), tmp directory on sda6 and root directory on sda1. See this:

tbsmdbdisk03

Picture 3. 6 successfully installed virtual disks on Linux

Now let’s follow the available disk space degradation as we install next components of TBSM. We start from DB2 9.7 installation.

Picture 4. Disk space after installation of DB2 database manager code.

Picture 4. Disk space after installation of DB2 database manager code.

It looks quite obvious, DB2 default installation directory seats in /opt directory, which mounts sda2 disk. You can check for exact space value taken by DB2 database manager, by running this command:

Du –m /opt/ibm/db2/V9.7

871

If you compare to disk space consumed by DB2 instance itself, you’ll see consumption in /home directory, this is because I installed default DB2 instance on Linux platform, called db2inst1 which uses db2inst1 user to run and this user’s home directory to store the data files. Here’s how fresh DB2 instance impacts the available disk space on my Linux vm:

tbsmdisk05

Picture 5. Disk space after configuration one default DB2 database instance.

 

So the disk space taken by DB2 instance files (and don’t forget you have to create fenced user for running stored procedures and at least one administrative user for managing all instances, so db2fenc1 and dasuser1) is about 80 MB.

I’m sorry for taking you so far to this place to tell you this. It’s not going to work with your default TBSM installation. Why? It’s too little space for DB2 instance data files. If you continue with TBSM database installation, you’ll see these sizing options:

Picture 6. TBSM database installation size options.

Picture 6. TBSM database installation size options.

According to the documentation for TBSM 6.1, available here, You need to secure the following disk space:

  • Small – 3G of disk space
  • Medium – 6G of disk space
  • Large – 10G of disk space

Well, it’s maximum limits, someone may say. If you plan for running just a simple demo, it’s not going to break anything. This is true, however would you assume that you needed to monitor disk space consumption especially because of that? If yes, good for you, skip it. If not, consider this. Why? Let me show you what’s going to happen and why you don’t want to use your default settings during TBSM installation, related to data paths and log paths

Let’s assume we simply continue the installation with defaults. Database configuration, especially transaction log files size for TBSM database won’t accept your disk space offer. This is how your installation is going to finish, if you continue with it like presented above, it will simply fail:

Picture 7. Example of failure message during TBSM database installation if disk space is too low.

Picture 7. Example of failure message during TBSM database installation if disk space is too low.

 

The same for TBSM Metric Marker tables and demo tables. If you go to  /opt/IBM/tivoli/tbsmdb/logs/db2_stdout.log log file, you’ll read:

CREATE TABLESPACE TBSMSCR16KTS PAGESIZE 16K MANAGED BY AUTOMATIC STORAGE INITIALSIZE 100 M BUFFERPOOL TBSMSCR16KBP

DB21034E  The command was processed as an SQL statement because it was not a

valid Command Line Processor command.  During SQL processing it returned:

SQL0968C  The file system is full.  SQLSTATE=57011

 

Off the record, after that entry in the log you can see that the installer tries to continue executing the DDL script without validation of available disk space based on the first failure occurrence, what starts all series of unfortunate events in consequence and doesn’t allow your installation succeeding at the end.

So it looks bad. Let’s see the disk space now:

Picture 8. Disk space after failing installation.

Picture 8. Disk space after failing installation.

Well, it doesn’t look that bad now after it failed, we still have some space available, it means that creating the tablespace must have used up all the space first, then failed and then rolled-back and so it released some of taken disk space after all. But what happened? Well, after TBSM database migrated to DB2 from Postgres in version 6.1.0 (previous TBSM 4.2.1 – there is no TBSM 5.x – was using Postgres 8.x) it became exposed to all those challenges and gains coming from the fact. In order to understand it step by step we need to get closer to few DB2 design assumptions. Here’s couple of settings which should draw your attention:

Picture 9. Default buffer pool settings for TBSM database.

Picture 9. Default buffer pool settings for TBSM database.

So, this is buffer pool setting first. Each row in each table is being cached in memory in objects called pages (number of rows) every time when DB2 has to read that data from external memory, means disk. Also, if data is to be written to disk from buffer pool to table space, you can calculate the maximum amount of data in 16K buffer pool will be 48MB and 32MB in 32K buffer pool. This is not something dangerous to your installation yet. It tells you about potential RAM memory consumption in future when you launch TBSM into production mode. But the screen tells you also about table spaces being created for TBSM database, let’s take a closer look on them.

Note, if you actively use TBSM Metric Marker and Metric History databases, they have their separate settings and disk and memory consumption rates.

From this command:

CREATE TABLESPACE TBSMSCR16KTS PAGESIZE 16K MANAGED BY AUTOMATIC STORAGE INITIALSIZE 100 M BUFFERPOOL TBSMSCR16KBP

You know that TBSM database has automated storage management which means that database manager creates new data containers if only needed. It is not System Managed Space (SMS) type of table space or database managed space (DMS) type. It means maximum capacity data can reach is defined by storage size while creating TBSM database, determined by the path. In the screenshot below you can see that TBSM database path is <default>.

Picture 10. Sample configuration of TBSM database, it shows default database path.

Picture 10. Sample configuration of TBSM database, it shows default database path.

What <default> is? Well, go to your DB2 command line, and run as DB2 instance owner the following command:

db2 get dbm cfg | grep DFTDBPATH

By default it is instance user’s home directory, in my case: /home/db2inst1.

It means your database will grow as long as it hits the storage limits, it means endlessly until some free disk space remains. It’s something good to remember, if didn’t realize or didn’t have clear answer. It means you don’t want to select <default> during your TBSM database installation, you rather want to check on other disks space and allocate TBSM data files there.

Next, lets see the transaction logs, like below. These steps lets you define how many and how big logs can be created for you TBSM database. Again – for TBSM Metric Marker and history databases it’s totally a separate story. So by default you define 10 primary logs and 2 secondary logs, 16000 4K pages big each, means up to 12*16000 = 192000 4K pages, means 768 MB of data max. Note, on Unix and Windows platform, the default log file size, both for primary and secondary is 1000*4K and range 4 – 1 048 572 (always *4K page size). This space gets allocated when only your database activates. It means you need to have this space available on your hard disk immediately when you start your database manager. Again, logs, similarly to data files, by default utilize the default log path name, see the screenshot below. By default it means it all goes to /home/db2inst1 directory again:

Picture 9. Default buffer pool settings for TBSM database.

Picture 11. Transaction log for TBSM database configuration snippet.

What to do then? Again, if you miss space in your /home directory mount, select another value for Log path name. Be knowledgeable,  know what it all means to your installation. Take the installation hardware requirements seriously and monitor the usage.

Last but not least, TBSM DB directory which is used to store TBSM DB DDL files, executables to recreate TBSM DB and jars etc. takes it piece of cake too, it’s 160 MB declared by the installer, and you can see it by Summary step:

Picture 12. Summary view

Picture 12. Summary view

 

Don’t forget about temporary disk space the installer takes and returns, but it must be available for installation time. The installer is flexible and will look for 200 MB in /tmp or /home directory of the user who runs it, in our case db2inst1 home directory (I have only 170 MB available in /tmp):

Picture 13. /tmp disk space is too low

Picture 13. /tmp disk space is too low

Conclusions.

I’ll need more than 1 GB disk space that I assigned to /home directory to install TBSM database successfully and the Install Guide is not specific about that.

This is because by default my data logs and transaction logs go to /home/db2inst1 directory and this is because I didn’t simply create one single disk space for all directories, which can be a real case in production environments too. Additionally all the temporary files of the installer will be copied there for TBSM DB installation time.

So what is the real hard disk space requirement for TBSM DB installation?

/home

a) at least 1x768MB for transaction logs (10 primary, 2 secondary, 16000 4K pages each, all in one TBSM database – means you don’t create separate TBSMHIST db)

b) 80 MB for fresh db2 instance installation

c) 200 MB for temporary files in case you don’t have space in /tmp

d) suggested 3 GB for up to 5000 service instances

Total: 4048 MB minimum in /home

 

/opt

a) at least 160 MB for tbsmdb installation in /opt/IBM/tivoli/tbsmdb

b) 871 MB for DB2 database manager in /opt/ibm/db2/V9.7

Total: 1031 MB minimum in /opt

 

Keep in mind this is still before installing TBSM dataserver, dashboard server, JazzSM, XMLtoolkit and ITMAgent for TBSM or embedded Netcool OMNIbus with EIF probe.

So, if you’re lucky and have one single / root partition only for all your files in your Linux or Unix box, prepare minimum 5,1 GB disk space for TBSM database preparation total. If you only get tempted to create TBSMHIST table, it will be next 768 MB for separate transaction logs. You can lower number of logs or decrease single log file size to accommodate it though. You’ll be 200 MB ahead if you secure enough space in /tmp directory (I put it 500-1000 MB to be safe usually).

That’s all for now, thanks and see you next time.

 

 

XMLtoolkit stop issues

Thursday, March 28th, 2013

If this ever happened to you, that XMLtoolkit doesn’t want to stop normally or gives you other issues related to creating connection to itself, it must be registry error.

Here’s the symptom:

 

[netcool@tbsm61 bin]$ ./tbsmrdr_stop.sh

GTMCL5478W: The request could not be delivered, the toolkit may be down. If the toolkit is busy processing data, allow it to complete and shutdown gracefully. If the toolkit is idle but will not stop, reissure the request with the -f flag.

The exception was: Exception creating connection to: 172.16.1.103; nested exception is:

java.net.NoRouteToHostException: No route to host

retCode: 4

 

So I try to stop my XMLtoolkit instance and the script fails and the toolkit itself keeps running.

 

This is mainly because of XMLtoolkit failover capability feature. Each instance registers itself by IP of the machine it runs on, and it’s usually IP configured for the first network interface. You can check on this at any time:

 

[netcool@tbsm61 bin]$ ./registryupdate.sh -U db2inst1 -P smartway -v

GTMCL5457I: Toolkit registry table information.

ID: 1

Name: 172.16.1.103

Primary: true

Action: 0

LastUpdate: Thu Jan 01 01:00:00 CET 1970

ID: 2

Name: null

Primary: false

Action: 0

LastUpdate: Thu Jan 01 01:00:00 CET 1970

GTMCL5358I: Processing completed successfully.

retCode: 0

 

The corresponding value is written to xmltoolkitsvc.properties under DL_Toolkit_Instance_ID property.

 

It may happen especially on a virtual machine, which is being reconfigured as moved to new networks. There’s quick remedy for this: update DL_Toolkit_Instance_ID property with a new value, like static IP or a unique hostname in xmltoolkitsvc.properties file and register that value in the database:

 

[netcool@tbsm61 bin]$ ./registryupdate.sh -U db2inst1 -P smartway -s 1

GTMCL5458I: Setting the toolkit registry table

GTMCL5358I: Processing completed successfully.

retCode: 0

[netcool@tbsm61 bin]$ ./registryupdate.sh -U db2inst1 -P smartway -v

GTMCL5457I: Toolkit registry table information.

ID: 1

Name: 10.10.10.21

Primary: true

Action: 0

LastUpdate: Thu Jan 01 01:00:00 CET 1970

ID: 2

Name: null

Primary: false

Action: 0

LastUpdate: Thu Jan 01 01:00:00 CET 1970

GTMCL5358I: Processing completed successfully.

retCode: 0

 

The second run of the script with associated –v flag will help you verify if value was set ok.

 

Now it’s time to stop the toolkit without issues:

[netcool@tbsm61 bin]$ ./tbsmrdr_stop.sh

GTMCL5443I: The script toolkit_stop.xml has been submitted for processing.

retCode: 0

And this is it.

 

 

Business Service Composer – pain in the… artifact?

Friday, March 22nd, 2013

I’ve been starting actively using Business Service Composer for creating my service trees in TBSM 6.1.0.1. It’s really powerful tool, that helps you doing a lot of things with service structure, regardless component CDM classes (or any classes in any namespace, regardless their origin), however with one big glitch – it’s a huge step back in single administration console approach by deselecting TIP to be portal of choice in Tivoli and using Java GUI instead. There are two conveniences:

– If you really want to enjoy BSC GUI – don’t use it via ssh tunnel from within your Putty session to TBSM server (MS Windows users). Loosing focus on buttons or just selected menu items is so frequent that more times it doesn’t work and confuses than it works ok. It’s a nightmare, let’s put it straight. Just run it on your local workstation instead, it will save you a lot of nerves and time.

– But – once you run it on your desktop, you’ll have to come over the pain of updating project files via SCP (Windows users) there and back again in order to upload any single change to your static resources or policies.

It would probably be better if you run it all mounted locally from a remote site, where projects directory remains already in the right place for xmltoolkit script to pick it up and load it to artifacts without all that copy nightmare.

This is it. Powerful, conditionally usable, still some way from a nice tooling. And has nothing to do with TBSM Service Editor. Confusing? We have application descriptors and templates to model a dependency of CIs to applications first, then we have some functions of TADDM GUI, then we have BSC and at the end we have TBSM Service Editor. Confusing? It all works and has special use, but I’d have piece of a better advice for development. And I promise to share it.

 

A few tips for successful new TBSM 4.2.1 failover + existing OMNIbus 7.3 failover

Wednesday, August 31st, 2011

I’d like to present a quick check list for successful TBSM 4.2.1 failover implementation in existing TIP/WebGUI and OMNIbus environment. Some of you may have gone through this, and can confirm it’s not that piece of cake unless you know some secrets. True, details always change the final picture, hence I’m presenting just a check-list to follow, at least a guidance especially for newcomers to the topic.

1. First. Update your software to the most recent releases, including Fix packs and interim fix packs. Even better – if releasing soon, wait for the next one. There’s always something in the fixed code that you may want to have more. Or, read at least the APARs record for incoming or latest fix pack, to learn more whether there’s something useful for failover coming.
Second – collect all the manuals and tips/tricks borrowed from your good mates who have done it and stuck couple of times before. Every single script, automating the pain of many files configuration does matter. If you’re lucky and have good mates, You may even not need this check list here 😉
2. I assume the worst scenario – OMNIbus 7.3.0 exists before TBSM installation, so you’ll have to go through schema updates, and dashboard server over WebGUI installation. On another hand, this situation should turn out with a stable ObjectServers failover set already up, running and for free. I’m assuming that your Object servers are two in the multitier architecture, being specific – they’re a virtual aggregation pair. And secondly, the gateway is configured properly with correct sync types and intervals, and just does its job too.
3. Learn on both ObjectServer alerts.status table schemas. They must be the same. When changing from Primary ObjectServer to backup one, TBSM data server does not discover the schema automatically again. TBSM Event broker reads alerts.status columns ordered by ‘OrdinalPosition’, and it means, that ordinal position of each column must be the same on each ObjectServer. It means that you need to run bsm schema updates on existing ObjectServers carefully, with respect to all the previous schema updates. And do not change the previous order, the original, predefined columns in alerts.status must remain on their places.
Secondly, make also sure that alerts.service_deps table and clear_service_deps automation were imported correctly too.
4. Update the gateway mapping and table definition files after all. Do the update for alerts.status RAD_, TEC_, ITM_ and BSM_ fields mapping, and alerts.service_deps fields mapping.
5. TBSM does not support a fail back to the original Primary data server. The last Primary is primary as long, as it goes down itself. Also, TBSM does not automatically fail back from backup ObjectServer if it is Primary to TBSM, after failover situation. Second information is important for your FO_GATE configuration, as events synced from Primary ObjectServer to Backup one must stay updated in Backup alerts.status and alerts.service_deps after the failback too.
6. For TBSM data and dashboard servers configuration, better use the fo_config script. It’s smart and easy. Watch out on DASH_ settings, they will be applied on dashboard server. Script will also secure the previous versions of files it updates.
7. If you apply FO against running production server, plan for a maintenance window, as the FO implementation does require restarting data and dashboard servers, and we assume your dashboard servers will be running on existing production TIPs with WebGUI, with band of NOC guys online, just watching the AEL, right?
8. The key store file being created during TBSM “primary” server installation should be reused during the “backup” TBSM installation. Keep an eye on it, secure it and pass it when the secondary TBSM installer asks for it.
9. Service details portlet on Service Administration and Service Availability pages is part of Webtop, and will get updated with clickedOn event from the Service Tree portlet during failover only if data sources in WebGUI were previously set correctly.
10. Tipadmin may be not the best idea for a user to test the failover. Go for nco users, and VMM object server plugin instead. Install the plugin on all machines, so data and dashboard servers. Unless you have to integrate with LDAP.
11. If you really want to, you may experiment with the newest settings introduced with TBSM 4.2.1 FP2 and IF3 – consumerQueue and eventsInThread. No risk, no fun.
12. If ObjectServer contains a lot of previously raised events, prepare for EventBroker and ConsistencyChecker hard times. Java heap size analysis and values increase may become needed on both Data servers.
13. Switch on finer or finest tracing and logging levels. Don’t get scary though.
14. Run Primary data server first, then wait and observe the primary’s trace.log until it contains exceptions on detecting not operational backup rad facade. It’s a signal for you to start your backup data server. Even before running any TBSM data server, make sure that Object servers and gateway are up. After data servers start the dashboard servers.
15. When testing the failover, give TBSM, TIP and ObjectServers some time to synchronize and resynchronize before sending next test events. Of course in real life, during regular operations, there can be no time for switch to backup server or failback and TBSM may miss an event match to service instances. Keep this in mind.
16. Observe, make notes, and report the failover scenarios results. Do needed corrections, clean old events to have clear picture and restart again, and again, until it works perfectly. Then let it go for real events.

This is it. Simple and nice, straight forward procedure, isn’t it?