Event Management System ----------------------- This document explains how to format data handler by epush and passed to eventd, and what happens with them. 1. Events structure 1.1 Event fields An event has the following fields : * level Gives the level of the event. Value is one of the following strings (from most to least problemetic) : EMERGENCY (or EMERG) : system is unusable URGENT (or URG) : action must be taken immediately CRITICAL (or CRIT) : critical conditions ERROR (or ERR) : error conditions WARNING (or WARN) : warning conditions NOTICE : normal, but significant, condition INFO : informational message DEBUG : debug-level message Those values are taken from PIKT level values, which are in turn derivated from syslog ones. * targethost The host affected by the problem, or frow which the data are taken. Value is a string containing the hostname. the name might be misleading, so it might change in the future. Choose the used hostname carefully, since it will impact mapping in the PHP web interface. fqdn CNAMES are preferred, in order to avoid confusion. * offender When used by an IDS or by tcp_wrappers, it might be useful to record the ip of the remote offender. This field is here just for that. You can later consolidate intrusion attempts by intruder, for instance. * type The event type, as discussed above. It can range for 0 to 2 : 0 : 'down' events, which indicates a failure condition 1 : 'up'events, which indicates a recovery condition, preferably after a down event, 2 : 'data' events, which pass measurements to RRDTool * subtype A supplemental free-form field that can hold more precision about the type of event. For instance, in a 'down' event, subtype could contain 'reboot', 'maintenance', etc... This can be useful for later reference when calculating availability and failure ratios. * source The source of the event. This is a free form string, indicating the process that generated the event (e.g. pikt, tcp_wrappers, etc...). It has a special meaning for type 2 events, explained in Data gathering * task The task description, PIKT sense. Explains in english was was the purpose of the process/alarms/etc... which generated the event. For instance : task:Checks system state * class A very important free-form string field. It will be used in eventd for correlation with ancestors. So, if you generate events with, say, HostUpChkEmergency pikt alarm, you ALWAYS have to send events with the same class. Example : Monitor/HostUpChkEmergency/tux * comment A free form text that describes more extensively the problem. This fiels can be repeasted in the message. * extended A free form text that can hold specific formatted data for external applications. Some kind of 'comments', but intended for machine processing. Example : extended:timeout 1.2 Events formatting Events created by clients for epush have the following structure : fieldname: value fieldname: value EOF The order in which the fields appear is not important. The only fields that can appear several times are 'comments' and 'extended'. If other fields appear more than once, the last that appears will be used. There are 4 mandatory fields : targethost, type, level, class 1.3 Server side supplemental fields When the event arrives at eventd via UDP, the server adds other fields. Also, when the event is inserted in the database, an 'id' (primary key) is automatically assigned to the event. * id A unique ID auto-assigned by the database. * father_id The 'id' of the father (ancestor) to which the received event is correlated to. * host The host from which the UDP datagram was received. Most of the time, this field will be identical to targethost, but it may vary because of reverse name resolution. Also, when an event is sent, it might concern another host (in the case of remote service monitoring for instance). * date_emitted This field is filled by epush (client side). It might be used to detect time alignment differences between clients, and clients and server. * date_received This field is filled by eventd (server side). It is there as a reference when there is a need to manually correlate or sort events together. * state Holds the state of the event : open (no corresponding 'up' event received yet) closed (a corresponding 'up' event resolving that problem has been received. There is also another state provisionned : 'archived'. 1.4 Example level:EMERG targethost:www.microsoft.com type:0 source:test task:Checks system state class:Monitor/HostUpChkEmergency/tux comment:Host www.microsoft.com is down extended: extended:time out If you wish to pass that event via epush at the command line, you could do : # echo "level:EMERG > targethost:www.microsoft.com > type:0 > source:test > task:Checks system state > class:Monitor/HostUpChkEmergency/tux > comment:Host www.microsoft.com is down > extended: > extended:time out" | epush -s server # and you event would go to 'server', port 2131. 2. Type of events and correlation Actually, EMS handles three types of events : - 'down' events, which indicates a failure condition 'up' events, - which indicates a recovery condition, preferably after a down event, - 'data' events, which pass measurements to RRDTool When eventd receives a type 0 event, it tries to find and ancestor, i.e. tries to find the closest (in time) identical down event. If it finds one, the new event is considered to be a child of that ancestor (called father event). If there is no ancestor, or if the ancestor is a type 1 event, the new event is considered to be a father. If eventd receives a type 1 event, it tries to find the necessary previous type 0 event, and consider the previous type 0 events as 'closed'. If it can't find one, it generates an error event itself. When eventd receives a type 2 event, it passes the received data to RRDTool, see "Data gathering" section below. 3. Data gathering Type 2 events are handled separately. If you have RRDTool installed at the eventd server, you can pass data via EMS via the following fields : type:2 subtype:update source:N:123:456 class:/var/lib/rrdtool/CounterRejectedMail.rrd This will pass 123 for the first DS update, 456 for the second, etc... The database file is passed in 'class'. See rrdupdate man page, and 'REMOTE CONTROL' section in rrdtool man page.