Event Management System -----------------------
This document explains how to format data handler by epush and passed
to eventd, and what happens with them.
1. Events structure
1.1 Event fields
An event has the following fields :
* level
Gives the level of the event.
Value is one of the following strings (from most to least problemetic) :
EMERGENCY (or EMERG) : system is unusable
URGENT (or URG) : action must be taken immediately
CRITICAL (or CRIT) : critical conditions
ERROR (or ERR) : error conditions
WARNING (or WARN) : warning conditions
NOTICE : normal, but significant, condition
INFO : informational message
DEBUG : debug-level message
Those values are taken from PIKT level values, which are in turn
derivated from syslog ones.
* targethost
The host affected by the problem, or frow which the data
are taken.
Value is a string containing the hostname. the name might
be misleading, so it might change in the future. Choose the used
hostname carefully, since it will impact mapping in the PHP web
interface. fqdn CNAMES are preferred, in order to avoid confusion.
* offender
When used by an IDS or by tcp_wrappers, it might be useful
to record the ip of the remote offender. This field is here just for
that. You can later consolidate intrusion attempts by intruder, for
instance.
* type
The event type, as discussed above.
It can range for 0 to 2 :
0 : 'down' events, which indicates a failure condition
1 : 'up'events, which indicates a recovery condition, preferably after a down
event,
2 : 'data' events, which pass measurements to RRDTool
* subtype
A supplemental free-form field that can hold more precision
about the type of event.
For instance, in a 'down' event, subtype could contain 'reboot',
'maintenance', etc... This can be useful for later reference
when calculating availability and failure ratios.
* source
The source of the event. This is a free form string,
indicating the process that generated the event (e.g. pikt,
tcp_wrappers, etc...). It has a special meaning for type 2 events,
explained in Data gathering
* task
The task description, PIKT sense. Explains in english was was
the purpose of the process/alarms/etc... which generated the event.
For instance : task:Checks system state
* class A very important free-form string field. It will be used in
eventd for correlation with ancestors. So, if you generate events
with, say, HostUpChkEmergency pikt alarm, you ALWAYS have to send
events with the same class. Example : Monitor/HostUpChkEmergency/tux
* comment
A free form text that describes more extensively the
problem. This fiels can be repeasted in the message.
* extended
A free form text that can hold specific formatted data for
external applications. Some kind of 'comments', but intended for
machine processing.
Example : extended:timeout
1.2 Events formatting
Events created by clients for epush have the following structure :
fieldname: value
fieldname: value
EOF
The order in which the fields appear is not important. The only
fields that can appear several times are 'comments' and 'extended'.
If other fields appear more than once, the last that appears will be
used.
There are 4 mandatory fields :
targethost, type, level, class
1.3 Server side supplemental fields
When the event arrives at eventd via UDP, the server adds other
fields. Also, when the event is inserted in the database, an 'id'
(primary key) is automatically assigned to the event.
* id
A unique ID auto-assigned by the database.
* father_id
The 'id' of the father (ancestor) to which the received
event is correlated to.
* host
The host from which the UDP datagram was received. Most of the
time, this field will be identical to targethost, but it may vary
because of reverse name resolution. Also, when an event is sent, it
might concern another host (in the case of remote service monitoring
for instance).
* date_emitted
This field is filled by epush (client side). It might
be used to detect time alignment differences between clients, and
clients and server.
* date_received
This field is filled by eventd (server side). It is
there as a reference when there is a need to manually correlate or
sort events together.
* state
Holds the state of the event :
open (no corresponding 'up' event received yet)
closed (a corresponding 'up' event resolving that problem has been received.
There is also another state provisionned : 'archived'.
1.4 Example
level:EMERG
targethost:www.microsoft.com
type:0
source:test
task:Checks system state
class:Monitor/HostUpChkEmergency/tux
comment:Host www.microsoft.com is down
extended:
extended:time out
If you wish to pass that event via epush at the command line, you
could do :
# echo "level:EMERG
> targethost:www.microsoft.com
> type:0
> source:test
> task:Checks system state
> class:Monitor/HostUpChkEmergency/tux
> comment:Host www.microsoft.com is down
> extended:
> extended:time out" | epush -s server #
and you event would go to 'server', port 2131.
2. Type of events and correlation
Actually, EMS handles three types of events :
- 'down' events, which indicates a failure condition 'up' events,
- which indicates a recovery condition, preferably after a down event,
- 'data' events, which pass measurements to RRDTool
When eventd receives a type 0 event, it tries to find and ancestor,
i.e. tries to find the closest (in time) identical down event. If it
finds one, the new event is considered to be a child of that ancestor
(called father event). If there is no ancestor, or if the ancestor is
a type 1 event, the new event is considered to be a father.
If eventd receives a type 1 event, it tries to find the necessary
previous type 0 event, and consider the previous type 0 events as
'closed'. If it can't find one, it generates an error event itself.
When eventd receives a type 2 event, it passes the received data to
RRDTool, see "Data gathering" section below.
3. Data gathering
Type 2 events are handled separately. If you have RRDTool installed
at the eventd server, you can pass data via EMS via the following
fields :
type:2
subtype:update
source:N:123:456
class:/var/lib/rrdtool/CounterRejectedMail.rrd
This will pass 123 for the first DS update, 456 for the second, etc...
The database file is passed in 'class'.
See rrdupdate man page, and 'REMOTE CONTROL' section in rrdtool man
page.