Tuesday 17 December 2013

Track and trace with Camel and Splunk

We have been using Camel for system integration for some years now, and are very pleased with the flexibility it has provided us.
Some time ago I build a Camel component for Splunk which now is on Camel master to be released with the coming 2.13.0 release, and I think a blog post about how it came about would be in it's place.


The big picture

We are running a JMS hub and spokes architecture with a central message hub consisting of topics and queues.

On each side of the hub we have Camel routes that act as integration points to other system.
Some routes act as consumers that collects data from provider systems. These integrations are typically pull based e.g. databases of various flavors, file, ftp, S3 or JMS.
The data collected is transformed to a common format (xml), and published to a queue or topic on the hub.

On the other side of the hub we have Camel routes that act as event providers. These routes consume the messages from the hub, transforms them to a target system specific format, and sends them on to the destination system using a variety of protocols such as SOAP, HTTP, database tables, stored procedures, JMS, file, ftp and raw socket.

All in all we are very pleased with this architecture since it has provided us with the desired flexibility and robustness.

Tracing messages

Early in the process we discovered that integration often can be kind of a black box, and you have to think of insight and traceability from the start. We needed insight in what was going on when we routed messages between systems, and also keep history of the message flow.
Therefor every integration adapter is publishing audit's (a copy of the original payload received or published) with some additional meta data in the header about the integration, to a audit queue on the hub.

Example of a route with a custom Camel component that creates a audit trail.

Audit trail adapter

The audit trail adapter consumes the messages from the audit queue (AUDIT_HUB), and stores the message payload in a database where they are kept for short time storage. This is usually enough to answer questions like "what happened yesterday between 9 and 10 on integration x, what did the message contain, and did the target system receive the message".


There is also a Angular app. that makes it possible for users to search and view events passing through the integration platform.


This has made it possible to gain fine grained insight over a short period of time, but for a more holistic and proactive approach something else was needed.

Splunk

That was when I stumbled upon Splunk. It has a lot of features to ingest data of any kind, awesome search features on big data, alerting, and a really easy way to build dashboards with real time data if needed.
To get data into Splunk I created the Splunk component, and with that in place it's kind of easy to get data into Splunk as this example illustrates.


With data in a Splunk index the fun begins !!
Now we can use the Splunk web to search the data, and to build a dashboard with panels that should go on a display in our office, both for insight and proactive alerting.
First up is the data that we have ingested in the audit-trail index. We want to display a real time graph of events flowing through the platform by the top most active adapters.
This is done using the Splunk web creating a search query, and when happy with the result choose a way to visualise it. The end result is a panel in xml format :


Panels can be combined to build a dashboard page like this one.

Splunk comes with the possibility to install different apps. from a Splunk "App Store". The middle 2 panels are build using the Splunk jmx app. With this app. you can connect to running jvm's and ingest jmx data into a Splunk index.
The app. has a configuration file where you can configure which mbeans an attributes should be ingested.
Since Camel exposes a lot of jmx stats. you can even ingest that into Splunk as this sample snip config illustrates.

The Camel stats. can then be used in Splunk to do ad hoc. reporting, dashboards and alerting as you need it.

Search and view Camel jmx attributes Splunk


We had cases where event processing took too long since it we were dealing with recording of live streams. With the Camel stats. we could build a report that showed the integrations involved (routes), and at which times there were long processing times. With this information at hand it's easier to drill down and make decisions of where to fix a problem.



The final piece for the dashboard should be a platform health indicator like (Ok, Warn and Error).
Since our integration platform already has a rest endpoint where our monitoring system collects status information from, we can use that to ingest status data into Splunk.

For that we installed another Splunk app rest_ta The app. calls the rest endpoint, and ingests status information into a surveillance-status index.
The dashboard panel uses a range component from Splunk to indicate the status :


Final dashboard on our office wall, with the health status at the top.



Nearly forgot to mention that we also created alerts when certain events happen e.g when no data ingested on a given integration. 

My final words on Splunk would be that it's a swiss army knife for analyzing, understanding and using data, and that I'm only starting to the grip off the possibilities because there are so many.

If want to try out the Camel and Splunk combo there is a small twitter sample hosted at my Git hub repo.