Forums  > Software  > What stack are you using for Logs/Monitor/Alerts  
Page 1 of 1
Display using:  


Total Posts: 446
Joined: Jan 2015
Posted: 2018-01-08 11:53
I'm curious what other NP'ers who are running automated trading systems are using when it comes to logging, monitoring, and alerts. I'm poking my nose in this topic, since I want to upgrade my current setup to something shinier. I haven't really put much effort on this side of things. Up until now, I pretty much get by dumping output to stdout, piping to log files, then just regularly checking things with grep/sed/awk by shelling into the production machine.

However, I have a baby at home and am doing a lot of trading in a different timezone. So, I'd like to make it easier to step away, plus offload some of the responsibility to a non-technical person on my team. It'd be interesting to hear what other solutions people are using in this area. Particularly any good open source or relatively cheap software that can just be plugged in and turned on. It's hard to do research in this area, since everything's so web-dev focused. Off the top of my head here's a rough outline of what I'm looking at (critique or suggestions definitely welcome):

- Log in application to syslog (instead of stdout)
- Logstash for sync'ing logs from prod to archive
- Nagios to let me know if the server blows up or quoter dies
- Logstash/Splunk to pub/sub trading events from the quoter output
- Pagerduty to blow up my phone in case shit hits the fan
- Some sort of web frontend for easy monitoring: refresh PnL, positions, trades, other strat-specific stats.
- Bonus points if that frontend could also plot intraday PnL, etc. Unfortunately can't really find any good type of project that does this out of the box. Would be nice if Graphite or Kibana could be easily shoehorned into doing this...

Good questions outrank easy answers. -Paul Samuelson


Total Posts: 1091
Joined: Nov 2004
Posted: 2018-01-08 12:47
I would be interested to see proposals on that topic too actually!

As a side note, and this will be my very modest contribution, my experience over the years has told me that the worst always happens silently... somewhere deep inside the systems and stays undiscovered and unlogged until sh... hits the fan. As a result I have been a supporter of jobs always systematically and politely saying: "I am done now".

If you are not living on the edge you are taking up too much space.


Total Posts: 1251
Joined: Jun 2007
Posted: 2018-01-08 13:05
Non-Trading applications, but probably still interesting:
we use the ELK stack a lot. And grafana.

In addition we started to blur the line bewteen data output and logs a bit.

We have some meta data (like counts, averages, medians, standard deviations and ranges of values and of proccessing times, deltas of these values to historical data, "hasFinished" flags with timestamps etc) as structured data, that is written to a database.

This makes it easier and straight forward to create supervisor and sanity check jobs, jobs that create technical reports and dashboards.

Here we use Scala, Python, the database itself and (I am embarrassed) we use jenkins for alerts and e-mailks

Ich kam hierher und sah dich und deine Leute lächeln, und sagte mir: Maggette, scheiss auf den small talk, lass lieber deine Fäuste sprechen...


Total Posts: 7
Joined: Oct 2010
Posted: 2018-01-09 12:53
This is what we use for monitoring our infrastructure (Micro$oft .NET and python shop).
Besides an automated trading systems marketplace, we run a traditional online brokerage business. I don't have any reference to compare scales, but we're generating around 100GB logs per day and 15000 realtime metrics.

Everything are open source projects running in auto-pilot, almost no maintentance needed :)


At application level we verbosely log almost everything using log4net. We started saving them into daily rolling text files, but soon we got some disk issues so we created an appender that fires and forget all log traces to rabbitmq.
These traces are consumed by a small process that indexes them into Elastic Search.

We're evaluating to migrate from log4net to serilog to create structured log traces, which are way better to deal with on the monitoring stage


We have used graphite a couple of years. A pain in the ass to install, but super powerful in terms of transform, combine and perform computations on series data.

Again we faced some scalability issues (disk bottleneck), and moved to influxdb (on windows!). Extremely easy to install, and telegraf (influxdb plugin-driven agent for collecting and reporting metrics) is just awesome.

The downside is that innodb is way worst in terms of metric aggregation and computing. We've had a hard time trying to replicate the real-time dashboards we had with same graphite metrics


Grafana. period.

Very easy to install, and super powerful. You can add multiple datasources (we've used graphite, innodb and elastic search) and combine metrics into real-time dashboards. You can visualize time-series, create single panel metrics with alerting colors and much much more

We do use kibana for some manual elastic search log analysis, but once you identify a thing you want to monitor in real-time don't hesitate: grafana


Years ago we developed a home made alerting systems, based on some database flags. I've done some tests using grafana alerting module and it's quite impressive. Out of the box you have email alerting (attaching fancy image charts) , but the great thing is that you have plenty of channel plugins (like telegram, slack and so on).

For telephone/SMS we use traditional SMS providers, but if I had to chose now I'd have a look at twilio

My two cents!


Total Posts: 18
Joined: Jun 2011
Posted: 2018-04-11 13:36
I've been doing some labs lately to move from a similar scenario (ssh into servers and grep) to something more practical.

I'm testing two separate pipelines.
Each trading server runs a filebeat agent that collects FIX logs in realtime and dumps them into a kafka topic as is, line by line.

#Pipeline 1 (Statistics and Monitoring)
Kafka -> Logstash -> ElasticSearch

Logstash parses the FIX logs to json using
At the end I use Kibana for easily searching through logs and Grafana for plotting stuff like orders/sec, rejects/sec, etc.

#Pipeline 2 (PnL and Trade database)
For this I have a custom python service that gets trades from the kafka topic and marketdata from our ticker plant.
The service calculates positions and marks to market in realtime and pushes to a kafka topic.
Another service reads this topic and writes to Postgres.
Then I use Grafana to query periodically and update the PnL on the dashboards.
For actual production workloads this might be a lot of data and I'm looking into the TimescaleDB extension in case Postgres chokes.

For application logs I haven't had the time yet but I want to try something similar to pipeline 1, just writing to ElasticSearch then use for live tail and grep.


Total Posts: 446
Joined: Jan 2015
Posted: 2018-05-30 19:33
Finally got around to doing this properly. It turned out to be a fun little project, and thought I'd add my personal retrospective. Thanks so much to everyone who contributed in this thread. All of the suggestions were great, and even when I didn't use the specific tools, seeing the case studies really helped me build a mental framework for this domain.

Approaching the Problem

Given the Cambrian explosion of tools written by the hordes of hipster devops at SV unicorns, getting started can be overwhelming. Plus add in that the categories between projects can be fairly nebulous. There are lots of products that are both complements and substitutes to each other. There are lots of products that span categories in weird ways, so that if you use X you might not need Y, but if you replace X with Z then you definitely need Y, but can drop T. It makes it hard to plan out a full stack.

That being said, as someone who was a total neophyte I really thought this podcast was a good way to get into the mindset of people who do this for a living. James Turnbull also wrote a book (Art of Monitoring), which was good, but promoted his hobbyhorse project too much.

An important question to keep in mind: are your servers pets or cattle? If you're monitoring a handful of co-located trading boxes, then its definitely the former. And that makes your needs very different than the typical web startup using an elastic cloud. Another is what metrics and processes do you really care about? It's easy to get feature bloat trying to track everything under the sun, but probably not worth the effort. Finally push vs pull-based is one of the big ideological divisions in this space, so at least be aware of the difference and the pros/cons.

Circuit Breakers

This was outside the scope of my project, but I just want to echo goldorak here. Monitors and alerts are no substitute for safe behavior inside the application. When a box is touching real money, it absolutely needs builtin safeguards. Millions can be vaporized in milliseconds, and the best dashboard in the world isn't going to fix that. Bugs are going to crop up and do crazy shit in well under human reaction time. Monitoring only exists to help the system operator cleanup after the fact.


Ultimately decided that ELK was probably overkill. My chosen solution was a lot more rudimentary. Log output to stdout, save a single compressed flat file per each quoter instance, then stash the individual files in S3 tagged by date, market and server. Anything that needs to be formally structured (like market data or trading activity) can be pulled out in an ETL pipeline at end-of-day.

I think logstash is a great product, but don't think you really need it in a trading context. First you're not running continuously. So, Real-time ingestion isn't really necessary. Formal, persistent and structured data probably is only required after market closes for the day. Second, you probably do everything in a single binary, and don't need to collect data from a bunch of disparate sources and micro services. Third elasticsearch is good when you have a deluge from a bunch of undifferentiated hosts. Something weird might happen and you have no idea where it came from, so need the ability to carefully search everything. In contrast if you want to know what happened with MSFT yesterday, you already know which quoter traded it. Its as simple as grabbing the specific log file from S3. (Or the production machine, if it's before end-of-day rotation.)


Prometheus for the backend, Grafana for the frontend. I'd recommend Grafana hands down. It just works, and produces beautiful, responsive dashboards. Dead-simple to setup.

At first I was hesitant to go with Prometheus. My intuition was that I really need something that was check-based or event-based, rather than metric-based solution. I looked at Nagios/icinga/sensu. Nagios is terrible, and no new project in the 21st century should be using it. The other options are big improvements, but are still working off nagios' core design flaws. Ultimately I think event-based is just a flawed paradigm. Almost all the solutions suffer from some combination of painfully inefficient setup, heavyweight resource usage, ill-defined data models, confusing separation between collection and storage, poor scalability, and/or unreliability.

In Prometheus "everything is a metric", which seems unintuitive at first, but is a more powerful data model. For one that means everything drops out into nice time series that you can throw up on a Grafana dashboard. That makes diagnosing problems and understanding behavior a lot easier. It also means that you don't need to mantain separate pipelines for continuous metrics and discrete events.

With Prometheus you get a pretty rich metadata model and query language that lets you turn metrics into events and vice versa. Sync'ing data between servers is dead simple. Nagios style checks can easily be done with the pushgateway just by piping your pre-existing scripts to curl.

That technique is also what I used to switch over my pre-existing check scripts into the monitoring framework. Rather than adding explicit scrapes to the quoter binary, it was really simple just to have crontab periodically scrape each instance's stdout logs and send the relevant metrics to pushgateway.

There are a few trading specific things that Prometheus isn't designed to natively deal with. E.g. I don't care about alerts that occur outside market hours. The good news is the metadata and query language is flexible enough to handle this. But you may have to scratch your head for a second, and read the docs. Overall Prometheus takes 30 seconds to get running, but there's a decent learning curve for its deeper features.


The thing I like about Prometheus is that both the time-series database and collection agent live in the same system. If you use Graphite or InfluxDB to store metrics, then you probably need some sort of scraper or collection agent like statsD. Consolidating these two functions together simplifies the stack, and allows for richer metadata, and makes collection easier.

What you give up is having a professional-grade TSDB. You could use InfluxDB for long-term persistent storage of data, and even heavy duty research. Like, I wouldn't use Prometheus for backtesting, but I might for InfluxDB.

But for me it was helpful to divide data into two buckets. Use Prometheus to store ephemeral but real-time metrics. Recognize that these are only used for production monitoring, without the need to keep clean or maintain for more than a day or two. Then ingest long-term persistent data into a separate structured solution at the end-of-each day. I think this is a better approach, but you do give up some cool capabilities by foregoing a unified system that ingests in real-time.


Pagerduty pretty much just works. I'd definitely recommend, although you have to pay $10-30 per person per month. It's got pretty much every possible feature you'd need. It routes to phone, email, SMS, and slack. It integrates natively with Prometheus, and pretty much all the other monitoring solutions. Straightforward setup. Allows you to pre-define on-call schedules, and set escalation policies or arbitrary complexity. Only criticism is that maybe it comes with too many superfluous features. (I don't really need to see a chart of notifications broken down by phone vs email over time.)

Anyway, unless money is tight, forget about rolling your own alerts scripts. Just use Pagerduty.

Good questions outrank easy answers. -Paul Samuelson


Total Posts: 1251
Joined: Jun 2007
Posted: 2018-06-04 09:55
Thanks for that retrospective! Very much appreciated!

Ich kam hierher und sah dich und deine Leute lächeln, und sagte mir: Maggette, scheiss auf den small talk, lass lieber deine Fäuste sprechen...


Total Posts: 1251
Joined: Jun 2007
Posted: 2018-09-20 10:27
Regarding alerting:
I am playing around with CUSUM at present to catch decaying predictive power in my models intraday.

What do you guys use to identify behaviour that should lead to alerts?

Ich kam hierher und sah dich und deine Leute lächeln, und sagte mir: Maggette, scheiss auf den small talk, lass lieber deine Fäuste sprechen...


Total Posts: 67
Joined: Jul 2018
Posted: 2018-09-22 02:42
>What do you guys use to identify behaviour that should lead to alerts?

Depends on the strategy? I don't think there's a one-size-fits-all. Fill rates, exposure, rejects etc. If you have a strategy that tries to hit an order that you speculate will appear before any of your own fills get acked or any market data event is broadcast, then rejects are fairly important. If you have a strategy that has many open orders relative to typical exposure, then exposure is important.

[rant truncated]


Total Posts: 446
Joined: Jan 2015
Posted: 2018-09-23 05:34

Thanks for the very insightful and detailed post. A lot of wisdom from the trenches. Gonna have to digest some of the more philosophical points a little bit more.

But I did want to mention something, which is potentially very convenient if you don't already know...

> you don't want to drop your timestamp calls into the kernel,

On modern versions of the linux kernel, clock_gettime() syscalls never leave userland thanks to vDSO. The overhead is something on the order of 50 nanos depending on your specific setup.

Good questions outrank easy answers. -Paul Samuelson


Total Posts: 67
Joined: Jul 2018
Posted: 2018-09-24 00:24
Glad it was of use, I thought it was too verbose.

>On modern versions of the linux kernel, clock_gettime() syscalls never leave userland thanks to vDSO.

Yes, sorry, I use that phrase as a force of habit. What I mean is when you want to offload making the system call or issuing rdtsc(p) to the host CPU. This is less of a matter of latency but more of a matter of accuracy and determinism.


Total Posts: 1176
Joined: Jun 2005
Posted: 2018-09-24 12:06

Shame that you have cut that valuable rant. Apart from qualitative stuff which I stored into my memory, could you reiterate the list of components?

Here is summary of the above to make your life easier.

Logging: log4net, Elastic Search, RabbitMQ
Metrix: Graphite, InfluxDB (+Telegraf), Prometheus
Monitoring: Grafana, Prometheus
Alerts: Grafana, PagerDuty

From my side, you guys look like Titans, so, perhaps, there is no interest in my DIY projects, but anyway, my two cents.

Having spend years in risk management in office at home I developed some "slow" algo-trading platforms first in Java and now in python, with
Logging: flat files
Metrix: flat files + zipping (tried HDF5, very good, but did not find a will to force myself to continue developing infrastructure around, focusing on strategies)
Monitoring: not yet... choosing. I tend to think about something like plotly. In Java it was Swing+JFreeCharts.
Alert: 10 years ago it was simple check of PnL consistency between my accounts and exchange. If anything wrong - Loud Alert in the house (my wife was complaining) and simple SMS. Now we will do the same.

A lot of thanks to all of you! Valuable read.


Total Posts: 67
Joined: Jul 2018
Posted: 2018-09-25 06:35
nikol: Logging, metrics and monitoring are mostly just composed of UI, messaging, serialization and persistence, all of which can be homebrewed. The external dependencies here could be any native GUI frameworks (GTK, Qt, Cocoa), web-based UI libraries, or the inbetween (Electron), a 3rd party brokered or brokerless messaging layer (RabbitMQ, Solace, Redis), any serialization format (protobuf, bson, msgpack, hdf5), and/or any DBMS (mongo, postgres, redis, snowflake).


Total Posts: 1251
Joined: Jun 2007
Posted: 2018-09-25 07:31
"Fill rates, exposure, rejects etc. If you have a strategy that tries to hit an order that you speculate will appear before any of your own fills get acked or any market data event is broadcast, then rejects are fairly important. "

Also thanks for the valuable input.

These are "features", like in "what variable am I tracking?". My question was more geared towards "given that I decided which variable represents the state of my system, how or when do I throw alerts?"

A simple threshold based on risk management or technical considerations? Some control chart thing like the CUSUM algo I mentioned? Do I throw warnings? How to calibrate to enhance precission and recall?

Monitoring is a exercise in state detection or classification. So which approach is used to detect something fishy?

Ich kam hierher und sah dich und deine Leute lächeln, und sagte mir: Maggette, scheiss auf den small talk, lass lieber deine Fäuste sprechen...


Total Posts: 1176
Joined: Jun 2005
Posted: 2018-11-05 17:36
As an alert system I ended up with Telegram bot which allows to send push notification.
It seems to be quite straightforward. Downside is that I must have internet connection on my device.
This can be solved by Telegram API (more complicated) which can duplicate the message by sending SMS.


Total Posts: 355
Joined: Feb 2014
Posted: 2020-09-09 14:31
@EspressoLover: Understand your reasoning, but elastic deliver more value when you are using structured logging, then you know from where it came and you can tag and later search the logs by any property. It's also easy to set up it in the cloud right now, so there are no longer much maintenance regarding upgrading and managing the stack (ml, kibana, elastic, etc). We are using it in that setup, pushing both to the std*, file and elasticsearch database.

First Commander of the USS Enterprise


Total Posts: 903
Joined: Jun 2004
Posted: 2020-09-12 23:03
@Maggette I like cusum tests as they are intuitive recently saw this paper.timeseries change point techniques


Total Posts: 1251
Joined: Jun 2007
Posted: 2020-09-13 12:00
Excellent. Thx.

Ich kam hierher und sah dich und deine Leute lächeln, und sagte mir: Maggette, scheiss auf den small talk, lass lieber deine Fäuste sprechen...
Previous Thread :: Next Thread 
Page 1 of 1