Forums  > Software  > What stack are you using for Logs/Monitor/Alerts  
     
Page 1 of 1
Display using:  

EspressoLover


Total Posts: 320
Joined: Jan 2015
 
Posted: 2018-01-08 11:53
I'm curious what other NP'ers who are running automated trading systems are using when it comes to logging, monitoring, and alerts. I'm poking my nose in this topic, since I want to upgrade my current setup to something shinier. I haven't really put much effort on this side of things. Up until now, I pretty much get by dumping output to stdout, piping to log files, then just regularly checking things with grep/sed/awk by shelling into the production machine.

However, I have a baby at home and am doing a lot of trading in a different timezone. So, I'd like to make it easier to step away, plus offload some of the responsibility to a non-technical person on my team. It'd be interesting to hear what other solutions people are using in this area. Particularly any good open source or relatively cheap software that can just be plugged in and turned on. It's hard to do research in this area, since everything's so web-dev focused. Off the top of my head here's a rough outline of what I'm looking at (critique or suggestions definitely welcome):

- Log in application to syslog (instead of stdout)
- Logstash for sync'ing logs from prod to archive
- Nagios to let me know if the server blows up or quoter dies
- Logstash/Splunk to pub/sub trading events from the quoter output
- Pagerduty to blow up my phone in case shit hits the fan
- Some sort of web frontend for easy monitoring: refresh PnL, positions, trades, other strat-specific stats.
- Bonus points if that frontend could also plot intraday PnL, etc. Unfortunately can't really find any good type of project that does this out of the box. Would be nice if Graphite or Kibana could be easily shoehorned into doing this...

Good questions outrank easy answers. -Paul Samuelson

goldorak


Total Posts: 1042
Joined: Nov 2004
 
Posted: 2018-01-08 12:47
I would be interested to see proposals on that topic too actually!

As a side note, and this will be my very modest contribution, my experience over the years has told me that the worst always happens silently... somewhere deep inside the systems and stays undiscovered and unlogged until sh... hits the fan. As a result I have been a supporter of jobs always systematically and politely saying: "I am done now".

If you are not living on the edge you are taking up too much space.

Maggette


Total Posts: 1047
Joined: Jun 2007
 
Posted: 2018-01-08 13:05
Non-Trading applications, but probably still interesting:
we use the ELK stack a lot. And grafana.

In addition we started to blur the line bewteen data output and logs a bit.

We have some meta data (like counts, averages, medians, standard deviations and ranges of values and of proccessing times, deltas of these values to historical data, "hasFinished" flags with timestamps etc) as structured data, that is written to a database.

This makes it easier and straight forward to create supervisor and sanity check jobs, jobs that create technical reports and dashboards.

Here we use Scala, Python, the database itself and (I am embarrassed) we use jenkins for alerts and e-mailks

Ich kam hierher und sah dich und deine Leute lächeln, und sagte mir: Maggette, scheiss auf den small talk, lass lieber deine Fäuste sprechen...

victorin


Total Posts: 6
Joined: Oct 2010
 
Posted: 2018-01-09 12:53
This is what we use for monitoring our infrastructure (Micro$oft .NET and python shop).
Besides an automated trading systems marketplace, we run a traditional online brokerage business. I don't have any reference to compare scales, but we're generating around 100GB logs per day and 15000 realtime metrics.

Everything are open source projects running in auto-pilot, almost no maintentance needed :)

Logging
=====

At application level we verbosely log almost everything using log4net. We started saving them into daily rolling text files, but soon we got some disk issues so we created an appender that fires and forget all log traces to rabbitmq.
These traces are consumed by a small process that indexes them into Elastic Search.

We're evaluating to migrate from log4net to serilog to create structured log traces, which are way better to deal with on the monitoring stage

Metrics
=====

We have used graphite a couple of years. A pain in the ass to install, but super powerful in terms of transform, combine and perform computations on series data.

Again we faced some scalability issues (disk bottleneck), and moved to influxdb (on windows!). Extremely easy to install, and telegraf (influxdb plugin-driven agent for collecting and reporting metrics) is just awesome.

The downside is that innodb is way worst in terms of metric aggregation and computing. We've had a hard time trying to replicate the real-time dashboards we had with same graphite metrics

Monitoring
======

Grafana. period.

Very easy to install, and super powerful. You can add multiple datasources (we've used graphite, innodb and elastic search) and combine metrics into real-time dashboards. You can visualize time-series, create single panel metrics with alerting colors and much much more

We do use kibana for some manual elastic search log analysis, but once you identify a thing you want to monitor in real-time don't hesitate: grafana

Alerting
=====

Years ago we developed a home made alerting systems, based on some database flags. I've done some tests using grafana alerting module and it's quite impressive. Out of the box you have email alerting (attaching fancy image charts) , but the great thing is that you have plenty of channel plugins (like telegram, slack and so on).

For telephone/SMS we use traditional SMS providers, but if I had to chose now I'd have a look at twilio

My two cents!

indiosmo


Total Posts: 12
Joined: Jun 2011
 
Posted: 2018-04-11 13:36
I've been doing some labs lately to move from a similar scenario (ssh into servers and grep) to something more practical.

I'm testing two separate pipelines.
Each trading server runs a filebeat agent that collects FIX logs in realtime and dumps them into a kafka topic as is, line by line.

#Pipeline 1 (Statistics and Monitoring)
Kafka -> Logstash -> ElasticSearch

Logstash parses the FIX logs to json using https://github.com/connamara/logstash-filter-fix_protocol.
At the end I use Kibana for easily searching through logs and Grafana for plotting stuff like orders/sec, rejects/sec, etc.

#Pipeline 2 (PnL and Trade database)
For this I have a custom python service that gets trades from the kafka topic and marketdata from our ticker plant.
The service calculates positions and marks to market in realtime and pushes to a kafka topic.
Another service reads this topic and writes to Postgres.
Then I use Grafana to query periodically and update the PnL on the dashboards.
For actual production workloads this might be a lot of data and I'm looking into the TimescaleDB extension in case Postgres chokes.


For application logs I haven't had the time yet but I want to try something similar to pipeline 1, just writing to ElasticSearch then use https://github.com/sivasamyk/logtrail for live tail and grep.

EspressoLover


Total Posts: 320
Joined: Jan 2015
 
Posted: 2018-05-30 19:33
Finally got around to doing this properly. It turned out to be a fun little project, and thought I'd add my personal retrospective. Thanks so much to everyone who contributed in this thread. All of the suggestions were great, and even when I didn't use the specific tools, seeing the case studies really helped me build a mental framework for this domain.

Approaching the Problem

Given the Cambrian explosion of tools written by the hordes of hipster devops at SV unicorns, getting started can be overwhelming. Plus add in that the categories between projects can be fairly nebulous. There are lots of products that are both complements and substitutes to each other. There are lots of products that span categories in weird ways, so that if you use X you might not need Y, but if you replace X with Z then you definitely need Y, but can drop T. It makes it hard to plan out a full stack.

That being said, as someone who was a total neophyte I really thought this podcast was a good way to get into the mindset of people who do this for a living. James Turnbull also wrote a book (Art of Monitoring), which was good, but promoted his hobbyhorse project too much.

An important question to keep in mind: are your servers pets or cattle? If you're monitoring a handful of co-located trading boxes, then its definitely the former. And that makes your needs very different than the typical web startup using an elastic cloud. Another is what metrics and processes do you really care about? It's easy to get feature bloat trying to track everything under the sun, but probably not worth the effort. Finally push vs pull-based is one of the big ideological divisions in this space, so at least be aware of the difference and the pros/cons.

Circuit Breakers

This was outside the scope of my project, but I just want to echo goldorak here. Monitors and alerts are no substitute for safe behavior inside the application. When a box is touching real money, it absolutely needs builtin safeguards. Millions can be vaporized in milliseconds, and the best dashboard in the world isn't going to fix that. Bugs are going to crop up and do crazy shit in well under human reaction time. Monitoring only exists to help the system operator cleanup after the fact.

Logging

Ultimately decided that ELK was probably overkill. My chosen solution was a lot more rudimentary. Log output to stdout, save a single compressed flat file per each quoter instance, then stash the individual files in S3 tagged by date, market and server. Anything that needs to be formally structured (like market data or trading activity) can be pulled out in an ETL pipeline at end-of-day.

I think logstash is a great product, but don't think you really need it in a trading context. First you're not running continuously. So, Real-time ingestion isn't really necessary. Formal, persistent and structured data probably is only required after market closes for the day. Second, you probably do everything in a single binary, and don't need to collect data from a bunch of disparate sources and micro services. Third elasticsearch is good when you have a deluge from a bunch of undifferentiated hosts. Something weird might happen and you have no idea where it came from, so need the ability to carefully search everything. In contrast if you want to know what happened with MSFT yesterday, you already know which quoter traded it. Its as simple as grabbing the specific log file from S3. (Or the production machine, if it's before end-of-day rotation.)

Monitoring

Prometheus for the backend, Grafana for the frontend. I'd recommend Grafana hands down. It just works, and produces beautiful, responsive dashboards. Dead-simple to setup.

At first I was hesitant to go with Prometheus. My intuition was that I really need something that was check-based or event-based, rather than metric-based solution. I looked at Nagios/icinga/sensu. Nagios is terrible, and no new project in the 21st century should be using it. The other options are big improvements, but are still working off nagios' core design flaws. Ultimately I think event-based is just a flawed paradigm. Almost all the solutions suffer from some combination of painfully inefficient setup, heavyweight resource usage, ill-defined data models, confusing separation between collection and storage, poor scalability, and/or unreliability.

In Prometheus "everything is a metric", which seems unintuitive at first, but is a more powerful data model. For one that means everything drops out into nice time series that you can throw up on a Grafana dashboard. That makes diagnosing problems and understanding behavior a lot easier. It also means that you don't need to mantain separate pipelines for continuous metrics and discrete events.

With Prometheus you get a pretty rich metadata model and query language that lets you turn metrics into events and vice versa. Sync'ing data between servers is dead simple. Nagios style checks can easily be done with the pushgateway just by piping your pre-existing scripts to curl.

That technique is also what I used to switch over my pre-existing check scripts into the monitoring framework. Rather than adding explicit scrapes to the quoter binary, it was really simple just to have crontab periodically scrape each instance's stdout logs and send the relevant metrics to pushgateway.

There are a few trading specific things that Prometheus isn't designed to natively deal with. E.g. I don't care about alerts that occur outside market hours. The good news is the metadata and query language is flexible enough to handle this. But you may have to scratch your head for a second, and read the docs. Overall Prometheus takes 30 seconds to get running, but there's a decent learning curve for its deeper features.

Metrics

The thing I like about Prometheus is that both the time-series database and collection agent live in the same system. If you use Graphite or InfluxDB to store metrics, then you probably need some sort of scraper or collection agent like statsD. Consolidating these two functions together simplifies the stack, and allows for richer metadata, and makes collection easier.

What you give up is having a professional-grade TSDB. You could use InfluxDB for long-term persistent storage of data, and even heavy duty research. Like, I wouldn't use Prometheus for backtesting, but I might for InfluxDB.

But for me it was helpful to divide data into two buckets. Use Prometheus to store ephemeral but real-time metrics. Recognize that these are only used for production monitoring, without the need to keep clean or maintain for more than a day or two. Then ingest long-term persistent data into a separate structured solution at the end-of-each day. I think this is a better approach, but you do give up some cool capabilities by foregoing a unified system that ingests in real-time.

Alerts

Pagerduty pretty much just works. I'd definitely recommend, although you have to pay $10-30 per person per month. It's got pretty much every possible feature you'd need. It routes to phone, email, SMS, and slack. It integrates natively with Prometheus, and pretty much all the other monitoring solutions. Straightforward setup. Allows you to pre-define on-call schedules, and set escalation policies or arbitrary complexity. Only criticism is that maybe it comes with too many superfluous features. (I don't really need to see a chart of notifications broken down by phone vs email over time.)

Anyway, unless money is tight, forget about rolling your own alerts scripts. Just use Pagerduty.

Good questions outrank easy answers. -Paul Samuelson

Maggette


Total Posts: 1047
Joined: Jun 2007
 
Posted: 2018-06-04 09:55
Thanks for that retrospective! Very much appreciated!

Ich kam hierher und sah dich und deine Leute lächeln, und sagte mir: Maggette, scheiss auf den small talk, lass lieber deine Fäuste sprechen...
Previous Thread :: Next Thread 
Page 1 of 1