Forums  > Software  > zeromq and friends for low latency systems communication  
Page 1 of 1
Display using:  


Total Posts: 222
Joined: Jul 2013
Posted: 2020-06-01 15:29
Hi guys

I am designing a low latency system from scratch and would like to open a discussion on how to disseminate data in and out of a market data ingesting and order routing system which feeds strategy engines living in different machines (xconnected).

One of my guys said zeromq, rabbitmq etc are fine and add very little latency (hard to say how much on paper), also the advantage is that the large amount of data we ingest and we send out can then be processed in parallel as this is managed by the queuing system itself, so no bottleneck in the fix encoding/decoding.

To me it sounds a bit too good to be true I would like to know if anybody has experience with this.

We are handling a ton of data from fix sessions from/to many (6-10) fx ECNs so it is not a typical 1 exchange connectivity type of project.

"amicus Plato sed magis amica Veritas"


Total Posts: 361
Joined: Feb 2014
Posted: 2020-06-01 15:53
@rickyvic: you can't throw zeromq and rabbitmq into the same group as they are not related to each other; also the rabbit one adding shitload amount of latency and requires an insane amount of CPU time which will prevent you from scaling this and use in any real-time trading; zeromq will work, but you need to know from the start what the setup will be as it's tightly coupled with the actual low-level code that your company will write around receiving and sending data - this including service discovery, parallelism, connection partitioning/sharding/channels, recovery and faillover scenarios, active/active or active/passive architecture handling, the callback for missing messages (tcp backprop or lost msgs) etc

First Commander of the USS Enterprise


Total Posts: 379
Joined: Mar 2018
Posted: 2020-06-01 16:59
why use rabbitmq in this case and not kafka?


Total Posts: 25
Joined: Apr 2011
Posted: 2020-06-01 18:20
zeromq - think of this as a replacement for unix message queues built around unix sockets. Like @svisstack said it's tightly coupled to what your low level code looks like. Don't think it offers persistence, that is, if your system goes down, all your data goes down with it. Upside is it's fast.

kafka - Works nicely if you've got a bunch of writers and bunch of readers. If you want your data to be persisted in the case of a failure, and have a nice way of partitioning your data this is a good choice (also they've recently taken out the zookeeper dependency). Downside is, it's a pain to setup, you probably need someone to manage it, and from my experience it adds a couple of ms latency. The language APIs are also pretty decent (I used the C one, I'm sure the Java one is even better)

rabitmq - never used it, but seems more flexible kafka (i.e. request/response type stuff), but slower and not as widely used

Just my 2 cents (honestly I'd ask around your organization and see what other people are doing)


Total Posts: 18
Joined: Jun 2011
Posted: 2020-06-02 00:18
I don't think I'd classify any of the given options as low latency.

Specially for market data distribution and if your servers are all in the same network TCP is also probably not the way to go and you'd be better of with something based on UDP multicast as most exchanges do.

As for zeromq (granted it's been a few years since I last used it), even though it should be the fastest of the 3 it still doesn't give you much control over how to run the event loop and you have to rely on it making system calls like epoll() under the hood. You can't busy spin on user-space for instance.
Even using something like openonload is not easy due to how threads interact and you can easily end up deadlocking.

Lately I've been looking into aeron.
It implements tcp-like capabilities on top of udp (reliability, flow control, etc) with very low latency by means of shared memory and mmap files.


Total Posts: 459
Joined: Jan 2015
Posted: 2020-06-02 04:04
Low latency means very different things depending on who you ask. To a web guy, low latency might mean 100 milliseconds. To someone doing custom FPGAs, it might mean 100 nanoseconds. If you're the former RabbitMQ is fine, if you're the latter it definitely is not.

The real question is what's the underlying reason you need low latency? Are you trying to achieve latency supremacy in an open market? E.g. Capturing first position in the queue or latency-arbing between venues? Or are you simply trying to hit some pre-defined threshold? Like rebalancing at 1-second intervals or taking advantage of a last look window as a liquidity provider (you mentioned this was FX)?

My opinion is if it's the latter, then you pretty much know your SLA requirements already. Or at least it shouldn't be too hard to figure them out. Once you have those, most of the major stacks have publicly available benchmarks (including the ones you're looking at). So just pick the best fit that clocks in under your SLA.

But if you're living in the Hobbesian jungle that comes with playing the latency supremacy game, then you probably have to go back to the whiteboard. I can't say for sure how competitive your market is, but unless you have some sort of monopoly, I'd assume you have competitors measuring tick-to-trade latency in the tens of microseconds. (Be careful just looking at the spacing of events in the data feed. That's not an accurate reflection of latency requirements since it incorporates both client- and exchange-side latency. Plus if I recall a lot of FX ECNs coalesce events in the datafeed.)

If you're aiming to be the fastest gun, I think the architecture's a non-starter. The problem is that you're introducing unnecessary hops between machines in the tick-to-trade path. It doesn't matter what the transport-layer looks like. The guy you're racing against is running the whole stack not only on the same machine, but the same thread on the same binary. Secondarily if it's multi-venue, you likely need microwave uplinks to truly compete. Or at the very least colocation at each separate datacenter. Finally if the ECNs expose native protocols, you won't be able to compete while running FIX.

Anyway, it really all depends on what your underlying objectives are. Low-latency is certainly a rabbit hole that you can keep digging further and further to eek out increasingly marginal gains. So, it's important to define a clear picture of what constitutes acceptable performance before you begin.

Good questions outrank easy answers. -Paul Samuelson


Total Posts: 1268
Joined: Jun 2007
Posted: 2020-06-02 09:05
As a small slightly off tangent note: if you happen to decide that Kafka latency is enough and you want to use Kafka because of its other advantages....I suggest to give Apache a closer look. IMHO it is the vastly superior design.

I have a Pulsar application that feeds a Flink application, where some CEP and the serving of an ML model and an optimization model is served. It is ind production for almost a year and I am quite happy so far.

Ich kam hierher und sah dich und deine Leute lächeln, und sagte mir: Maggette, scheiss auf den small talk, lass lieber deine Fäuste sprechen...


Total Posts: 89
Joined: Jul 2018
Posted: 2020-06-02 09:22
@EL Great points - I'm assuming the OP is not playing the speed game as they mention they are new/small and there is no room for new players in that game.

I was in exactly this position (small hft, fast but not jump/virtu fast). ZeroMQ is a great piece of kit in this position - saves a ridiculous amount of time and still very fast. Great flexibility in routing patterns with the pub/sub, rep/req, router/dealer patterns. Our system was distributed with many components in different countries - the patterns make this pretty easy to handle from a socket perspective.

did you use VWAP or triple-reinforced GAN execution?


Total Posts: 222
Joined: Jul 2013
Posted: 2020-06-02 15:57
Thanks for the many good answers.
We do market making in major ecns so we try to be not too far back in the queue but we are not competitive with the big guys.
For this lower latency system we do have a dedicated machine that does everything inside the same machine thread and so on... (I am not the guy who coded it).
We have colo with dedicated xconnects.
We don't arb or compete on hitting two sides on different ecns, we try to play a different more relationship game which is non toxic.

This new system is more for strategies that don't mind a few microsec of added latency.
I could live with 20 microsec of added latency in total honesty, so I was looking for an easy solution to disseminate this functionality to different engines.

Another way to send a book snap to the consumers is fix. What do you think? Then I can just maintain a buffer of recent snapshots within my code.

"amicus Plato sed magis amica Veritas"


Total Posts: 74
Joined: Jul 2018
Posted: 2020-06-03 08:46
> Another way to send a book snap to the consumers is fix.

If you've already built the book on 1 data ingestion server, I wouldn't reserialize the book delta as a FIX message because that creates unnecessary work. Presumably you have an internal data structure for the book delta that resembles a C struct, you can just write that as your message payload.


Total Posts: 222
Joined: Jul 2013
Posted: 2020-06-03 12:57
Basically yes a c structure needs to be shared with another process in a c program.
I was thinking myself of simple shared memory. Since it is static information I can then sample instead of reading messages.

"amicus Plato sed magis amica Veritas"


Total Posts: 459
Joined: Jan 2015
Posted: 2020-06-03 16:14
One approach you may want to consider is off-loading some of the latency-critical, low-level logic to the edge nodes. Especially since you already have the facility to run a full trading stack in-core.

Think about it this way. Right now you have your edge nodes which handle market data parsing and order routing. Behind those sit your strategy engines, which contain heavy-weight logic and make all trading decisions. But I'm guessing there are certain times when the effective trading logic is simple and latency critical. That could be offloaded to lightweight fast-responders that live directly in-core inside the edge node.

The analogy here is autonomic reflexes in the human body. There are certain actions that are so important and time-critical that the nervous system has built in machinery without the latency of conscious processing. Similarly there may be times your strategies are following simple but latency critical directives.

Say you want to buy 1000 lots by joining the best bid. If the price moves, you want to quickly move your order to stay with the touch. That's an easy-to-implement directive, but managing it through the slow strategy-engine path means that you'll wind up with worse queue positions and fills. So instead of directly managing those quotes, the strategy engine asynchronously tells the edge system that its directive is to quote 1000 lots at the best bid until filled or the directive is countermanded.

In a lot of ways this gives you the best of both worlds. You can identify the small subset of the most latency critical logic and move it into the edge nodes. Yet you still have complex, heavy-weight, frequently changing strategy code running on segregated machines. You don't have to worry about the performance or reliability of freqeuntly-changing strategy code brining down the entire edge node. Managing strategies as services across the network is a lot cleaner from a devops perspective.

Good questions outrank easy answers. -Paul Samuelson


Total Posts: 222
Joined: Jul 2013
Posted: 2020-06-03 22:05

Thanks for this... I find it particularly respondent with what I had in mind.
The market making that is more "reflex" stays in a box that runs autonomously and in process with no external communication.
That is highly optimised but also difficult to change and delicate so to speak. That is done and out of my hands.

For the more complex logic and always changing, with potentially at some stage external ip, etc... I was going to use a subset of the data/order routing api decoder/encoder then disseminate the market data and receive orders from the workers which are potentially and most likely on a different or many different machines.

Basically as you just explained it.

So thanks for pointing this out... the question was really and most likely not as clear as you posed it, how do I communicate between these nodes and do I handle the job of parsing fix messages and disseminating C struct data in and out on one machine only.

The two proposed designs are:

1. aggregator distributor on server 1, disseminates to xconnected strategy server 2, 3, 4, etc

2. an open architecture with native mqtt (or I would say zmq) will run on potentially multiple machines and will take care of the messages and will disseminate to the strategies in different boxes too.

Now I think the first one with the aggregator distributor on server1 with zeromq sending messages to the xconnected workers seems reasonable speedwise, but is it reliable. Big question.
I add 10 microsec but I sleep in peace or add 1 microsec and I dont sleep at all at night.

Am I thinking right or should I look into udp multicast as suggested and lower level stuff?

"amicus Plato sed magis amica Veritas"


Total Posts: 459
Joined: Jan 2015
Posted: 2020-06-04 19:17
One thing I think is worth doing is reading through the technical documentation for ITCH or similar system. Not because you want to replicate it, but rather because the system architects thought a lot about the problem domain.

Beyond even the protocol considerations there's some fundamental questions that you should think about if you haven't already:

* What do you actually need represented at client-side? Do you need the full order book? Or is a price book, possibly with N top-levels sufficient?
* What kind of time granularity is needed the client-side? Do they need the full unaltered history with every update or are they fine with throttled updates during bursts or even periodic snapshots?
* Do you intend to distribute snapshots or deltas in the messages? Deltas are smaller and faster, but you need a reliable recovery system, because one dropped packet corrupts the book forever.
* Along those lines, how do you intend to deal with recovering from missed messages? Consider everything from a single dropped packet all the way to a 30 minute network partition.
* How do you intend to deal with late-joiners? Do you re-broadcast the full history or consolidate a snapshot? Even if you don't intend to support late-joining, sometimes client systems will need to be bounced in the middle of the day.
* What's the average messaging rate? Peak messaging rate?

ITCH-style multicast works very fast and is pretty reliable when setup correctly. With high-end hardware and correctly tuned settings, packet loss should be under one in a hundred million. That being said the biggest risk of message loss isn't the network stack, it's the client systems themselves. If someone pushes some slow strategy code that can't process peak market data rates, then the best network stack in the world won't make a difference.

To second @doomanx, ZeroMQ is rock-solid and a lot easier to work with than multicast. It won't be the part of the stack to lose messages. That being said with a bunch of heterogenous client systems, sooner or later one will fall-over during peak activity and need to recover. Conceptually this shouldn't look any different than a late-joining client that starts fresh.

ITCH solves this by broadcasting deltas, then having an out-of-band request-reply snapshot service. A late-joiner or recovering client starts buffering the delta stream. Then it requests the latest snapshot over a TCP session. The snapshot is stamped with the sequence in the delta stream. All the buffered deltas *after* that stamp are applied, and the client is then current. This is definitely an approach to consider. You combine a low-latency fast broadcast (say ZeroMQ), with a slow but reliable out-of-band system to let clients catch up (say RabbitMQ or Kafka or even a REST API).

Another option is to bake recovery/durability directly into the ZeroMQ system. That certainly simplifies things by having to only manage one system. You can use the Majordomo pattern as a broker with different levels of durability, but that does add latency overhead. One thing to be aware of is that if recovery involves replaying the entire day, then you need to persist *a lot* of data in the broker. ZeroMQ is not optimized for that much durability. ITCH gets around this by only snapshotting the orders that are currently open in the book. However, you need to build this yourself. No messaging protocol will do this out of the box, since it requires awareness of how to construct an order book.

Finally an option is to just flatten the representation to the point that you don't need historical recovery. Obviously this is impossible with a stream of order-book deltas. But maybe the clients don't need anything beyond the top-5 levels in the level book. Then you can easily fit the full snapshot in a single message packet. If a buffer overflows or client disconnects, then it instantly has an uncorrupted view of the book on the next message in the stream.

Worst case a client loses some history that's its dependent on for trading logic. But if you append a sequence stamp in each message, then the client's aware of this and can act accordingly. Losing a message in a normally functioning client should be a rare enough event that it the occasionally censored history doesn't make a meaningful difference to strategy performance.

Good questions outrank easy answers. -Paul Samuelson


Total Posts: 74
Joined: Jul 2018
Posted: 2020-06-05 01:52
P.S.: I'll send you an email separately, just recovered from sickness and backlogged on inbox, figured this part of the conversation is useful to share since there's other folks who can chime in. -L

Regarding latency, ZeroMQ's internal overhead is quite small. I'd like to say 8-14 mics mean time to send a 1 byte message, from process to wire, on the machines I've tested in the past. And most of it is just coming from the kernel socket implementation - you can keep ZeroMQ and still shed 70+% off that number with a low latency userspace network stack with sockets API for practically no effort. If you have a strong prior that you can live with an extra 20 mics here and your developer is used to working with ZeroMQ rather than raw sockets, I think it's a good starting point. It doesn't lock your application into a particular design that makes it hard to optimize later.

>Now I think the first one with the aggregator distributor on server1 with zeromq sending messages to the xconnected workers seems reasonable speedwise, but is it reliable. Big question.

The way you've proposed using ZeroMQ is like a layer 5 abstraction over TCP/UDP transport to achieve support for 2 more things (i) multiple endpoints per socket and (ii) lightweight semantics of a message queue. The kind of reliability issue (packet loss from network congestion, equipment downtime, protocol-level guarantees of message delivery) you're thinking of is at the transport layer (layer 4) or hardware itself and not with the ZeroMQ library (layer 5), and switching to raw sockets and multicast isn't going to mitigate the issue.

I think the primary tradeoff of using raw sockets and UDP multicast vs ZeroMQ here is one of manageability vs latency. How easy is it for you to engineer (i) and (ii) by your own means, and is it worth it for the latency savings? You need to configure the multicast addresses at the network level, which presumably is outsourced to your hosting provider. (This isn't that bad, it usually need only happen once during initial install.) And working with a stream of bytes could take more development time than working with message frames with boundaries out of the box using the library.

I'd actually try to squeeze everything onto a single host if it's possible. I see a common (sub-optimal) situation where people split onto separate servers on the data side and have 1-2 servers build the book to disseminate it to other strategy servers. That's when you have a software licensing issue that makes it prohibitively expensive to have every server equipped with a feed parser. I also see 3 common (and suboptimal) situations where you want to aggregate the orders on 1 server before sending them to the ECN. That's when session/port charges are prohibitively expensive, the orders need to pass through a common pre-trade risk layer, or you need to internalize orders while keeping teams siloed. I presume it's not because of port charges as FX ECNs are pretty liberal with FIX sessions. e.g. Currenex is free for as many session IDs as you want, and Cboe waives every pair per 25M ADV.

In all of these cases, I see a very high chance that you can still achieve better performance (and much lower colocation costs) pinning your parser and order router, and then multiplexing the rest of your critical path code over a lightweight thread pool and passing events between your virtual threads in inner queues. The network hop is just that much more expensive than CPU time. This isn't a great design for low latency and curiously bears some similarity to ZeroMQ's threading model, but I know this approach can handle 5+% ADV of entire US cash equities and 2,000+ tickers with only 8 cores and 3-5 mics wire-to-wire - and this was nearly 10 years ago. Nowadays with 64 core processors the balance is even more in favor of the single host setup.

If you can't do that and multiple servers are absolutely necessary, and suppose they're at the same site, I'd usually try to fan out the original exchange feeds on the switch and let every server build its own independent book and maintain independent order sessions. Then the outgoing order messages can be muxed onto the port outbound to ECN or to your order internalizer. There's actually a reason for building independent books and order sessions that I consider more important than the latency, which is that making multiple server hops then makes it incredibly inconvenient to simulate properly and sync with your post-trade order logs. That's even if you're not latency-sensitive and can take a 20 mic delay on every order.

If you feel very confident about braving the path of multiple server hops in the critical path, I'd suggest you tap all of your inbound and outbound traffic onto 1 port and capture all of it in 1 place and augment tag 58 on your orders with triggering seqnum, which is practically unique in any short window of time even if you have like 70+ UDP channels like full US equities or something. This is essentially how Corvil appliances work.

For persistence and recovery, +1 to EspressoLover's last post.

FWIW as you know I'm working on the data distribution side project where we have no choice but to disseminate data to multiple servers. It's not meant to be very fast as we intended it to be deployed in similar scale as Bloomberg B-PIPE. We use TCPDirect (yes it can do UDP) and layer 1 filtered multicast between our servers. The inter-server, application-to-application latency is sub-mic over a switched network and the port-to-port latency is under 90 nanos. TCPDirect is easier to use than ef_vi and I like it over Onload for market data because the cache is always warm. Unfortunately they didn't provide a mechanism for warming the cache on TCPDirect so it's about 110-112 cycles slower than Onload on the minimum path for sparse events like order messages. For fast filtered multicast the switch reads into a packet only deep enough to check the filter, for packets that fail the filter it just cuts off and aligns to 64 bytes with 0xff to poison the FCS. The other textbook approach is using RDMA (or RoCE) - Exegy does this for their ticker plants - but I prefer multicast.


Total Posts: 222
Joined: Jul 2013
Posted: 2020-06-06 00:20
@prikolino I hope all is well man with all that's been going on....
@EL thanks a lot things are coming together finally

Ok I am gonna have to study here, there are a number of pros and cons.
This is enough information to get me started.

Love you guys

"amicus Plato sed magis amica Veritas"
Previous Thread :: Next Thread 
Page 1 of 1