Forums  > Software  > Hybrid FPGA  
     
Page 1 of 1
Display using:  

EspressoLover


Total Posts: 451
Joined: Jan 2015
 
Posted: 2020-09-23 18:56
I've been looking to tighten up the latency for an HFT strat. Of course, FPGAs are always a potential answer when it comes to this question. Rather than trying to scrape the software stack at O(10 uSec) it's tempting to just nuke it from orbit down to the O(1 uSec) that come with FPGAs

I don't have the manpower to re-write the entire quoter in Verilog. But the vast majority of latency-critical events seem to be easy to infer at eval time. Things like new level formation, a large trade, or a tick in the index futures. Most of the complex logic can be pre-computed in the software stack. Then the software just asynchronously hands off flat event triggers to the FPGA.

This would seem to vastly reduce the complexity of FPGA development. 99% of the logic stays in the pre-existing quoter software. In most cases the FPGA just acts as a NIC card, passing along north/south bound packets to/from the CPU. The FPGA only needs a simple hot-path that tests an incoming packet against a set of triggers. And if tripped, inject a pre-defined messages into the gateway session. The FPGA layer doesn't have to build the book or even parse anything but a few critical fields.

All of this seems deceptively easy to implement. Of course, there's nothing new under the sun. The idea seems common enough that more than a handful of vendors already sell products that portend to provide something like this out of the box. It's tough to tell much about these products just from Googling, because they only seem to put up light-on-details sales sheets. Anybody with reviews, positive or negative, of any of the products from this space? Pricing also seems to be completely opaque. Overall my bias is leaning towards build instead of buy.

Anyone with general opinions on this topic who can share? (I realize this is skirting competitive proprietary information.) The paradigm sounds pretty simple in theory. But in theory, theory and practice are the same. In practice they're quite different. There are always hidden pitfalls, when you go from whiteboard into the weeds. In particular, I'd care to hear about any "unknown unknown" that I'm overlooking.

I'd also be curious if anyone's had experience both with this approach, in contrast to putting the full quoter stack entirely into the FPGA. I.e. removing the CPU/software layer entirely. Did you think the marginal gains, either in performance or maintainability, from the full stack FPGA were worth the much steeper development curve?

Good questions outrank easy answers. -Paul Samuelson

Lebowski


Total Posts: 78
Joined: Jun 2015
 
Posted: 2020-09-23 19:41
At some point I really gotta find a way to reciprocate with knowledge but just the fact that you’re rolling one deep over there and find yourself confronted with these problems is great motivation for all of us junior glorified IT boys strapped to the desk. Gratitude and good luck with your FPGA. Hopefully there’s a less junior and more empowered full blown IT man around to make this worth your while.

tabris


Total Posts: 1272
Joined: Feb 2005
 
Posted: 2020-09-24 00:08
In my previous life working with FPGA, I have always found syncing to be the biggest issue if you were to go hybrid. Basically either forcing delays on calculations in CPU/event flags that come back from FPGA to CPU needs to be synced/timing of new calculation from CPU depending on FPGA flags needs to be synced... etc etc

Depending on the calculations, it might be more trouble than its worth but I have never tried it in HFT so my experience might not apply to what you are planning to do.

Dilbert: Why does it seem as though I am the only honest guy on earth? Dogbert: Your type tends not to reproduce.

doomanx


Total Posts: 89
Joined: Jul 2018
 
Posted: 2020-09-24 08:44
@EL we paid for an external provider to build an FPGA to do some very simple stuff (mostly unpacking UDP packets and a few bits of simple logic) and it wasn't crazy expensive. I've never tried to do it myself but I imagine there's a decent learning curve when you get down into the nitty gritty of getting the FPGA interfacing with networking kit and such. So maybe look into a third party solution?

In terms of full stack FPGA again only via talking to people I know there are some shops that rhyme with bump and kerchu that have 100% FPGA strategies but as you would expect they are simple guns blazing mechanical strategies.

did you use VWAP or triple-reinforced GAN execution?

nikol


Total Posts: 1198
Joined: Jun 2005
 
Posted: 2020-09-24 10:20
General opinion about O(1us):
Back in end of 90s, when I did some work in this field, it was already 100 ns with data snap frequency of 25 ns (4 internal FPGA pipelines was used). Now it must be better.

Split your problem into blocks and tell which one takes the most time.

Processors try to analyze events in full and be the first in the queue within the right price slot. You can try using FPGA to filter out everything except potential opportunities, where GPU/CPU can be used to make the final decision.
Main suggestion is that FPGA engineer does not develop smart algo, his job is to program it efficiently (maybe there are exceptions, but I never met them). In algo development you have to use very "modest" model features:
- The best is to reduce your problem to lookup tables (LUT), which possibly to be updated dynamically (better not).
- Conceptually think of FPGA mesh as a physical representation of LOB where potential scenarios can be applied.
- multiplication/division is "no-no". Develop really simple algos with sums, subtracts, accumulations, walk-through, LUT/if's, etc.
For example, "natural" model for FPGA is binomial mesh for option pricing or LS MC for american etc. These are not LOB models but can give you an idea.

Recently, I saw this (google/video: Optiver FPGA or high frequency trading FPGA)
https://www.youtube.com/watch?v=RCb8PsdipHI
https://www.youtube.com/watch?v=Kq7Q3PFIcWc

EspressoLover


Total Posts: 451
Joined: Jan 2015
 
Posted: 2020-10-08 17:55
Thanks so much guys! A lot of great points that I wasn't aware of before. I'm still in the early stages here, but you guys pointed out a lot of things that should keep me from wasting time on a wrong avenue. I'll update the board if/when I have more progress to report.

Good questions outrank easy answers. -Paul Samuelson

EspressoLover


Total Posts: 451
Joined: Jan 2015
 
Posted: 2020-10-21 20:46
I erroneously assumed that having an FPGA inject packets into a pre-existing TCP/IP session would be a solved a problem with off-the-shelf products. As it turns out this, this is the "and then a miracle occurs" step in the plan.

The cheapest quote I've found for a TCP IP block is $100k. And there doesn't appear to be any decent open source solutions. It seems like most people in this space run TCP on a softcore or an on-board hardcore. But the consultants that I've talked to are telling me that's 5uS of latency minimum, which nearly defeats the advantage over the most optimized software stack. (As an aside, I'm starting to wonder if putting flat triggers into C code on a SoC SmartNIC is actually the low-hanging fruit here...)

That brings me to my question. Anybody who has experience here with TCP/IP on an FPGA and can share? Even outdated perspectives or vague recollections would help me get the lay of the land.

Good questions outrank easy answers. -Paul Samuelson

nikol


Total Posts: 1198
Joined: Jun 2005
 
Posted: 2020-10-22 06:05
Without looking into TCP/IP thing:
I didnt do it for years, but it would be the weirdest approach I could think of. Direct cable with fixed addressation is the only cure. See readout, data aqcuisition, trigger themes in experimental nuclear (high energy) physics.

Multilevel trigger system within your C-code is cheaper/faster solution to test the idea. Later you may come back to FPGA again.

PS. See example discussion
https://www.researchgate.net/publication/221193225_Efficient_PC-FPGA_communication_over_Gigabit_Ethernet

UDP/IP is better.

here are interesting discussions:
https://forums.xilinx.com/t5/PCIe-and-CPM/Write-latency-on-PCIe-from-PC-host-to-FPGA/td-p/541535

https://stackoverflow.com/questions/43964832/why-did-pci-express-suffer-high-latency-in-pipeline-transfer-mode

About charachteristics: reaction of FPGA is subnanoseconds, but the signal train goes slowly (hundreds ns). Additional interpretation difficulty is that everyone measures bandwidth speed in ZilloBytes/second, not speed of travel (latency). Before thinking about FPGA you have to work out how data will be represented and how it will be injected. It is not just like plugin and play GPU with CUDA (or else) on top.

PPS. C-code implementation (emulation) of tobe FPGA algorithmic idea could be already significant improvement. Without this idea you should stop waisting your time.

Handsome Jack


Total Posts: 1
Joined: Jul 2020
 
Posted: 2020-10-22 19:12
For context, my $dayjob is embedded software engineer at a fabless semiconductor firm. I only browse this forum out of personal interest, and cannot speak to the mathematical and theory side of quant finance.

We had a hardware team tasked with designing and implementing some TCP offload features for a custom MAC, which as for unstated reasons, we were looking to do it in-house rather than license existing IP. This wasn't your run-of-the mill offloading (e.g. TSO or the like), it was having the hardware (FPGA while testing) do a lot of heavy lifting so our core could wash its hands of dealing with most connections.

Long story short, after 3 months of dev time by a full team, the project ended up getting scrapped from above because the amount of effort was quickly mounting. Because TCP has so many possible states, building a FSM that can accurately cover all possible transitions was becoming very cumbersome, very quickly. It also involved a lot of combinatorial logic that was forcing us to run our MAC at lower and lower clock speeds on FPGA, and introduced many timing violations into our design.

Now, the problem of "too many possible states" may not be as big of an issue here, because you don't have to support a wide variety of TCP stacks and quirks that they may have - you only have to design to accommodate any idiosyncrasies that the exchange's TCP stack may have, which may reduce your dev time.

Also from my understanding, the softcore approach may not be the best. Any time we've had to put a softcore on our FPGAs we've always had tons of issues with having enough space for all of our other components. YMMV

P.S. Switching to DACs instead of fiber will shave off a little of your latency too, the transceivers for fiber add noticeable latency.
Previous Thread :: Next Thread 
Page 1 of 1