Forums  > Software  > Flat Files vs DB for Book Data  
Page 1 of 1
Display using:  

Its Grisha

Total Posts: 92
Joined: Nov 2019
Posted: 2021-04-30 14:37
At risk of starting another database conversation, are you guys storing order book snapshots + deltas in simple flat files?

Was using Timescale (postgres extension) with some success until inability to buffer more than a week or so of data in ram began killing basic query performance even with nvme ssd's.
i.e. my backtester asks
select * from orderbook where ts between(x, x + 1 hour)
If x is more than ~10 days ago (cold cache), this is slow to the point that pulling data is the main bottleneck in backtesting.

I'm thinking to just devise a flat file scheme since I'm not able to call any analytic queries on the raw book deltas anyway. Can I expect reading specialized flat files from ssd to be much faster than db? Use case is running through history for backtesting in c++ and also reconstructing book in python for general research.

"Nothing is more dangerous to the adventurous spirit within a man than a secure future."


Total Posts: 80
Joined: Jul 2007
Posted: 2021-04-30 17:11
kdb if you can or

Master of ES

Total Posts: 493
Joined: Jan 2015
Posted: 2021-04-30 17:32
How big is the data? On my shitty laptop I can ingest from a gzip flat file at around one million order book deltas per second per core. It uses maybe 70 mb of memory. That's all customized C++ code though.

For order book deltas, it probably doesn't make sense to pay the overhead of a database. There's not really much you can query on. You pretty much always need to grab deltas atomically on any single symbol-date. Therefore the only thing you can condition on is date or symbol. SQL is kinda overkill.

The big decision with flat files is whether you do one file per date or one file per symbol-date. That depends on how wide your universe is. For the former, you may waste time parsing the whole date just to get one symbol's data. For the latter, sequencing multiple symbols into time order has a lot of overhead. Personally, I think S3 storage is cheap enough that I'd just duplicate the data and store both.

Good questions outrank easy answers. -Paul Samuelson

Its Grisha

Total Posts: 92
Joined: Nov 2019
Posted: 2021-04-30 17:33
@ax Maybe one day I'll touch the storied kdb, for now too expensive. Parquets are an idea though, used them for some equities minute bars at one point.

@EL Playing in crypto so not as big as it could be. Past 2 weeks are a billion rows. Yeah, I'm thinking I need to bite the bullet and go flat files and write a custom parser like yours. DB was more out of laziness, thinking that with a nice server and a trendy new time series solution this could actually be fast enough. Apparently not..

"Nothing is more dangerous to the adventurous spirit within a man than a secure future."

Its Grisha

Total Posts: 92
Joined: Nov 2019
Posted: 2021-05-01 07:48
So after battling the problem today, I have new slightly unorthodox idea.

Timescale is nice because out-of-order inserts are handled well, it can handle very high insert rates, and there is an efficient system for chunking the data and compressing older slices. So at least in my use case, flat files are good for reading but the database is much easier to stream live data to.

So this is kind of backwards, but I'm thinking to use the database as a staging area combined with some utils to generate flat files on demand for backtesting. Even if it takes an hour to generate a big flat file, I can the then mmap the csv in my backtester and run as many fast backtests as I want before deleting the temporary slice.

It may not be the best use of drive space but I'm thinking this will save a lot of dev time and maintenance.

"Nothing is more dangerous to the adventurous spirit within a man than a secure future."


Total Posts: 87
Joined: Nov 2013
Posted: 2021-05-02 16:21
I haven't even tested this yet (using Postgres at the moment), but since I'm using Python for a lot of exploratory-stage prototyping, this flat file dataframe indexing scheme looks interesting - PyStore

A quasi-summary:
"It’s built on top of Pandas, Numpy, and Dask and stores the data in the Parquet file format (via Fastparquet) in a hierarchical directory structure. Files are compressed using Snappy, a fast and efficient compression/decompression library from Google.

The end result?

A powerful, pythonic datastore for Pandas dataframes that can easily query millions of rows in sub-second speed.

PyStore provides namespaced collections of data. These collections allow bucketing data by source, user, or some other metric (for example frequency: End-Of-Day; Minute Bars; etc.). Each collection (or namespace) maps to a directory containing partitioned Parquet files for each item (e.g. symbol)."

"How be caught up in a game and have no idea of the rules." - C.S.


Total Posts: 1350
Joined: Jun 2007
Posted: 2021-05-02 18:44
Parquet files, stored ordered and partitioned by some meaningful key does the job. Like ES I don't hesitate to duplicate data to store it in a different use case specific key-partition/aggregation level. If the data is big, I use apache spark. Slow, but get's the job done.

Jslade is a fan of jd ( Obscure stuff and I never used it, but smart "get the job done" people seem to gravitate to it.

All of the above should be fast enough for most cases. None of the above supports good asOfJoins. To me having solid windowing and asOfJoin capability is more important than super fast read- write.

Playing around with shakti a bit lately and like it a lot. Also started (for fun) to abuse PyTorch as an aclerated Numpy.

Ich kam hierher und sah dich und deine Leute lächeln, und sagte mir: Maggette, scheiss auf den small talk, lass lieber deine Fäuste sprechen...


Total Posts: 249
Joined: Jul 2013
Posted: 2021-05-03 12:46
I think we had this conversation somewhere before.
I use hdf5 and flat csv text files for the logs.
I dont use deltas but I compress it, which makes it really small.

Ideally one should store directly into the preferred reading format (binary, hdf5, teafiles, there are more) and then reading should be easy directly on the storage machine, cloud, research machine (or all of them).
We never got to the point where we are doing everything in one format so a bit of a waste there.

That is how I see it.

"amicus Plato sed magis amica Veritas"


Total Posts: 97
Joined: Jul 2018
Posted: 2021-05-03 14:39
It depends on your most common operations.

Storing the deltas is what I've seen most commonly done. This usually results in slightly smaller compressed size than storing the snapshots. Networked IO is more expensive and grows at a much slower rate than CPU, so on all multi-PB systems I've seen, effort is spent towards squeezing the size of data that needs to be moved on the wire down by 100s of bps before worrying about the performance of the tasks (e.g. single core backtesting throughput, single host memory use, DBMS/querying framework performance, simulation threading model) themselves.

You'll need snapshots if you use deltas, and the optimal frequency to dump snapshots depends on your use case. Sometimes you'd just inherit it from the source protocol of your largest source of revenue - like weekly for Globex, daily for TCP/unicast-style feeds used by many dark pools. At one extreme, you could have frequent, intraday snapshots (QuantHouse's normalization format is the only one in the wild I've seen doing this).

It is also quite natural to store deltas because the pareto-optimal compression scheme (in the speed-compression ratio plane) for order book data involves delta-of-delta encoding for timestamps.

If you don't have tooling dependencies (e.g. HDF5 is good for MATLAB, Parquet is good for Apache/pandas), then a zero copy binary format is absolutely the way to go.

Sometimes you need to sacrifice some of the simplicity in your structure and introduce metadata in your storage format to optimize certain intraday lookups. For order book data, there's usually structure in the data that you can exploit for fast intraday lookups, e.g. a hybrid search algorithm that passes from binary -> linear interpolation -> linear scan.
Previous Thread :: Next Thread 
Page 1 of 1