Forums  > Trading  > Recs for options backtesting  
     
Page 1 of 1
Display using:  

ctd


Total Posts: 62
Joined: Jan 2008
 
Posted: 2021-09-11 02:25
Wondering about best approaches to storing/retrieving options price data for backtesting purposes? Anything SQL-based seems unwieldy (ie, very slow) given the amount of data, wondering if it may be necessary to go to a flat file or TSDB solution or if there's a better approach.

errrb


Total Posts: 12
Joined: Oct 2007
 
Posted: 2021-09-19 21:05
I use memory mapped files to store options data for my simulations. Another fast tool is kdb (commercial version is extremely expensive, but 32 bit version could be used for free)

Maggette


Total Posts: 1330
Joined: Jun 2007
 
Posted: 2021-09-21 17:03
QuestDB is fine.
https://questdb.io/

Won't give you much over your own m-maped file solution. Maybe 32bit kdb+ is even superior. But I like it.

Ich kam hierher und sah dich und deine Leute lächeln, und sagte mir: Maggette, scheiss auf den small talk, lass lieber deine Fäuste sprechen...

Dizzy


Total Posts: 259
Joined: May 2006
 
Posted: 2021-09-22 09:57
QuestDb looks interesting. (Note to self: I really should try it out)

Also, 32bit kdb is only for non-commercial use.

"Although the code snippet makes taking over the earth look fairly easy, you don't see all the hard work going on behind the scenes." - Programming F#, Chris Smith

Maggette


Total Posts: 1330
Joined: Jun 2007
 
Posted: 2021-09-22 12:52
@dizzy
If you check the guys who push questDB, it is kinda obvious that they took a lot of things out of Archis playbook. But it is nicely integrated with all the fluff that you need today to be able to convince people that it is a usefull tool.

I still think Shakti will rule them all. But QuestDB is nice enough.

Ich kam hierher und sah dich und deine Leute lächeln, und sagte mir: Maggette, scheiss auf den small talk, lass lieber deine Fäuste sprechen...

JTDerp


Total Posts: 85
Joined: Nov 2013
 
Posted: 2021-09-23 13:48
You might have a look at Parquet or, if Python's in your workflow, this package called PyStore which stores Pandas dataframes in Parquet files; no SQL involved https://medium.com/@aroussi/fast-data-store-for-pandas-time-series-data-using-pystore-89d9caeef4e2

"How dreadful...to be caught up in a game and have no idea of the rules." - C.S.

nikol


Total Posts: 1402
Joined: Jun 2005
 
Posted: 2021-09-23 17:03
I was keeping my head low... but after is mentioned python, I would propose to look into

pandas.DataFrame(...).to_dhf(file, key, mode='a')

Optimal way is to store your data with "key"= something like "year/month/day/hour/maturity/strike", it helps to split collected data by hour or may be day, it depends on the size of the chunk. Experiment within your setup. Welcome to try different order of tags within the key. It will depend on usage scenario. Maybe use option ticker as a tag, don't limit your imagination.

read with
hdf_file=pandas.HDFStore(fname),
hdf_file.keys() and
hdf_file.get(key)

I like HDFs.

TSDB InfluxDB is for distributed clouds-like storage.

... What is a man
If his chief good and market of his time
Be but to sleep and feed? (c)

Maggette


Total Posts: 1330
Joined: Jun 2007
 
Posted: 2021-09-25 20:58
Nothing wrong with well partitioned HDF5 + some well written API that allows to pull the data in the way you want it. Got plenty of good experiences in python centric gigs I had (either setting it up on my own in green field projects or already established or working with an established solution).

There are limits and other stuf can be more performant. But python + HDF5 is well established.

Ich kam hierher und sah dich und deine Leute lächeln, und sagte mir: Maggette, scheiss auf den small talk, lass lieber deine Fäuste sprechen...

EspressoLover
Master of ES

Total Posts: 492
Joined: Jan 2015
 
Posted: 2021-10-06 20:51
OLTP databases very rarely make sense for research. Data is WORM, there's no reason to pay the operational and computational overhead for ACID compliance.

I think (gzip) flat files are a pretty good starting point. No reason to go fancier unless you can identify a specific advantage of already have the infra setup. If you need to index over multiple dimensions just duplicate the data sliced multiple ways. Storage is pretty cheap.

Good questions outrank easy answers. -Paul Samuelson
Previous Thread :: Next Thread 
Page 1 of 1