 flare9x
|
|
Total Posts: 2 |
Joined: Nov 2016 |
|
|
Data: EOD Data sources = .csv 1. Want to import .csv to postgreSQL 2. Perform calculations (use column A and B to make C) perhaps perform calculation in python and store final result in SQL 3. Multiple data sources: options, price volume etc etc... all to match up per date 4. Search scripts to be able to pull data from the database - perhaps writing these in python 5. .csv source to SQL destination automated process on a daily basis
The above - of course open to suggestions
Looking for figure out the best way to achieve the above work flow and organise the structure of the SQL database... one table, multiple etc?
Thanks
|
|
|
|
 |
 FDAXHunter
|
Founding Member |
Total Posts: 8356 |
Joined: Mar 2004 |
|
|
As a structure for the table I would suggest:
For this sort of thing, a single table is usually sufficient. And because it's daily data, you can't really do anything wrong with regards to data types (which is why I'm leaving them up to you).
[Timestamp - Start] // The beginning of the day. In its most simple form this will be 00:00:00 [Timestamp - End] // This allows you to have variable length daily bars (this can be a use case for example with EOD data that you calculate yourself). [Timestamp - Calculation] // This might be irrelevant, but if you calculate the data yourself, this is when the data was calculated and persisted. [Instrument ID Key] // Your way of storing the instrument identification. Can be simple for stocks, more complex for options. [Price Open] [Price High] [Price Low] [Price Close] [Total Quantity] // also known as volume [Open Interest] [Open Interest Release Timestamp] // Open interest is not normally calculated and updated same day. [Data Source] // Allows you store different sources for the same data. [Record Create - Timestamp] // Debugging and tracebility/maintenance [Record Modify - Timestamp] // Debugging and traceability/maintenance [Source Data - Raw] // It's actually feasible to store the raw data from the source as a string in the table for daily data. It won't take up much space, but can help tremendously with debugging and traceability problems at a later date.
|
The Figs Protocol. |
|
 |
|
FDAX's schema is great, if you decide to go forward. But I'd ask yourself first, do you even need a database? If so, why? You could end up doing a lot less work by skipping points 4) and 5). Just store the raw (compressed) CSVs, and use the scripts from 2) to perform the calculations on demand. If you need point 3) just use GNU join. Pipe the calculator scripts directly into the consuming application. Yes that means you re-run the calcs on every access, but processor power and SSD access is *way* cheaper than developer effort. No need to deal with any of the headaches of a persistence layer.
I'd really doubt if you need SQL. Do you need ACID transactions? Indexing? Scalability? Probably not. It's a WORM dataset small enough to fit into L3 cache, with a max of under a hundred dozen simultaneous readers. When in doubt - KISS. |
Good questions outrank easy answers.
-Paul Samuelson |
|
|
 |
|
Look into to_sql and read_sql within pandas if you're keen on using python.
In line with EspressoLover spend some time thinking about if you need a database.
|
|
|
 |
 sv507
|
|
Total Posts: 165 |
Joined: Aug 2010 |
|
|
I found pandas to_sql very slow (hours vs minutes) , compared to postgres copy command. (I don't remember why it was so much slower). Personally would definitely go for the db option, agree its worm, but that's precisely the advantages of the db... So much easier to slice and dice. And do all the queries. |
|
|
|
 |
 Lebowski
|
|
Total Posts: 66 |
Joined: Jun 2015 |
|
|
FDAXHunter's table structure is quite comprehensive given the question posed and some of the details he's given are the type of wisdom you only gain by implementing this stuff such as the OI release timestamp. It's by no means necessary but perhaps worth mentioning that if your applications of this data include the need for the the "Record Modify" timestamp you might consider implementing event sourcing as perhaps a more comprehensive pattern for dealing with the potential for modifications. |
|
|
 |
|
Sorry to be thick, but wouldn't you want a date stamp in the schema as well, or is that included within all the various timestamps? 
I played with a postgres database years ago, and decided to just stick with csv files for daily time series data. They are nice and simple. |
|
|
|
 |
|
@tradenator: timestamp includes date, @lebowski: everything is around your specific requirements |
Time well wasted. |
|
 |
|
I guess you have a schema that doesn't change. But for flexibility isn't it better to use a variable/value structure? What I mean is instead of: -Open | High | Low | Close Something like: record 1: Open | 16 record 2: High | 17 record 3: Low | 15 record 4: Close | 16 Cord |
Vespertilio homo est cientificus |
|
|
 |
 FDAXHunter
|
Founding Member |
Total Posts: 8356 |
Joined: Mar 2004 |
|
|
That may sound like it gives you flexibility, but for highly structured data (which this is), this "Attribute-Value" pair is a really bad idea. You will run into countless operational issues. It's a really bad idea. Entire careers have been lost to Attribute-Value or Entity-Attribute-Value ("EAV") schemas.
There are hundreds of references detailing why it's bad. I just googled a couple without really reading them:
EAV - The Good and Bad OTLT/EAV design - why do people hate it? |
The Figs Protocol. |
|
 |