Forums  > Software  > Jd database?  
     
Page 1 of 2Goto to page: [1], 2 Next
Display using:  

procrastinatus


Total Posts: 8
Joined: Oct 2012
 
Posted: 2020-07-30 15:39
I'm looking into different timeseries databases and I remember a few folks here spoke highly of Jd. I'd appreciate any details about how that's worked out (good and bad).

Generally, my philosophy is to be conservative with technology for data storage (i.e. old ass fs/db that have a long enough history that data corruption bugs are unlikely), but I guess Jd has been around for a while (albeit with what seems like relatively few people using it) and I wanted to see if anyone here has kicked the tires.

I'm also all ears if you want to volunteer an opinion on another timeseries db.

doomanx


Total Posts: 78
Joined: Jul 2018
 
Posted: 2020-07-30 15:45
What language are you interfacing it with? If you're using python you could take a look at https://github.com/man-group/arctic.

did you use VWAP or triple-reinforced GAN execution?

procrastinatus


Total Posts: 8
Joined: Oct 2012
 
Posted: 2020-07-30 15:58
python and c++ for now.

Thanks for the link, will take a look. I've been allergic to Mongo for a while - mostly from some underwhelming experiences talking to Mongo engineers and my conservatism about data stores - but it's been long enough that it's probably time to reassess.

Patrik
Founding Member

Total Posts: 1364
Joined: Mar 2004
 
Posted: 2020-07-30 16:14
Depending on your use case (amount of data, granularity, latency reqs, etc) timescale may be in the running - https://www.timescale.com/

The foreign data wrapper part is perhaps not old and super battle hardened, but it relies a lot on the overall postgres foundation and the extension is open source so I get myself pretty comfortable with that compared to something like e.g. mongo.

Capital Structure Demolition LLC Radiation

svisstack


Total Posts: 351
Joined: Feb 2014
 
Posted: 2020-07-30 23:33
A quick look at the Jd. It looks like total shit, why someone could waste time on this; don't understand, please explain.

clickhouse / influxdb are also here, there is no general correct answer

it looks like data corruption is the most important requirement from your side, in that case, you don't need a database at all and you can just store the data in S3 (99.999999999% durability) + query with Amazon Athena.

First Commander of the USS Enterprise

kloc


Total Posts: 34
Joined: May 2017
 
Posted: 2020-07-31 06:10
No intention to start flame wars here, but I'm really interested so I have to ask... What's wrong with just pure, plain old postgres? :-)

I've tried several other open-source options (including TimeScale which was mentioned below), but I've never managed to convince myself they're clearly better solutions for TS data (TimeScale actually performed somewhat *worse* than pure postgres in some simple test I've performed recently).

Re InfluxDB... it does not look like a clear winner either:
https://portavita.github.io/2018-07-31-blog_influxdb_vs_postgresql

I'm sure it's just me being thick, inept and unable to squeeze the optimal performance by tweaking the configuration correctly. I'm also sure everyone has different datasets and different platforms and it's not easy to cover everyone's needs, but... is it really that hard to have the default configuration settings reasonable enough out of the box to be at least comparable to something standard like postgres?

I'm really curious to hear others' opinions and more than happy to be convinced that I'm wrong...


jslade


Total Posts: 1217
Joined: Feb 2007
 
Posted: 2020-08-01 20:46
> why someone could waste time on this

Jd has similar performance to Kx on analytics queries/research. If you write code in J it's a no brainer. If you don't and don't want to, you should do something else, same as Kx/K. I write code in J. It's good and I can embarrass idjits doing 100 node "spark cluster" BS on my shitty thinkpad.

Influx is a hilarious turd which should die in a fire. Clickhouse is OK. Postgres with columns is OK as long as you don't care too much about performance/ have small queries.

"Learning, n. The kind of ignorance distinguishing the studious."

TonyC
Nuclear Energy Trader

Total Posts: 1344
Joined: May 2004
 
Posted: 2020-08-01 22:29
Jslade, youre still using Jd? Still writing in J?(says the guy that still does APL)
... I must admit that if I were starting out fresh I would probably skip both APL & J and just go to k/kx/kdb/q

Although Arthur's new "shakti/k9" promises to be even zippier than what FirstDerivatives/KX is doing

flaneur/boulevardier/remittance man/energy trader

procrastinatus


Total Posts: 8
Joined: Oct 2012
 
Posted: 2020-08-03 16:03
Entertaining responses all around.

patrik - recently I've noticed timescale pop up (although mostly on people's side projects). pg backend in principle seems fine.

svisstack - thanks for the suggestions - I had not looked at clickhouse yet. Yeah, for maximum reliability, I have been looking into etching the data into quartz silica glass and encasing the disks in artificial amber - I hear that's the best to prevent data corruption... More seriously, I just have a lower floor on data reliability, but given that (and other constraints) am optimizing for latency running queries (like everyone else). S3 is great, and Athena is helpful for certain kinds of data, but it is slow as balls (not to mention that at the top of the hour queries grind to a halt from everyone's cron jobs) - redshift spectrum is probably what you meant and is a little better - I'll probably take a look at that.

kloc - pg sounds worth looking more into - I didn't know it could be configured with non-{btree,rtree,hash} indexes. pg and mysql (mariadb) have good caching and reliable data storage/parallelization, and I've seen both used to serve massive and important datasets at latencies and reliabilities that crush "web-scale" crap; they also tend to be better at supporting mutable data, although I'm not sure I want that in this case.

jslade - I have a cursory interest in J which is part of my interest in Jd. Do you use it mostly for prototyping/analysis or is it hooked up to talk to anything important? Any regrets or PITA?
I've used spark clusters for some text processing stuff - which I guess it's well suited for - but it hurts my teeth how slow and full of overhead it is to run simple map-reduce tasks on smaller datasets (~100GB or smaller). I messed with hadoop (for some analytics works in the late 00's) which was such a turd that maybe spark seems good to a certain audience by comparison? Also, don't get me started with the container cluster stuff...

tonyc - how much is a commercial kdb license (I've heard O($100k/year))? What's better about k/kdb compared to J/Jd? I've generally found that for large complex projects (like in evolution) intermediate steps need to be viable on their own - spending $100k on step 1 is probably not viable for me, but might be ok on step N down the road.

jslade


Total Posts: 1217
Joined: Feb 2007
 
Posted: 2020-08-04 15:14
I like J better than K; it's more expressive and has better tooling (dissect and the debugger are great) and ecosystem of add ons.

K is of course better developed for financial applications, simpler, has all the support and documentation money can buy, probably has better threading (probably not for long). Downsides I think K splitting into Shakti/Kx isn't a good situation for a small language, and of course it costs money.

TBH I think all APLs are eccentric at this point. They represent a truly different way of doing things, and can be very efficient, but the only way I can see them setting the world on fire again is if they start targeting GPUs. Most people should just develop code in Python and C++ and buy more computers like everyone else.

related: https://mcfunley.com/choose-boring-technology

"Learning, n. The kind of ignorance distinguishing the studious."

prikolno


Total Posts: 55
Joined: Jul 2018
 
Posted: 2020-08-05 00:37
Can you provide more details about the use case? Typical types of queries that you will run, the number of client applications, and the hardware everything is running on? e.g. ETL? Post-trade analysis? Embarrassingly parallel jobs on a HPC cluster? Backtests? Ad hoc tasks on 1-2 beefy servers? A bit of everything? Multi-cloud, hybrid cloud, self-hosted? Preferred language clients?

I know some of the Man AHL engineers who wrote Arctic - actually had to interview one of them so I was impelled to read through the entire source - and generally don't recommend it for time series data. It just happens to fit very specifically in Man AHL's niche. It's OK if your working set is under 5 TB; your typical workflow involves materializing the entire dataset at once for a date (in day granularity) range; and you like using pandas dataframes in most of your workflows; or if you really absolutely need versioning and journaling of metadata and cannot invest a few weeks building your own. But there's many inefficiencies otherwise, e.g. the way it serializes binary blobs to work around Mongo's BSON document size limit, zero hardware codesign considerations, lack of server-side compute, many other performance and maintenance issues inherited from Mongo.

Only have a cursory exposure to Influx from maintaining a few instances for logging and devops-related tasks, but I find it's never on the pareto-frontier for typical financial analysis workflows, i.e. conditioned on you investing time to maintain a DBMS, you will find that it is always strictly dominated by something else in all categories.

kdb, clickhouse, Snowflake, Vertica are OK. Happen to know both the biggest user of Vertica and biggest (until they got acquired in 2019) user of Snowflake and their engineering insights from over the years, so I can provide some color there. I prefer Clickhouse over Vertica if you don't need relational integrity, which is likely for time series data, and it is quite strong if your queries are usually a composition of range select on the time index and field = f1 on another column. This is idiomatic for low frequency data where all your tickers reside in a single table, e.g. CRSP, corporate bonds.

But for higher frequency data, you often wouldn't put everything in a single table and subset from that table, because in practice your cross-sectional features don't require the entire universe, so it's cheaper to pay logarithmic complexity on selected tickers than scanning through the table. I can probably write a whole book about this, so you need to be specific about the requirements.

I second using boring technology: For mixed use cases and not knowing exactly what you do, I think using Parquet as a primary on-disk format and then building everything on top, no DBMS, works OK, especially if you will be using its C++ or Python client libraries. (My current project is in Rust, so we gave up on this option since it was annoying to write language bindings.) This will survive up till 1 rack of servers and a naive shared file system if you simply put 10G between your data and the compute nodes on which your client applications reside. It will quite likely outperform Jd and require less commitment.

Just drop in a performant file system under this and you'll end up with something quite similar to what's used at most premier mid-frequency shops. You can also add clustering/parallelization frameworks from the Apache ecosystem over this for whatever reason, most of them can read pq. If your working set is like 2 orders of magnitude smaller than your cold data set, I would consider tiering it with your cold set on cheap storage and the hot data getting rotated into something like Weka.

Maggette


Total Posts: 1237
Joined: Jun 2007
 
Posted: 2020-08-05 07:55
Hi,
I use SPARK on ETL jobs a lot. Results are written into reasonably partitioned parquets.

To save my customers money for EC2 instances, I run that only on S3. So no EC2 instances are storing data, they just compute and go to sleep.

That works reasonably well for 2 of my customers. One has now data in the PB ballpark.

It is robust and can be maintained from any Java drone developer who is able to read the S3 and Spark documentation. I have rather complex ETL jobs running that actually did not fail for years now.

But it is obviously not a database, slow as fuck...and crazy slow for small to medium data.

I have ok experience with postgres. A customer of mines uses timescale. Seems ok, but I am not overly impressed.

Ich kam hierher und sah dich und deine Leute lächeln, und sagte mir: Maggette, scheiss auf den small talk, lass lieber deine Fäuste sprechen...

svisstack


Total Posts: 351
Joined: Feb 2014
 
Posted: 2020-08-05 08:48
@procrastinatus: kdb is $12k/core/year or 2x for perpetual.

I'm not convinced with the value that kdb provides; benchmarks look sketchy as they compare to the elasticsearch, Cassandra or mongodb. No mention about the PostgreSQL, clickhouse, MS SQL, or Oracle and no way to reproduce the results. Comparisons to InfluxDB are tricky as at some point they replaced the DB engine, agree that in the past was total crap, but heard that things changed for better and we are using influxdb 2 for telemetry, but can't declare success at the moment as it's just a few GB's.

First Commander of the USS Enterprise

EspressoLover


Total Posts: 437
Joined: Jan 2015
 
Posted: 2020-08-05 15:11
I'll second @maggette in being a Spark proponent. Especially with the workflow he recommended of using S3 as a backing store. A really nice feature of that is that you're totally cloud elastic. You arbitrarily scale up or down on dirt-cheap preemptible VMs. Run it on Kubernetes with auto-scaling groups, fire off a single command on your local machine, and all the compute allocation is provisioned elastically and transparently. Zero machines to sysadmin. All cattle- no pets.

A lot of people's impression of poor Spark performance comes from the older versions. Modern Spark with Kyro, Tungsten and Catalyst is actually pretty competitive with traditional SQL databases. (And then consider that you can buy 5 preemptible worker cores for the cost of 1 database core.)

If you're getting bad performance make sure to check your query plans and also to add cache() and broadcast() at chokepoints. Also evaluate your RDD partition scheme and move shuffle operations to as late in the query plan as possible. Another thing to keep in mind is that you can always call out to pipe() and run sub-components of the pipeline in other software. For example, I use pipe() to construct order books inside an optimized C++ binary, as that task is much faster than in relational algebra. In the limit case, you can treat Spark as a very convenient cloud-native cluster scheduler.

Embarrassingly, I have to admit I've never used the legendary kdb. Both philosophically and practically, I strongly prefer open-source. But also, every time I see the price tag I just think of how much hardware one could buy instead. If your dataset is under 10 TB, you can easily fit that into the memory of a single cloud instance that costs way less than a kdb license. I can't imagine that vectorized R isn't nearly equivalent in performance, let alone a much more expressive language. But that's just my uninformed opinion. Having never used kdb, what am I missing here?

Good questions outrank easy answers. -Paul Samuelson

Its Grisha


Total Posts: 56
Joined: Nov 2019
 
Posted: 2020-08-05 15:57
@EL For what it's worth, in a recent call with a data vendor I asked how clients are storing the stuff. He said in the last few years a huge shift from KDB+ to parquet and spark type stack.

Having never used KDB either, I think of it as persistent Pandas because the in-memory time series maneuvers seem to have quite similar qualities. Just has more production database features on top of it like ingesting feeds and backing up to disk. Would be interested to know if this is roughly a correct understanding.

Maggette


Total Posts: 1237
Joined: Jun 2007
 
Posted: 2020-08-05 16:32
And I in turn second what ES said.

Spark has greatly improved its optimization. If you
- leverage Tungsten as much as you can
- restrict yourself to the DataFrame API to benefit form code generation
- and organize your data in a way that it is well parallelized,

stuff has become faster for quite big datasets. Still it feels very slow for medium sets. But that is not what it was build for.

@EL: Never used kdb+ in a production setting either. But IMHO kdb+ is much more than a simple database. IMHO, when people rave about kdb+, they are actually raving about "kdb+ tick". You get an
=> ticker plant that writes data into a log first (apache pulsar does the same by writing to apache bookkeeper first)
=> in memory database
=> on disk database
=> ticker plant clients/CEP engine....

So you actually have a "one stop shop" that allows you to query rather gigantic on disk data + a machine that allows you to process 10s of millions messages per second. You also have a very smooth transition from batch to stream processing. All in an application smaller than 800KB.

You are a highly capable individual. Hence you are able to plug several open source tools together without doing stupid shit. After years in several industries I can attests that this is not the case for everybody/every organization:).

Many organizations spend millions on developers and dev/ops to plug together the whole apache jungle to get what kdb+ does for you.

As I said: I never used kdb+ and I use Spark a lot. But I think that kdb+ outperforms the Apache world in a lot of things and can do a lot of things exceptionally well. Will take a closer look at Shakti.

Ich kam hierher und sah dich und deine Leute lächeln, und sagte mir: Maggette, scheiss auf den small talk, lass lieber deine Fäuste sprechen...

prikolno


Total Posts: 55
Joined: Jul 2018
 
Posted: 2020-08-05 17:01
@Grisha The query language in kdb is substantially more powerful than the whole API surface of pandas. You can solve a regression with a 1 liner in kdb, just like MATLAB, whereas pandas can't do that.

Moreover, that is a server-side query, you're pushing that to the location of the data, so you're not paying I/O penalty, whereas pandas is on client-side.

Then of course we have the issue that pandas is not well-suited for datasets larger than host memory. For small multiples of memory, you might hack a workaround with some combination of numpy mmap and chunking, but usually I see people just make the leap to Databricks/Spark/dask.

The other matter is that kdb has deterministic real-time read/write. Don't have a scientific way to confirm this, but I've occasionally had to use Integral FX's web dashboard (backed by kdb) for a MM strategy and it updates consistently unlike say State Street's X2 interface, not sure how much it attributes to poor backend application design.

So it's a very different beast.

procrastinatus


Total Posts: 8
Joined: Oct 2012
 
Posted: 2020-08-05 17:25
jslade -
>Most people should just develop code in Python and C++ and buy more computers like everyone else.
>related: https://mcfunley.com/choose-boring-technology
This is good advice. I should probably admit that some part of considering J/Jd is intellectual curiosity and I should save trying risky shit for hobbies.
One counterexample (although not really) - I worked on a project years ago building a high performance messaging server that needed to support a very large number of simultaneous connections; we wrote the thing in erlang which at the time had just been open sourced and was a weird swedish language nobody used, but turned out to be super stable, very performant, and quick to write this kind of thing. I think ericson used erlang for some mission critical stuff long before releasing the language so it was quite mature despite low adoption. I was sort of hoping J/Jd might be vaguely similar - stable, performant, productive despite the small community. Then again, the code we wrote later got folded into a giant company and I think they eventually rewrote it in C++ bc nobody wanted to learn erlang...

prikolno - this is all great advice. I am into the idea of keeping things simple and directly reading a bunch of parquet files - there is a lot of flexibility in this (e.g. I can mirror the files in s3 if I later decide to move or create parallel workflows in spark) and it should be quite performant. To take it a step further, I could use apache arrow and mmap data to disk (or a subset of "hot" data).

maggette - point taken about spark and this is consistent with my experience as well, particularly for batch ETL type pipelines; another thing that's sort of crazy about spark/s3 setup is you can run multiple different cluster simultaneously against the same data like it's nothing.

svisstack - thanks for the pricing info; sometimes I also wonder if APL folks are playing an elaborate joke on everyone - it's amazing how disconnected the APL world seems to be from everything else.

EL - I've used fairly recent versions of Spark, and definitely take your point about cloud elasticity / low admin burden. The latency is what bothers me - perhaps I messed things up badly, although I did spend some time trying to optimize the query plan / caching / partitioning. pipe() on binaries and thinking about spark as a scheduler is interesting - I also noticed you can run stuff on GPUs if you provision them which could be useful.
Btw, although Spark is open source, AWS EMR and databricks runtimes are supposedly 10x faster and not contributed back. Also, debugging errors in Spark jobs on a cluster is a PITA unless you build or pay for (databricks) tooling to help aggregate and filter through logs efficiently.

Its Grisha


Total Posts: 56
Joined: Nov 2019
 
Posted: 2020-08-05 17:28
@prikolno
Thank you for the info! One more question on kdb, maybe you know.
For the in-memory component of kdb, how are multiple clients managed, is there access to the same shared memory? I imagine this starts to create crowding based on what different clients want to manipulate in-memory and the temporary objects they might create.

While sandboxing poses the issue that there are some tables you may want to persist in memory for all clients throughout the day. Any idea how they manage this?

procrastinatus


Total Posts: 8
Joined: Oct 2012
 
Posted: 2020-08-05 17:34
maggette - this sheds quite a bit of light for me about kdb being a tool that handles multiple domains of time/space.

prikolno - frankly, pandas sucks. It is such a memory hog that it exacerbates the gap where people run out of memory and move to spark/dask. The intermediate domain between these is where I often want something performant, but haven't found anything good. Maybe dask will get there eventually, but I think you just need to write your own stuff now.

prikolno


Total Posts: 55
Joined: Jul 2018
 
Posted: 2020-08-05 20:32
@Grisha No problem. kdb's architecture is probably lower level than what you're thinking. A kdb system is really a group of q processes that you coordinate. The q processes are like lego blocks. It is entirely up to you to architect your own strategy how to partition a table, point one or more q processes to serve those partitions, and how to load balance your client requests across those q processes. It also gives you low level messaging facilities to communicate both synchronously and asynchronously between these processes. So your answer will depend on your setup:

A single q process can be multithreaded by spawning secondary threads from a main thread. Each secondary thread maintains a copy of its own local variables and has access to global variables in the main thread's scope, since they share the same address space. Since queries are processed sychronously on the main thread and there are low level language constraints only allowing the main thread to mutate the global variables, by construction the language prevents race conditions and guarantees write consistency for a single q process.

When you have multiple q processes servicing the same data for clients, there are no concurrency primitives to synchronize I/O between those q processes - it behaves just like ordinary file I/O. This leaves it entirely up to you how to route queries to those q processes to avoid consistency issues, and there's different ways for doing this. (TLDR: The conventional design essentially creates partition-level locks at the load balancer level.) This is really rarely a problem for financial data though, most of the data is WORM and you almost never need to write to the same table partition on multiple clients. In any case, you can see the trade-offs made to achieve flat file-like performance.

Going back to your question, you can intuit from what I've described about the typical kdb load balancing strategy that the performance deteriorates as there is greater write contention, as you would expect of most databases.

prikolno


Total Posts: 55
Joined: Jul 2018
 
Posted: 2020-08-05 20:59
@procrastinatus Yes I think you get the gist of my proposal, glad you like it.

I find it's the intermediate domain that separates the top tier firms from the mid tier. Every top 50 HFT firm or quant manager whose name or abbreviation matches the regex /[A-Z]{3}/ has both a Spark/Hadoop/Databricks installation and a developer writing some one-shot .py scripts for single machine. But the top 5 are just a bit better in the intermediate domain.

svisstack


Total Posts: 351
Joined: Feb 2014
 
Posted: 2020-08-05 22:02
"This will survive up till 1 rack of servers"

@prikolno: don't see that, why?

Not excited about the outsourcing regression or messing with my threads to the database guys as I will certainly want to have control over all fine details of it. Also, it's easy to move the computation to where the data is these days.

Another risk for kdb, apart from no expertise in the job market, is k/q development linked heavily to Arthur consciousness? That things usually grow in the open-source environment. Even Microsoft bend the knee on this matter.

First Commander of the USS Enterprise

Maggette


Total Posts: 1237
Joined: Jun 2007
 
Posted: 2020-08-05 22:27
"That things usually grow in the open-source environment."

Does it? I am still sitting on the fence on that.

I contributed to Flink and Spark and I can say it isn't a good sign when an idiot like myself is able to push his nonsense into products:).

I think people sometimes are not really aware that a lot of the "open source" stuff is driven by heavy company investment that vanishes at the whim of some managers.

Apache PIG anyone?

Flink is basically the guys from DataArtisans and Alibaba....and Alibaba recently bought DataArtisans. Kafka is more or less Confluent.

That said, I am pretty confident you can deliver the stuff kdb+ does for you on your own for less money.


Ich kam hierher und sah dich und deine Leute lächeln, und sagte mir: Maggette, scheiss auf den small talk, lass lieber deine Fäuste sprechen...

svisstack


Total Posts: 351
Joined: Feb 2014
 
Posted: 2020-08-05 22:37
@Maggette: yes they do for sure and if you can push and merge to the master without any approvals the code must be low quality and mess; if you care about the quality you are not letting this happen even in the small closed dev team.

Actually, spark looks like a mess with a lot of workaround's patched/added later (student started it so it's also supporting that hypothesis), but technology is mature enough for sure and a lot of players are on board so I don't see that happening (ref-to/lack of contributions), but maybe I'm wrong and you can paint some color or 30k foot view here.

First Commander of the USS Enterprise
Previous Thread :: Next Thread 
Page 1 of 2Goto to page: [1], 2 Next