Forums  > Software  > Ray  
     
Page 1 of 1
Display using:  

Nonius
Founding Member
Nonius Unbound
Total Posts: 12778
Joined: Mar 2004
 
Posted: 2019-01-02 11:01
anyone used this before and have some thoughts/experience with it?

Ray

came across it after some frustration with distributing some computations in the cloud.

Chiral is Tyler Durden

Maggette


Total Posts: 1106
Joined: Jun 2007
 
Posted: 2019-01-02 11:55
Played around with it on tutorial level stuff. Liked it and think it is very promising. I think it might fill a gap.

+ IMHO really well designed for specific applications.
- not very mature framework (I don't know anybody who uses it in production)

For most ETL/data pipeline stuff I am really happy with he later versions of spark and really would recommend use that stuff. It's pretty battle tested. But it's obviously a bad choice if you looking for some HPC kind of application or real time/streaming stuff. It's JVM and needs a distributed files system (HDFS/S3)

Do you want to distribute because of size of data or because of computational restrictions (like in applications for simulations, numerics etc)

May I ask what the application is?

Regards

Edit: added link and another comment

Ich kam hierher und sah dich und deine Leute lächeln, und sagte mir: Maggette, scheiss auf den small talk, lass lieber deine Fäuste sprechen...

Nonius
Founding Member
Nonius Unbound
Total Posts: 12778
Joined: Mar 2004
 
Posted: 2019-01-02 18:41
I want a framework for implementing Evolution Strategies without mountains of logistics/configurations etc in launching a job. our jobs are linked to marrying neural nets with backtesting. we have set up some HPCs in the cloud but I'm not super satisfied with them thus far.

as you mention, Spark we are looking at as well; testing DataBricks.

Chiral is Tyler Durden

Nonius
Founding Member
Nonius Unbound
Total Posts: 12778
Joined: Mar 2004
 
Posted: 2019-01-02 22:56
DataBricks is actually pretty coool

Chiral is Tyler Durden

Maggette


Total Posts: 1106
Joined: Jun 2007
 
Posted: 2019-01-03 06:59
I somehow seem to remember that you guys are an azure shop?

When it comes to "industry ready" distributions in the hadoop/spark/hbase/kafka ecosysteme I liked MapR (https://mapr.com/products/mapr-distribution-including-apache-hadoop/). Look here Not quite as awesome but still very solid was Hortenworks HDP.

In both cases my customers use these distributions (one uses MapR , the other HDP) for several years now and we have some real time streaming applications and lots of batch processing (only several TB per day ). And I can't complain. We more or less DevOps the stuff on our own.

You always lag in the versions of the software using these distributions..but it really really makes configuring that stuff way more easier.

But again, for your .problem (marrying evolutionary strategies with backtest results.....you are wandering in dark places sifu), you might reconsider the whole apache/hadoop world. It is solid parallel computing....but it's not always fast.

jslade might probably come to support this position with his usual (and well deserved) hadoop bashing.


Ich kam hierher und sah dich und deine Leute lächeln, und sagte mir: Maggette, scheiss auf den small talk, lass lieber deine Fäuste sprechen...

Nonius
Founding Member
Nonius Unbound
Total Posts: 12778
Joined: Mar 2004
 
Posted: 2019-01-03 12:01
hahah, "you guys?"

we are an AWS shop....(kids nowadays, they hate MSFT etc, whilst I love C#).

for some things we use MapR. but not for this.

I also want some relatively general tools for distributing some computations without the need to do, as the Ray guys mention, a lot of "software engineering", which I'm not very good at.

Chiral is Tyler Durden

Maggette


Total Posts: 1106
Joined: Jun 2007
 
Posted: 2019-01-03 13:02
I am offended:) I'll take C# over the JVM shit all day. Scala made the playing field a bit more even....but still. Loved C# and in general think that the MSFT universe is a good place to be for many companies.

Ich kam hierher und sah dich und deine Leute lächeln, und sagte mir: Maggette, scheiss auf den small talk, lass lieber deine Fäuste sprechen...

EspressoLover


Total Posts: 368
Joined: Jan 2015
 
Posted: 2019-01-06 17:32
I think Spark's the right choice. Ray's basically designed to be Spark++ in the same way that Spark is Hadoop++. That being said it doesn't really have any lower "software engineering" overhead than Spark, and it's a much less mature, much less widely used product. The main point is to make distributed tasks have less overhead and support more nested dependencies. Basically bridging some of the gap between workflow that's currently in MPI because it doesn't cleanly fit into the RDD/DAG framework.

One thing to consider is maybe all you need is a cluster scheduler, instead of a full-fledged distributed computing platform. The relevant question is how compute vs. data intensive are your individual task workloads. If it's the former, you can basically abstract away any considerations about data-locality. Just pick some centralized store (S3, NFS, Redis, etc.), launch the tasks, grab the inputs, then write back the outputs to the datastore. If your tasks do a lot of compute, and don't shuffle that much data, then the bandwidth+IO inefficiency is de minims.

In which case you can treat slave nodes as fungible resources. Just launch tasks on any node with room as needed. (Plus maybe some resiliency support if your cluster's large enough that the probability of node failure during a job is more than epsilon.) There's plenty of options in this space: Slurm, Mesos, even Kubernetes these days. And after initial setup, they pretty much "just work" as a transparent layer. Set it and forget it.

But if your workflow requires data locality awareness. Then that's a whole 'nother can of worms. Unfortunately there's really no way to just abstract away the distributed layer. You'll always have to spend some mindshare on how the underlying system operates. Either because the platform restricts you to a non-generalized compute paradigm, like MapReduce. And/or because it gives you enough rope to hang yourself, like Spark lineages growing unbounded inside iterative algorithms

Distributed computing is hard. Even on a theoretical level. Trying to pick one platform to rule them all is probably a futile effort. It'd be nice if a single option would cover all possible usage cases that we could imagine. But it won't happen. The better approach is to figure out the requirements of your current workloads and select the best framework(s) suited to that. With full awareness that the choice may need to be re-evaluated in the not-to-distant future.

Good questions outrank easy answers. -Paul Samuelson

Nonius
Founding Member
Nonius Unbound
Total Posts: 12778
Joined: Mar 2004
 
Posted: 2019-01-07 03:59
Thanks espresso. In fact we are probably going with Spark.

Will continuously monitor Ray progress though .

Chiral is Tyler Durden
Previous Thread :: Next Thread 
Page 1 of 1