Forums  > Books & Papers  > how to use "Machine Learning" to build trading system?  
     
Page 1 of 2Goto to page: [1], 2 Next
Display using:  

vx2008


Total Posts: 11
Joined: Jan 2016
 
Posted: 2016-03-18 07:22
nowadays I am studying 'Machine Learning' after got some tutorial course about it; but until now I don't know exactly how to use it for trading stock or future. Now I want to learn how to use "machine learning" on trading, so shall I get some tutorial courses about it, especially by Matlab and Python, such as some books?
Thank you!

Nonius
Founding Member
Nonius Unbound
Total Posts: 12736
Joined: Mar 2004
 
Posted: 2016-03-19 10:34
I used to use a pretty decent unsupervised learning method called PCA. It "clusters" "points" along a hyperplane.

Chiral is Tyler Durden

Hansi


Total Posts: 300
Joined: Mar 2010
 
Posted: 2016-03-19 13:35
Can't comment on the quality but popped up in my RSS reader just after seeing this post: http://www.quantinsti.com/blog/free-resources-learn-machine-learning-trading/

ElonMust


Total Posts: 3
Joined: Sep 2015
 
Posted: 2016-03-25 13:18
This article makes some points against it: http://www.priceactionlab.com/Blog/2016/03/machine-learning-rediscovered/

although the author uses data-mining but he calls that "deterministic"

http://www.priceactionlab.com/

I suspect he means that the rules in the data-mining are dependend and that limits the data-mining bias.

Question

How important is "reproducible" in the case of data-mining/nachine learning? I know that is extremely important in physical sciences.


EspressoLover


Total Posts: 320
Joined: Jan 2015
 
Posted: 2016-03-25 16:51
> This article makes some points against it: http://www.priceactionlab.com/Blog/2016/03/machine-learning-rediscovered/

> "Attempting to discover trading algos via machine learning is a dangerous practice because of the problem of multiple comparisons."

Nope

> But when many independent tests are performed, the error gets out of control and significance at the 5% level does not suffice for rejection of the null hypothesis.

Nope

> In practice, machine learning works by combining indicators with entry and exit rules

That's not how it works at all.

> [T]his process... was abandoned in favor of other more robust methods, such as the testing of unique hypotheses based on sound theories of market microstructure and order book dynamics. These approaches also led to the discovery of HFT algos for market making and statistical arbitrage.

That's not even close to being right.


Good questions outrank easy answers. -Paul Samuelson

jslade


Total Posts: 1132
Joined: Feb 2007
 
Posted: 2016-03-25 18:13
To be fair, EL, some of those statements are roughly correct taken on their own, though the article taken as a whole is complete bullshit. In such threads, I'm occasionally tempted to mention the uplifting insights of my friend Lars, but that would be marketing.

"Learning, n. The kind of ignorance distinguishing the studious."

Mat001


Total Posts: 33
Joined: Aug 2014
 
Posted: 2016-03-25 18:22
"Nope"

https://en.wikipedia.org/wiki/Multiple_comparisons_problem

See also "Evaluating Trading Strategies" by Harvey and Liu, or provide justification.

"Nope"

See above wiki link or provide justification

"That's not how it works at all."

AFAIK, everyone is doing it this way, E.G. https://www.youtube.com/watch?v=V7UGqi83iJw&utm

All software I have tested does this. If you know a different way please post. I would be interested in it. Other than NNs.

If you know of successful ML applications please let us know.

a路径积分


Total Posts: 80
Joined: Dec 2014
 
Posted: 2016-03-25 19:15
Joining this thread to fan the flames, don't mind me.

>In such threads, I'm occasionally tempted to mention the uplifting insights of my friend Lars, but that would be marketing.

Lars? Are you referring to the same Lars who was at Tower - GS now I believe?

jslade


Total Posts: 1132
Joined: Feb 2007
 
Posted: 2016-03-25 23:13
Never hoyd of da bum.
Maybe someone should send OP the NP dispersion trading whitepaper.

"Learning, n. The kind of ignorance distinguishing the studious."

EspressoLover


Total Posts: 320
Joined: Jan 2015
 
Posted: 2016-03-26 07:17
@Mat001

I don't disagree with the assertion that multiple hypothesis require higher p-thresholds. What I disagree with is that this is in anyway a problem related to machine learning. If anything this is far more of an issue in classical statistics than ML. One of the central tenets of ML is that computer cycles are cheap, and statisticians who derive closed-form solutions to weird distributions are expensive.

It's easier to measure error by simply sub-dividing the data umpteen ways, re-running the entire fit process on each sub-division, and directly sampling out-sample performance. You can go hog-wild on an insane number of parameter and meta-parameter degrees of freedom, without worrying about multiple hypothesis testing. A cross-validated ML system always tests a single hypothesis: the out-sample error rate. Goodness-of-fit is completely abstracted from the fit algorithm.

In contrast with classical statistics, you're only fitting once on the entire data set. You derive some sort of probability distribution for the parameter estimate errors. Then you use that (possibly with some Bayesian prior) to determine the significance of the fit. As the fitted model becomes increasingly complex, the fragility and intractability of the significance test gets of out control. At a certain complexity, it's almost guaranteed that some subtle assumption is violated. The p-values become increasingly meaningless. Good luck trying to come up with some a prior significance test on the parameters of a random forest.

Of course, overfitting bias isn't impossible with ML. That is, if you as the researcher are repeatedly data-mining the same data. I.e. if you try some method on the data, look at the result, try another method, and repeat until you achieve an acceptable result. But that's nothing specific to ML, the very same pitfall is present in classical statistics. In fact, in ML if you were foresightful, you could encode this meta-logic in your system. If you plan on using methods A, B and C, picking the best performance, then simply apply that in cross-validation. Voila, you've now just eliminated the multiple hypothesis bias with a simple wrapper function. Applying the same logic with classical statistics in contrast is much harder, largely because of the possible dependence between the estimation error distributions of methods A, B and C.

> AFAIK, everyone is doing it this way... If you know of successful ML applications please let us know.

ML algorithms usually don't work out of the box in quant trading for two reasons. First the EMH means that the signal to noise ratio in financial markets is extremely low. For example deep learning is awesome in a lot of contexts, but its extremely hard to get right in a market context. If I'm trying to predict which pictures are of cute cats, the relationship between the raw pixel values and image classification may be extremely complex. But it still exists and the former strongly predicts the latter. We know this because humans can correctly classify cute cat pictures 99%+ of the time.

In contrast, even the best possible alpha still predicts far less than 1% of the variance of major asset returns. (Except at extremely the short horizons.) The corollary guarantees that the vast majority of the information of any indicator is worthless to predicting returns. Deep learning heavily relies on encoding the structure of the independent variables before even trying to predict the dependent variable. In trading that makes it easy to fall into equilibriums where the encoder thinks it's doing a great job, but it's still entirely missing the relevant 1% of information.

I use deep learning as an example, because it starkly lays out the contrast between traditional ML problem domains and quant trading. But the same issues still apply to some extent in nearly every ML model. Which brings us to the second challenge using ML in quant trading: almost all algorithms are designed for classification not regression.

With good reason. Classification is a simpler problem domain than regression. Often something that works well in classification will fail miserably in regression, or at least need some major tweak or layer added to it. Trading is fundamentally about the expected return to an asset. If I'm considering buying hog futures today, I want to know how much I can expect to make. Classifying whether an asset goes up or down could be useful, but frequently isn't. 80% of options go to zero, a classifier would tell you that selling puts is the best investment ever.

You frequently see people, like the guy in the video you link, try to kludge a solution by using a classifier to determine exit/entry points. That is a very bad approach. It stems from day traders erroneously thinking in terms of exit/entry rather than expected returns and risk. A properly designed alpha system should be completely abstracted from the trading logic. To work with ML methods, that frequently means making changes to get them working in a regression context. That's not a trivial task.

Good questions outrank easy answers. -Paul Samuelson

tbretagn


Total Posts: 259
Joined: Oct 2004
 
Posted: 2016-03-26 21:59
To be fair I'm not familiar with ML but I have the feeling you could train it to spot trend following systems behaviour (in a large sense).
But I agree with EL. I can't see a tool "explaining" (in the sense where it can predict and monetize) the market reaction we had on the last ECB meeting for example.

Et meme si ce n'est pas vrai, il faut croire en l'histoire ancienne

levkly


Total Posts: 28
Joined: Nov 2014
 
Posted: 2016-03-27 07:19
EspressoLover,

1. About CV , Are You randomly divide time series to N fold and perform CV for each variation ?
2. How much history you use for that?
3. I retrain the ML very frequent in order to adapt to to changing market ,How you take it into account in CV and history length?

ElonMust


Total Posts: 3
Joined: Sep 2015
 
Posted: 2016-03-27 14:56
Espressolover:

When you make statements llike this:

"Trading is fundamentally about the expected return to an asset."

Nope

We get the idea that you are unfamiliar with the subject.

That is not what trading is fundamentally about. Not even close. Maybe for some speculators.

Now this immortal fable

" It stems from day traders erroneously thinking in terms of exit/entry rather than expected returns and risk."

Nope

You need exit/entry to pocket return. Do you know this? It boils down to this.

Day traders are way more advanced than your college statistics.

Bye


akimon


Total Posts: 566
Joined: Dec 2004
 
Posted: 2016-03-27 15:53
If you train a machine learning system on historical JGB prices, it'll just tell you to buy.

tabris


Total Posts: 1255
Joined: Feb 2005
 
Posted: 2016-03-28 04:00
akimon:

And why would that be wrong! Evil Smile

Dilbert: Why does it seem as though I am the only honest guy on earth? Dogbert: Your type tends not to reproduce.

akimon


Total Posts: 566
Joined: Dec 2004
 
Posted: 2016-03-28 06:27
Nothing wrong with it Smiley

Sometimes Machines, rather than Gaijins, have an easier time learning the age-old advice of the JGB market.

A wise man had once told me never to be short JGB futures.

Japanese Flag

EspressoLover


Total Posts: 320
Joined: Jan 2015
 
Posted: 2016-03-28 21:01
@levkly

1) You want to make sure that there's little to no correlation in residuals across separate CV bins. Think about keeping contiguous points in atomic groupings. That is make sure every point in an atom ends up in the same bin. For example let's say you're training 24-hour returns on the previous 24 hour returns. You don't want 10:00 and 11:00 from the same day to be in different bins, they're going to have heavy residual-correlation because they share the 23 out of 24 hours. If cases like this frequently wind up in separate bins, then CV error will significantly under-estimate true out-sample error. In this particular case you probably want to atomically group at the granularity of month or coarser. If looking at a pure intraday strategy with no overnight returns or indicators you can group at the level of trading day.

2) Depends on the strength of the signal you're fitting (weaker signals requires more history) and its horizon (shorter usually allows for less history). Most times it's better to use more history than less. Yes regimes change, but explanatory power from larger training sets usually outweighs the difference. Plus regimes may also change in the future, so training across different historical periods can increase robustness (at the cost of reducing regime-specific fit).

3) I say if you have the development time and computing resources treat history length and retraining frequency as another meta-parameter to be selected in cross-validation. Try multiple history lengths and rolling re-train schedules and select what works best.

As with anything YMMV. Frequently the answer is highly dependent on the data and model being used. Developing an intuitive understanding of both is is really one of the most important aspects of quant research.

@ElonMust

"[Expected return] is not what trading is fundamentally about..."

Whatever, dude. This statement is so incorrect on a basic level, that I'm not even going to bother disputing it. I'm trying to help you, by pointing out how something you read is mis-leading. If you want to ignore what people in the industry actually do, and instead read the next Seeking Alpha article about Fibonacci sequences, then go for it.

Every major quant shop in the world abstracts out alpha and monetization. RennTech, KCG, PDT, Citadel, Jump, Teza, AQR, Two Sigma, Tower, and HRT all start by building models that predict expected return. But you could ignore what the best trading shops in the world do, in favor of a blog written by a charlatan who's only expertise lies in selling worthless, overpriced software to gullible day traders.

Good questions outrank easy answers. -Paul Samuelson

levkly


Total Posts: 28
Joined: Nov 2014
 
Posted: 2016-03-28 22:36
Thanks EL,

If I have 10 years of historical data I divide it for example for monthly bins, I have 120 bins. On this bins I perform CV.
In the first post you told that every time you perform CV you choose different bins (random) otherwise you will burn your data with more then few retries.
How you randomly choose different bins every time?

I don't have problem with computing resources and I retrain daily on sliding window with long history and by retraining solve the non stationary and regime change.
The problem with this method high variance because I validate only on next day data.

goldorak


Total Posts: 1042
Joined: Nov 2004
 
Posted: 2016-03-29 06:55
> I don't have problem with computing resources and I retrain daily on sliding window with long history and by retraining solve the non stationary and regime change

What about the non-stationarity of the sliding window's length parameter?

If you are not living on the edge you are taking up too much space.

levkly


Total Posts: 28
Joined: Nov 2014
 
Posted: 2016-03-29 08:20
What do you mean? the length can be fixed or dynamic (concept drift)
You can calibrate it.

goldorak


Total Posts: 1042
Joined: Nov 2004
 
Posted: 2016-03-29 08:46
Well, this is what I mean. I hope for you it IS dynamic and that you recalibrate the window length daily as for the stuff your are retraining daily based on that particular window length.

If you are not living on the edge you are taking up too much space.

EspressoLover


Total Posts: 320
Joined: Jan 2015
 
Posted: 2016-03-29 10:32
> If I have 10 years of historical data I divide it for example for monthly bins, I have 120 bins.

Well you have 120 "atoms", but that doesn't mean you have to use 120 bins. There's tradeoffs to how many bins to use in CV, particularly when it comes to the variance of the estimate of out-sample error. You should google this topic and check it out for yourself. But say you decide to use 5 bins, and your data starts in 2003. Well you could assign Jan-2003 to bin 1, Feb-2003 to 2, Mar-2003 to 3, Apr-2003 to 4, May-2003 to 5, June-2003 to 1, July-2003 to 2, etc.

> In the first post you told that every time you perform CV you choose different bins (random) otherwise you will burn your data with more then few retries.

I think I must have mis-communicated my point. Data gets "burned", in a sense, anytime you or your software makes a choice about your model or model parameters based on the results from a previous analysis on the same data. Any time you interact with the data more than once, you're potentially biasing your error estimates. Binning the data in a different way doesn't really fix this problem.

Say you decide to try a linear regression, look at the results. You're disappointed so try a Gaussian process instead. Results look good and you decide to go with the GP. In some sense you're probably biasing your error. If the Gaussian process results had been unacceptable, you probably would have tried something else, so the performance of your final model is upwardly biased by this sort of online selection process.

That being said if you try linear regression then decide to try a GP you have two options. You could just try a Gaussian process directly and compare the results. Or you encode the selection between the two within each run of the CV. On certain runs, regression might get selected, so this captures some of the noise associated with model selection. That being said there is still some final bias because you did look at the aggregate performance of regression.

Obviously interacting with the data is inevitable in any sort of serious endeavor. Given that, it's a good idea to keep some of the data in total reserve. Don't use it, backtest it or even look at it until you've completely decided on a final model that you're happy with. Then even if you are introducing bias in your error, you're not too worried because there's still untouched data left to get a true out-sample estimate.

Good questions outrank easy answers. -Paul Samuelson

Lebowski


Total Posts: 66
Joined: Jun 2015
 
Posted: 2016-03-29 15:53
Hi @EspressoLover, thanks for all the thankless anonymous effort you're putting in helping people all over this site. Your breadth of knowledge is impressive. It's great for "two ears, one mouth" folks like me who are here to learn and actively seek out your posts because they're worth a read.

I have a question about CV that might be so basic as to detract from this thread. In class, when we covered CV, we of course studied how your choice of K (in K fold CV) impacts our bias/variance trade off. In general, it does appear that a higher K is the way to go (to a fault), but a concern I had that would pertain (if founded) directly to applying CV for timeseries/trading is this:

So suppose @levkly has 10 * 252 ~ 2500 data points over ten years. Maybe with 120 bins it isn't a problem, but if we chop up a timeseries too much, wouldn't that begin to destroy the serial correlation of returns at some point?


Thanks

EspressoLover


Total Posts: 320
Joined: Jan 2015
 
Posted: 2016-03-30 03:07
@Lebowski

Thanks, man! It's nice to know the effort's appreciated.

1) If you're targeting some sort of serial correlation, I'd definitely advocate making sure your "atomic groups" are significantly larger than the correlation horizon you're looking at. This not only minimizes how much the training sets are chopped up, but also avoids potential correlation in the residuals across the bins.

The latter happens because consecutive points tend to have very similar long-run history. If you're looking at lagged returns from the previous month, consecutive days will tend to have very similar independent variables. Dividing your data set into every odd and even day doesn't really produce uncorrelated data sets.

Let's say the longest horizon loopback your using is a week. If you atomically group at the level of 3 months the effect of binning is pretty minimal. Only 6% of your points will touch any history outside the bin. If you had a lot of history, you could set atom size to 1 year, and that falls to 1.5% of points. Assign atoms to bins, not individual points. E.g. if Jan 2010-Mar 2010 is one atom, then every point in this range always gets the same bin value.

2) The chopping effect becomes less, not more, of a problem with more bins. Remember you're training on the complement of the bin. Say we have 100 points and are using 2 bins (disregarding atomic grouping for the example). [1,3,5,7,...] forms bin 1, so you train on points [2,4,6,8,...]. Every point is 'chopped' in the sense of missing its neighbor.

Now say we go the other extreme and use one-against-all (i.e. 100 bins). Bin 57 comprises [57], and its training set is [1,2,3,...,56,58,...]. Out of the 99 training points only 2 are chopped in the sense of missing their neighbors.

Good questions outrank easy answers. -Paul Samuelson

Lebowski


Total Posts: 66
Joined: Jun 2015
 
Posted: 2016-03-30 17:01
@EL

Good point about training on the compliment of the bin, I guess I maybe had it backwards in that sense now that it's presented this way. That being said, as you pointed out towards the end of 1), you'll want to be conscious of your bin size vs. loop back period. Thanks. It's great to hear about some of the practical considerations of applying these ideas from class to trading.

Here's another model validation question, and the answer may not be ML related at all. How does one go about determining the performance of an options pricing model? Is MSE sufficient to capture all the dynamics of an option model, or do quants in practice adjust somehow for known faults of these models when they go to do this sort of thing? For example, it's well known (enough for them to mention in class) that certain stochastic vol models such as Heston overprice OTM options and underprice ITM options. Is there a way one could discount for this phenomenon during model validation?

Say I'm an active trader, maybe I'm trying to make markets in some options chain, presumably (I'm fairly new to this), I'm interested in staying closer to the money because a) more volume = more fills = more spread captured and b) if I sell a .50 delta, I'd rather buy a .40 delta to hedge than a .04 delta. Maybe a trader of this ilk would want to penalize errors near the money more than I/OTM, for example?

Maybe I just need to pick up a good book on option pricing. I haven't read them cover to cover, but neither our textbook nor Natenberg seem to really address how to actually determine the performance of a pricing model that I found. Again...very, very new to this, I realize there are some people on NP who have been doing options pricing for a while who will probably be shaking their heads reading this.
Previous Thread :: Next Thread 
Page 1 of 2Goto to page: [1], 2 Next