Forums  > Trading  > Hyperparam tuning of large models for timeseries  
     
Page 1 of 1
Display using:  

dlwlrma


Total Posts: 15
Joined: Jul 2019
 
Posted: 2021-07-23 19:02
I've been working on building models using more modern ML techniques that have lots of hyper-parameters that need to be searched in some way for optimal convergence and generalization.

Typically I use a walk forward approach / rolling calibration for training my models (i.e. after a fixed amount of time, I retrain the model using data from a fixed look back period) and stick to a reasonably stable fixed set of hyper parameters across every training window.

I notice that there can sometimes be a lot of performance left on the table due un-optimized early stopping, or larger / smaller model capacity..etc, since a different set of hyper-params might be able to give better out-of-sample performance on each training window. (I.e. if I retrain a model every quarter, the optimal hyperparams used to predict q1 vs q2 might be different). Part of the reason that the models can be so sensitive to hyper params, especially like early stopping / max # of training iterations is because they are dependent on random initializations / random seeds (e.g. weight initialization in a NN).

The typical approach is to optimizing the hyperparams is to use a holdout / validation set, search the hyperparam space in the train set and then pick the configuration that provides the best results on the val set and use model in production.

Unfortunately there are several issues when it comes to my data (and I suspect this is common for anyone using ML to do quant trading).

1) if I were to randomly pick some training data as the val set, the val set will lie within the same manifold as the train set since not all features are particularly fast moving - therefore the results on the val set are very similar to the results I would get if I just ran my model on the training set and provide no indication of generalization

2) if I choose just the last X% of the data as the validation set, because some of the features are slow, the validation set is still kinda similar in nature to the train set, and so the val results are inflated compared to the test results and so it still doesn't predict generalization too well (although it's better than the above)

Maybe this is some secret quant sauce but I'm curious whether any of you guys have any tips or ideas about this!

gnarsed


Total Posts: 95
Joined: Feb 2008
 
Posted: 2021-07-28 20:16
i'm not sure i fully appreciate the second concern. if there is a genuine effect whereby performance is better closer to the training set, you can just re-fit very frequently and reap the benefits.

dlwlrma


Total Posts: 15
Joined: Jul 2019
 
Posted: 2021-07-29 01:23
Maybe I didn't explain properly.

Let's say you have time periods A,B,C,D,E (in order). Let E be the purely out of sample data.

You could train different configurations of your model on A,B,C and pick the one that performs best on D. However, this is unlikely to perform on E.

You could alternatively just train on all of A,B,C,D - but then you have no validation set to try different configurations on. You would probably get good performance on E from using a reasonable configuration but you are missing out on being able to optimize your configuration.

And just to be clear, by configuration I mean Hyper-parameters sets.

dlwlrma


Total Posts: 15
Joined: Jul 2019
 
Posted: 2021-07-29 01:23

Maggette


Total Posts: 1319
Joined: Jun 2007
 
Posted: 2021-07-29 10:57
Well. Everybody who works with instationary data struggles with your problem.

(Hyperparameter) calibration and model validation become hard...if not impossible. I doubt there is an optimal solution or secret sauce to solve that.

The question is:why is a model trained on A,B,C and optimized on D not generalizing to E?

If E is nothing like D but your model trained on A,B,C performed well on D....it also follows that either
- E is also nothing like A,B or C. Nothing you do on A,B,C,D will improve your performance on E.
- or you over-fitted on D.

"You would probably get good performance on E from using a reasonable configuration but you are missing out on being able to optimize your configuration."
Sure. And how could that be otherwise. You train for some general thing that is hidden in A,B,C,D and E.

I understand A,B,C,D and E are in temporal order. But depending on the frequency of you data and your trading frequency (say intra-day strats), I would really consider falling for the cardinal sin and split each A,B,C,D AND E in train, test and validation and then either:
- train a model that generalizes well over all A,B,C,D and E
- either optimize each and find a way to detect if you are in state A,B,C,D, or E (like paper trade all models and depending on the performance switch, or trade all models and dynamically adapt weights/allocation).

I guess that's not really helpful, but that's a hard problem and I don't think there exists an one size fits all solution.

Ich kam hierher und sah dich und deine Leute lächeln, und sagte mir: Maggette, scheiss auf den small talk, lass lieber deine Fäuste sprechen...

ronin


Total Posts: 684
Joined: May 2006
 
Posted: 2021-07-29 12:02
@dlwlrma,

You talk about "optimizing" a strategy like it's a good thing. It's not.
"Optimized" = "overfitted".

Either the strategy works, or it doesn't.

If it works, improve it by increasing the universe. Or by improving the diversification. Or by tightening up the portfolio construction. Or by reducing slippage. Or by doing any of a million other things. Which, you know, improve a strategy.

Trying to optimize some parameters will ruin it. Provided there is anything to ruin.

"There is a SIX am?" -- Arthur

dlwlrma


Total Posts: 15
Joined: Jul 2019
 
Posted: 2021-07-29 12:10
@maguette - it's definitely that I end up overfitting on D too much. When I do train on ABCD, I do get reasonable performance on E - I've just been trying to incrementally improve it.

@Ronin - appreciate the perspective, maybe you're right. To be clear the alpha does work well (live) without a "robust" hyper param optimization technique. However, since the return of the alpha is very low (but high Sharpe), even adding like 0.5bps improvement to the return would be non-trivial.

dlwlrma


Total Posts: 15
Joined: Jul 2019
 
Posted: 2021-07-29 12:10
.

ronin


Total Posts: 684
Joined: May 2006
 
Posted: 2021-07-30 12:50
Got it. That doesn't really sound like optimization at all.

Presumably you don't trade very often, and you only trade small sizes. Lower the signal threshold and increase the size. You will pick up more vol than return, but there may be a sweet spot somewhere.

It doesn't sound like re-calibrating the strategy makes too much sense.

"There is a SIX am?" -- Arthur

gnarsed


Total Posts: 95
Joined: Feb 2008
 
Posted: 2021-07-31 01:30
back to your abcde example, and independent from the discussion on the futility of that, the idea would be to "tune/optimize" your hyperparameters so they work well for all of a->b, ab->c, abc->d, abcd->e, etc.

gnarsed


Total Posts: 95
Joined: Feb 2008
 
Posted: 2021-07-31 01:30
back to your abcde example, and independent from the discussion on the futility of that, the idea would be to "tune/optimize" your hyperparameters so they work well for all of a->b, ab->c, abc->d, abcd->e, etc.

EspressoLover


Total Posts: 486
Joined: Jan 2015
 
Posted: 2021-08-02 15:15
I think there are two related, but ultimately different factors at play here. One is training error. How much fit decay you'd expect even if the underlying probability distributions remain exactly the same. Two is regime change. How much the probability distributions drift over time.

Cross validation is ultimately about the former, not the latter. Regime change is a lot messier of a beast to tackle. The way I like to get a handle on it is to fit a rolling model, then quantify the rate that it decays from month to month as you move away from the training period. You also want to make some effort to optimize for an ideal fit period. You're trading off regime recency from shorter training periods against lower training error from the larger datasets that come with longer periods. Somewhere in between is the sweet spot.

Finally, one thing to keep in mind in regime-driven environments, you want to bias towards robust local minimums in a neighborhood of flat gradients. Think of a fitness landscape. This is the difference between gently rolling hills and narrow spikes. Regime drift means you're living in a perturbation of your fitted point. So you want to find a point with a good neighborhood.

Similarly, when regime drift is prominent saddle points can be particularly nasty. This becomes a bigger problem on higher dimensional models. Ostensibly they'd seem okay, because perturbation can actually improve the performance. But practically, they make a model behave unpredictably. A saddle point can look regime-robust for a long time, until the perturbation hits the wrong dimension, and all of a sudden you see performance collapse seemingly out of nowhere.


Good questions outrank easy answers. -Paul Samuelson
Previous Thread :: Next Thread 
Page 1 of 1