Forums  > Software  > scipy.optimize.minimize?  
     
Page 1 of 1
Display using:  

Strange


Total Posts: 1453
Joined: Jun 2004
 
Posted: 2018-06-29 01:17
So I have decided to be all scientific and optimize the parameters for my strategy using gradient descent. scipy.optimize.minimize seems to do it all, but I am not sure which method is preferable and it's not clear how to set boundaries on the paramters. Does anyone have experience with it?

I don't interest myself in 'why?'. I think more often in terms of 'when?'...sometimes 'where?'. And always how much?'

EspressoLover


Total Posts: 338
Joined: Jan 2015
 
Posted: 2018-07-03 23:23
Assuming the objective function is backtest PnL?

If that's the case your fitness landscape is probably non-smooth. That's because as parameters change, the objective function will vary discontinuously. Trades are included/excluded discretely, so if a param changes by an epsilon, then either the objective function changes by zero or suddenly jumps to a different value. That means any algorithm that uses the gradient would be unstable.

(Aside: Theoretically, you could define your own function for deriving a smooth gradient. The way to do this would look something like change the param enough until the trade set was altered, then interpolating to an epsilon step. This is very likely more effort than it's worth.)

Unless you're compute constrained, probably just stick with Nelson-Mead. Trading parameters have such huge generalization error, that the slight accuracy differences in finding exact local maxima is irrelevant. The only potential annoyance with NM is setting the termination tolerances. You could also try Powell or COBYLA which are also gradient free, but often they require more hand-tuning than NM

The easiest way to handle constraints is just to bake them into the objective function. I.e. if an evaluated point steps outside the bounds then add a penalty of infinity. This isn't the most computationally efficient approach, and you need to make sure your initial point starts inside the constraints. But again, its the route of least headaches.

Also most trading parameter optimizations are non-convex. Therefore you should probably use spicy.optimize.basinhoppin to find the global maximum, not just whatever local maximum is in the neighborhood of your initial point.

Good questions outrank easy answers. -Paul Samuelson

Strange


Total Posts: 1453
Joined: Jun 2004
 
Posted: 2018-07-04 09:04
Thanks!

I checked out basin hopping, looks very interesting and i will experiment with it.

Out of curiosity, for a really slow backtest that has a fair number of parameters (let's say 30 parameters and 1 min per backtest), would NM still be the best choice?

I don't interest myself in 'why?'. I think more often in terms of 'when?'...sometimes 'where?'. And always how much?'

TonyC
Nuclear Energy Trader

Total Posts: 1282
Joined: May 2004
 
Posted: 2018-07-04 13:32
here is a 23 yr old bit of code that i probably should get around to rewriting as an operator on a target function instead of a function executing a character string


flaneur/boulevardier/remittance man/energy trader

ronin


Total Posts: 365
Joined: May 2006
 
Posted: 2018-07-04 14:47
> (let's say 30 parameters

Just out of interest, is that seriously a sensible approach? Surely with 30 parameters you can overfit to any amount of historical data on the planet. Just 10 grid points per parameter would give you 10^30 states.


"There is a SIX am?" -- Arthur

Strange


Total Posts: 1453
Joined: Jun 2004
 
Posted: 2018-07-04 17:27
Valid question regarding curve-fitting.

It is actually portfolio of 7 different sub-strategies generating separate targets, each one produces sensible results in its own right. The “parameters” that go into each sub-strategy are continuous and fairly "benign" like the EWM centers of mass for the historical regressions. My usual test is to check that the sensitivity to the strategy “parameters” parameters is fairly low and the response curve is smooth - discontinuity in either PNL/trade or risk metrics in response to any parameter is a bad sign, obviously. So let’s assume I know that each sub-target is not badly curve-fit.

This brings us to the combined portfolio. Multiple sub-targets, when combined, create more parameters. Each strategy has only 1, at most 2 parameters, but once you add various portfolio management inputs such as weights etc, input parameters add up to a scary high number. I am pretty sure the combined results will be curve-fit, but it’s sensible to assume that it still should be the most valid starting point (IMHO).

As an alternative, I can keep doing what I am doing already. I.e. optimize each strategy separately to generate the best target on standalone basis. Then combine them using some form of portfolio optimization and optimize the execution parameters as a separate step.

I don't interest myself in 'why?'. I think more often in terms of 'when?'...sometimes 'where?'. And always how much?'

EspressoLover


Total Posts: 338
Joined: Jan 2015
 
Posted: 2018-07-04 18:08
I’d consider just ditching scipy.optimize all together. Given how expensive evaluating the backtest is, you need to parallelize and utilize all the cores at your disposal. That’s not really easy to do in python. You’re also outside the realm of traditional optimization and pretty far into metaheuristics. scipy.optimize doesn’t give you a lot in this domain.

An aside on the dimensionality, I’d guess that the parameter space is effectively low-dimension. That is the objective function really only varies along some much smaller manifold embedded in the full high-dimensional space. For any given param-set all that really matters is where it projects to on the manifold. There’s lots of reasons this might be the case: some parameters are irrelevant, some parameters basically have the same effect, some group of params only matter because of their L2 norm, etc. So the search space is actually much smaller than it appears, but we only know how to look in the larger space.

With low effective dimensionality, random search tends to perform pretty well. The space is too large for grid search, but even a small number of random points in the embedding space will tend to pretty closely approximate grid search on the manifold. Random search also benefits from being brain-dead easy to parallelize, even across multiple machines. A final benefit is that you’ll probably get better generalization error.

Overfitting tends to be a bigger issue with peaks in the fitness landscape that are narrow and steep. In one sense, we want to bias towards local maxima that are high-volume, as well as high-height. Random search naturally does this because it’s more likely to land in high-volume regions. So, I’d keep random search as your baseline to beat.

Beyond that the next logical approach would be simulated annealing. Which is basically what basinhopping does, but you want to use a library that you can parallelize across cores, and probably even machines. Again, more compute time is going to trump better algorithms. If you really want to get into the weeds of metaheuristics you could try the various flavors of iterated local search, tabu search, or particles swarms. Honestly though, I doubt whether the potential improvement over random search would be large enough to justify your time.

EDIT: Based on your last post, I'd also maybe suggest starting by using the params you get from the individual fits. Then use those points as seeds into the starting values simulated annealing or some other hill climbing algorithm. In fact I'd use three sources of seed points: completely random points, fully fitted points, and individual strategy fitted points with randomized portfolio level params.

Good questions outrank easy answers. -Paul Samuelson

schmitty


Total Posts: 56
Joined: Jun 2006
 
Posted: 2018-07-05 08:51
Before attempting the time-consuming or possibly intractable optimization suggested in the posts above, I would try a quick approximation to see if the full optimization would be worth the effort.

Given 7 models, two params each, 100 increments per parameter, 1mm 1min market data test rows: create log returns matrix R 1000000 X 70000. Remove columns with sum negative returns. Solve for minimum variance portfolio directly via Kempf/Memel regression. Compare Sharpe of min var portfolio to your current portfolio run over same rows. If Sharpe is significantly higher then running your optimization is probably worth the effort,

This is nowhere near exact but can give you a general idea. The creation of the 70k X 70k return matrix could probably be done in parallel Python over the weekend even on a slow 24 core machine. And the rest of it only takes a few minutes.

Also note that the full optimization, as described in previous posts in this thread, does not account for the possible inclusion of multiple parameterizations of a model.

katastrofa


Total Posts: 458
Joined: Jul 2008
 
Posted: 2018-07-05 11:05
"An aside on the dimensionality, I’d guess that the parameter space is effectively low-dimension. That is the objective function really only varies along some much smaller manifold embedded in the full high-dimensional space."

That's what people believe in in many disciplines, but there's usually no proof given for the statement. Everybody talks about this manifold, but who has ever seen one?

ronin


Total Posts: 365
Joined: May 2006
 
Posted: 2018-07-05 11:07
> As an alternative, I can keep doing what I am doing already. I.e. optimize each strategy separately to generate the best target on standalone basis. Then combine them using some form of portfolio optimization and optimize the execution parameters as a separate step.

I agree. Quite apart from the dimensionality issue, you want to keep your signals separate from execution and portfolio construction. The last thing you want when one of your signals deteriorates is for your execution and portfolio construction to go south as well.

With the situation that you have, I would go with several versions of each substrategy with different centres of mass, and let the portfolio allocator sort them out. If the strategy level optimization is pushing them to be too correlated, that will come out.



"There is a SIX am?" -- Arthur

EspressoLover


Total Posts: 338
Joined: Jan 2015
 
Posted: 2018-07-05 19:35
My last post was made in a race condition with @strange. Given the followup information, I'm a lot less confident that he's dealing with an effective low-dimensional situation. Seems more likely to be a situation of (nearly-)orthogonal subspaces along the axis of each sub-strategy.

I do agree with the others, I'd definitely start by fitting strategies individually, then using something like traditional portfolio optimization to combine. The possible gains I see from pooled fitting are 1) pushing the individual strategy parameters to be more anti-correlated with each other. And 2) transaction cost reductions by pushing strategies to interally cross each other.

Both aren't possible with individual optimization because the sub-strategy parameters are already fixed by the time you optimize around any portfolio effects. Kempf/Memmel is definitely a clever heuristic to gauge the impact, but even if it turns up negative, pooled fitting might still produce the above gains.

That being said the portfolio-optimal params probably live in close proximity to their individual optimal levels. So, your best bet is probably starting off with individual fit params. Then doing something like iterated local search on the pooled param set.

Another approach that might work is doing something like coordinate descent. Start off with the individual fits, then do the portfolio fits. Then start with the first sub-strat. Adjust that strategy, and only that strategy's params, with the objective of portfolio performance. Leave all the other sub-strat params fixed. Then do the same with the second sub-strat. And so on, repeating until convergence.

This approach makes it easier to get stuck in local minima and saddle points. You give up improvements requiring coordinated changes between strats. But that being said you drastically reduce the dimensionality at each optimization step, and still get the opportunity to reap most of the plausible pooled parameter gains.

> Everybody talks about this manifold, but who has ever seen one?

Absolutely disagree, how else would you describe an autoencoder? How could image recognition even be possible unless the problem was effectively lower dimensional?

Good questions outrank easy answers. -Paul Samuelson

Strange


Total Posts: 1453
Joined: Jun 2004
 
Posted: 2018-07-06 03:56
Yeah, my main aim is to create a joint execution across all strategies (it's a small set of instruments). I am going to think about all this over the weekend.

I don't interest myself in 'why?'. I think more often in terms of 'when?'...sometimes 'where?'. And always how much?'
Previous Thread :: Next Thread 
Page 1 of 1