Strange


Total Posts: 1561 
Joined: Jun 2004 


Let's imagine I have a highly skewed and kurtotic distribution of returns (e.g. returns from selling S&P skew). I also have a model that selects a subset of these returns. Any reliable techniques that would tell me how statistically significant is my model?
My initial inclinations are to either use Mann–Whitney U test or Wilcoxon test. In the first one, I'd draw a random sample of the original distribution and compare it to the model sample. In the second one, if I understand it correctly, I can simply compare a sample to the full population. 
“My dear, here we must run as fast as we can, just to stay in place. And if you wish to go anywhere you must run twice as fast as that.” 



TonyC

Nuclear Energy Trader

Total Posts: 1298 
Joined: May 2004 


sitting here in the Louis Armstrong Airport nursing my third beer waiting for my twice delayed flight to get me back home ... given three beers and two Bourbon and sodas, I vaguely recall that Mann Whitney makes the assumption that the two distributions are independent while Wilcoxon makes assumption that the two distributions are related
to the extent that one of your distributions is a subset of the other, that would tend to make me think that that it violates the independent distribution Assumption of Mann Whitney, and that wilcoxen is more appropriate 
flaneur/boulevardier/remittance man/energy trader 


Strange


Total Posts: 1561 
Joined: Jun 2004 


I thought if I take a random samples, they would be independent for the purposes of the test. But yes, I hear you. 
“My dear, here we must run as fast as we can, just to stay in place. And if you wish to go anywhere you must run twice as fast as that.” 



TonyC

Nuclear Energy Trader

Total Posts: 1298 
Joined: May 2004 


you make a good point about the random resampling, and now that you brought it up, I've had some second thoughts.
wilcoxon requires "paired" data sets. if I understand your hypothetical example, the model decides when to engage in shorting the skew as opposed to just naively always shorting the skew until you go bust.
so the model distribution is a subset of the stb (short till bust) distribution. the other "comparisondistribution" is the excluding distribution, the stb distribution with the models trades excluded
let's suppose that excluded distribution is 1000 observations and the model is 150 observations. I suppose you could randomly resample with replacement a hundred observations of the model and 100 observations of the excluded distribution, run the wilcoxon test on those hundred observation pairs, and wash rinse repeat ad infinitum so that you had a whole bunch of wilcoxon statistics.
but if my memory serves, (and consider that I am in the window of the bar at Mollys on Decatur Street because my flight did not leave tonight), if you do that you're really doing threequarters of a Mann Whitney test
in fact if I remember correctly, you can actually run a Mann Whitney test by using the wilcoxon algorithm by invoking the R command : wilcox_test(x,y, pair="false") rather than pair equal true.. (and if the only difference betwixt a wilcoxon test and a Mann Whitney test is setting a flag from true to false, that has to count for something)
but like I said ... I'm in the window ... at the bar ... at Mollys. so take all this not with Just Grains of salt but with truckloads of salt; as there is lots of handwaving involved. 
flaneur/boulevardier/remittance man/energy trader 


ronin


Total Posts: 470 
Joined: May 2006 


It helps if you think a bit about what you think is different about your subset. Mean? Variance? Skew? Kurtosis?
The outofthebox statistical tests measure the sameness of the mean. Are the two means closer than some measure of uncertainty due to finite sampling. MannWhitney even assumes that the uncertainty is normally distributed. Wilcoxon doesn't.
I assume that is what you are hoping for. Do I generate some value by deciding when to sell puts as opposed to just selling them all the time. But I would expect that the appropriate measure of noise in the denominator would be specific to your returns distribution, rather than coming out of some standardised test.

"There is a SIX am?"  Arthur 




Maybe you're putting the cart before the horse. Just because the underlying distribution is skewed and leptokurtic doesn't mean that the sample statistics deviate significantly from normality. Unless the sample size is very small (less than 100 independent points) or very nonnormal, CLT probably means that you can use plain ole' Ttests.
At the very least, I'd get a handle on the issue by bootstrapping the relevant sample stats. (I'm assuming that you primarily care about the difference between the population means.) Histogram out the values you get from bootstrapping. If you can't really eyeball a significant nonnormality from this, then you're probably fine relying assuming CLT applies 
Good questions outrank easy answers.
Paul Samuelson 


TonyC

Nuclear Energy Trader

Total Posts: 1298 
Joined: May 2004 


Pay attention to what Ronin said, he's smarter 'n i am 
flaneur/boulevardier/remittance man/energy trader 



Strange


Total Posts: 1561 
Joined: Jun 2004 


@ronin Since I want to gauge the expectation of future PnL, it's the mean.
@EspressoLover Can we actually assume i.i.d. considering that these are sample and a subsample? However, I'll try that tomorrow.
Spinning this in my head, I feel a little like there are two very distinct cases in terms of the significance testing. Can't quite put my finger on it, but something like "am I aligned with the median or against it"...
Let's say for simplicity that I buy or sell puts on S&P. In the first case, let's say I am a seller, collect the bleed but somehow I seem to avoid the drawdowns. Did I avoid the drawdowns due to dumb luck or due to some information content in my signal? Second case is I buy puts, don't seem to bleed much and seem to catch most of the big down moves. Do I think that my lack of bleed is just due to random luck or is it due to information content in my signal?

“My dear, here we must run as fast as we can, just to stay in place. And if you wish to go anywhere you must run twice as fast as that.” 


TonyC

Nuclear Energy Trader

Total Posts: 1298 
Joined: May 2004 


maybe hennrickson and merton test of effectivness of market timing signal?
Attached File: onmarkettimingpart2.pdf
but beware of options causing false findings of timing ability as described here https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1947323 
flaneur/boulevardier/remittance man/energy trader 



Strange


Total Posts: 1561 
Joined: Jun 2004 


@TonyC  thanks, I'll take a look! 
“My dear, here we must run as fast as we can, just to stay in place. And if you wish to go anywhere you must run twice as fast as that.” 


ronin


Total Posts: 470 
Joined: May 2006 


> he's smarter 'n i am
I wish I was mate, I wish I was...
@strange, in all seriousness.
Say you are picking some times to sell puts. You collect your pennies, and you get steamrolled every once in a while. Are you better than a dumb robot who sells puts all the time.
I would factor out various ways of being 'better'.  are you collecting higher premia than average?  are you having fewer drawdowns than average?  are your drawdowns shallower than average?
Each of these (premia, drawdown frequency, drawdown depth) is probably more log Gaussian than your return distribution. So as a first go, I would try testing them separately using standardised tests. Out of those, the second one is going to be most suspicious, so if that's your edge you would probably have to dig in quite deep before you get comfortable.
And then there are ratios. Premium size to drawdown depth, premium size to drawdown frequency etc. Once you start digging in, something should come out.
If it doesn't, you'll have to just randomise your picks and work out where you are in the distribution of randomly picking strategies. I don't think there is a simple formula.
But that last scenario would probably not be tradable. "I am better than average, but I don't really know why"  it doesn't really fly off the shelves.

"There is a SIX am?"  Arthur 



Strange


Total Posts: 1561 
Joined: Jun 2004 


@ronin
I get it now, I confused a lot of people by talking about selling puts or whatever. It was a poor mnemonic device (in retrospect) because people right away start thinking about the distribution of returns.
 assume that this is a process for signal validation, involving the actual prices and P&L quality is not the right approach because it will mask the signal quality. If anything, let's convert the results into a digital form  "vol wins" or "vol loses".
 also, assume that each signal has some reasonable hypothesis behind it. It's not like "I am better than average, but I don't really know why" but more like "does buying puts after Trump tweets actually work" and "is it better to avoid selling puts on rainy days".
 there are several of these signals, but not a large number, so we don't need to make any form of data mining corrections.
 total number of these digital observations is large, but not unlimited (so it's not tick data);
So, to reframe, looking at two types of models. Both are based on some external bit of data, both models are drawing a subsample from a highly asymmetrical binomial distribution
(a) model 1 aims to maximize draws from the bigger bucket (b) model 2 aims to maximize draws from the smaller bucket

“My dear, here we must run as fast as we can, just to stay in place. And if you wish to go anywhere you must run twice as fast as that.” 


ronin


Total Posts: 470 
Joined: May 2006 


> "does buying puts after Trump tweets actually work"
That's not a bad one. Let me know if it does...
I guess the problem is that I don't really know what sort of distribution we are dealing with here.
E.g., how much is 3 standard deviations in your distribution? Is it pretty much all there is, or is just a little bit?

"There is a SIX am?"  Arthur 



nikol


Total Posts: 751 
Joined: Jun 2005 


What is the point in comparing set and subset unless there is some selection (filtering) model in between which extracts useful info, am I right?
If you care only about plain distribution (no time is involved, special ordering etc) then compare empirical CDFs: x=CDF(set), y=CDF(subset). If they are equal, you get diagonal qqplot.
Test for Uniformity distribution U = inverse_CDF_set(CDF_subset(i_point)) with, for example, KolmogorovSmirnov or AndersonDarling.
These examples can give you further inspiration by giving more weight to tails or by other censoring or choosing right metrics (FF)^2 or abs(FF) etc. Particular choice requires some research and thinking about what kind of info is stored in your data.





> Can we actually assume i.i.d. considering that these are sample and a subsample? However, I'll try that tomorrow.
I would suggest just comparing the subsample with the complement of the subsample. That eliminates any issues of overlapping sample points.
The null hypothesis is that the distribution of the subpopulation is different than the distribution of the population. That's true if and only if the subpopulation is different than the complementsubpopulation. So comparing subsample and complement tells you the same thing.
#drawdown
I just want to add here that estimating drawdown is a whole 'nother bucket of worms. Unlike mean/variance/skew, it's not a population summary statistics. It's a timeseries property, because the sequence of returns matters. In which case coming up with a tractable, analytical statistical test is a lot harder.
Plus drawdown has all sorts of pathological issues in terms of its mathematical behavior. For example a shortersample will always have lower expected drawdown than a longer sample, even if the returns are i.i.d. drawn from the same population. Now add on top that you want a nonparametric test, which is really hard to do with time series.
Personally, I'll use drawdown to give me a "gut feeling" about how a strategy behaves. But I'd just pass in terms of using it any formal statistical sense (like testing the hypothesis that one series has larger expected drawdown than another series). I think as long as you estimate mean, variance, skew, kurtosis, autocorrelation, heteroskedasticity and regime effects that pretty much captures everything relevant unless your time series is superweird.
> assume that this is a process for signal validation, involving the actual prices and P&L quality is not the right approach because it will mask the signal quality.
If you're talking about signal fitting, not portfolio/risk management, then just use Rsquared. Technically leastsquares is MLE for normal distribution, but the target variable has to be really skewed or leptokurtic to meaningfully change the results. Very rare for returns, even VIXtype returns to be affected.
You can try for yourself, cap your dependent variables at threestandard deviations from the mean and refit the leastsquares model. I'm willing to bet this fitted model and the vanilla model have 90%+ correlation with each other. 
Good questions outrank easy answers.
Paul Samuelson 



nikol


Total Posts: 751 
Joined: Jun 2005 


> I would suggest just comparing the subsample with the complement of the subsample. That eliminates any issues of overlapping sample points.




Strange


Total Posts: 1561 
Joined: Jun 2004 


@EspressoLover "I would suggest just comparing the subsample with the complement of the subsample."
That. Is. Smart.
The whole thing is about signal fitting only; in fact, it is rather specific to using a bunch of alternative data sets . The more I think about, the more I like the idea of using signs instead of returns.
PS. would anyone really try to use historical option returns to understand their risk?

“My dear, here we must run as fast as we can, just to stay in place. And if you wish to go anywhere you must run twice as fast as that.” 



TonyC

Nuclear Energy Trader

Total Posts: 1298 
Joined: May 2004 


> subsample with the complement of the subsample
complementary set, that's what I meant when I said excluded set as total set minus subset ... I am an inarticulate idiot 
flaneur/boulevardier/remittance man/energy trader 


goldorak


Total Posts: 1060 
Joined: Nov 2004 


Aside from the base question and re: assumptions to apply ttest, you may find the paper below interesting in the sense that you can compute an equivalent of the sharpe ratio (hence a tstat) without relying on any normality assumption.
Sharper asset ranking from total drawdown durations
The paper is of course available from the usual sources. 
If you are not living on the edge you are taking up too much space. 



Strange


Total Posts: 1561 
Joined: Jun 2004 


@goldorak  thanks! looks interesting 
“My dear, here we must run as fast as we can, just to stay in place. And if you wish to go anywhere you must run twice as fast as that.” 


nikol


Total Posts: 751 
Joined: Jun 2005 


@goldorak  thank you. duration is an interesting metrix, simple and informative 




goldorak


Total Posts: 1060 
Joined: Nov 2004 


and does not rely on distributional assumptions

If you are not living on the edge you are taking up too much space. 


doomanx


Total Posts: 23 
Joined: Jul 2018 


Really nice paper goldorak thanks for the recommendation. For those interested you can find implementations of the proposed estimator at https://cran.rproject.org/web/packages/sharpeRratio/index.html for R and https://pypi.org/project/pysharperratio/0.1.10/ for Python. 



