Nonius

Founding Member Nonius Unbound

Total Posts: 12779 
Joined: Mar 2004 


So, I built this neural net with about 70 inputs and one output. it is very very simple in design so far. It has some inner nodes but no feedback loops. presently, it is topologically just a tree.
now I have PARTIAL input data for training. I have about 3000 examples of maybe half t 2/3 of the inputs and the relevent output values.
what can I do in this case? 
Chiral is Tyler Durden 



Anthis

It's all Greek to me

Total Posts: 1180 
Joined: Jul 2004 


I guess you mean that you dont have enough data for one of the 70 input variables.
In such a case you either drop the whole variable which may have negligible effect since you have some many variables or you trim the data sample.
Its recommended to experiment with both, and choose the best model out of sample. Try to avoid overfitting. Also try to be parsimonius.
If i can recall NN need to split the sample in three sub samples training sample, validation sample and testing sample. 
Αίεν Υψικράτειν/Τύχη μη πίστευε/Άνδρα Αρχή Δείκνυσι/Νόησις Αρχή Επιστήμης //Σε ενα κλουβί γραφείο σαν αγρίμι παίζω ατέλειωτο βουβό ταξίμι



Nonius

Founding Member Nonius Unbound

Total Posts: 12779 
Joined: Mar 2004 


thanks dude, yes I don't have all the inputs, but intuitively I think they should be important. I'll take your comments under advisemen. 
Chiral is Tyler Durden 



Anthis

It's all Greek to me

Total Posts: 1180 
Joined: Jul 2004 


Also i guess it would be a good start to run a multiple regression first before moving to non parametric methods. In absense of non linearities the NN wont add any insights. Moreover, I can recall that the MINITAB package (you may download demo) has a best subsets multiple regression function that helps you choose or rank the best subset(s) of explanatory variables. Its something like combinatorial optimisation. But i am not sure if the function can handle such a large number of variables.....

Αίεν Υψικράτειν/Τύχη μη πίστευε/Άνδρα Αρχή Δείκνυσι/Νόησις Αρχή Επιστήμης //Σε ενα κλουβί γραφείο σαν αγρίμι παίζω ατέλειωτο βουβό ταξίμι



Johnny

Founding Member

Total Posts: 4333 
Joined: May 2004 


Anthis has raised a good question: dude, why are you using a neural net rather than, for example, a regression based method? The main reason would be if you think there are strong nonlinear relationships, but with 70 explanatory variables you're unlikely to be sure that you're really capturing nonlinearities and not just seeing the effects of something else, such as multicollinearity.

Stab Art Capital Structure Demolition LLC 



Anthis

It's all Greek to me

Total Posts: 1180 
Joined: Jul 2004 


Johnny thanks for the compliment. I ve been Refenes' student, i hope the know how i ve got did worth the fees

Αίεν Υψικράτειν/Τύχη μη πίστευε/Άνδρα Αρχή Δείκνυσι/Νόησις Αρχή Επιστήμης //Σε ενα κλουβί γραφείο σαν αγρίμι παίζω ατέλειωτο βουβό ταξίμι



Baltazar


Total Posts: 1769 
Joined: Jul 2004 


i would add: try typical PCA and then try kernelPCA (there are code for matlab on the web)
you maybe familiar with NNet but i would suggest going into support vector machines if linear technique do not work well. it is as potent as NNet but a lot less "blackbox" 
Qui fait le malin tombe dans le ravin 



Johnny

Founding Member

Total Posts: 4333 
Joined: May 2004 


I ve been Refenes' student ...
Me too.

Stab Art Capital Structure Demolition LLC 


Anthis

It's all Greek to me

Total Posts: 1180 
Joined: Jul 2004 


Very nice it makes two of us. How many more do we need to make a club?
Also feel free to email me for details, experiences, etc. 
Αίεν Υψικράτειν/Τύχη μη πίστευε/Άνδρα Αρχή Δείκνυσι/Νόησις Αρχή Επιστήμης //Σε ενα κλουβί γραφείο σαν αγρίμι παίζω ατέλειωτο βουβό ταξίμι




Johnny

Founding Member

Total Posts: 4333 
Joined: May 2004 


Two is plenty for that sort of club. What's he doing now anyway? Working for a hedge fund probably?

Stab Art Capital Structure Demolition LLC 


Nonius

Founding Member Nonius Unbound

Total Posts: 12779 
Joined: Mar 2004 


I am describing this as a neural net but it is very very simplistic. think of a tree of depth 2. the leaves have a total of around 70 inputs nodes. some nodes call for numerical inputs while other nodes call or a subjective "rating". there about 10 inner nodes that have real world descriptions. the values of the "ratings" at the inner nodes are given by weighted sums of values of the leave output values. Then, the value of the root of the tree is a weighted sum of values of the outputs of the inner nodes. this is why I call the network topologically trivial.
what I need to choose are the weights for each of the 80 or so branches and certain parameters of the functions that map numerical input into a number lying in a certain interval.
the tree network was initially designed with the idea that it would be very difficult to statistically calibrate the model, but our intention was to "sell" the concept to the regulators as being reasonable. anyway, now I met a guy who says he has a reasonably good dataset for a portion of the inputs.

Chiral is Tyler Durden 



Johnny

Founding Member

Total Posts: 4333 
Joined: May 2004 


My thoughts are:
1. If you've got data for the inner nodes you could do a 2 step process of (step one) calibrating each inner node against its 7 or so feeder nodes and (step two) calibrating the final node against the inner nodes. You could just use OLS or something to calibrate it, KISS.
2. If you haven't got data for the inner nodes then you need to reduce the scale of the problem, which probably makes it a PCA kind of thing, doesn't it?
On the off chance that this is some kind of lendingtohfcreditscoring model, even if you've got tons of crosssection data, I doubt you've got enough time series data to say anything reasonable. In this scenario I'd come up with my own a priori weights for the 70 input factors, rather than trying to justify/calibrate them empirically. And then defy the regulator to disagree with you. "What, you don't think xxx is important??" 
Stab Art Capital Structure Demolition LLC 


Anthis

It's all Greek to me

Total Posts: 1180 
Joined: Jul 2004 


Unless it is for some sort of weather forecasting model ( are you into yachting or air sports? ) , i have a hanch that 70 input factors are way too many. 1015 factors should be more than enough for any financial application.
Johnny check your gmail. 
Αίεν Υψικράτειν/Τύχη μη πίστευε/Άνδρα Αρχή Δείκνυσι/Νόησις Αρχή Επιστήμης //Σε ενα κλουβί γραφείο σαν αγρίμι παίζω ατέλειωτο βουβό ταξίμι




Nonius

Founding Member Nonius Unbound

Total Posts: 12779 
Joined: Mar 2004 


johnny, thanks for the comments...in fact, I think you're right....I'm gonna KISS and just set the weights...I did some tests and, perhaps not too suprisingly, the rating isn't that volatile to changes in weights. there are just so many fucking inputs and there does seem to be some degree of dependence on inputs, so, if a fund is highly transparent then there is a more than even probability that they have good stress testing, independent risk controls etc. you remember, this is all total bs and one could come up with a "model" in one day. funny thing, I spoke with Citadel today...those phuckers GOT an SP rating....LOL! 
Chiral is Tyler Durden 


prophet

Banned

Total Posts: 149 
Joined: Oct 2004 


You might try using NNs or other nonlinear iterativelytrained (slow or difficult to converge) networks for your first stages only then multiple regression or least squares in the second stage (fast, exact convergence). I find this is fairly successful and computationally efficient for hard problems like outright price forecasting or trading models. Pure textbook NNs can have serious convergence issues especially with market data inputs/outputs.
In other words, structure the network. Figure out which input factors, basis vectors or parts of the problem benefit from some kind of non linear transform or combination, if any, but use linear or loworder models for everything else. Linear networks are nice to start with because the trained weights and errors can be readily understood. Otherwise a full NN, even for a few inputs is forced to search over many degrees of freedom. Sometimes you may never find a good convergence depending on the method used to train the NN. 




Johnny

Founding Member

Total Posts: 4333 
Joined: May 2004 


funny thing, I spoke with Citadel today...those phuckers GOT an SP rating....LOL!
fwiw, I've always thought this is such a shrewd thing for a hf manco to do and worth devoting a lot of time and effort to getting it right. 
Stab Art Capital Structure Demolition LLC 


Baltazar


Total Posts: 1769 
Joined: Jul 2004 


well prophet,
i thougth price forecasting was impossible (never tried though).
do you have a up or down forecast or even a value? 
Qui fait le malin tombe dans le ravin 



prophet

Banned

Total Posts: 149 
Joined: Oct 2004 


Yes, forecasting price directly is a very hard problem, though definitely not impossible. Two years ago I worked on systems that directly forecasted 3K to 5K stock price change vectors two weeks into the future. These could be traded by ranking the forecasts, then going long/short the top/bottom ranked N stocks for fixed 2 to 4 week periods. These tested positive versus controls, in walk forward, survivorshipbias free trials over my entire 5 year data * 3K to 5K (highest liquidity NYSE+Nasdaq) stock set which was derived from NYSE TAQ T&S data. Elimination of survivorship bias was a real necessity for that analysis.
Currently I find more success by instead forecasting the future trading statistics for a population of 100 to 200 different trading models (subsystems) into the future by a few thousand ticks. Each subsystem has it’s own independent inductively trained network with a (short) 10 to 20 day lookback training period. Real time trading (2 to 10 trades/day) is done by periodically ranking the real time forecasts then trading the top ranked models forward. Most of this is based on the ordinary principle of diversification between weakly correlating returns across different models and markets. Yes, that seems obvious for trading systems, yet many try and fail. My “edge” has been to use adaptive networks that can function with shockingly small training sets versus the numbers of network inputs (seemingly low statistical significance), yet produce robust results in the independent walk forward trials, demonstrating adaptability and profitability. This is why I am not a fan of textbook NNs... they can't perform well with noisy market data and borderline stat. significance in terms of low (training examples)/(input factor) ratios, and are often computationally inefficient too. Anyway, the real time realmoney trades are consistent with expectations, despite some major implementation and scalability issues in the past. Currently I'm computationally limited to about 40 to 80 real time systems. I hope to eventually trade thousands across many more markets. Currently I'm just considering CME ES, NQ, 6E and ER2 futures. 



Baltazar


Total Posts: 1769 
Joined: Jul 2004 


that's very impressive. a friend of mine use the same approach: a NN to choose between different models. i never really played with auto adaptive nn (all i did was feedforward stuff).
don't you have problem to train the network? and do you still work on the overfitting/underfitting by acting on the number of hidden nodes?
more a remark. i played with SVM lately, did you tried kernel methods instead of NN? (i dunno if we can do the equivalent of a adaptive nn with lkernel methods) 
Qui fait le malin tombe dans le ravin 



prophet

Banned

Total Posts: 149 
Joined: Oct 2004 


I use some nonlinear transforms on the inputs to create a higher dimensional, more linearly separable feature space. The network inputs are now less numerically singular. I don't use any hidden nodes as NNs use them. I adjust the dimensionality of the feature space through empirical testing. Some transforms work. Others don’t. Some transforms require others transforms. Adding extra transforms and expanding the feature space always requires more training examples.
It’s not so much an auto adaptive network. Some input transforms use feedback though. I retrain the networks from scratch, independently of past training, training at daily intervals, per subsystem, then walk forward to papertrade or trade.
I also use training example filtering to make the problem more linearly separable. After all, many instances of price action must be 100% unpredictable from the perspective of the system, due to news or whatever outside factors. Thus I felt it would be helpful to test heuristical methods to classify training examples. It worked. In trading terms the system doesn’t panic through news events, and does not let price shocks or price stagnation overly bias the training.
I have not tried kernel methods yet. I need to do that. Thanks for the suggestion. 



Baltazar


Total Posts: 1769 
Joined: Jul 2004 


in fact the trick of kernel methods is: plug your inputs into a higher dimensional space where it would be more separable.
svm works this way
1) map your input into higher dim space (even infinty dim space) (using some operator Phi) 2) construct the best (according to some criterion of course) linera decision function in that space.
seems simple, two key points however:
steps 2 only uses dot products of the mapped inputs, therefore instead of computing Phi(x), Phi(y) and then (very difficult and demanding in high dim space), We use a mercer kernel k(x,y). This kernel is the reproducing kernel of the high dim space, so we don't compute Phi anymore just k(x,y), (x,y belong to the low dim space, so easy calculus) (that is the kernel trick used in all mercer kernel methods)
step 2 construct a linear function in a very high dim space with a trainin set: over/under fitting !! the criterion used is the margin (more on that later if you want) not fisher metric or SNR.> detector complexity is reduced when using this technique >less overfitting problems (Vapnik uses the detectir complexity as a measur for overfitting (roughtly)) 
Qui fait le malin tombe dans le ravin 



prophet

Banned

Total Posts: 149 
Joined: Oct 2004 


Hi Baltazar,
Thanks again for your help. I may not understand kernel methods fully yet. This is what I understand so far based on your comments and the following docs:
http://www.learningwithkernels.org/sections/section17.pdf
http://www.learningkernelclassifiers.org/chapter_1.htm
I understand this is a faster method to calculate nearest neighbor Euclidian distances (between unseen data and support vectors) in the (much) higher dimensional feature space, avoiding the need to calculate Phi() or any dot products in the highdim space. I like this very much.
My typical networks use ~50 inputs (after nonlinear transforms/expansions), a single output and ~50,000 training examples to produce a single 50 element weighting vector that can be rapidly dotted with either realtime data or historical unseen data in walkfoward trials. I do a seperate independent network for every subsystem * system combination.
With a kernel method I will need to find a transform Phi() to take the 50 inputs to a higher dimensional, more separable space. I will also choose a small number of support vectors, N, much less than 50K for nearest neighbor Euclidian classification of unseen data. I’m not sure how to do this yet, apart from taking group averages of various training examples meeting output critieria. This I can experiment with. Then I find N eigenvectors, each expressed as an expansion over the support vectors in the original input space. Is this part correct? Then to execute the model, for each desired classification estimate f_n(x) I will process my incoming 50 channel data by the n’th kernel and then dot with the n’th eigenvector to give the final classification estimate. Am I understanding any of this right? 



Baltazar


Total Posts: 1769 
Joined: Jul 2004 


SVM is a decision technique.
in fact svm combines two tricks:
the kernel trick, that is instead of
1) choosing the higher dim space, than finding de Phi() and then compute the dot product k(x,y)= 2) in fact you just choose a k(x,y), verifie the mercer condition and then you forget about Phi's.
all kernel methods use that trick: formulate the problem in terms of dot products only. and then use k(x,y).
the second trick in svm is the margin (and the support vectors). svm is a linear decision function (in the higher dim space). it's written only in terms of dot products (so you can use the kernel trick). The criteria you use to find your hyperplane is the margin. That is the distance between that hyperplane and the samples.
in fact that margin do not depend on all the training samples but just on some the samples (if you've got clouds of data, typically these samples are the one at the periphery of the cloud) . This samples are called support vectors. So you don't chose your support vectors, they are selected automatically.
now your problem: if i got it: you got 50 000 labbeled training samples. each sample is in R^50, right? you find a Phi that is an operator from R^50 to R^n with n>50, right? and then use your NN. rigth?
i'll have to fresh up on kernel regression but for svm (that is kernel binary decision) all is done in one step. you chose the k(x,y) (you forget about phi, it is implicetly mapped) and then your decision hyperplane is (in the higher dim space) a linear combinaison of some training samples (the support vectors); add that optimizing the hyperplane is quadratic programming so no problem of local minima (oppose to nn).
i'll brush on kernel regression

Qui fait le malin tombe dans le ravin 



prophet

Banned

Total Posts: 149 
Joined: Oct 2004 


I could treat this as either a regression or a decision problem. I will try both. Perhaps if I leave out the final threshold (sigma) it will work for ranking classifications within the same network or ranking across different SVM networks.
Your assumptions are correct… 50000 training examples composed of x in R^50 (inputs) and y in R^1 (outputs). The outputs can be converted to {1,1}. I frequently will solve (and walkforward test) a few hundred of these problems per hour, each with different data. Is this 50000 set too large for fast determination of support vectors? I may be able to reduce it by binning or subsampling, though I would prefer the algorithm can process 50000 or more examples in a reasonable amount of time. Maybe I am asking too much?
It appears that computational complexity to find support vectors may scale as O(N^2), N = the training set size for solving for eigenvectors or doing a quadradic programming maximization. Perhaps it can be made O(N*logN) by sacrificing some accuracy, or with monte carlo methods. I also do not have any apriori knowledge of which training examples are "better" than others, beyond some simple filters to remove excessive price shocks and stagnation.
Here is what else I think I understand about SVM: I identify a smaller set of support vectors (automatic feature extraction?) by choosing a kernel and solving for the alphas which are either eigenvectors of the Phi(x)’s or solutions of a quadradic programming problem, maximizing the margin. The choice of kernel (meeting the mercer condition) will dictate the Phi(). Never is Phi(x) evaluated. Then to classify data, I dot the appropriate alpha with the kernel as: alpha_sv * k(x_sv,x) where x_sv is a support vector, alpha_sv is the solvedalpha for that support vector. x is data to classify (forecast future trading model returns, or just binary +/ perhaps).
It may take me a few days to study and understand this enough to implement it properly. I found this link to be very helpful: http://www.cse.msu.edu/~lawhiu/intro_SVM.ppt I also found several Matlab and C algorithms, but want to write my own algorithm to help me better understand this before I use other people's code. I need to understand this from a speed versus accuracy point of view, simply because I have a LOT of data I will need to process, if this should be successful. I have at least 200 GB of training examples readily available, across several markets and trading models, most of it can be calculated on the fly with low overhead thankfully. 



Baltazar


Total Posts: 1769 
Joined: Jul 2004 


"Then to classify data, I dot the appropriate alpha with the kernel as: alpha_sv * k(x_sv,x) where x_sv is a support vector, alpha_sv is the solvedalpha for that support vector. x is data to classify (forecast future trading model returns, or just binary +/ perhaps)."
in fact you dot not choose appropriate alpha, you use all: sum_i (alpha_sv_i. k(x_sv_i,x) )
that points out a limitation : to use your decision function, you need to compute (number of support vectors) operations. you can approximate the solution by setting to zeros small alphas or by some more complex technique.
"I also do not have any apriori knowledge of which training examples are "better""
in fact better one are the ones that, once transformed, will lie near the hyperplan. you got no way to know, pre mapping, witch sample will by a support vector.
but because you udpate your detector continusly, such information could be of interest, therefor I propose (after you've gotten familiar with svm) to check "http://www.irccyn.ecnantes.fr/hebergement/Publications/2003/1668.pdf".
that is: you got a new sample, using ideas of this paper, you check if such sample is "usual" or not. if it's not: retrain your classifier, otherwise, no need to do so as the result won't change.
regarding if SVM will be fast enough, honestly i don't know. i use them with matlab and sketchy codes for methods validation. i didn't work on that aspect of things. I bet it's faster than nn in training but slower in use ( depend on sizes for sure) 
Qui fait le malin tombe dans le ravin 


