Overfitting to polynomials

This entry is part of 15 in the series Grails Finance

Grails Finance 1.0

Overfitting occurs when a fit to historical data only works for the dataset used to fit the data. I am using polynomial fits up to the 4th degree, but in principle you could use any other sort of function. I tried fitting closing prices of two securities to each other and the closing price of a single security to time. By adding some noise to the data and calculating the resulting relative error I tested the robustness of the fit.

Services

I used the following services

  • correlationService
  • polynomialFitService
  • relativeErrorService
  • noiseService

Correlation service

The correlation service calculates the Spearmans correlation and corresponding Z score. The assumption is that higher correlation leads to better fits.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
...
    def correlate(data1, data2) {
        def n = data1.size()
        def correlation = new SpearmansCorrelation().correlation(data1 as double[],
            data2 as double[])
        def z = fischer( correlation ) * Math.sqrt( (n - 3)/ 1.06)
 
        return [correlation : correlation, z : z]
    }
 
    def fischer(r) {
        return 0.5 * Math.log( ( 1 + r ) / ( 1 - r ) )
    }
...

Polynomial fit service

The polynomial fit service uses the PolynomialFitter of Apache Commons Math to produce polynomial fits.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
...
    def fit(degree, vals) {
        PolynomialFitter fitter = new PolynomialFitter(degree, new LevenbergMarquardtOptimizer())
 
        for(i in 0 ..<vals.size() ) 
            fitter.addObservedPoint( i + 1, vals[i], i + 1)
 
 
        PolynomialFunction fitted = fitter.fit()
 
        def regression = new SimpleRegression(new TDistributionImpl(degree))
 
        for(i in 0 ..<vals.size() ) 
            regression.addData(vals[i], fitted.value(vals[i]))
 
 
        return [coefficients : fitted.getCoefficients(), rsquare: regression.getRSquare()]
    }
 
    def fit(degree, vals, vals2) {
        PolynomialFitter fitter = new PolynomialFitter(degree as int, new LevenbergMarquardtOptimizer())
 
        for(i in 0 ..<vals.size() )
            fitter.addObservedPoint( 1, vals[i], vals2[i])
 
 
        return fitter.fit()
    }
 
    def eval(vals, poly) {
       def map = [:]
 
       vals.each {
           map[ it ] = poly.value( it ) 
       }
 
       return map
    }
...

Relative error service

The relative error service calculates the relative error of a list of observed values and a list of reference values.

1
2
3
4
5
6
7
8
9
10
11
...
    def error(vals, vals2) {
        def relError = []
        for( i in 0..<vals.size() ) {
            if( vals2[i] != null && vals[i] != null)
                relError << ( vals2[i] - vals[i] ) / vals[i]
        }
 
        return relError
    }
...

The noise service

The noise service adds noise to a list of values. My definition of noise is a uniform random distribution between 0 and some factor times the variance of the input values. Other definitions exist, for instance based on the Average True Range. I leave that implementation as an exercise to the reader :).

1
2
3
4
5
6
7
8
9
10
11
12
...
    def add(vals, factor) {
        if( factor == 0 ) {
            return vals
        }
 
        def rdi = new RandomDataImpl()
        def range = factor * StatUtils.variance(vals as double[]) 
 
        return vals.collect {  it + rdi.nextUniform(0, range) }
    }
...

The controller

The fit function is produced using the first 200 points of a 1 year closing price dataset, so that there are some values – about 50 that were not fitted to. This is the “control group”.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
...
    def fieldname = "Close"
 
    def index = { 
 
        if(params.select1 != null) {
            def map = historicalQueryService.pairedValuesMap(params.select1, 
            params.select2, fieldname)
            def responseMap = [:]
            def data1 = map["data1"]
            def data2 = map["data2"]
            responseMap["corrParams"] = correlationService.correlate(data1, data2)
            responseMap["polynomial"] = polynomialFitService.fit(params.power,
            data1[0..<200], data2[0..<200])
 
            responseMap["fitData"] = 
            polynomialFitService.eval( data1, responseMap["polynomial"] )
 
            def descStats = errorStats( data2, 
            responseMap["fitData"] )
            responseMap["stats"] = descStats
 
            def data = [:]
 
            for( i in 0..<data1.size())
            data[data1[i]] = data2[i]
 
            responseMap["data"] = data
 
 
            return responseMap
        }
    }
 
    def single = {
        if(params.select1 != null) {
            def data = historicalQueryService.queryVals(params.select1, 
            fieldname)
            def responseMap = [:]
            responseMap["polynomial"] = polynomialFitService.fit(params.power,
            0..<200, data[0..<200])
 
            responseMap["fitData"] = 
            polynomialFitService.eval( 0..data.size(), 
            responseMap["polynomial"] )
 
            def descStats = errorStats( data, 
            responseMap["fitData"] )
            responseMap["stats"] = descStats
 
            def dataMap = [:]
 
            for( i in 0..<data.size())
            dataMap[i + 1] = data[i]
 
            responseMap["data"] = dataMap
 
            return responseMap
        }
    }
 
    def errorStats(data, fittedData) {
        def statsList = []for(i in 0 .. 5) {
            def noisyData = noiseService.add( fittedData.values().asList(),
            0.01  * 2 ** i) 
            def error = relativeErrorService.error( data, 
            noisyData)
            statsList << descriptiveStatisticsService.stats(error)
        }
 
        return statsList
    }
...

The view

Here is the view for pairs of securities. The view for a single security is almost a copy of this one.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
...
      <table>
         <tr>
            <g:form>
            <td>
                    <g:select name="select1" from="${['DIA', 'SPY', 'GLD']}" value="${params.select1}"/>
            </td> 
            <td>
                    <g:select name="select2" from="${['DIA', 'SPY', 'GLD']}" value="${params.select2}"/>
            </td>
            <td>
                    Power <g:select name="power" from="${[1, 2, 3, 4]}" value="${params.power}"></g:select>
            </td>
            <td>
               <g:actionSubmit name="doChart" action="index" value="Chart"/>
            </td>
            </g:form></td>
         </tr>
      </table>
 
      <g:if test="${params.select1 != null}">
        <p>&nbsp; &nbsp;EOD Close ${params.select1} vs ${params.select2}</p>
        <p>&nbsp; &nbsp;y = ${polynomial}</p>
 
        <table border="1">
           <tr>
               <g:each in="${corrParams}" var="entry" >
                      <td>${entry.key}</td>
                    </g:each>
           </tr>
           <tr>
               <g:each in="${corrParams}" var="entry" >
               <g:each in="${entry.value}" var="item" >
                      <td>${sprintf('%.3g', item)}</td>
                    </g:each>
              </g:each>
           </tr>
        </table>
 
        <br/>
 
            <%
               def dataMax = data?.values()?.max()
               def fitDataMax = fitData?.values()?.max()
               def yMax = Math.max(dataMax, fitDataMax)
               def priceData = [data?.values()?.asList(), 
                   fitData?.values()?.asList()]
            %>
            <g:lineChart type="lc" 
                size="${[600,200]}" 
                colors="${['FF0000','0000FF']}" 
                axes="x,y" 
                lineStyles="${[[2,2,2],[2,8,4]]}" 
                legend="${['Data', 'Fit']}" 
                gridLines="${dataMax/11},25" 
                axesLabels="${[0:xLabels,1:[0,yMax/2,yMax]] }"
                data="${priceData}"
                dataType="simple"/>
 
            <br/>
 
        <table border="1">
           <tr>
               <td align="center" colspan="${stats[0].size() + 1}">
                   Relative error fit statitics
               </td>
           </tr>
           <tr>
               <td>Added Noise % of variance</td>
               <g:each in="${stats[0]}" var="entry" >
                     <td>${entry.key}</td>
                    </g:each>
           </tr>
 
           <g:each in="${0..< stats.size()}" }>
           <tr align="center">
                   <td>${ 2 ** it } %</td>
               <g:each in="${stats[it]}" var="entry" >
               <g:each in="${entry.value}" var="item" >
                      <td>${sprintf('%.3g', item)}</td>
                    </g:each>
                       </g:each>
           </tr>
      </g:each>
        </table>
        <br/>
        </g:if>
...

Result

Here are the results for polynomials of 1st to 4th degree. The higher the polynomial degree, the worse the fit. We can also see that stronger correlation leads to better fits – the DIA-SPY fits are better than the DIA-GLD ones.

1234
DIA-SPYFIT DIA SPYFIT DIA SPYFIT DIA SPYFIT DIA SPY
DIA-GLDFIT DIA GLDFIT DIA GLDFIT DIA GLDFIT DIA GLD
DIAFIT DIAFIT DIAFIT DIAFIT DIA

Random links of interest

Series Navigation
By the author of NumPy Beginner's Guide, NumPy Cookbook and Instant Pygame. If you enjoyed this post, please consider leaving a comment or subscribing to the RSS feed to have future articles delivered to your feed reader.
Share
This entry was posted in programming and tagged , , . Bookmark the permalink.