Statistical Bootstrapping by Case Resampling

This entry is part 10 of 23 in the series Numpy Strategies

NumPy Strategies 0.1.7

I haven’t blogged in a while, because I am supposed to work on a Big Secret Project (BSP). Obviously, I am not allowed to talk about that. The Product Owner/Manager of our FHF (Fantasy Hedge Funds) has come up with the following User Story:

  • Measure the margin of error of a small data sample.

This is about the data that we are using. Our Product Master is worried that we don’t have enough data to do anything meaningful. One way to solve this issue is to apply Statistical Bootstrapping or a type of bootstrapping called Case Resampling. We will apply this method to the problem of computing the mean of the AAPL stock price and the normal distribution.

The steps of the algorithm are:

  1. Store the empirical distribution from our data.
  2. Generate random samples from this distribution of the same size as the original sample.
  3. Calculate and store the means of these samples.
  4. Determine in which percentile of the means distribution the mean of the original sample lies.

The code on Github and below performs these steps.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy
import sys
import matplotlib.pyplot
from matplotlib.finance import quotes_historical_yahoo
from datetime import date
import scipy.stats
 
def random_indices(N):
   return numpy.random.randint(0, N, N)
 
def random_values(values):
   return numpy.take(values, random_indices(len(values)))
 
def generate_means(values):
   NTRIES = int(sys.argv[2])
 
   means = numpy.zeros(NTRIES)
 
   for i in xrange(NTRIES):
      means[i] = random_values(values).mean()
 
   return means
 
def format_mean(values):
   return "Mean=%.3f" % (values.mean())
 
def plot_percentile(values, means):
   matplotlib.pyplot.hist(means)
   percentile = scipy.stats.percentileofscore(means, values.mean())
   matplotlib.pyplot.legend([format_mean(means), "Percentile=%.2f" %(percentile)])
 
def plot(values):
   matplotlib.pyplot.hist(values)
   matplotlib.pyplot.legend([format_mean(values)])
 
today = date.today()
start = (today.year - 1, today.month, today.day)
quotes = quotes_historical_yahoo(sys.argv[1], start, today)
close =  numpy.array([q[4] for q in quotes])
close_means = generate_means(close)
 
normal_values = numpy.random.normal(size=len(close))
normal_means = generate_means(normal_values)
 
matplotlib.pyplot.subplot(221)
matplotlib.pyplot.title("Close Values")
plot(close)
 
matplotlib.pyplot.subplot(222)
matplotlib.pyplot.title("Normal Values")
plot(normal_values)
 
matplotlib.pyplot.subplot(223)
matplotlib.pyplot.title("Close Means")
plot_percentile(close, close_means)
 
matplotlib.pyplot.subplot(224)
matplotlib.pyplot.title("Normal Means")
plot_percentile(normal_values, normal_means)
 
matplotlib.pyplot.show()

After running the program I get the following plots for the AAPL close price and the Gaussian distribution with 400 generated samples.

Case Resampling

If you liked this post and are interested in NumPy check out NumPy Beginner’s Guide by yours truly.

Series Navigation
By the author of NumPy Beginner's Guide, NumPy Cookbook and Instant Pygame. If you enjoyed this post, please consider leaving a comment or subscribing to the RSS feed to have future articles delivered to your feed reader.
Share
This entry was posted in programming and tagged , , , . Bookmark the permalink.