# Hunting Black Swans

This entry is part 21 of 23 in the series Numpy Strategies

Once upon a time in a cold, but prosperous kingdom people only knew about the existence of white swans. Swans were white by definition there was no doubt about it. Then one day a scholar with the blessing of the King traveled to a far land to find … something I don’t remember what. In that far exotic place in a different continent, the scholar discovered swans. And they were black!  This sort of put a dent in the white swans supremacy. Blacks swans defied the model of the World.

If you look at http://en.wikipedia.org/wiki/List_of_largest_daily_changes_in_the_S%26P_500 then you can explain some of the “Black Swans”. But not all. There is a stylized fact called “volatility clustering”, which indicates that volatility tends to cluster. Clustering is also the name of a group of unsupervised learning algorithms. Those algorithms either need to know the number of clusters or they can figure it out themselves. I prefer the latter category. scikit learn provides AffinityPropagation and MeanShift. From my own experiments I concluded that Meanshift is faster. Still not fast enough so I decided to pickle the results. Another thing that seems to improve performance is the scaling of the input data. As input I provided the returns of the low price (should have used close price for easier comparison) and the standard deviation of the same as proxy for the volatility.

```x = np.vstack((preprocessing.scale(r), preprocessing.scale(vols))) x = x.T labels = ms_cluster(x)   for i in np.unique(labels): cnt = (labels == i).sum() print 'Cluster', i, '#',cnt```
```Cluster 0 # 15744
Cluster 1 # 97
Cluster 2 # 125
Cluster 3 # 4
Cluster 4 # 2
Cluster 5 # 1
Cluster 6 # 180
Cluster 7 # 5
Cluster 8 # 1
Cluster 9 # 3```

Most of the returns (about 97%) are in the first cluster. There are some mid-size clusters and very small ones some of which have only 1 member. There is maybe some kind of power law at work here, but it is difficult to tell. I defined Black Swans as a move larger than 5 standard deviations.

``` swan_idxs = np.where(np.abs(r/vols) &gt; 5)[0] swan_labels = labels[swan_idxs] print "Swan Indices", swan_idxs, '#', len(swan_idxs)   for i in swan_idxs: print df.index.values[i]```
```Swan Indices [  119  1378  1435  1958  2835  3105  3114  3115  3116  4756  5101  6179
6205  7585  7965  8189  8236  9217  9495  9496  9498  9501  9505  9525
9552  9999 12031 12033 12243 12245 12276 12633 12645 12646 12653 13009
13218 13222 13237 13277 14606 14772 14778 14783 14786 14787 14788 14789
14790 14792 14795 14800 14806 14812 14816 14818 14821 14822 14827 14883
14890 15181] # 62
1950-06-23T00:00:00.000000000Z
1955-07-05T00:00:00.000000000Z
1955-09-23T00:00:00.000000000Z
1957-10-22T00:00:00.000000000Z
1961-04-17T00:00:00.000000000Z
1962-05-14T00:00:00.000000000Z
1962-05-25T00:00:00.000000000Z
1962-05-28T00:00:00.000000000Z
1962-05-29T00:00:00.000000000Z
1969-01-10T00:00:00.000000000Z
1970-05-27T01:00:00.000000000+0100
1974-09-03T01:00:00.000000000+0100
1974-10-09T01:00:00.000000000+0100
1980-03-26T01:00:00.000000000+0100
1981-09-28T01:00:00.000000000+0100
1982-08-17T02:00:00.000000000+0200
1982-10-22T01:00:00.000000000+0100
1986-09-10T02:00:00.000000000+0200
1987-10-15T01:00:00.000000000+0100
1987-10-16T01:00:00.000000000+0100
1987-10-20T01:00:00.000000000+0100
1987-10-23T01:00:00.000000000+0100
1987-10-29T01:00:00.000000000+0100
1987-11-27T01:00:00.000000000+0100
1988-01-07T01:00:00.000000000+0100
1989-10-12T01:00:00.000000000+0100
1997-10-24T02:00:00.000000000+0200
1997-10-28T01:00:00.000000000+0100
1998-08-28T02:00:00.000000000+0200
1998-09-01T02:00:00.000000000+0200
1998-10-15T02:00:00.000000000+0200
2000-03-16T01:00:00.000000000+0100
2000-04-03T02:00:00.000000000+0200
2000-04-04T02:00:00.000000000+0200
2000-04-13T02:00:00.000000000+0200
2001-09-18T02:00:00.000000000+0200
2002-07-18T02:00:00.000000000+0200
2002-07-24T02:00:00.000000000+0200
2002-08-14T02:00:00.000000000+0200
2002-10-10T02:00:00.000000000+0200
2008-01-23T01:00:00.000000000+0100
2008-09-18T02:00:00.000000000+0200
2008-09-26T02:00:00.000000000+0200
2008-10-03T02:00:00.000000000+0200
2008-10-08T02:00:00.000000000+0200
2008-10-09T02:00:00.000000000+0200
2008-10-10T02:00:00.000000000+0200
2008-10-13T02:00:00.000000000+0200
2008-10-14T02:00:00.000000000+0200
2008-10-16T02:00:00.000000000+0200
2008-10-21T02:00:00.000000000+0200
2008-10-28T01:00:00.000000000+0100
2008-11-05T01:00:00.000000000+0100
2008-11-13T01:00:00.000000000+0100
2008-11-19T01:00:00.000000000+0100
2008-11-21T01:00:00.000000000+0100
2008-11-26T01:00:00.000000000+0100
2008-11-28T01:00:00.000000000+0100
2008-12-05T01:00:00.000000000+0100
2009-02-27T01:00:00.000000000+0100
2009-03-10T01:00:00.000000000+0100
2010-05-05T02:00:00.000000000+0200
Swan Labels [2 6 4 6 2 6 4 2 5 2 6 2 9 2 6 6 2 2 2 8 7 2 6 2 2 2 2 9 2 6 6 6 2 6 2 2 2
6 6 6 6 9 2 2 2 2 7 6 2 6 2 7 2 6 2 7 6 2 7 2 6 2]```

So most of the Black Swans can be found in cluster 2 and 6. Cluster 6 I believe is a “good” cluster, while cluster 2 is related to “crashes”.