Scraping 10 years of historical data per security from NASDAQ

This entry is part of 15 in the series Grails Finance

Grails Finance 1.1

You can like request 10 years of historical OHLC data from the NASDAQ website. That seems more than enough for all intents and purposes. Also you can download index composition details of the Dow Jones Industrial Average Index from another website. Combining these possibilities gives me plenty of data to play around with, structured around the index of course.

Dependencies issue

In order to scrape the Web, I wanted to add HtmlUnit as a dependency. Unfortunately, there was some problem with SAXParseException. The solution can be found here.

You need to exclude xml-apis

Dow Jones Service

1
2
3
4
5
6
7
8
9
10
   inherits("global") {
           excludes 'xml-apis'
   }
   ...
   dependencies {
  ...
      runtime 'org.apache.commons:commons-math:2.1', 'net.sourceforge.htmlunit:htmlunit:2.8'
  ...
  }
  ...

So first I start with getting all the Dow Jones composition info and storing it in the database. I also made an Index domain class, which is probably not a good idea, because index is used in all kind of contexts as well, such as SQL and index pages. The Index entity contains a reference to instruments, an index name, symbol and weights.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
    def retrieve() {
//        def url = "http://www.djindexes.com/mdsidx/?event=components&symbol=DJI"
        def url = "http://www.djindexes.com"
 
        withAsyncHttp(poolSize : 4, uri : url, contentType : HTML) { 
            def result = get(path:'/mdsidx/', query: [event: 'components', symbol: 'DJI']) {
               resp, html ->  println ' got async response!'
               return html
            }
 
            assert result instanceof java.util.concurrent.Future
 
            while (! result.done) {
               Thread.sleep(2000)
            }
 
            def html = result.get().toString()
 
            for(line in html.split('\n')) {
                if(!line.startsWith('COMPANY NAME')) {
                   def fields = line.split('\t')
                   def name = fields[0]
                   def symbol = fields[2]
                   def weight = Double.valueOf(fields[6]) / 100
                   def instrument = saveInstrument(name, symbol)
                   new Index(name : 'Dow Jones Industrial Average', symbol: 'DJIA',
                       instrument : instrument, weight : weight).save()
                }
             }
        }
    }
 
    def saveInstrument(name, symbol) {
        def type = InstrumentType.findByType('Stock')
        def nasdaq = Datasource.findByName('NASDAQ')
 
        return new Instrument(name: name, symbol: symbol, 
            source: nasdaq, instrumentType : type).save()
    }

NASDAQ service

With a little bit of help from Wireshark I found a way to construct URLs to download historical price data and volumes for the last 10 yearss. I downloaded the data for all the Dow Jones Index components and the index itself. By the way I would like to start a petition against Javascript. Javascript is so not cool, especially for web scrapers :). I have been lazy again with regular expressions and XPath. Don’t do this peepz.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
...
    def retrieve(symbol) {
        final WebClient webClient = new WebClient()
        webClient.setJavaScriptEnabled(false)
//        HtmlPage page = webClient.getPage("http://www.nasdaq.com/aspx/historical_quotes.aspx?symbol=AAPL&selected=AAPL");
        def instrument = Instrument.findBySymbol(symbol)
        log.info( instrument.symbol )
        requestAndSave(instrument, webClient)
 
        webClient.closeAllWindows();
    }
 
 
    def saveValueByFieldName(fieldName, dateTime, instrument, val) {
        def mnemonic = Field.findByName(fieldName)
        def fVal = new FieldValue(added: dateTime, val: val, instrument: instrument, field:mnemonic).save();
  }
 
    def requestAndSave(instrument, webClient) { 
        def page = webClient.getPage("http://charting.nasdaq.com/ext/charts.dll?2-1-14-0-0-5120-03NA000000" 
                + instrument.symbol 
                + "-&SF:4|5-WD=539-HT=395--XTBL-")
        def trs = page.getByXPath("//table/tbody/tr[2]/td[2]/center/table/tbody/tr")
 
        trs.each { 
            (it.asText() =~ /(\d+\/\d+\/\d+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d,]+)/).each {
               all, date, open, high, low, close, volume -> 
               def dateTime = DateTimeFormat.forPattern("MM/dd/yyyy").parseDateTime(date)
               saveValueByFieldName('Open', dateTime, instrument, open)
               saveValueByFieldName('High', dateTime, instrument, high)
               saveValueByFieldName('Low', dateTime, instrument, low)
               saveValueByFieldName('Close', dateTime, instrument, close)
               saveValueByFieldName('Volume', dateTime, instrument, volume.replaceAll(",", ""))
            }
        }
 
        page.cleanUp()
    }
...

Open Office

Another virtue that is good, besides laziness is being cheap or as we say in IT supporting open source. As alternatives of Office I suggest Open Office . In order to share the results with you, I uploaded my ODS spreadsheet to Google Docs, which is another Office alternative. Currently there is a bug/feature that prevents you to export charts. So I had to generate HTML from Open Office with File/Preview in Web Browser. Then from a browser save the images of the charts and then from Google Docs insert those images. Here you can see the component weights and correlations of the close price with the DJIA value for each component.

So what is going on with Merck, Pfizer and Walmart?! I invented my own weighting scheme based on normalized R squared and I get some crazy results as you can see in the diagrams.

OpenCl and GPU links of interest

Random links of interest

Series Navigation
By the author of NumPy Beginner's Guide, NumPy Cookbook and Instant Pygame. If you enjoyed this post, please consider leaving a comment or subscribing to the RSS feed to have future articles delivered to your feed reader.
Share
This entry was posted in programming and tagged , , . Bookmark the permalink.