Now that we survived the end of the Mayan calendar, we have the Singularity to look forward to. After writing twice about sharpening of saws, I found that I have a lot more other saws left to sharpen. There were two spikes in traffic caused by “Sharpen the Vim Saw” being tweeted by @UnixToolTip and the “Sharpen the Python saw” submission by GovindReddy on Reddit. Thanks for that!
Every day I read something about Big Data. A decade ago it was all about Business Intelligence, Data Warehousing and datamining. Who knows what will be “hot” in 10 years? In the meantime Big Data analysis has been used by politicians, marketers, scientists and to search the web:
- How Obama used big data
- How Target knows you are pregnant
- Astronomers crunch big data to map the galaxies
- How Nonprofits Can Use Data to Solve the Worlds Problems
- How algorithms secretly shape the way we behave
After the previous post I was asked on Google+ to say something about Big Data and Python, because apparently this has not been given enough visibility. Python is used extensively nowadays to analyze gigantic data sets, for instance in the field of gene sequencing and finance. It’s easy to find “proof” on the Internet:
- Python brings simplicity to Big Data Analytics
- A Python Compiler for Big Data
- Python: Big Data’s secret power tool
If that’s not enough you can also watch this video of a talk by Travis Oliphant (creator of SciPy and NumPy) entitled “Python in Big Data”. Some jokes were really funny as far as I remember:
In order to achieve the next paradigm shift (at least until “Big Data” is replaced by something else) it’s necessary to change our mindset. These 7 habits loosely based on “The 7 Habits of Highly Effective People” are absolute essential in my humble opinion to complete this goal.
1. Be pragmatic when collecting data
In other words: keep it simple! Things always start simple. A shell script that retrieves data here. A little Perl program that does something with the data there. You end up with hundreds of programs in different programming languages exchanging data using a wide range of formats and the data being scattered in various databases.
I am not saying that you should use only one wonderful programming language like Python. Although it is possible with NumPy, for instance. However, we should try to be pragmatic and maintain a balance between high productivity and low complexity.
2. Know thy algorithms and data structures
Algorithms are like recipes. Every good chef knows hundreds of yummy recipes. An amateur would just throw some random stuff in the frying pan and probably burn the food or undercook it. Of course, you can look up recipes in a cookbook. So for instance if you are interested in NumPy you could look up interesting things you could do with NumPy in “NumPy Cookbook“. Or if you are more interested in general algorithms and data structures you could go through this Quora board.
3. Know thy databases
You can’t have Big Data without storing it somewhere. In the good old days you only needed to know about relational databases. Not any more. We now have to deal with all sorts of new types of databases – so here is a map to find your way around the database landscape. You might also want to take the “Introduction to Databases” course from Coursera.
4. Know thy subject matter and models
Knowing something about your Big Data will help you avoid some embarassing errors down the road. For instance, if your data is about subatomic particles, it’s good to know that they decay after an extremely short period of time. If you are measuring the engagement and purchasing propensity of website visits, it helps to know what the metrics you are using mean. This all may seem very obvious, but I can tell you from experience that metrics can be very confusing.
Models are supposed to be simplified representations of reality based on certain assumptions. You should know about the models:
- what the assumptions are and why they were made
- what the pain points are in terms of computation and memory usage
- the accuracy of the model
- under which conditions is the model no longer valid
5. Have a social graph
Data scientists and “Big Data” people are digging, drilling and slicing into our social graphs. They can be found on all the “social” networks. Sometimes doing things average users never do, such as downloading data via APIs or scraping web pages. They can probably tell whether you are pregnant or not and where you live. There is even a network chart of the most influential data scientists on Twitter. Probably created with Python.
It’s important to synergize and increase the “Win-Win” within your social graph. Here are some world class “Big Data” Twitter users I follow:
- @DataJunkie – data scientist and reviewer of both NumPy 1.5 Beginner’s Guide and NumPy Cookbook
6. Learn to love statistics
Statistics helps us make sense of data. It is a very useful tool, but not one that is necessarily easy to master. The learning curve is steep and the obstacles many. The rewards however are huge. Being able to extract value from data is after all no small feat. Still not convinced? Read this essay by Zed Shaw – “Programmers Need To Learn Statistics Or I Will Kill Them All“. As the title suggests it’s not for the faint of heart.
You can get started with these statistics tutorials on Khan Academy or check out these online books:
- Elements of Statistical Learning (free PDF)
- Collaborative Statistics (free download)
- Probabilistic and Statistical Modeling in Computer Science (free PDF)
7. Sharpen thy saw
“Big Data” is a big deal and a big challenge. Keeping up to date with the latest developments is already pretty hard. Sharpening the saw requires a lifestyle of continuous learning in an upward spiral. There are lots of online courses about data science and “Big Data”. Here is a small selection:
I hope you enjoyed this post. Next week I will do a “Python Meme” writeup. I would like to point you to the reminder, just in case you missed it.
Remember sharing is part of the abundance mentality in service of society. The following is a list of Python links I came across this week:
- Python best language on Linux journal
- Teaching Python with the Raspberry Pi
- How one scientist learned Python
- Python love page
- Restricted Boltzmann Machines in Python
- Passing torch of NumPy
- Galry: High-Performance Interactive Visualization in Python
- The Python SMTP Server
- Singular Value Decomposition in SciPy
- Will it Python? Machine Learning for Hackers: Naive Bayes Text Classification
- A brief tour of the IPython notebook
- Python Meme 2012 by Tarek Ziade. I will do mine next week.