Sharpen the Big Data Saw Towards the Singularity

Reminder – you can still win a free copy of NumPy Cookbook during the giveaway period. Check out this Amazon review by cbrunet.

 

Merry Christmas don't forget the cookbook giveaway

 

Now that we survived the end of the Mayan calendar, we have the Singularity to look forward to. After writing twice about sharpening of saws, I found that I have a lot more other saws left to sharpen. There were two spikes in traffic caused by “Sharpen the Vim Saw” being tweeted by @UnixToolTip and the “Sharpen the Python saw” submission by GovindReddy on Reddit. Thanks for that!

Every day I read something about Big Data. A decade ago it was all about Business Intelligence, Data Warehousing and datamining. Who knows what will be “hot” in 10 years? In the meantime Big Data analysis has been used by politicians, marketers, scientists and to search the web:

After the previous post I was asked on Google+ to say something about Big Data and Python, because apparently this has not been given enough visibility. Python is used extensively nowadays to analyze gigantic data sets, for instance in the field of gene sequencing and finance. It’s easy to find “proof” on the Internet:

If that’s not enough you can also watch this video of a talk by Travis Oliphant (creator of SciPy and NumPy) entitled “Python in Big Data”. Some jokes were really funny as far as I remember:

In order to achieve the next paradigm shift (at least until “Big Data” is replaced by something else) it’s necessary to change our mindset. These 7 habits loosely based on “The 7 Habits of Highly Effective People” are absolute essential in my humble opinion to complete this goal.

1. Be pragmatic when collecting data

In other words: keep it simple! Things always start simple. A shell script that retrieves data here. A little Perl program that does something with the data there. You end up with hundreds of programs in different programming languages exchanging data using a wide range of formats and the data being scattered in various databases.

I am not saying that you should use only one wonderful programming language like Python. Although it is possible with NumPy, for instance. However, we should try to be pragmatic and maintain a balance between high productivity and low complexity.

2. Know thy algorithms and data structures

Algorithms are like recipes. Every good chef knows hundreds of yummy recipes. An amateur would just throw some random stuff in the frying pan and probably burn the food or undercook it. Of course, you can look up recipes in a cookbook. So for instance if you are interested in NumPy you could look up interesting things you could do with NumPy in “NumPy Cookbook“. Or if you are more interested in general algorithms and data structures you could go through this Quora board.

3. Know thy databases

You can’t have Big Data without storing it somewhere. In the good old days you only needed to know about relational databases. Not any more. We now have to deal with all sorts of new types of databases – so here is a map to find your way around the database landscape. You might also want to take the “Introduction to Databases” course from Coursera.

4. Know thy subject matter and models

Knowing something about your Big Data will help you avoid some embarassing errors down the road. For instance, if your data is about subatomic particles, it’s good to know that they decay after an extremely short period of time. If you are measuring the engagement and purchasing propensity of website visits, it helps to know what the metrics you are using mean. This all may seem very obvious, but I can tell you from experience that metrics can be very confusing.

Models are supposed to be simplified representations of reality based on certain assumptions. You should know about the models:

  • what the assumptions are and why they were made
  • what the pain points are in terms of computation and memory usage
  • the accuracy of the model
  • under which conditions is the model no longer valid

5. Have a social graph

Data scientists and “Big Data” people are digging, drilling and slicing into our social graphs. They can be found on all the “social” networks. Sometimes doing things average users never do, such as downloading data via APIs or scraping web pages. They can probably tell whether you are pregnant or not and where you live. There is even a network chart of the most influential data scientists on Twitter. Probably created with Python.

It’s important to synergize and increase the “Win-Win” within your social graph. Here are some world class “Big Data” Twitter users I follow:

6. Learn to love statistics

Statistics helps us make sense of data. It is a very useful tool, but not one that is necessarily easy to master. The learning curve is steep and the obstacles many. The rewards however are huge. Being able to extract value from data is after all no small feat. Still not convinced? Read this essay by Zed Shaw – “Programmers Need To Learn Statistics Or I Will Kill Them All“. As the title suggests it’s not for the faint of heart.

You can get started with these statistics tutorials on Khan Academy or check out these online books:

7. Sharpen thy saw

“Big Data” is a big deal and a big challenge. Keeping up to date with the latest developments is already pretty hard. Sharpening the saw requires a lifestyle of continuous learning in an upward spiral. There are lots of online courses about data science and “Big Data”. Here is a small selection:

 

I hope you enjoyed this post. Next week I will do a “Python Meme” writeup. I would like to point you to the reminder, just in case you missed it.

 

Remember sharing is part of the abundance mentality in service of society. The following is a list of Python links I came across this week:

By the author of NumPy Beginner's Guide, NumPy Cookbook and Instant Pygame. If you enjoyed this post, please consider leaving a comment or subscribing to the RSS feed to have future articles delivered to your feed reader.
Share
This entry was posted in programming. Bookmark the permalink.