The increasing popularity of Python as a language for Big Data analysis is yet another reason to learn some optimisation techniques to write software that can scale easily without putting the hardware infrastructure under stress (for nothing). Here is a list of optimisation strategies, from string manipulation to loops. The list is non-exhaustive, but a good starting point to improve your code.
No string pumping in a for loop
Whenever you are dealing with large strings, avoid pumping the string in a for loop, which is usually done by this:
s = ""
for substr in list:
s += substr
Instead you should use
s = "".join(list). The same applies when you are generating strings via a function foo:
s = ""
for x in list:
s += foo(x)
This is much better:
slist = [foo(x) for x in somelist]
s = "".join(slist)
Still in the realm of string manipulation, pumping a string with
out = "<html>" + head + post + query + tail + "</html>"
is not nice, though correct. Instead, use the sprintf C-like form
out = "<html>%s%s%s%s</html>" % (head, post, query, tail)
Appending the result of a function
Other scenario: you want to append to a list the result of a function. An optimisation that will save you a lot of time is reported in the snippet below. Let’s say that you are converting to uppercase the words of a list. A naive programmer would write
newlist =  for word in oldlist:
That’s correct. But terribly slow. The for loop, the string manipulation function and the append function will crash a pretty powerful computer whenever crunching on very large lists. The function map will get rid of the interpreted loop and switch to the C-compiled loop, of the Python virtual machine.
Check it out
newlist = map(str.upper, oldlist)
Use local variables
Another trick that most programmers ignore is about using local variables. The lookup system of the Python interpreter is much more efficient with local than global variables. Moreover the code will be much more readable.
Write something like this
upper = str.upper
newlist = 
append = newlist.append for word in oldlist:
Fast histogram without plotting
This one might seem a bit off topic. But since that’s quite common in data analysis I will report it here. A recent improvement I found allowed me to compute the histogram of a dataset without plotting it (which already saves a lot of time) and is also much faster than the equivalent matplot/pyplot version. The magic is in numpy. I was used to write this:
from matplotlib import pyplot as plt binned_data = plt.hist(data, bins)
Now I do this:
import numpy as np
binned_data, bin_edges = np.histogram(data, bins)
Enjoy Python. Happy optimization!
Before you go
If you enjoyed this post, you will love the newsletter at datascienceathome.com It’s my FREE digest of the best content in Artificial Intelligence, data science, predictive analytics and computer science.