At the moment I’m preparing a workshop for the Python programming language and thus I’m looking for interesting examples. They should be short and sweet and I already stumbled over The sieve of Erastosthenes. Unfortunately, I was a bit keen to present some advanced data structures from Python. Because of that the performance of my implementation was bad,
-
def sieve1(N):
-
primes = set()
-
numbers = set(range(2,N))
-
while numbers:
-
current_prime = min(numbers)
-
multiples = set([n for n in numbers if not n % current_prime])
-
numbers -= multiples
-
primes.add(current_prime)
-
return primes
If you are a little bit into computational complexity and python, you should spot several problems with performance:
- I’m using
min(...)which has a runtime of O(N) in the number of elements in the set. So we are doing an O(N) operation every while-loop. Not good. - The outer while-loop runs, until all numbers are removed from the set. Here some unnecessary loops might be made.
Frank pointed these problems out and I made a futile attempt at reducing the runtime cost of min() by using lists instead of sets. Since now the lists are ordered, min() can be done in O(1).
-
def sieve2(N):
-
primes = list()
-
numbers = range(2,N)
-
while numbers:
-
current_prime = min(numbers)
-
numbers = [n for n in numbers if n % current_prime]
-
primes.append(current_prime)
-
return primes
I compared these two approaches with the cProfile module, for sieves up to 100000 and got the following results:
38375 function calls in 63.917 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 63.917 63.917 :1()
1 24.233 24.233 40.626 40.626 erastosthenes.py:10(sieve1)
1 18.460 18.460 23.277 23.277 erastosthenes.py:20(sieve2)
1 0.014 0.014 63.917 63.917 erastosthenes.py:52(main)
9592 0.024 0.000 0.024 0.000 {method 'add' of 'set' objects}
9592 0.016 0.000 0.016 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
19184 21.163 0.001 21.163 0.001 {min}
2 0.007 0.004 0.007 0.004 {range}
We can already see a speedup of approximately 50% from sieve1 to sieve2 but Frank then proposed an even faster approach. It uses an array of boolean values, whose indices represent the corresponding numbers. If ‘x’ is prime, the value at that position in the array is True. The algorithm now sweeps over the array and looks for already found prime numbers up to sqrt(N), if a number is marked as prime (i.e. array field is set to true) we mark all it’s multipliers as not prime. This can be done easily with a second for-loop. The array is initialized with all values as True and we start with 2, mark all even numbers as non-prime and go to 3. Now 3 is found to be prime and all it’s multipliers are marked as non-prime, then 4 is found as non-primed and skipped,… and this is continued up to sqrt(N).
At first I could not grasp why we only need to look until sqrt(N). But consider that we have reached sqrt(N), and marked all non-prime numbers according to our algorithm. Frank now told me boldly that at this point there are only correctly marked primes left in the array above sqrt(N) and up to N. This is true, even if it’s not obvious at first. But consider the following: every non-prime number greater than sqrt(N) can be factorized into two primes, where one prime is smaller than sqrt(N), otherwise we have a non-prime number which is greater than sqrt(N)*sqrt(N) = N, so it’s not in our search range. So if we have correctly found all primes up to sqrt(N) and marked all their multipliers as non-primes, only correctly marked primes greater sqrt(N) and smaller N remain.
-
from array import array
-
-
def sieve3(N):
-
numbers = array('b', [1] * N)
-
sN = int(math.sqrt(N))
-
-
for i in xrange(2,sN):
-
if numbers[i]:
-
for j in xrange(2*i, N, i):
-
numbers[j] = 0
-
-
return [i for i in xrange(2,N) if numbers[i]]
This algorithm does not need high-level data structures like linked-lists, so we tried to get the most simple array which could be found in Python and hence tried the array module. But for reference we also tried the standard lists of Python.
-
def sieve4(N):
-
numbers = [1] * N
-
sN = int(math.sqrt(N))
-
-
for i in xrange(2,sN):
-
if numbers[i]:
-
for j in xrange(2*i, N, i):
-
numbers[j] = 0
-
-
return [i for i in xrange(2,N) if numbers[i]]
These are our performance results:
38379 function calls in 98.769 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 98.769 98.769 :1()
1 37.161 37.161 60.856 60.856 erastosthenes.py:10(sieve1)
1 29.776 29.776 37.521 37.521 erastosthenes.py:20(sieve2)
1 0.211 0.211 0.211 0.211 erastosthenes.py:29(sieve3)
1 0.120 0.120 0.120 0.120 erastosthenes.py:40(sieve4)
1 0.061 0.061 98.769 98.769 erastosthenes.py:52(main)
2 0.000 0.000 0.000 0.000 {math.sqrt}
9592 0.027 0.000 0.027 0.000 {method 'add' of 'set' objects}
9592 0.017 0.000 0.017 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
19184 31.387 0.002 31.387 0.002 {min}
2 0.009 0.005 0.009 0.005 {range}
Now the speedup is quite gigantic, at a factor of roundabout 180. Interestingly, the standard Python lists are two times faster than the arrays provided by the array module.I must have missed the point of that module. But we wanted to go down to the bare metal and compare the performance with a C program.
Due to the fact, that all you need is an array and two for-loops, we tried to implement the algorithm in C and embed it as a module in Python, so we could make a better comparison. The most difficult part here was getting the algorithm properly wrapped and packed into a .so-library usable from Python. We saved the following source code in a file called erastomodule.c and compiled it into a .so-library for dynamic linking.
-
#include "Python.h"
-
#include <math .h>
-
#include <stdlib .h>
-
#include <string .h>
-
#include <stdio .h>
-
-
static PyObject *
-
erasto_sieve(PyObject *self, PyObject *args)
-
{
-
unsigned long N = 0;
-
unsigned char *numbers = NULL;
-
-
PyObject *primes = NULL;
-
PyObject *prime = NULL;
-
-
unsigned long i,j;
-
-
/* Parse the given parameter */
-
if (!PyArg_ParseTuple(args, "k", &N))
-
return NULL;
-
-
-
/* Initialize datastructure for sieve */
-
numbers = malloc(sizeof(unsigned char) * N);
-
memset(numbers, 1, N);
-
-
/* This is the actual sieve !!! */
-
for (i = 2; i < sqrt(N); i++) {
-
if (numbers[i]) {
-
for(j = 2*i; j < N; j+=i) {
-
numbers[j] = 0;
-
}
-
}
-
}
-
-
/* create a new PyList */
-
primes = PyList_New(0);
-
if (primes == NULL)
-
return NULL;
-
-
/* Put primes into list */
-
for(i = 2; i < N; i++) {
-
if (numbers[i]) {
-
prime = Py_BuildValue("k", i);
-
if(PyList_Append(primes, prime))
-
return NULL;
-
}
-
}
-
-
return primes;
-
}
-
-
static PyMethodDef ErastoMethods[] = {
-
{"sieve", erasto_sieve, METH_VARARGS, "Sieve of Erastosthenes."},
-
{NULL, NULL, 0, NULL} /* Sentinel */
-
};
-
-
-
PyMODINIT_FUNC
-
initerasto(void) {
-
PyObject *m;
-
-
m = Py_InitModule("erasto", ErastoMethods);
-
if (m == NULL)
-
return;
-
}
Then we could import it like a normal module from Python and use the function we called erasto.sieve for our computation. The following is a comparison of sieve4 and the erasto.sieve function we implemented in C. Again we have a speedup of roundabout 180, resulting in a total speedup of roundabout 32400 when compared to my initial approach.
6 function calls in 0.149 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.149 0.149 :1()
1 0.126 0.126 0.126 0.126 erastosthenes.py:40(sieve4)
1 0.017 0.017 0.149 0.149 erastosthenes.py:52(main)
1 0.006 0.006 0.006 0.006 {erasto.sieve}
1 0.000 0.000 0.000 0.000 {math.sqrt}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
The conclusion should be clear:
- Think twice about your algorithm
- Profile mercilessly
- And port expensive functions to C
Happy optimizing!
2 comments so far...
Psyco?
- Paddy.
Psyco is a nice idea and upon the suggestion I tried to integrate it into my experiment, but if I create a proxy for the sieve4 function, this proxy is invisible to all profilers. Thus, a significant comparison needs a completely new approach. Any ideas are welcome.
Here is the source-code for my experiment if anyone wants to work with it.
leave a reply