Sep 13

At the moment I’m preparing a workshop for the Python programming language and thus I’m looking for interesting examples. They should be short and sweet and I already stumbled over The sieve of Erastosthenes. Unfortunately, I was a bit keen to present some advanced data structures from Python. Because of that the performance of my implementation was bad, really bad. Frank Hellweg, a friend of mine, and someone who excels at theoretical computer-science had a little optimization-session with me, resulting in a total speedup of roundabout 30000. This was my first approach of implementing the sieve:

  1. def sieve1(N):
  2.     primes = set()
  3.     numbers = set(range(2,N))
  4.     while numbers:
  5.         current_prime = min(numbers)
  6.         multiples = set([n for n in numbers if not n % current_prime])
  7.         numbers -= multiples
  8.         primes.add(current_prime)
  9.     return primes

If you are a little bit into computational complexity and python, you should spot several problems with performance:

  • I’m using min(...) which has a runtime of O(N) in the number of elements in the set. So we are doing an O(N) operation every while-loop. Not good.
  • The outer while-loop runs, until all numbers are removed from the set. Here some unnecessary loops might be made.

Frank pointed these problems out and I made a futile attempt at reducing the runtime cost of min() by using lists instead of sets. Since now the lists are ordered, min() can be done in O(1).

  1. def sieve2(N):
  2.     primes = list()
  3.     numbers = range(2,N)
  4.     while numbers:
  5.         current_prime = min(numbers)
  6.         numbers = [n for n in numbers if n % current_prime]
  7.         primes.append(current_prime)
  8.     return primes

I compared these two approaches with the cProfile module, for sieves up to 100000 and got the following results:

   38375 function calls in 63.917 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   63.917   63.917 :1()
        1   24.233   24.233   40.626   40.626 erastosthenes.py:10(sieve1)
        1   18.460   18.460   23.277   23.277 erastosthenes.py:20(sieve2)
        1    0.014    0.014   63.917   63.917 erastosthenes.py:52(main)
     9592    0.024    0.000    0.024    0.000 {method 'add' of 'set' objects}
     9592    0.016    0.000    0.016    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    19184   21.163    0.001   21.163    0.001 {min}
        2    0.007    0.004    0.007    0.004 {range}

We can already see a speedup of approximately 50% from sieve1 to sieve2 but Frank then proposed an even faster approach. It uses an array of boolean values, whose indices represent the corresponding numbers. If ‘x’ is prime, the value at that position in the array is True. The algorithm now sweeps over the array and looks for already found prime numbers up to sqrt(N), if a number is marked as prime (i.e. array field is set to true) we mark all it’s multipliers as not prime. This can be done easily with a second for-loop. The array is initialized with all values as True and we start with 2, mark all even numbers as non-prime and go to 3. Now 3 is found to be prime and all it’s multipliers are marked as non-prime, then 4 is found as non-primed and skipped,… and this is continued up to sqrt(N).

At first I could not grasp why we only need to look until sqrt(N). But consider that we have reached sqrt(N), and marked all non-prime numbers according to our algorithm. Frank now told me boldly that at this point there are only correctly marked primes left in the array above sqrt(N) and up to N. This is true, even if it’s not obvious at first. But consider the following: every non-prime number greater than sqrt(N) can be factorized into two primes, where one prime is smaller than sqrt(N), otherwise we have a non-prime number which is greater than sqrt(N)*sqrt(N) = N, so it’s not in our search range. So if we have correctly found all primes up to sqrt(N) and marked all their multipliers as non-primes, only correctly marked primes greater sqrt(N) and smaller N remain.

  1. from array import array
  2.  
  3. def sieve3(N):
  4.     numbers = array('b', [1] * N)
  5.     sN = int(math.sqrt(N))
  6.  
  7.     for i in xrange(2,sN):
  8.         if numbers[i]:
  9.             for j in xrange(2*i, N, i):
  10.                 numbers[j] = 0
  11.  
  12.     return [i for i in xrange(2,N) if numbers[i]]

This algorithm does not need high-level data structures like linked-lists, so we tried to get the most simple array which could be found in Python and hence tried the array module. But for reference we also tried the standard lists of Python.

  1. def sieve4(N):
  2.     numbers = [1] * N
  3.     sN = int(math.sqrt(N))
  4.  
  5.     for i in xrange(2,sN):
  6.         if numbers[i]:
  7.             for j in xrange(2*i, N, i):
  8.                 numbers[j] = 0
  9.  
  10.     return [i for i in xrange(2,N) if numbers[i]]

These are our performance results:
   38379 function calls in 98.769 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   98.769   98.769 :1()
        1   37.161   37.161   60.856   60.856 erastosthenes.py:10(sieve1)
        1   29.776   29.776   37.521   37.521 erastosthenes.py:20(sieve2)
        1    0.211    0.211    0.211    0.211 erastosthenes.py:29(sieve3)
        1    0.120    0.120    0.120    0.120 erastosthenes.py:40(sieve4)
        1    0.061    0.061   98.769   98.769 erastosthenes.py:52(main)
        2    0.000    0.000    0.000    0.000 {math.sqrt}
     9592    0.027    0.000    0.027    0.000 {method 'add' of 'set' objects}
     9592    0.017    0.000    0.017    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    19184   31.387    0.002   31.387    0.002 {min}
        2    0.009    0.005    0.009    0.005 {range}

Now the speedup is quite gigantic, at a factor of roundabout 180. Interestingly, the standard Python lists are two times faster than the arrays provided by the array module.I must have missed the point of that module. But we wanted to go down to the bare metal and compare the performance with a C program.

Due to the fact, that all you need is an array and two for-loops, we tried to implement the algorithm in C and embed it as a module in Python, so we could make a better comparison. The most difficult part here was getting the algorithm properly wrapped and packed into a .so-library usable from Python. We saved the following source code in a file called erastomodule.c and compiled it into a .so-library for dynamic linking.

  1. #include "Python.h"
  2. #include <math .h>
  3. #include <stdlib .h>
  4. #include <string .h>
  5. #include <stdio .h>
  6.  
  7. static PyObject *
  8. erasto_sieve(PyObject *self, PyObject *args)
  9. {
  10.     unsigned long N = 0;
  11.     unsigned char *numbers = NULL;
  12.  
  13.     PyObject *primes = NULL;
  14.     PyObject *prime = NULL;
  15.  
  16.     unsigned long i,j;
  17.  
  18.     /* Parse the given parameter */
  19.     if (!PyArg_ParseTuple(args, "k", &N))
  20.         return NULL;
  21.  
  22.  
  23.     /* Initialize datastructure for sieve */
  24.     numbers = malloc(sizeof(unsigned char) * N);
  25.     memset(numbers, 1, N);
  26.    
  27.     /* This is the actual sieve !!! */
  28.     for (i = 2; i < sqrt(N); i++) {
  29.         if (numbers[i]) {
  30.             for(j = 2*i; j < N; j+=i) {
  31.                numbers[j] = 0;
  32.             }
  33.         }
  34.     }
  35.  
  36.     /* create a new PyList */
  37.     primes = PyList_New(0);
  38.     if (primes == NULL)
  39.         return NULL;
  40.  
  41.     /* Put primes into list */
  42.     for(i = 2; i < N; i++) {
  43.         if (numbers[i]) {
  44.             prime = Py_BuildValue("k", i);
  45.             if(PyList_Append(primes, prime))
  46.                 return NULL;
  47.         }
  48.     }
  49.  
  50.     return primes;
  51. }
  52.  
  53. static PyMethodDef ErastoMethods[] = {
  54.     {"sieve",  erasto_sieve, METH_VARARGS, "Sieve of Erastosthenes."},
  55.     {NULL, NULL, 0, NULL}        /* Sentinel */
  56. };
  57.  
  58.  
  59. PyMODINIT_FUNC
  60. initerasto(void) {
  61.     PyObject *m;
  62.  
  63.     m = Py_InitModule("erasto", ErastoMethods);
  64.     if (m == NULL)
  65.         return;
  66. }

Then we could import it like a normal module from Python and use the function we called erasto.sieve for our computation. The following is a comparison of sieve4 and the erasto.sieve function we implemented in C. Again we have a speedup of roundabout 180, resulting in a total speedup of roundabout 32400 when compared to my initial approach.

   6 function calls in 0.149 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.149    0.149 :1()
        1    0.126    0.126    0.126    0.126 erastosthenes.py:40(sieve4)
        1    0.017    0.017    0.149    0.149 erastosthenes.py:52(main)
        1    0.006    0.006    0.006    0.006 {erasto.sieve}
        1    0.000    0.000    0.000    0.000 {math.sqrt}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

The conclusion should be clear:

  • Think twice about your algorithm
  • Profile mercilessly
  • And port expensive functions to C

Happy optimizing!

2 comments so far...

  • Paddy3118 Said on September 13th, 2008 at 13:41:

    Psyco?

    - Paddy.

  • Prakti Said on September 25th, 2008 at 01:49:

    Psyco is a nice idea and upon the suggestion I tried to integrate it into my experiment, but if I create a proxy for the sieve4 function, this proxy is invisible to all profilers. Thus, a significant comparison needs a completely new approach. Any ideas are welcome.

    Here is the source-code for my experiment if anyone wants to work with it.

leave a reply