Thursday, October 27, 2011

PyPy and the road towards SciPy

Hello


Recent PyPys effort to bring NumPy and the associated fundraiser
caused a lot of discussion in the SciPy community regarding PyPy, NumPy,
SciPy and the future of numeric computing in Python.


There were discussions on the topic as well as various blog posts
from the SciPy community who addressed few issues. It seems there was a lot
of talking past each other and I would like to clarify on a few points here,
although this should be taken as my personal opinion on the subject.


So, let's start from the beginning. There are no plans for PyPy to
reimplement everything that's out there in RPython. That has been pointed
out from the beginning as a fallacy of our approach -- we simply don't plan
to do that. We agree that Python is a great glue language and we would like
to keep it that way. PyPy can nicely interface with C using ctypes with
a slightly worse story for C++ (even though there were experiments).
What we know by now is that CPython C API is not a very good glue for PyPy,
it's too tied to CPython and it prevents a lot of interesting optimizations
from happening. The contenders are a few with Cython being a favorite
for now, however for Cython to be usable we need to have a story for C++
(I know Cython does have a story but it's unclear how that would work with
the PyPy backend).


Which brings me to second point that while a lot of code in packages like
SciPy or matplotlib should be reusable in PyPy, it's probably not in
the current form. Either a lot of it has to move to Cython or some other
way of interfacing with C will come across. This should make it clear that
we want to interface with SciPy and reuse as much as possible.


Another recurring topic that seems to pop up is why we just don't reuse Cython
for NumPy instead of reimplementing everything. The problem is that we need
a robust array type with all the interface before we can start using Cython
for anything. Since we're going to implement it anyway, why not go all the way
and implement the full NumPy module? And that is the topic of the current
funding proposal is exactly that -- to provide full NumPy module. That
would be a very good start for integrating the full stack of SciPy and
matplotlib and all other libraries out there.


But also the trick is that a robust array module can go a long way alone.
It allows you to prototype a lot of algorithms on it's own and generally has
it's uses, without having to worry "but if I read all the elements from the
array it's going to be dog slow".


The last accusation is that we're trying to split the community. The answer is
simply no. We have a relatively good roadmap how to get to support what's out
there in scientific community and ideally support all people out there. This
will however take some time and the group of people that can run their
stuff on top of PyPy will be growing over time. This is indeed precisely what
is happening in other areas of python world -- more and more stuff run on PyPy
and people find it more and more interesting to try and to adapt their
own stuff to run.


To summarize, I don't really think there is that much of a gap between us
and SciPy people. We'll start small (by providing full NumPy implementation)
and then gradually move forward reusing as much as possible from the entire
stack.


Cheers,
fijal

7 comments:

  1. I'm going to play devil's advocate and ask the question of why PyPy should care one bit about the existing Numpy implementation or supporting C++ right now. I think it would be cool if the PyPy folks simply built the array type that *they* want. Make it fast. Do every kind of crazy optimization you can think of with it. Sure, call it something other than numpy to start, but make it something that programmers who want to live on the bleeding edge can play around it and try out (I know I'd be interested in messing around with something like that). Providing full numpy compatibility and all of that can come later on after more experience has been gained.

    ReplyDelete
  2. Hi Dave.

    If you download PyPy nightly, you can play with numpy.array that does exactly this. We're working on adding features (like multi dimensional arrays) and simply numpy API is kind of good.

    ReplyDelete
  3. The numpy interface is battle-tested over many years of use, and is pretty flexible. I am usually pleasantly surprised when applying it to new problems.

    Given the effort required to integrate a multidimensional array type into PyPy, I don't think it makes sense to try to reinvent the wheel by designing a completely new API. I could see someone experimenting with the API after a numpy-derived core is in place.

    ReplyDelete
  4. You can write "full" in bold, but that doesn't make it so. It should be clear to you by now that by claiming to provide a full numpy implementation you are at the very least confusing the issue for many users. To spell it out once more, here is what numpy provides and what you plan to implement:

    - Python API; ndarray, dtypes (yes)
    - C API; ndarray, dtypes (no)
    - Fourier transforms (no - I think)
    - Linear algebra (no - I think)
    - Random number generators (yes - I think)

    Furthermore, several people (Travis, David, Ian, Dave Beazley above) mentioned you shouldn't call your implementation numpy. Before you were using micronumpy, that makes a lot more sense.

    ReplyDelete
  5. When I mean full, I mean full. It's all yes in your table except the C API. The way to call C using those arrays will be provided, but not using the CPython C API.

    We'll rename it to numpypy for time being (at least until it's reasonably complete).

    ReplyDelete
  6. I'm not quite sure why people are getting so fussed about it. Most of the work in SciPy is in the C code, and it will still be easy to point some algorithm written in C at the memory held by the new PyPy arrays as it is in the current numpy.

    Why would people use PyPy for science if it's implementation of numpy was slower than CPythons one? They wouldn't, so that's why PyPy can't expose the existing CPython C API, as simply the act of exposing that API would make it much slower, due to the overhead of simulating ref-counting etc. There's no point PyPy trying to make a numpy implementation that exposes the CPython C API.

    ReplyDelete
  7. I think that linear algebra and Fourier transforms are frequently needed.
    Come on guys, lets donate:
    http://pypy.org/numpydonate.html

    ReplyDelete