Tuesday, February 19, 2013

JVMs and kill signals

Ever wondered how a JVM reacts to various kill signals? The (intended) behaviour might be documented somewhere already, but I found having a table of the actual behaviour available quite useful. In particular I wanted to know which kill signals trigger the JVM to run registered shutdown hooks, and which kill signals don't actually terminate the JVM. So I decided to compile a table of that information.

I wrote up a small Java application that just registers a shutdown hook that I can detect whether it has executed or not, and then sleeps until I get a chance to kill it:

class Death {
  public static void main(String... args) throws Exception {
    Runtime.getRuntime().addShutdownHook( new Thread(){
      @Override
      public void run()
      {
        System.out.println("Shutting down");
      }
    } );
    for (;;) Thread.sleep(100);
  }
}

Then I ran the program in one terminal window (java Death; echo $?) while iterating through all kill signals (0-31) in another:

kill -$SIGNAL $(jps | grep Death | cut -d\  -f1)
signalshutdownruns hookexit codecomment
default (15)yesyes143SIGTERM is the default unix kill signal
0no--
1 (SIGHUP)yesyes129
2 (SIGINT)yesyes130SIGINT is the signal sent on ^C
3 (SIGQUIT)no--Makes the JVM dump threads / stack-traces
4 (SIGILL)yesno134Makes the JVM write a core dump and abort on trap 6
5yesno133Makes the JVM exit with "Trace/BPT trap: 5"
6 (SIGABRT)yesno134Makes the JVM exit with "Abort trap: 6"
7yesno135Makes the JVM exit with "EMT trap: 7"
8 (SIGFPE)yesno134Makes the JVM write a core dump and abort on trap 6
9 (SIGKILL)yesno137The JVM is forcibly killed (exits with "Killed: 9")
10 (SIGBUS)yesno134Emulates a "Bus Error"
11 (SIGSEGV)yesno134Emulates a "Segmentation fault"
12yesno140Makes the JVM exit with "Bad system call: 12"
13no--
14yesno142Makes the JVM exit with "Alarm clock: 14"
15 (SIGTERM)yesyes143This is the default unix kill signal
16no--
17no-145Stops the application (sends it to the background), same as ^Z
18no-146Stops the application (sends it to the background), same as ^Z
19no--
20no--
21no-149Stops the application (sends it to the background), same as ^Z
22no-150Stops the application (sends it to the background), same as ^Z
23no--
24yesno152Makes the JVM exit with "Cputime limit exceeded: 24"
25no--
26yesno154Makes the JVM exit with "Virtual timer expired: 26"
27yesno155Makes the JVM exit with "Profiling timer expired: 27"
28no--
29no--
30yesno158Makes the JVM exit with "User defined signal 1: 30"
31yesno134Makes the JVM exit on Segmentation fault

This list was compiled using (a quite old) Oracle Hotspot Java 8 EA on Mac OS X:

java version "1.8.0-ea"
Java(TM) SE Runtime Environment (build 1.8.0-ea-b65)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b09, mixed mode)

Hope this is useful to more people than myself.

Wednesday, October 12, 2011

Be my friend, work with me

We are hiring engineers for Neo4j at Neo Technology, and I want to work with you!

I never grew up and got a job. While I was still in college I found myself in a career as an open source developer and international speaker. Sounds like a job? Well I never got paid (other than different organizations sponsoring my trips and hotel expenses), I did it for the fun of it. This got me a bit of attention, and a few offers from high profile companies. No. Even before I graduated I had joined a startup (pre funding), founded by some friends of mine. They worked on interesting technology, and did things differently, challenging the established technology of the field of database management systems. Perhaps that spoke to the mindset of my inner rebel.

Now, four years after me joining Neo Technology, this once small group of friends who used to camp on empty desks at various offices where our friends worked, has grown a lot and keeps growing faster than ever before. But to me, that original feeling never disappeared, we are still a group of friends, playing with technologies that we think are interesting.

I don't have much of a social life outside of work. Why would I, I have many of my closest friends at work. Even though we are over 20 people now, compared to three when I started, I really feel like we are all friends playing with cool tech. Sure there have been times when I've been a bit annoyed over having to share my toys with some new kid, but after getting to know them and the cool things they can do, we've always ended up being friends.
So no, I wouldn't say that I've grown up and gotten a job yet. It doesn't feel like a job, it really does feel like play. Nor do I intend to, why would I? I have the privilege to keep playing and making a living out of it.

So why don't you come play with us? We've got room for more in our sandbox!

We are looking for great talent worldwide who can help us build the coolest sandcastle ever! You don't have to be a Java developer, but JVM experience helps. The important part is a talent for using technology in exciting new ways. Neo4j is (to my knowledge) the most widely used graph database in the world. It is an awesome product, but still has lots of room for improvement, from the lower levels, to the higher for modeling, or bindings for different languages. Everything is open source, so the best resume you can send us is a code contribution.

While we are recruiting from anywhere in the world, we are still trying to keep the organizations focused around only a few locations. So being able to relocate to the San Francisco Bay Area or Malmö, Sweden is a huge plus. We are looking for product engineers, QA experts, and support engineers to help our customers with everything from domain design to troubleshooting. Versatility is expected, you will not be doing one task only, product engineering experience makes for better support engineers, and customer support experience makes for better product engineers.

For official details see: http://neotechnology.com/about-us/jobs/

I'm looking forward to play with you.

Friday, February 18, 2011

Better support for short strings in Neo4j

In the past few days I've been working on a feature in the Neo4j Graph Database to store short strings with less overhead. I'm pleased to announce that this feature is now in trunk and will be part of the next milestone release. In this blog post I will describe what a short string is, and how Neo4j now stores them more efficiently.

At Neo Technology we spend one day each week working on what we call "lab projects", a chance to explore new, experimental features outside of the regular roadmap, that might be useful. Two weeks ago I spiked a solution for storing short strings in a compressed way as a lab day project. To understand why we first need a bit of background on how strings are usually stored in Neo4j. Since strings can be of variable length, Neo4j stores them in something called the DynamicStringStore. This consists of a number of blocks, 120 bytes in size, plus 13 bytes header for each block. A string is divided into chunks of 60 characters, each such chunk stored in its own block (a character is two bytes). For a short string, such as "hello", that if encoded in UTF-8 would occupy only 5 bytes, the overhead of storing it in the DynamicStringStore (including the property record of 22 bytes needed to reference the block in the DynamicStringStore) is almost 97 percent!

My initial spike analyzed all strings shorter than or equal to 9 characters, and if all characters were found to be 7bit ASCII, stored it directly in the property record, without involving the DynamicStringStore at all. The 7bit part is important. The property record contains a 64bit payload field, which when the DynamicStringStore is involved contains the id of the first block. 9 7bit characters sums up to 63bits, I can store that in the 64bit payload field. I can then use the high order bit to denote that the content is a full 9 char string, and if it isn't, the high order bit doesn't get set, but instead the first byte denotes the length of the string, and the rest of the 56 (7*8) bits are the actual string.

While this started out as something I thought was a fun project to hack on for a day, we quickly found use for it. When importing the OpenStreetMap data for Germany, with this feature in place we found that the DynamicStringStore was now 80% smaller than before! Not only that but time for reading and writing strings had improved by at least 25%! (the benchmark I got this from creates nodes and relationships as well, so pure string operations is probably even faster) Such figures are great for getting a feature into the backlog.

I am not a big fan of ASCII though. It was designed for communicating with line printers, not for storing text. Also, with short strings the number of exotic characters that people use drops significantly, it is more likely to just be some simple alphanumerical name or identifier, such as "hello world", "UPPER_CASE", "192.168.0.1", or "+1.555.634.5773". So the next thing I did was to write a tool that could analyze the data stored in actual Neo4j instances and generate a report on the statistics of strings actually stored. I then sent this to our public users mailing list. The feedback confirmed my suspicions about what kind of text people store, and also suggested that we would be able to store up to 65% of our users strings as short strings.

Armed with statistics about actual strings I set out (along with my most recent colleague, Chris Gioran) to write an even better short string encoding, and incorporate it into Neo4j. Last night we pushed it to git. The format we ended up with can select between 6 different encodings, all encoded using the high order nibble of the payload entry of the property record:

  • Numerical up to 15 characters binary coded decimal, with the additional 6 codepoints used to encode punctuation characters commonly used in phone numbers or as thousand separators. This can encode any integer from -1015 to 1016 (both edges exclusively), most international phone numbers, IPv4 addresses, et.c.
  • All UPPER CASE strings or all lower case strings up to 12 characters, including space, underscore, dot, dash, colon, or slash. Useful for identifiers (Java enum constant names et.c.). This doesn't support mixed case though.
  • Alphanumerical strings up to 10 characters, including space or underscore. Supports mixed case.
  • European words up to 9 characters, this includes alphanumerical, space, underscore, dash, dot and the acute characters in the latin-1 table. Useful for building translation graphs.
  • Latin-1 up to 7 characters. Will give you parenthesis if you have those in a short string.
  • UTF-8 if the string can be encoded in 7 bytes (or less). Useful for short CJK strings for example.

The code is still in internal review, and shouldn't be considered stable until its inclusion in the next milestone release a week from now. But I am very exited about the benefits this will give to Neo4j users, both in terms of lower storage sizes, but also in terms of performance improvements. Reading (and writing) a string that is encoded as a short string is much faster than reading (or writing) a string in the DynamicStringStore, since it is only one disk read instead of two.

A big thank you goes out to the people in the Neo4j community who provided me with the string statistics that made this possible.

Happy hacking

Tuesday, December 15, 2009

Seamless Neo4j integration in Django

About a year ago I gave a presentation at Devoxx where I showed off how easy it was to use any Java library with Django in Jython. The library I demonstrated this with was of course Neo4j. I had written some code for using Neo4j to define models for Django, and now it is ready to be released for you to use it.

The way that the integration between Django and Neo4j is implemented is in the Model layer. Since Neo4j does not have a SQL engine it would not have been efficient or practical to implement the support as a database layer for Django. Google did their implementation in the same way when they integrated BigTable with Django for App Engine. This means that there will be some minor modifications needed in your code compared to using PostgreSQL or MySQL. Just as with BigTable on App Engine, you will have to use a special library for defining your models when working with Neo4j, but the model definition is very similar to Djangos built in ORM. With persistence systems that integrate on the database layer the only difference is in configuration, but that requires the database to fit the mold of a SQL database.

Why the **** has this taken a year to finish?

Short answer: The cat ate my source code.

A mess of symlinks that stemmed from the fact that Jython didn't have good support for setuptools when I started writing this code actually lead to the complete loss of my source code. But to be honest the code wasn't that good anyways. I wanted to add support for Django's administration interface, and I knew that undertaking would require a complete rewrite of my code. A complete rewrite is done and now it will be possible for me to support the administrative interface of Django in the next release. So why not until now, a year after the first prototype? I was working on other things, it's that simple.

Getting started

While the demonstration I gave a year ago was geared towards Jython, since that was the topic of the presentation, the Python bindings for Neo4j work equally well with CPython. That is all you need, Neo4j and Django, the Python bindings for Neo4j comes with a Django integration layer built in as of the most recent revisions in the repository. The source distribution also contains a few sample applications for demonstrating how the integration works. The Django integration is still in a very early stage of development, but the base is pretty solid, so new features should be much easier to add now. Since the state is pre-alpha, installation from source is the only option at the moment. Let me walk you through how to get things up and running:

  • Set up and activate a virtualenv for your development. This isn't strictly necessary, but it's so nice to know that you will not destroy your system Python installation if you mess up. Since we got Jython to support virtualenv I use it for everything. If you use CPython your virtualenv will contain a python executable, and if you use Jython it will contain a jython executable, I will refer to either simply as python from here on, but substitute for that for jython if you, like me, prefer that implementation.
  • If you are using CPython: Install JPype, it is currently a dependency for accessing the JVM-based core of Neo4j from CPython:
    $ unzip JPype-0.5.4.1.zip
    $ cd JPype-0.5.4.1
    $ python setup.py install
    
  • Check out the source code for the Python bindings for Neo4j, and install it:
    $ svn co https://svn.neo4j.org/components/neo4j.py/trunk neo4j-python
    $ cd neo4j-python
    $ python setup.py install
    
  • Install Django:
    $ easy_install django
    
  • Create a new Django project:
    $ django-admin.py startproject neo4django
    
  • Create a new app in your Django project:
    $ python neo4django/manage.py startapp business
    
  • Set up the configuration parameters for using with Neo4j in Django by adding the following configurations to your settings.py:
    NEO4J_RESOURCE_URI = '/var/neo4j/neo4django'
    # NEO4J_RESOURCE_URI should be the path to where
    #    you want to store the Neo4j database.
    
    NEO4J_OPTIONS = {
        # this is optional and can be used to specify
        # extra startup parameters for Neo4j, such as
        # the classpath to load Neo4j from.
    }
    
    You can ignore the default Django configurations for RDBMS connections if you only plan to use Neo4j, but if you want to use Djangos built in Admin interface (not supported with Neo4j quite yet) or authentication module you will need to configure this.
  • You are now ready to create your first Neo4j backed domain objects for your Django application, by editing business/models.py. Let's create a simple model for companies with owners and employees:
    from neo4j.model import django_model as model
    
    class Person(model.NodeModel):
        first_name = model.Property()
        last_name = model.Property()
        def __unicode__(self):
            return u"%s %s" % (self.first_name, self.last_name)
    
    class Company(model.NodeModel):
        name = model.Property(indexed=True)
        owners = model.Relationship(Person,
            type=model.Outgoing.OWNED_BY,
            related_name="owns",
        )
        employees = model.Relationship(Person,
            type=model.Incoming.WORKS_AT,
            related_name="employer",
            related_single=True, # Only allow Persons to work at one Company
        )
        def __unicode__(self):
            return self.name
    
  • That's it, you've created your first Django domain model using Neo4j, let's try it out:
    $ python neo4django/manage.py shell
    >>> from neo4django.business import models
    >>> seven_eleven = models.Company.objects.create(name="Seven Eleven")
    >>> seven_eleven.employees.add(
    ...     models.Person.objects.create(
    ...         first_name="Sally", last_name="Getitdone"),
    ...     models.Person.objects.create(
    ...         first_name="John", last_name="Workerbee"))
    >>> seven_eleven.save() # store the newly created relationships
    >>> people = list(seven_eleven.employees.all())
    >>> someone = people[0]
    >>> print someone, "works at", someone.employer
    

Notice how the model objects are compatible with model objects created using Djangos built in ORM, making it easy to port your existing applications to a Neo4j backend, all you need to change is the model definitions. For more examples, see the example directory in the repository: https://svn.neo4j.org/components/neo4j.py/trunk/src/examples/python/.

Future evolution

There is still more work to be done. As this is the first release, there are likely to be bugs, and I know about a few things (mainly involving querying) that I have not implemented support for yet. I also have a list of (slightly bigger) features that I am going to add as well, to keep you interested, I'll list them with a brief explanation:

  • Add support for the Django admin interface. You should be able to manage your Neo4j entities in the Django administration interface, just as you manage ORM entities. To do this I need to dig further into the internals of the admin source code, to find out what it expects from the model objects to be able to pick up on them and manage them. The hardest part with this is that the admin system has a policy of silent failure, meaning that it will not tell me how my code violates its expectations.
  • Add support for Relationship models. Currently you can only assign properties to nodes in the domain modeling API, you should be able to have entities represented by relationships as well. The way you will do this is by extending the Relationship-class.
  • Add a few basic property types. I will add support for creating your own property types by extending the Property-class (this is implemented already, but not tested, so if it works it's only by accident). I will also add a few basic subtypes of Property, a datetime type at the very least. I will also add support for choosing what kind of index to use with each indexed property, in the case of datetime a Timeline-index seems quite natural for example... Supporting enumerated values for Properties is also planned, i.e. limiting the set of allowed values to an enumerated set of values.
  • Tapping in to the power of Neo4j. By adding support for methods that do arbitrary operations on the graph (such as traversals), and where the returned nodes are then automatically converted to entity objects. I think this will be a really cool and powerful feature, but I have not worked out the details of the API yet.

Report any bugs you encounter to either the Neo4j bug tracker, or on the Neo4j mailing list. Suggestions for improvements and other ideas are also welcome on the mailing list, to me personally, or why not as a comment on this blog.

Happy Hacking

Friday, August 07, 2009

Java integration in future Jython

I've seen a lot of this lately, so I thought that it was time for an actual Jython developer (myself) to share some ideas on how Java integration in Jython could be improved. At the same time I'd like to propose some changes that could make the different Python implementations more unified, and even could lead to a common Java integration API in all of them.

The most basic part of the Java integration in Jython is the ability to import and use Java classes. This is impossible for other Python implementations to do in the same way, and thus breaks compatibility fundamentally. I therefore propose that we remove this functionality as it is in Jython today (!). Instead we should look at how IronPython enables using CLR (.NET) classes. In IronPython you first need to import clr before you can access any of the CLR types. The same is done in other languages on the JVM as well, for example JRuby where you need to require 'java' before using any Java libraries. I propose we require something similar in Jython, and what better package to require you to import than java?

An observation: The java package in Java does not contain any classes, only sub-packages. Furthermore all the sub-packages of the java package follow the Java naming conventions, i.e. They all start with a lowercase letter. This gives us a name space to play with: anything under the java package that starts with an uppercase letter.

What happens when you import java? The java Python module is a "magic module" that registers a Python import hook. This import hook will then enable you to import real Java packages and classes. In Jython many of the builtin libraries will of course import java, which means that this will be enabled by default in Jython. But writing code that is compatible across Python implementations would now be possible, by simply ensuring that you import java before any other Java packages.

The content of the java module

Most if not all of what is needed to utilize Java classes from Python code is provided by the import hook that the java module registers when it is loaded. This means that the content of the java module needs to deal with the other direction of the interfacing: defining and implementing APIs in Python that Java code can utilize. I propose that the Python java module contain the following:

JavaClass
A class decorator that exposes the decorated class as a class that can be accessed from Java. Accepts a package keyword argument for defining the Java package to define the class in, if omitted it is derived from the __module__ attribute of the class.
Possibly JavaClass should also be the Python type of imported Java classes.
Field
An object for defining Java fields in classes. Takes a single argument, the type of the field. Example usage:
@java.JavaClass
class WithAField:
    data = java.Field(java.lang.String)
Array
An object for defining Java arrays. This is used to define Java array types. Examples:
  • Array[java.Primitive.int] corresponds to the Java type int[]
  • Array[java.lang.String] corresponds to the Java type java.lang.String[]
  • Array corresponds to the Java type java.lang.Object[]
  • Array[Array[java.lang.String]] corresponds to the Java type java.lang.String[][]
Access
A set of Java access definition decorators. Contains:
  • Access.public
  • Access.package - this needs to be explicitly available since it does not make sense as the default in Python code.
  • Access.protected
  • Access.module - for the new access modifier in the upcoming module system (a.k.a. Project Jigsaw) for Java.
  • Access.private
  • The default access modifier should either be public or the absence of an access modifier decorator would mean that the method is not exposed in the Java class at all. This needs further discussion.
Primitive
The set of primitive types in Java:
  • Primitive.void
  • Primitive.boolean
  • Primitive.byte
  • Primitive.char
  • Primitive.short
  • Primitive.int
  • Primitive.long
  • Primitive.float
  • Primitive.double
  • These can be used as type parameters for Array but not for Generic types (Since primitives are not allowed as generic type parameters in Java).
Overload
Used to implement (and define) overloaded methods, several different methods with the same name, but different type signatures. Example usage:
@java.JavaClass
class WithOverloadedMethod:
    @java.Access.public
    def method(self, value:java.lang.String) -> java.util.List[java.lang.String]:
        ...
    @java.Overload(method)
    @java.Access.public
    def method(self, value:java.lang.Integer) -> java.lang.String:
        ...
    @java.Overload(method)
    @java.Access.public
    def method(self, value:java.lang.Iterable[java.lang.String]) -> java.Primitive.void:
        ...

Java classes and interfaces, when imported, are Pythonized in such a way that they can be used as bases for Python classes. Generics are specified by subscripting the generic Java class. Java annotations are Pythonized in a way that turns them into decorators that add a special attribute to the decorated element: __java_annotations__. Annotations on imported Java classes and methods would also be exposed through the __java_annotations__ property for consistency. Access modifiers would similarly add a __java_access__ property to the object they decorate.

Kay Schluer also suggested allowing decorators on assignments, to be able to support annotations on fields. I don't really have an opinion on this. Since I don't think fields should be exported in any public API anyway it's a bit useless, and for the the cases where fields are used (such as dependency injection systems) I think it suffices to have it all in the same assignment: dependency = javax.inject.Inject(java.Access.private(java.Field(JavaClassIDependOn))), the name will be extracted to be "dependency" when the class is processed by the JavaClass class decorator. But if others find assignment decorators useful, I am not opposed to them. If assignment decorators are added to Python, it might be worth considering having a slightly different signature for these decorator function, so that the name of the target variable is passed as a parameter as well. Then my example could look like this:

@java.JavaClass
class WithInjectedDependency:
    @javax.inject.Inject # This is JSR 330 by the way
    @java.Access.private
    @java.Field
    dependency = JavaClassIDependOn
    # could expand to: dependency = javax.inject.Inject(
    #     "dependency", java.Access.private(
    #         "dependency", java.Field(
    #             "dependency", JavaClassIDependOn)))
    # or to the same thing as above, depending on how
    # assignment decorators were implemented...

When defining methods in Java integration classes we use Python 3 function annotations to define the method signatures. These can be omitted, the default types in that case would of course be java.lang.Object. It is important that we support exposing classes that don't have any Java integration added to them from Jython, since we want to enable importing existing Python libraries into Java projects and use them without having to port them. These classes will not have the JavaClass decorator applied to them. Instead this will be done automatically by Jython at the point when the Python class first need to expose a class to Java. This is not something that the java module need to deal with, since it doesn't fit with other Python implementations.

Outstanding issues

There are still a few Java integration issues that I have not dealt with, because I have not found a solution that I feel good about yet.

Defining Java interfaces
Is this something we need to be able to do? If so, the proper approach is probably to add a JavaInterface decorator to the java module, similar to the JavaClass decorator.
Defining Java enums
This might be something that we want to support. I can think of two options for how to declare the class. Either we add a JavaEnum decorator to the java module, or we add special case treatment for when a class extends java.lang.Enum (I am leaning towards this approach). Then we need to have some way to define the enum instances. Perhaps something like this:
@java.JavaClass
class MyEnum(java.lang.Enum):
    ONE = java.EnumInstance(1)
    TWO = java.EnumInstance(2, True)
    THREE = java.EnumInstance(3, True)
    FOUR = java.EnumInstance(4)
    def __init__(self, number, is_prime=False):
        self.number = number
        self.is_prime = is_prime
    def __str__(self):
        return self.name()
    class SEVENTEEN(java.EnumInstance):
        """This is an enum instance with specialized behavior.
        Will extend MyEnum, but there will only be one instance."""
        def __init__(self):
            """This class gets automatically instantiated
            by the __metaclass__ of Enum."""
            self.number = 17
            self.is_prime = True
        def __str__(self):
            return "The most random number there is."
Defining generic types
I have discussed how to specify type parameters for generic types, but how would you define a generic Java type in Python? How about something like this:
@java.JavaClass
class GenericClass:
    T = java.TypeParameter() # default is "extends=java.lang.Object"
    C = java.TypeParameter(extends=java.util.concurrent.Callable)
This gets complicated when wanting to support self references in the type parameters, but the same is true for implemented interfaces, such as:
class Something implements Comparable<? extends Something> {
    ...
}
Defining Java annotations
I have dealt with supporting the use of Java annotations, but what about defining them? I highly doubt that defining Java annotations in Python is going to be useful, but I prefer to not underestimate what developers might want to do. I do however think we could get far without the ability to define Java annotations in Python, but if we were to support it, what would it look like? Defining the class would probably be a lot like how enums are defined, either by special casing java.lang.annotation.Annotation or providing a special java.Annotation decorator.
@java.JavaInterface
class MyAnnotation(java.lang.annotation.Annotation):
    name = java.AnnotationParameter(java.lang.String)
    description = java.AnnotationParameter(java.lang.String, default="")

java for other Python implementations

I mentioned that requiring the user to explicitly import java to make use of Java classes would make it possible for other Python implementations to support the same Java integration API. So what would the default implementation of the java module look like? There is a very nice standardized API for integrating with Java from other external programming languages: JNI. The default java module would simply implement the same functionality as the Jython counterpart by interacting with JNI using ctypes. Since ctypes is supported by all Python implementations (Jython support is under development) the java integration module would work across all Python implementations without additional effort. Right there is a major advantage over JPype and JCC (the two major Java integration modules for CPython today).

Integration from the Java perspective

I have not given as much thought to the area of utilizing Python code from Java. Still this is one of the most important tasks for Jython to fulfill. This section is therefore just going to be some ideas of what I want to be able to do.

Use Python for application scripting
This is possible today, and a quite simple case, but I still think that it can be improved. Specifically the problem with Jython today is that there is no good API for doing so. Or to be frank, there is hardly an API at all. This is being improved upon though, the next update of Jython will include an updated implementation of the Java Scripting API, and the next release will introduce a first draft of a proper Jython API, something that we will support long term after a few iterations, and that you can build your applications against.
Use Jython to implement parts of your application
We want to be able to write an polyglot applications, where parts of it is implemented in Python. This is more than just scripting the application. Applications generally work without scripts. We want to be able to write the implementation of parts of an application in Python with Jython. This is possible today, but a bit awkward without an official Jython API. This is being worked on in a separate project called PlyJy, where we are experimenting with an API for creating object factories for Jython. Jython object factories are objects that call into a Python module, instantiate a Python class, conforms it to a Java interface and returns it. So far this project is looking good and there is a good possibility that this will get included in the Jython API.
Directly link (Java) applications to Python code
This is where things are starting to get advanced. It would be nice if you could write a library in Python (or import an existing one) and link your Java code with the classes and functions defined in that library directly. This would require Jython to generate Java proxies, actual Java classes where the methods correspond to the actual signatures, with proper constructors and the things you would need to use it like any other Java code, while hiding away the dynamic aspects that make it Python. This could either be done through a compilation step, where some Jython proxy compiler generates the proxies that the Java code can link with, or through utilizing a ClassLoader that loads a Python module and inspects the content, automatically generating the required proxies. With the ClassLoader approach javac would need to know about and use it to load signatures from Python code. This is of course where the Java integration decorators described above fits in.

What do you think?

I would love to get feedback on these ideas. Either through comments to this entry, via Twitter or on the Jython-dev mailing list.

Please note that the ideas presented in this blog post are my own and does not reflect any current effort in the Jython project.

Friday, July 31, 2009

"Social networking" killed productivity

Twitter has become work. Not acceptable for work, i.e. something that is not frowned upon to do at work, but actual work, something you are required to do at work. At least this is the case if you are involved somewhere where the development team is the marketing team, like a startup or an open source project. For the record my involvement in Neo4j qualify to both categories, and Jython is most certainly an open source project, and quite a high profile such as well.

In order to stay on top of things in this situation you easily find yourself with push based twitter notification or at least reading a lot of material on a regular basis. I for example get about 150 to 200 tweets per day from the people I follow. Combine this with the expectation to stay on top of email (yet again yo go for push), and you've got a constant stream of interrupts, and this really kills productivity.

Just the other day I read the Life Offline posts by Aaron Swartz, and found that very much recognize myself in how he describes the problems of the constant online presence. It would be wonderful if I, like he was able to do, could take a long stretch of time away from being connected, but I don't think that is possible, at least not now or in a near future. The problem stands though, I am not being productive. And some things don't get done in time. And this is a problem.

I've tried shifting my email and twitter use to only do processing of these things once per day, but it still takes two hours or more from my day to simply process the incoming stream. By processing I mean:

  • Read all the email and define actions.
  • Read all tweets, open tabs for links that seem interesting, skim those pages and define actions.
  • Read feeds and define actions.

That takes two hours. Then I still have to perform the actions that I have defined. Which could take up to the rest of the day.

I noticed already about twelve years ago how destructive online communities and social networks could be, and how much time they consume. I have thus tried to stay away from them, which is why I don't use my facebook account. But when social networking has become part of work it is much harder to avoid. In the case of Twitter it is also difficult to ignore because of how hugely influential it is. Twitter is the de facto way to find out about new things and interesting articles.

I am starting to believe that perhaps Donald Knuth made a wise decision in not having an email address, but as he points out having an email address is for people who need to be on top of things, and that he does not have an email address because he does not have to be on top of things anymore. I will agree with that, Donald Knuth has contributed a lot to the field of computer science, but he is definitely not on top of things anymore. So how do you cope with both being on top of things while still being productive? Is it possible? I would love to get any insight into the secrets that I am obviously unaware of.

Wednesday, July 15, 2009

Improving performance in Jython

About two weeks ago I published a writeup about my findings about the performance of synchronization primitives in Jython from my presentation at JavaOne. During the presentation I said that these performance issues were something that I was going to work on, and improve. And indeed I did. I cannot take full credit for this, Jim Baker played a substantial part in this work as well. The end result is still something I'm very proud of since we managed to improve the performance of this benchmark as much as 50 times.

The benchmarks

The comparisons were performed based on the execution of this benchmark script invoked with:

  • JAVA_HOME=$JAVA_6_HOME jython synchbench.py
  • JAVA_HOME=$JAVA_6_HOME jython -J-server synchbench.py
# -*- coding: utf-8 -*-
from __future__ import with_statement, division

from java.lang.System import nanoTime
from java.util.concurrent import Executors, Callable
from java.util.concurrent.atomic import AtomicInteger

from functools import wraps
from threading import Lock

def adder(a, b):
    return a+b


count = 0
def counting_adder(a, b):
    global count
    count += 1 # NOT SYNCHRONIZED!
    return a+b


lock = Lock()
sync_count = 0
def synchronized_counting_adder(a, b):
    global sync_count
    with lock:
        sync_count += 1
    return a+b


atomic_count = AtomicInteger()
def atomic_counting_adder(a,b):
    atomic_count.incrementAndGet()
    return a+b


class Task(Callable):
    def __init__(self, func):
        self.call = func

def callit(function):
    @Task
    @wraps(function)
    def callable():
        timings = []
        for x in xrange(5):
            start = nanoTime()
            for x in xrange(10000):
                function(5,10)
            timings.append((nanoTime() - start)/1000000.0)
        return min(timings)
    return callable

def timeit(function):
    futures = []
    for i in xrange(40):
        futures.append(pool.submit(function))
    sum = 0
    for future in futures:
        sum += future.get()
    print sum

all = (adder,counting_adder,synchronized_counting_adder,atomic_counting_adder)
all = [callit(f) for f in all]

WARMUP = 20000
print "<WARMUP>"
for function in all:
    function.call()
for function in all:
    for x in xrange(WARMUP):
        function.call()
print "</WARMUP>"

pool = Executors.newFixedThreadPool(3)

for function in all:
    print
    print function.call.__name__
    timeit(function)
pool.shutdown()

glob = list(globals())
for name in glob:
    if name.endswith('count'):
        print name, globals()[name]

And the JRuby equivalent for comparison:

require 'java'
import java.lang.System
import java.util.concurrent.Executors
require 'thread'

def adder(a,b)
  a+b
end

class Counting
  def initialize
    @count = 0
  end
  def count
    @count
  end
  def adder(a,b)
    @count = @count + 1
    a+b
  end
end

class Synchronized
  def initialize
    @mutex = Mutex.new
    @count = 0
  end
  def count
    @count
  end
  def adder(a,b)
    @mutex.synchronize {
      @count = @count + 1
    }
    a + b
  end
end

counting = Counting.new
synchronized = Synchronized.new

puts "<WARMUP>"
10.times do
  10000.times do
    adder 5, 10
    counting.adder 5, 10
    synchronized.adder 5, 10
  end
end
puts "</WARMUP>"

class Body
  def initialize
    @pool = Executors.newFixedThreadPool(3)
  end
  def timeit(name)
    puts
    puts name
    result = []
    40.times do
      result << @pool.submit do
        times = []
        5.times do
          t = System.nanoTime
          10000.times do
            yield
          end
          times << (System.nanoTime - t) / 1000000.0
        end
        times.min
      end
    end
    result.each {|future| puts future.get()}
  end
  def done
    @pool.shutdown
  end
end

body = Body.new

body.timeit("adder") {adder 5, 10}
body.timeit("counting adder") {counting.adder 5, 10}
body.timeit("synchronized adder") {synchronized.adder 5, 10}

body.done

Where we started

A week ago the performance of this Jython benchmark was bad. Compared to the equivalent code in JRuby, Jython required over 10 times as much time to complete.

When I analyzed the code that Jython and JRuby generated and executed, I came to the conclusion that the reason Jython performed so badly was that the call path from the running code to the actual lock/unlock instructions introduced too much overhead for the JVM to have any chance at analyzing and optimizing the lock. I published this analysis in my writeup on the problem. It would of course be possible to lower this overhead by importing and utilizing the pure Java classes for synchronization instead of using the Jython threading module, but we like how the with-statement reads for synchronization:

with lock:
    counter += 1

Getting better

Based on my analysis of the how the with-statement compiles and the way that this introduces overhead I worked out the following redesign of the with-statement context manager interaction that would allow us to get closer to the metal, while remaining compatible with PEP 434:

  • When entering the with-block we transform the object that constitutes the context manager to a ContextManager-object.
  • If the object that constitutes the context manager implements the ContextManager interface it is simply returned. This is where context managers written in Java get their huge benefit by getting really close to the metal.
  • Otherwise a default implementation of the ContextManager is returned. This object is created by retrieving the __exit__ method and invoking the __enter__ method of the context manager object.
  • The compiled code of the with-statement then only invokes the __enter__ and __exit__ methods of the returned ContextManager object.
  • This has the added benefit that even for context managers written in pure Python the ContextManager could be optimized and cached when we implement call site caching.

This specification was easily implemented by Jim and then he could rewrite the threading module in Java to let the lock implementation take direct benefit of the rewritten with-statement and thereby get the actual code really close to the locking and unlocking. The result were instantaneous and beyond expectation:

Not only did we improve performance, but we passed the performance of the JRuby equivalent! Even using the client compiler, with no warm up we perform almost two times better than JRuby. Turn on the server compiler and let the JIT warm up and perform all it's compilation and we end up with a speedup of slightly more than 50 times.

A disclaimer is appropriate here. With the first benchmark (before this was optimized) I didn't have time to wait for a full warmup. This because of the fact that the benchmark was so incredibly slow at that point and the fact that I was doing the benchmarks quite late before the presentation and didn't have time to leave it running over the night. Instead I turned down the compilation threshold of the Hotspot server compiler and ran just a few warmup iterations. It is possible that the JVM could have optimized the previous code slightly better given (a lot) more time. The actual speedup might be closer to the speedup from the first code to the new code using the client compiler and no warmup. But this is still a speedup of almost 20 times, which is still something I'm very proud of. There is also the possibility that I didn't run/implement the JRuby version in the best possible way, meaning that there might be ways of making the JRuby version run faster that I don't know about. The new figures are still very nice, much nicer than the old ones for sure.

The current state of performance of Jython synchronization primitives

It is also interesting to compare how the current implementation compares to the other versions in Jython that I included in my presentation:

Without synchronization the code runs about three times as fast as with synchronization, but the counter does not return the correct result here due to race conditions. It's interesting from the point of view of analyzing the overhead added by synchronization but not for an actual implementation. Two times overhead is quite good in my opinion. What is more interesting to see is that the fastest version from the presentation, the one using AtomicInteger, is now suffering from the overhead of reflection required for the method invocations compared to the synchronized version. In a system with more hardware threads (commonly referred to as "cores") the implementation based on AtomicInteger could still be faster though.

Where do we proceed from here?

Now that we have proven that it was possible to get a nice speedup from this redesign of the code paths the next step is to provide the same kind of optimizations for code written in pure Python. Providing a better version of contextlib.contextmanager that exploits these faster code paths should be the easiest way to improve context managers written in Python. Then there are of course a wide range of other areas in Python where performance could be improved through the same kind of thorough analysis. I don't know at this point what we will focus on next, but you can look forward to many more performance improvements in Jython in the time to come.