Wednesday, October 12, 2011

Be my friend, work with me

We are hiring engineers for Neo4j at Neo Technology, and I want to work with you!

I never grew up and got a job. While I was still in college I found myself in a career as an open source developer and international speaker. Sounds like a job? Well I never got paid (other than different organizations sponsoring my trips and hotel expenses), I did it for the fun of it. This got me a bit of attention, and a few offers from high profile companies. No. Even before I graduated I had joined a startup (pre funding), founded by some friends of mine. They worked on interesting technology, and did things differently, challenging the established technology of the field of database management systems. Perhaps that spoke to the mindset of my inner rebel.

Now, four years after me joining Neo Technology, this once small group of friends who used to camp on empty desks at various offices where our friends worked, has grown a lot and keeps growing faster than ever before. But to me, that original feeling never disappeared, we are still a group of friends, playing with technologies that we think are interesting.

I don't have much of a social life outside of work. Why would I, I have many of my closest friends at work. Even though we are over 20 people now, compared to three when I started, I really feel like we are all friends playing with cool tech. Sure there have been times when I've been a bit annoyed over having to share my toys with some new kid, but after getting to know them and the cool things they can do, we've always ended up being friends.
So no, I wouldn't say that I've grown up and gotten a job yet. It doesn't feel like a job, it really does feel like play. Nor do I intend to, why would I? I have the privilege to keep playing and making a living out of it.

So why don't you come play with us? We've got room for more in our sandbox!

We are looking for great talent worldwide who can help us build the coolest sandcastle ever! You don't have to be a Java developer, but JVM experience helps. The important part is a talent for using technology in exciting new ways. Neo4j is (to my knowledge) the most widely used graph database in the world. It is an awesome product, but still has lots of room for improvement, from the lower levels, to the higher for modeling, or bindings for different languages. Everything is open source, so the best resume you can send us is a code contribution.

While we are recruiting from anywhere in the world, we are still trying to keep the organizations focused around only a few locations. So being able to relocate to the San Francisco Bay Area or Malmö, Sweden is a huge plus. We are looking for product engineers, QA experts, and support engineers to help our customers with everything from domain design to troubleshooting. Versatility is expected, you will not be doing one task only, product engineering experience makes for better support engineers, and customer support experience makes for better product engineers.

For official details see: http://neotechnology.com/about-us/jobs/

I'm looking forward to play with you.

Friday, February 18, 2011

Better support for short strings in Neo4j

In the past few days I've been working on a feature in the Neo4j Graph Database to store short strings with less overhead. I'm pleased to announce that this feature is now in trunk and will be part of the next milestone release. In this blog post I will describe what a short string is, and how Neo4j now stores them more efficiently.

At Neo Technology we spend one day each week working on what we call "lab projects", a chance to explore new, experimental features outside of the regular roadmap, that might be useful. Two weeks ago I spiked a solution for storing short strings in a compressed way as a lab day project. To understand why we first need a bit of background on how strings are usually stored in Neo4j. Since strings can be of variable length, Neo4j stores them in something called the DynamicStringStore. This consists of a number of blocks, 120 bytes in size, plus 13 bytes header for each block. A string is divided into chunks of 60 characters, each such chunk stored in its own block (a character is two bytes). For a short string, such as "hello", that if encoded in UTF-8 would occupy only 5 bytes, the overhead of storing it in the DynamicStringStore (including the property record of 22 bytes needed to reference the block in the DynamicStringStore) is almost 97 percent!

My initial spike analyzed all strings shorter than or equal to 9 characters, and if all characters were found to be 7bit ASCII, stored it directly in the property record, without involving the DynamicStringStore at all. The 7bit part is important. The property record contains a 64bit payload field, which when the DynamicStringStore is involved contains the id of the first block. 9 7bit characters sums up to 63bits, I can store that in the 64bit payload field. I can then use the high order bit to denote that the content is a full 9 char string, and if it isn't, the high order bit doesn't get set, but instead the first byte denotes the length of the string, and the rest of the 56 (7*8) bits are the actual string.

While this started out as something I thought was a fun project to hack on for a day, we quickly found use for it. When importing the OpenStreetMap data for Germany, with this feature in place we found that the DynamicStringStore was now 80% smaller than before! Not only that but time for reading and writing strings had improved by at least 25%! (the benchmark I got this from creates nodes and relationships as well, so pure string operations is probably even faster) Such figures are great for getting a feature into the backlog.

I am not a big fan of ASCII though. It was designed for communicating with line printers, not for storing text. Also, with short strings the number of exotic characters that people use drops significantly, it is more likely to just be some simple alphanumerical name or identifier, such as "hello world", "UPPER_CASE", "192.168.0.1", or "+1.555.634.5773". So the next thing I did was to write a tool that could analyze the data stored in actual Neo4j instances and generate a report on the statistics of strings actually stored. I then sent this to our public users mailing list. The feedback confirmed my suspicions about what kind of text people store, and also suggested that we would be able to store up to 65% of our users strings as short strings.

Armed with statistics about actual strings I set out (along with my most recent colleague, Chris Gioran) to write an even better short string encoding, and incorporate it into Neo4j. Last night we pushed it to git. The format we ended up with can select between 6 different encodings, all encoded using the high order nibble of the payload entry of the property record:

  • Numerical up to 15 characters binary coded decimal, with the additional 6 codepoints used to encode punctuation characters commonly used in phone numbers or as thousand separators. This can encode any integer from -1015 to 1016 (both edges exclusively), most international phone numbers, IPv4 addresses, et.c.
  • All UPPER CASE strings or all lower case strings up to 12 characters, including space, underscore, dot, dash, colon, or slash. Useful for identifiers (Java enum constant names et.c.). This doesn't support mixed case though.
  • Alphanumerical strings up to 10 characters, including space or underscore. Supports mixed case.
  • European words up to 9 characters, this includes alphanumerical, space, underscore, dash, dot and the acute characters in the latin-1 table. Useful for building translation graphs.
  • Latin-1 up to 7 characters. Will give you parenthesis if you have those in a short string.
  • UTF-8 if the string can be encoded in 7 bytes (or less). Useful for short CJK strings for example.

The code is still in internal review, and shouldn't be considered stable until its inclusion in the next milestone release a week from now. But I am very exited about the benefits this will give to Neo4j users, both in terms of lower storage sizes, but also in terms of performance improvements. Reading (and writing) a string that is encoded as a short string is much faster than reading (or writing) a string in the DynamicStringStore, since it is only one disk read instead of two.

A big thank you goes out to the people in the Neo4j community who provided me with the string statistics that made this possible.

Happy hacking