Quote Crate Works!

Seriously, this is a big deal to me. About eight (8) months ago Alex Payne wrote a quick blurb about an idea he had for a site but wasn’t going to work on. He called it Quotidian with a tagline of “A place to store and organize quotes”. The original text of the blurb he wrote is on his site as an unfinished idea and when I read this in June of last year I was immediately inspired. I shot him an email that basically said “I love this idea, I want to work on it” and promptly got a response of (essentially) “Have at it.”

So I started, thinking I’d be able to get something up on the relatively (at the time) newly release Google App Engine for Java. This was a little off as it turned out. A key component of Alex’s idea was being able to search the quotes that were stored by text or author and out of the box Google’s App Engine has no text tokenizing, full-text search type capabilities. When I realized this I thought, “no problem, I’ll use Lucene.” I’d used Lucene before, though a much earlier version of it, and it was pretty dead simple to set up and get things indexed.

As it turns out, trying to use Lucene on the App Engine does in fact pose a pretty significant problem. Lucene by default is set up to write it’s indexing data out to the file system, which is simply not possible in the sandboxed version of the App Engine JVM’s. Since I was somewhat in search of a project at the time anyway I decided to implement the Lucene API to write to the App Engine datastore. I spent a lot of night’s poring over Lucene API/Javadoc documentation trying to figure out what the expected results were when performing certain operations, trying to piece together the flow through the library and trying to understand exactly how my code was being called by the guts of Lucene.

For a while this drove me a little crazy but I eventually got something together that passed (slightly) modified versions of Lucene’s unit tests for IndexInput and IndexOutput implementations. I integrated it with the code to save quotes and threw it up on the App Engine servers. Once it was there I noticed something, it was reaaallllllly slow. Knowing I still had a ways to go with the rest of the application I figured this was ok and I left it alone.

Then my (former) day job at AKQA New York got crazy, I mean super busy. For months. And I had learned enough Scala (the other motivation for working on this project) that my motivation to work on this simply disappeared. Time passed and I carried on with my life and started a different Scala project. Recently things changed a bit. I saw an opportunity at Arc90, they were looking for a Java engineer. I decided to give it a shot and they brought me in for an interview where they asked me about things I’d worked on in Java and I showed them Quote Crate without thinking. Fortunately, as I discovered later, it had lain completely dormant since my last batch of testing and one of the guys submitted a quote. It was slow, but I knew it would be. He said “cool” and we moved on but the encounter got this project back on my radar.

A week or two passed, Arc brought me on and I started thinking about this code again and what I had wanted to do with it when I started. So I went back and tested it out again, firing it up in the dev server and adding a bunch of comments to see what happened. BOOM!! 500 Error. Stack Trace. Looking at the error I saw it was in the Lucene indexing stuff I wrote and thought, shit. After some investigation I realized it blew up when adding the tenth (10th) quote. Yeah, that’s not very many, I know.

So I created a new branch, commented out the part where quotes get added to the Lucene index, called it stable and pushed it up. The stack traces were gone but now I was haunted by what was causing it. I had thought this problem solved and so I started digging again. Foolishly, I waited to create a unit test (or spec) for the situation until recently. Once I did I saw a different problem when I ran my tests something about the thread not having a datastore environment set up. I thought to myself “Nothing should be creating threads” and went on a hunt. It turns out, Lucene periodically does merging of index data after you get a sizeable enough in the index and this was spawning a background thread to clean it up in an attempt to not bog down the application. This is generally a good idea except you’re not supposed to be able to manually spawn threads in the App Engine environment, maybe this was the problem. I changed the MergeScheduler used by the IndexWriter to be serial rather than concurrent and the errors have, so far, disappeared.

I’ve pushed up a v2 of the Quote Crate code and though it’s still slow it doesn’t, to my knowledge, throw errors. The best part of this though is I’m mired in the quotidian code again and really excited to keep on improving it and adding the missing features to the application.

Thank you Alex for the inspiration. Thank you Chris Dary and Avi Flax at Arc90 for inspiring me to bring this application back from the dead.

Guerilla career move: send your resume in it’s present form to the longest shot company that you’ve always wanted to work for.
/via @rands

Concurrency and Me

I got to play with the java.util.concurrent package today and I have to admit that after some of the lessons I’ve learned in Scala it didn’t seem nearly as scary as all these people make it out to be. Of course I suppose that part of the danger is exactly that it doesn’t seem difficult but I’m optimistic.

The reason I was playing with it at all is because I’m trying to export data from Google Sites but the current exporter was producing broken links and a lot of HTML cruft. Admittedly the cruft is most likely from the generated markup and not the export but I wanted to try and clean it up anyway. Since it was open source I pulled it down (and imported it into git from mercurial) and started hacking away.

The easiest way I could think of to do XML manipulation was to include Scala and use it’s builtin libraries but after I did that I noticed the export was ‘hanging’ so I looked at the processes and saw it was only using the main thread so I hacked in some multi-threaded support with Executors. It isn’t perfect and I broke some tests but I can export most of my data in a timely fashion now. The work I did on this is visible on github

D’oh: GAELucene

As I said just a bit ago I took a look at GAELucene and I guess I should have spent more time reading than writing because I noticed this bit on the project home page when I checked out the source code.

The GAEDirectory is read only, that is, you can not use the Directory to build index! You should do indexing on another machine, then push the indices onto google appengine datastore with LuceneIndexPushUtil.

Because of the quota limitation of google appengine, GAELucene is not fit to run with huge indices, it does better for small indices, around 100Mb. For large changing indices, you need to find other solutions.

Since I am looking to use a changing index I fall into the category of “need to find other solutions.”

Sorry Google, I jumped the gun.

I recently tweeted that something broke when I updated my Quote Crate application from version 1.2.5 of the AppEngine SDK to version 1.3.1. As it turns out this was unfair to Google (sorry). It doesn’t appear to have had anything to do with the SDK change but rather my poor implementation of Lucene for the AppEngine datastore.

Since the data in my application was all my test data I wiped it all and and tested the application again. Much to my surprise, with no data entries saved without a problem until I had nine entries. After the ninth entry it started dying again. This, along with the still failing after rolling back to 1.2.5, makes me pretty certain it’s my code and nothing to do with the AppEngine.

So it seems it’s time to go back to the drawing board on how that all operates but it looks like someone else has tackled the Lucene on AppEngine problem in the form of GAELucene on Google code. I’m going to dig through this a bit and see how that works and if I can take some inspiration from it in order to clean up my code.