Update on Detexify

Hello again. Sorry I've not been very responsive to emails lately. I wanted to get this preview of the next version of Detexify out the door before doing anything related to the current version. So here it is:

What's inside? Recognition is done on the client side in Javascript. Symbols are now an abstract concept having representations that can be LaTeX or Unicode. Yeah, finally Unicode support. To be able to train symbols you need to log in (which is only possible via Twitter right now - I'll add other OAuth providers later). This also enables you to vote on samples. The best samples for each symbol are used for recognition.

I'll be adding more symbols over the next few days to see how it scales.

What's to come? Offline symbol recognition via http://www.w3.org/TR/offline-webapps/. Interface tweaks. The source code on github.

P.s.: Thanks everyone for the financial support. And special thanks go to my employer Zweitag for paying my AWS-Account.

Posted

Detexify needs help

Hi everyone! I am Daniel. I've created Detexify (detexify.kirelabs.org) - a tool that tries to identify handwritten LaTeX symbols. The first version was written as a side project over a year ago. In autumn of last year I have written my diploma thesis with Detexify as it's topic. While writing I've done research on symbol recognition and tweaked the recognition algorithms. Detexify works reasonably well now. It is used by over 1500 people on an average work day. I am really glad that I was able to create something that improves a few peoples lives by saving them some time, because time is the most precious thing we have.

I've since left university and landed a nice job in software engineering*. I am working only 32h a week. That allows me to work on projects like Detexify. Of course I am not getting payed for that. Actually a new version is on the way because the current version has terrible scaling problems.

The root of the problem is that the recognition is running on a single server and there is lots of computation going on. In addition booting and training a new recognition server takes a few hours making it impossible to react to sudden spikes in the number of visitors. Thus every time a social news aggregator ( or famous blog like http://rjlipton.wordpress.com/2010/12/20/some-mathematical-gifts/ ) had Detexify on it's front page I had no choice but to take Detexify offline.

Until recently a pretty small server was enough to handle the average load of regular users. I've replaced that machine with a really fast one that has lots of computing power when the recognition became slower and slower due to more and more requests to the server. Unfortunately that server also comes with a different price tag.

The price for the small server was almost always covered by donations from users of Detexify. That is not the case any more. The bill for the last month (Heroku, EC2 and S3) was over 200$. Therefore I am now looking for sponsors. I am looking for someone who want's to pay the hosting bill. And I would still be happy about donations to cover the costs of the last two months. I think I could downgrade to the next smaller EC2 Instance but that would still mean 100$ each month. If you think you can help please contact me!

danishkirel@gmail.com

Click here to lend your support to Detexify and make a donation

(Yes, I don't like Paypal too – I still have not found an alternative. And posting my BIC and IBAN here doesn't feel right.)

A few words on the next version of Detexify: I hope I can solve the two main problems the software has right now. They are scaling and quality of training data. The first problem will be solved by moving the processing to the client. That will also enable Detexify to work offline (the recognition part of it at least). The second problem will be tackled with a community approach. The users of detexify will be able to rate training samples and only the best samples will be used for recognition.

Because the heavy lifting will happen on the client there will also be no need for big servers and I will have a smaller hosting bill.

*I am working at Zweitag GmbH in Münster, Germany. Awesome folks. You should hire them!

Posted

Gödel's Lost Letter and P=NP

Wow - it's an honor to be mentioned in Dick Lipton's famous blog
( http://rjlipton.wordpress.com/2010/12/20/some-mathematical-gifts/ ).
Unfortunately the effect was very similar to the reddit effect a few
weeks ago. I will try to bring the computation to the client side to
avoid these scaling problems. Stay tuned and again sorry for the
inconvenience!

Update: I'll be setting up a bigger backend server this evening.

Update: The new backend server is an EC2 High-CPU Medium (M) Instance. It is already active and currently being trained. Detexify will be operating normally within the next two hours.

Posted

Reddit

Really sorry for the inconvenience but reddit found us. Again. The
architecture can't handle the load. I am working to get everything up
and running again.

And I need to think about doing all the processing on the client side...

Update: I've decided to just wait until the reddit madness has ended and meanwhile ponder about a real architectural overhaul instead of just setting up a bigger server. I hope you have Detexify back by tomorrow :|

Posted

Haskell Backend

I have finally got the new backend written in Haskell into a useable
state. The sourcecode is available at
https://github.com/kirel/detexify-hs-backend . I am training the new
backend right now and will switch out the servers as soon as it is
finished. Internal benchmarks suggest a modest increase in recognition
accuracy and it should be faster too. I will keep the old server
around for a few days until I am sure that the new one performs fine.

Technical background: Apart from the IO-Stuff it has been a pleasure
to program in Haskell. I have also written a version in Clojure but
that one was too slow, probably due to my lack of understanding the
JVM. Haskell ist blazingly fast but of course I had a memory leak due
to excessive laziness.
I am still using nearest neighbor search but switched the distance
measure to a different algorithm. Strokes are now compared using a
greedy version of dynamic time warping.

Posted

2500 alphas

This is the exciting work I am doing right now. Killing bad training
data. Awesome! Not.

Posted

The road ahead

Detexify 2.0 — or 1.0?

I have hesitated for some time now to add major new features to the current version of Detexify. That is because I have felt there is something wrong with the architecture. And although the current version is 2.0 it is using a web 1.0 architecture.

I have had some requests by users for an offline version of Detexify but with the current architecture that demand is not easy to meet. To run Detexify one needs the fontend server, the backend server and the database. What is great about this setup is, that component can be switched out easily. I have recently rewritten the backend in Haskell (which is much faster that Ruby) and could swap the current classification server out for the new one without problems. This is convenient for me but not convenient for anyone who want’s to run Detexify locally.

Another annoyance is that I have to manage three servers. Currently the frontend is hosted at Heroku (awesome!) and the database and backend are running on rackspace cloud servers. This also stacks up expenses.

Detexify 3.0 (the better 2.0)

HTML5 to the rescue

But is there a better way? There sure is! Enter the world of HTML5 where everything is possible. Detexify 3.0 will work offline in the browser. When the user opens Detexify the current relevant preprocessed training data will be downloded as JSON (or the previously downloaded cached file will be used) and any classification task will be run in a HTML5 web worker.

CouchDB’s true powers

Now I have only the frontend which is static HTML and javascript and the database left. But wouldn’t it be awesome if I could serve the app directly from the database? As it turns out, CouchDB can do just that, and more…

So you want to have a full local copy of Detexify? Setup your own CouchDB and replicate it from my CouchDB. I know this sounds all crazy but if you really want to know what the hell I am talking about, read up on it in the CouchDB book.

Detexify mobile

There will be a mobile version of Detexify that uses the same techniques and is pure HTML5. The current iPhone and Android Versions will be deprecated. Actually I hope that I can just provide another stylesheet for small screens and implement touch events for the canvas. This also ensures that every platform gets the same features as everyone is accessing the same app after all.

Disclaimer

Doesn’t this sound amazing? Unfortunately all of it is just hot air. No code has been written. Everything is just an idea. And since symbol recognition is essentially number crunching and Javascript isn’t known for it’s blazing speed I am not sure if the proposed architecture is even capable enough.

But I am convinced that this is the way to go and therefore I will not add anything new (apart from requested symbols) to Detexify 2.0 and pursue this vision. I hope things will turn out well.

Posted

The blog has moved

Hi and welcome to the new location of the Detexify blog. The old blog can be found at http://detexifyblog.kirelabs.org. I have not moved anything over. Read old posts over there and new posts here.

Posted