I thought it was about time to step out of the coding closet were I’ve almost spent a year now. My name is Jon Kågström, I’ve had my own company since 1999, I’ve a Master Degree in Computer Science with AI as major. My master thesis in 2004 improved a naive bayesian method for spam classification, since then I’ve been more or less obsessed with classification. Since 2004 I’ve been refining, improving and optimizing the results of my thesis further. I implemented the algorithm in a free spam filter, Cactus Spam Filter. However, the world of spam is quite limited (either spam or not) and I wanted to move on to something new. In December 2007 I contacted Mattias Östmar after he spoke at Daytona with this e-mail
“En kompis såg ditt föredag på Daytona och uppfattade det som att du saknade tekniska verktyg för automatiskt klassificera dokument till olika kategorier. […] Om du skulle vara intresserad så är det bara att hojta till!”.
Mattias got back to me and we set up a meeting. I had no idea what to expect, other than that it had something to do with text classification and some hippie personality stuff. We decided to find out if Mattias hypothesis; “given a bunch of texts tagged on different personalities, can a classifier learn to distinguish previously unseen documents”, was possible and if so – to what degree.
Testing Mattias hypothesis in the lab
During two months we ran tests on training data from Mattias. The training data he provided were blogs labelled with strange letters, INTJ, ESFP and so on. I could not see any obvious pattern myself, I didn’t understand and nor did I care what the heck those letters were – only that they represented different personalities. Instead I trained the classifier on his labelled documents and ran 10 folded cross validation tests to find out if the computer was any better than me on those mysterious letters. As a surprise, the computer did well; it distinguished between the different classes by correctly categorizing about 75% of the documents in our initial tests. These findings showed that his hypothesis was achievable. So we spent many weeks on trying to figure out how to improve accuracy, in terms of classifier technology and training data. Looking though the results, we found that there was a good amount of noise in the corpus. We did a couple of iterations on the training data by removing noise and adding more and we improved the performance further (about 80% at the time).
Having achieved that performance basically means, as the classified would have expressed herself
“Given that the corpus is a good representation of real world data, I am able to take any real world data and give it the correct label by a chance of 80%”.
So how can we know how good representation our training data is of the real world? Even though there are thousands and thousands of blogs in the training data, it may be biased. What better then let everyone “out there” run it on their blog and see what response we get? On this premise we decided to build a prototype, our first version of Typealyzer, were everyone can test their blog while we listened to their reactions. Typealyzer was well received and we could use feedback to get more and improved training data (we are still improving training data today, we now have above 90% accuracy on most classifiers). At the time our first prototype was finished in February even I had started to learn about the cryptic four letter combinations, and started to see all the patterns myself. We kept on developing and prototyping different models over the next months.
Inspired by Mattias, “think big” I got the idea to build a general purpose classifier while we are at it. As the hard part, the classifier itself, already was working fine we “only” needed to build a robust, parallel, scalable high performance classifier server. Since I already was working a lot on DICE I distributed some of the tasks to Roger Karlsson (who presently also makes computer games at Avalanche Studios) and Emil Kågström (my brother).
Getting it right
Building a classifier server turned out to take a lot time, especially with our criteria’s. We needed to make sure that it was really stable, the same way SQL servers are (should be). I was really worried that handling every possible out of memory exception that could occur (will occur) would take more time than available. But Roger pushed for doing it right, and so we did. Testing it on a 512 MB VPS without any page file turned to be really exciting, as we could identify many flaws in our code. Even in Microsoft’s C++ STL library. Making the server transactional (making sure that no classifiers are left in an undefined state if the server is turned off during writing to it) was also challenging and took a lot of time. In April my project on DICE was over and I decided to quit and work full time developing the classification server with PRfekt. This work has been going on until today and the beta fruits can be harvest by everyone for free at uclassify.com. If anyone is interested in having an own classifier server (for processing huge amounts of data) please contact Björn Gustafsson.
Were we are at
Shortly after I quit my job at DICE, Ragnar Eklund and Björn Gustafsson joined PRfekt. Ragnar started to work on our tool and Björn as the CEO. During a five month period Ragnar did an incredible job with our tool – visualising the yummy data and Björn took control over all lose ends in the company improved the company structure a lot. Mattias has been, as always, coming up with new cool ideas that we have implemented (not all of them – that would be crazy).
- Today we have a tool that can be used in real time, shortly after a profile is added we can see the charts building up.
- We have a robust classification server that processes 100 blog posts/second on five different classifiers. Since it is parallelized and we have 8 cores we can theoretically classify 2.8 million blog posts per hour.
- We have a constantly growing database, today with around 3000 profiles ranging from music artists to KIA sites.
Back to the closet!