Comparing clusterings

Monday, December 13th, 2010

I am currently was looking at how to compare (at least) two algorithms’ clustering results and had Wagner & Wagner’s Comparing clusterings: An overview as a starting point, which appeared to be a longer and less useful version of Meila’s 2002 paper. Anyway, in short, I had decided to go with the latter’s suggestion of using Variation of Information (VoI) as a measure. My actual problem is that I have a bunch of data – run an algorithm on it – and the results are essentially clusters. Thus, I need a systematic way of evaluating how ‘good’ these clusters are. This VoI will hopefully be useful as it can give me an indicator of which are the best sets for me to humanly look at (and make some sort of interpretation of).

I wrote a little script (which took far too long, mostly ‘cos I had to re-learn how to program after not having done much in about two years) in Python so if anyone wants to borrow it, feel free to contact me. I’ll post a sample calculation of VoI at some point too.

On a slightly (un)related note, I am getting tired of writing damn little Python scripts…little things require little scripts which requires a little more time… little + little = big, like few + few = lots. Grr…