Archive for the 'research' Category

h1

Network Data sources

Thursday, July 19th, 2012

Data data data, it’s so much easier (and reassuring) to use data that is preprocessed and previously analysed. Here are some nice sources of them:

  • KONECT – The Koblenz Network Collection. These are proper big networks ranging from few ten thousands of nodes to a few ten million. The page layout is broken down by types of networks e.g. Authorship, Lexical, Ratings etc. with neat symbols beside the name of them to show what type of graphs they are, bipartite, directed, weighted, temporal etc.
  • Social network datasets – a page with datasets associated with some social network analysis course. Most of the datasets were originally collected by sociologists studying the behaviour of animals/people so are understandably small (<100 nodes). However, very useful as well-studied and ground truth is, well, well-grounded XD Format is less pretty – it’s pretty old-school in layout, links that are names of authors/collectors of datasets that goes to the section of the page that describes it.
  • Clique Datasets – a list of links to datasets that have been created by Clique researchers, which have been made available for academic use. A smaller collection of datasets, but well-documented with links to related publication and blog posts (when there is one).
  • Stanford Large Network Dataset Collection as well as their Web and Blog datasets – of which a bunch are part of the KONECT datasets. They are mostly large (again ranging from ten thousands to ten millions nodes) datasets from Jure Leskovec‘s page for their Stanford Network Analysis Package (SNAP).
  • A bit meta here and I link a couple of links with links:
h1

Plotting Heatmaps in R

Tuesday, January 24th, 2012

I had recently had to create a heatmap visualisation as a part of our results in a paper we had submitted for a conference and as it took way more time than I had anticipated, I figured it’s something worth documenting. My first point of call was obviously Sgt. Google and the first hit given was How to Make a Heatmap – a Quick and Easy Solution, which I naturally liked since the sample dataset was a basketball stats one 😀 However, I quickly realised that this was not enough for what I wanted – my x-axis showed time and instead of nice, fat blocks, my heatmap/graph showed thin, coloured lines. Another problem I ran into was interpretation. I initially had something like this:

n = 50
matrix_to_be_plotted <- rnorm(n*n) # generate 50 x 50 = 2500 random numbers
dim(matrix_to_be_plotted) <- c(n,n) # change vector to matrix of dimension 50 x 50

heatmap(matrix_to_be_plotted, # as name suggests, the matrix of the data to be plotted
scale = “row”, # this is important; I did not realise this at first and spent an evening wondering why data values did not match with what I was describing in the heatmap (I’d assumed the darker the shade, the higher the value). Basically, you can control colour scaling by row or column, default is row. This is so important in my opinion that I’ll quote the documentation: “character indicating if the values should be centered and scaled in either the row direction or the column direction, or none. The default is “row” if symm false, and “none” otherwise.”
main = “Greyscale heatmap – squares”, # name/title of figure
Colv=NA, Rowv=NA, # set to NA or columns/rows of matrix would be rearranged into hierarchical clusters according to R’s dendrogram function
margins=c(3,3), # column/row margin space, the higher the number, the more space you get
cexCol=1, cexRow=1, # size of column/row labels, 1 is default
col=grey(seq(1,0,-0.01)) # colour is greyscale, sequence from 1 (black) to 0 (white) in steps of 0.01
)

which was a grey-scale version of what I wanted. You can read the documentation of the parameters of R heatmap, but my own explanation/interpretation of the parameters in the context of what I’ve given is written.

Here’s a picture of what should come out (tips on Saving Plots in R):

Example of Greyscale heatmap - squares

Not bad: Example of Greyscale heatmap, square matrix

This doesn’t look so bad, and if this is all you need, great. However, here’s a rectangular matrix, i.e. more similar to what I had in originally, which can be emulated by changing the dimensions like so:

dim(matrix_to_be_plotted_thin) <- c(n/2,2*n)
colnames(matrix_to_be_plotted_thin) <- rep(” “, 2*n)
colnames(matrix_to_be_plotted_thin)[seq(1,2*n,3)] <- paste(“Wk”, seq(1,2*n,3))

The column names are changed so they don’t get so squished on the x-axis:

Example of Greyscale heatmap, non-square matrix

Kind of Ugly: Example of Greyscale heatmap, rectangular matrix

This is less visually appealing. Moreover, it didn’t suit my graph because looked like this:

Greyscale insufficient

Yuck: My own graph, Greyscale insufficient

Part of the problem was that I didn’t realise I hadn’t the scaling properly – by default it was row-scaled, I was describing it as it were column-scaled. However, in retrospect, it still wouldn’t have worked if I had the correct scaling because this, as a figure taking up about barely one-sixth of a page, it was fairly difficult to read when printed. (I couldn’t figure out how to get a border around the graphic either so if anyone knows, please comment and let me know!) Anyway, my Saturday morning thus became an online google image and documentation hunt with keywords of R, heatmap, image, matrix image, visualisation,  <what have you>, and finally I converged to using an upgraded package of heatmap (library being heatmap.plus) with gplots – to prettify the graph with custom colours.

This apparently Easy Guide To Drawing Heat Maps To PDF With R (With Color Key) was a great starting point, but ultimately, I found that the full heatmap.2 documentation combined with this colorRamp pdf most useful as they actually explained what you needed to do to customise.

A slight detour – colour

I am a bit particular about my colours, both from an explanatory viewpoint and from an aesthetic one. I utterly despise bad graphs and badly-coloured ones even more so. In academia (or any good documentation that requires printing out), the best thing to do is to have graphs free of colour, such that it still makes perfect sense when printed in greyscale, and I do prefer that. However, when needs be for colour, I think it’s important to get it right, e.g. the most common type of colour blindness is red-green, so avoid using those two for distinguishing.

This arbitrary compulsive requirement of mine lead me to actually create my own palette for my heatmap…(Yay, as if I don’t have enough ways to spend my time.) More importantly, graphs are supposed to be a more efficient way of explaining data, not made just for the sake of them. If I am going to use colour in my graph, there should be a reason for it, and it shouldn’t require more text to explain why. (Picture – a thousand words – that sort of thing.)

A useful resource for creating your own palette is to look at this R colour chart. Here are a few examples:

TestPalette <- colorRampPalette( c(‘aliceblue’,’aquamarine1′,’azure3′,’blue’,’blueviolet’, ‘darkcyan’,’darkblue’,’darkgreen’,’darkmagenta’,’darkolivegreen’, ‘darkmagenta’,’darkviolet’,’black’))
WarmPalette <- colorRampPalette(c(‘antiquewhite’,’pink’,’rosybrown3′,’rosybrown4′,’saddlebrown’,’brown’,’black’))
CoolPalette <- colorRampPalette(c(‘lavender’,’mediumslateblue’,’blue’,’turquoise4′,’seagreen2′,’seagreen4′,’black’))
BluesPalette <- colorRampPalette(brewer.pal(9,”Blues”))(100)

brewer.pal is the in-built palette (as described in detail in the colorRamp pdf I linked earlier) – the handiest query for me was:

display.brewer.all()

which shows the name of the palette (like “Blues”) and the range of colours in it. Also, (100) in the BluesPalette example gives how fine you want the shading to be. So, if you had (3) then there’d be three varying shades of blue of something like Dark Blue, Blue, Light Blue, (100) gives you on hundred shades varying from dark to light.

Finally…

The code for my final plot and comments for explanation of new things:

library(gplots) # for colour panel of heatmap
library(heatmap.plus)

heatmap.2(MyMatrix,
Rowv=FALSE, Colv=FALSE, dendrogram= c(“none”),
cexCol=1, cexRow=1,
key=TRUE, keysize=0.1, # display colour key
density.info=c(“none”), # options of different plots to be drawn in colour key
trace=”none”, # character string indicating whether a solid “trace” line should be drawn across ‘row’s or down ‘column’s, ‘both’ or ‘none’.
margins=c(5,9),
lmat=rbind( c(0,3), c(2,1), c(0,4) ), lhei=c(0.2, 8.5, 2), # where to display colour key
col=CoolPalette # custom colours for colour key
)

….and the result:

heatmap_cool_palette

Much nicer: Heatmap with Cool Palette colours

My only and final annoyance with this is the “Value” which floats a bit too near the displayed numbers, but I don’t think it impedes on the readability so much that it’s worth unnecessary tweaking.

Funnily enough, the greyscale versions don’t look as bad on screen as a blog post. But trust me, it makes one hell of a difference on paper.

h1

Web Science Doctoral Summer School 2011 (belated)

Friday, August 26th, 2011

Here’s a somewhat very belated post on the Web Science Summer School I attended over a month ago (July 6, 2011 – July 13, 2011).

My thoughts? It was great! Fortunately, there were also enough others who felt compelled to write about it so that I don’t feel a huge urge to have a massive brain squeeze of whatever memories I have of it. That said, there’s always room for a few (short) thoughts.

It was an intense week of lectures/tutorials, often starting at 9am and finishing at 6pm, which meant many of us were very much drained at the end. Tiredness aside, the highlight for me (minus the social bits) was the mini-project. Over the course of a few days, we collected different types of communication data (specifically, face-to-face communication, facebook & twitter) and pretended to be web-social scientists and tried to make as much sense of it (until silly hours). With a bit of help of compewters and algorisms to make it all sound legit, of course. (Actual details on it can be found at Aaron’s post, which I’ve linked below.)

Here’s one example of what coordinated collective action at this school resulted in:

Some other people’s posts (thoughts) on the school:

  • Clare Hooper’s personal perspective on it, as well as the start of it (with subsequent posts on the attended lectures – well worth a read as summaries to decide if you want to watch the allotted ~1.5 hours for each talk).
  • Aaron McDaid and Owen Phelan‘s posts on the school – the former has more details of our team’s mini-project.

Resources:

  • Here are some links to the lecture/tutorial videos and the corresponding slides.
  • Here are also some flickr photos of the school.
  • In particular, very useful supporting material to the tutorial given by Derek Greene, post-doc at UCD. (Not just promoting a colleague 😉 It really is a good intro. to important concepts in social network analysis and relevant software (networkx, gephi), with concrete examples and datasets.)

Oh, and my favourite quote of the week (and aspire to partially be – i.e. automate as much ‘trivial’ stuffs as a I can so that what I am is the creator/storyteller rather than command follower (explanation especially for Ursula, even I’m not usually this pedantic!) ):

“You are what you do not automate” — Marc Smith

h1

IJCAI 2011: Part 2

Wednesday, July 27th, 2011

I didn’t expect to write so much, but it’s supposed to be good documentation, even if overly informal in places 🙂 and so it continues (from Part 1)…

My Talk/Poster Session

For those who have asked me about this beforehand, you would’ve known that I was rather nervous about this. I’m not a naturally good public speaker and am also afraid of sounding over-rehearsed so it’s tricky divide. I opted (well, more persuaded) for the latter approach, so had lots of preparation; but even so, I was fairly nervous and had a shaky start. It didn’t help that my chair was missing for the first half of my talk as he was double-booked – no panic though, a kind and seemingly experienced member of the audience introduced me and started the session on time (shame I didn’t catch the name, I really should’ve thanked him more).

Anyhow, once I got into the flow, it was grand. Being able to watch it afterwards, although initially very cringe-inducing, is also surprisingly helpful. Once I got over to listening to my very weird accent (no wonder people can’t pin it!), I tried to objectively evaluate it and figure out what needs work. I think I need to be more aware of utterances (i.e. um um UM), even if it’s impossible to control exactly, better pacing might be a semi-solution. I’m glad I gestured (but not overly à la wavy arms) as I think it helps alleviate dullness. I was also more monotonic than I’d realised, which is another thing I think could be worked on. Overall I’m reasonably happy with how it went. I was able to answer the questions at the end (naturally, I could’ve answered better in retrospect, but I think it was really the best I could’ve done at the time). My only regret was not being able to catch the second person who asked a question to continue the conversation ‘offline’ :/

As to the poster session – essentially didn’t happen 🙁 I had brought the poster back to my hotel room on Monday having had a colleague help transport it over. Friday morning came and I realised it’s gone. I had searched my room thoroughly and my conclusion was it was thrown out by the cleaners. Given that it was in a cardboard tube which I had very haphazardly chopped off bits of to get it to fit, it was reasonable to mistake for rubbish. This was very frustrating at the time but in retrospect, even if I had realised it was missing earlier, it still might not have been recovered. Moreover, I decided to go to some talks arranged for Industry Day during my alloted time so my time was spent reasonably productively either way.

Other/Overall impressions

I don’t think I can spend much more time on expanding any more thoughts but here’s some final snippets:

  • I’ve been out of out of touch with the AI community in a while (~3 years) so it was interesting to see what’s going on in (sub)field(s).
  • There is a project, AISN, which is attempting to create a sort of AI online community based on the conference attendees. I do hope it takes off, it’s an interesting idea and data is always good…
  • I am not sure I liked the multiple-track format, but I guess that’s unavoidable with such a broad conference. And for the same reason, it was a bit difficult to find people who worked in the same field, but I suppose the workshops were organised for this – slightly more specialist.
  • The schedule was also quite tight and packed which meant stress at times – and tiring!
  • However, I did quite enjoyed those extra events – the delightfully geeky Casparo opera at the beautiful Palau de la Música Catalana, not-so-yummy “banquet” at Poble Espanyol, and I think the aforementioned Industry day talks were useful, even if a bit commercial.

Summary/What I’ve learned

  • Large conference, broad areas -> plan what to do in advance.
  • Look at organising committee of workshops to see if worth/has future. Take advantage of tutorials.
  • Best quality talks are usually invited ones. Especially seasoned researchers who aren’t just there to advertise their methods, but their field.
  • Don’t be afraid to ask questions. I should apply this more myself.
  • Networking isn’t as difficult as I expected. People are open, but do realise they also have personal agendas which may or may not align with yours.
  • Although may sound anti-previous point, don’t be afraid to make new friends – people have more sides than just their academic ones 🙂

Honorable Mention: ICWSM

IJCAI was co-located (in that both were in Barcelona) with ICWSM this year and a few colleagues of mine attended it. Though I was not officially supposed to, I attended one talk there by Jimmy Lin entitled “Twitter and Data Science” who is an academic but taking a couple of sabbatical years to work at Twitter. His talk mentions (see below) what types of techniques are often applied to Twitter data which reminded me of the concept of “builders and studiers” we’d used at the Web Science Summer School.

At this point, I must also extend a big congrats to John Breslin who I believe to have had a major part in the result that ICWSM will be held in Dublin next year!

Last words (not terribly important)

And finally, the rumours have been quashed and though we knew IJCAI was to be held in Beijing in 2013, people lobbying for Melbourne in 2015, despite their t-shirts have been declined – and will be held in Argentina (Buenos Aires, I think). Moreover, it has been confirmed that it will become an annual conference post-2015:

Who knows how the community will evolve by then, and whether I’ll have the chance to attend again. I would love to, but given the scope of it, I’d see it more as a high-end educational holiday than a directly relevant work week (especially given the exotic locations 😉 ). I was certainly inspired by it, and feel the need to push a bit more – my reading list will never end!

h1

IJCAI 2011: Part 1

Wednesday, July 27th, 2011

I was at IJCAI last week; this was my first major academic conference and I very much enjoyed it.

IJCAI is a well-established bi-annual conference (since 1969) and the acronym stands for “International Joint Conference(s) on Artificial Intelligence”. The term “Artificial Intelligence” is an interesting one, and it came up in several conversations throughout, though I won’t go into it here (we can pub-talk this if we meet!). But in short, it spawned from the fact that it is a very large and broad conference with ~1500 participants, and I think a few of us were struggling to figure out where we fit.

Tangential Modularity Rant

Personally, I was playing the community-finding card, specifically a community of roles in the network (at least that’s what I’ve been convincing myself!), so I was browsing the clustering, web mining topics, and bits of the search landscape. My impression is that there is a lot of work going on that focussed on recommendation/prediction (I suppose this is a kind of ‘intelligence’ after all, so shouldn’t be surprising) but I also noticed that Newman’s modularity heuristic was very popular.  It was often used as a basis of a sub-part of a solution (to another problem) or formulated differently in order to scale it up, with little requirement in interpretation of results, which I think is a pity.

In other words, results sections often consisted of graphs showing how fast so and so performed in comparison to other methods, but the found clusters themselves usually omitted. This is fair enough if they were generated from synthetic data, but for any real-world data, the only argument as to how they are good is that the smaller the Q value, the better it is which is too weak IMHO. If I agree with your assumptions, then perhaps this makes sense, but it does require a leap of faith, and Fortunato & Barthélemy does a good job of making this leap look bigger than a simple hop.

Furthermore, although it was highlighted to me that my approach may be too specific to social networks (semi-supervised approach to finding roles), that for very large networks it may not be possible to interpret everything, I still think there should be a place for thoughts on the trees than the wood. In particular, the very first workshop talk I attended, Detecting Communities from Social Tagging Networks Based on Tripartite Modularity (as well as a few others in the IJCAI proceedings itself), at least tried to show some of the clustered results and acknowledged that methods beyond modularity may be considered. Ok, less rant, more conference details.

Workshops/Tutorials

The conference officially started on Monday evening but workshops and tutorials were run during the weekend (and including the Monday) before. I took part in the Link Analysis in Heterogeneous Information Networks workshop which I unfortunately have to say, wasn’t very well-organised. The talks were interesting enough, but it became clear that not a lot of thought was put into the scheduling of them; there wasn’t really any themes within/connecting the sessions.

The timing was also a bit off. Workshops were held in parallel with specific times for breaks, and we didn’t align to them in the second half of the day which resulted in some confusion and the workshop finishing early. There was also promise of a wiki where the slides and information on participants were to be put up, which I thought was a good idea at the time but I haven’t heard anything about it yet… However, all workshop proceedings are available in one place, which is handy. As for recommendations, I would look at the Web Mining one in the tutorials.

Invited Talks

I didn’t manage to make it to all of them (9am Barcelona time meant 8am Irish time, which, combined with 20-30 minute journey is a near-impossible feat for me). However, of the ones I did attend, two particularly stuck out – Daphne Koller who spoke about “Rich probabilistic models for image understanding” and Jonathon Schaeffer on “The Games People Play Revisited”. This is not to say the other talks weren’t well presented but these two (I felt) struck the right balance between detail and, what I consider often difficult to do without sounding like a blind fan, genuine enthusiasm & passion.

Remember, although they were addressing to a very broad audience, a common theme is that they all have a fairly technical and critical minds, which can be hard to please. Both formats of the talks were similar in that they stated the problem in plain English terms, and then showed the incremental (but important) developments in the field – how aspects of one approach worked, why it didn’t overall, what approach was popular, what inspired a new one, and what they think is the “future”. I especially like Jonathon’s slide below (quoting his wife), which he very proudly gave his response to each line, and in particular to the final one where he said he didn’t know what that meant – this (I assume he meant research into games) is his life 🙂

Note I’ve ended up splitting this entry into two parts, as I had more to say than I’d expected (here’s Part 2).

h1

Comparing clusterings

Monday, December 13th, 2010

I am currently was looking at how to compare (at least) two algorithms’ clustering results and had Wagner & Wagner’s Comparing clusterings: An overview as a starting point, which appeared to be a longer and less useful version of Meila’s 2002 paper. Anyway, in short, I had decided to go with the latter’s suggestion of using Variation of Information (VoI) as a measure. My actual problem is that I have a bunch of data – run an algorithm on it – and the results are essentially clusters. Thus, I need a systematic way of evaluating how ‘good’ these clusters are. This VoI will hopefully be useful as it can give me an indicator of which are the best sets for me to humanly look at (and make some sort of interpretation of).

I wrote a little script (which took far too long, mostly ‘cos I had to re-learn how to program after not having done much in about two years) in Python so if anyone wants to borrow it, feel free to contact me. I’ll post a sample calculation of VoI at some point too.

On a slightly (un)related note, I am getting tired of writing damn little Python scripts…little things require little scripts which requires a little more time… little + little = big, like few + few = lots. Grr…

h1

Reading material

Wednesday, November 10th, 2010

Before I lose track with semi-random reading, a general guiding list would be useful:

  • General graph theory
  • Blockmodelling – potentially the main focus of the PhD, I haven’t quite decided yet, but it’ll be an important component anyway. This splits into:
    • Structural
    • Regular (or generalized) and
    • Stochastic

    An idea to incorporate overlapping community search is a thought, but not a very deep one as it stands – will need to return to this.

  • General network theory (ties with graph, but more applied) – specifically to look at current and competing theories and techniques.
  • Some statistics/machine learning – e.g. goodness-of-fit/clustering methods – the theory behind.
  • General programming top-up – in particular C++ and Python (also Matlab, if that counts 🙂 ).

Currently, I am reading interpretation and evaluation of blockmodels, as well as comparing clusterings. Using Mendeley as a way of sharing and keeping track of papers appears to be handy.

h1

Hello world!

Sunday, October 10th, 2010

Time to resurrect the blagger in me and put my thoughts into something more coherent than strands in my head. This is likely to turn out into a research-y sort of blog so expect intermittent rants about the life of a PhD-er 🙂