July 19th, 2012

Data data data, it’s so much easier (and reassuring) to use data that is preprocessed and previously analysed. Here are some nice sources of them:

  • KONECT – The Koblenz Network Collection. These are proper big networks ranging from few ten thousands of nodes to a few ten million. The page layout is broken down by types of networks e.g. Authorship, Lexical, Ratings etc. with neat symbols beside the name of them to show what type of graphs they are, bipartite, directed, weighted, temporal etc.
  • Social network datasets – a page with datasets associated with some social network analysis course. Most of the datasets were originally collected by sociologists studying the behaviour of animals/people so are understandably small (<100 nodes). However, very useful as well-studied and ground truth is, well, well-grounded XD Format is less pretty – it’s pretty old-school in layout, links that are names of authors/collectors of datasets that goes to the section of the page that describes it.
  • Clique Datasets – a list of links to datasets that have been created by Clique researchers, which have been made available for academic use. A smaller collection of datasets, but well-documented with links to related publication and blog posts (when there is one).
  • Stanford Large Network Dataset Collection as well as their Web and Blog datasets – of which a bunch are part of the KONECT datasets. They are mostly large (again ranging from ten thousands to ten millions nodes) datasets from Jure Leskovec‘s page for their Stanford Network Analysis Package (SNAP).
  • A bit meta here and I link a couple of links with links:

