LionsPhil's Stuff - Open Semantic Hyperwikis

What I did

I hold a philosophy doctorate from the University of Southampton for computing research into Open Hypertext and the Semantic Web. This is the fairly whimsical summary. If you want the full-blooded academicese, scroll down. (If you just want pretty videos, scroll down further.)

Scope

I pretty much smushed two-and-a-half research areas together and saw what the resulting explosion looked like.

Open Hypertext is the research field that the web pretty much brushed aside when it took off. The web is a really simplistic hypertext system; research before it looked into more aggressive re-use and fancier ways to link things. Open hypertext splits the links out from the documents so you can apply them separately.

The web is a bunch of interlinked documents. The Semantic Web is a bunch of interlinked data. It's about a consistent, basic representation of knowledge (RDF), and giving things sane names (URIs) so they can refer to each-other. (You might hear the resurgence of this latter part referred to as Linked Data.)

Semantic Wikis are what happens if you take wiki systems (like Wikipedia) and let people specify how pages are related when they link them together. If you look at it just right, this fits quite nicely with the Semantic Web defining relations between things in RDF.

My hypothesis was more-or-less that there were still useful things to loot from hypermedia to make semantic wikis more useful.

Studies on wiki editing

I carried out a couple of experiments on how people edit Wikipedia.

A macro-scale experiment looked at the kind of edits people had made over time, which happened to involve parsing about two terabytes of XML and wiki markup and developing a "sloppy" string comparison algorithm that could scale to handling that much prose comparison on an old Celeron laptop.

Twice as many edits change just links as change content, and about 10% of edits are "list of" or category changes. Maintaining Wikipedia's link structure takes considerable effort. (About 5% of them are just reverting someone else, too.) The usual power law stuff holds---lots of people make few edits, few people make lots of edits; lots of edits are small, few edits are large.

A micro-scale experiment focused on a few people editing pages in detail (and was repeated against my system to compare). It's a bit wordy to summarise here, but it's interesting to note a few things like:

people deliberately leave blank and broken parts of pages as an invitation for someone else to complete them
context means that just because the same thing is described in several places doesn't mean the same text fits
templates on Mediawiki are a nightmare
people have a feel for how much should be linked, but it varies on if they pick the simple terms or the complex ones

Prototype system

I developed a model capable of representing much of open hypermedia that also functioned as a wiki system and mapped into useful semantic web resources. I then modified my previous research wiki to partially support it, and evaluated how it helped and hindered people. By and large, it helped.

Here's a "real world" example of it showing how treating the links between articles as relations can be useful. For TVTropes, it can automatically create a listing for a trope of all the works that use it, and how, rather than this fact having to be listed in both pages.

Screenshot

It also supports sharing parts of pages with each-other, assisting filling in facts about a page by knowing what kind of thing it's about and what kind of facts you might want about that kind of thing, and globally linking terms to a page. The best way to grasp all that is to watch the videos below.

Get the thesis

You can get the bare thesis straight from here, or go to ePrints and get it with all the trimmings.

Download as PDF
The printable 221 page compiled document Download source
LaTeX sources and supporting files* View on ePrints
Open Access, with all supporting resources

If you're a bit of an academic, you might find the plain-text abstract and references useful. ePrints can export a citation.

*To build from the sources, you will want a UNIX environment with LaTeX (probably TeX Live, since you'll need a bunch of extension styles), GNUPLOT, Inkscape, dot2tex, detex, and a few other sundry tools like GNU make. If that doesn't make any sense to you, you're probably best just leaving them alone. (It was only ever written to build on my machines; I provide the sources as a nicety for anyone wondering "how did you do that?" in the PDF output.)

See the system

A major part of my thesis was the development and evaluation of a wiki system called Open Weerkat.

The easiest way to see what it does that's cool is to watch the four short demonstration videos on YouTube (they no longer allow embeds without referrer tracking):

English closed captioning is available.

If you can't do YouTube (or would rather have slightly fewer transcoding/sizing losses than subtitles), there are raw CamStudio/Microsoft Video 1 format videos on ePrints.

Sorry, but there's no public demo site. Being a prototype it never got reviewed for horrible security holes, so the only demo lives firewalled within a University network.

Get the code

Disclaimer: Open Weerkat is an academic research system. It's reasonably architected, but lacks the polish of a deployable product.

Download from Forge
Licensed under the GNU AGPLv3

You'll need a Apache 2 with mod_perl environment to get it running, and a bunch of CPAN modules, such as the Redland RDF bindings. I'm afraid that, again, it was written to run on a grand total of two machines and generate 'research', so you may need to tinker a bit.

Mediawiki bulk processing

For one of the experiments, I wrote tooling to semi-parse huge amounts of Wikipedia history (i.e. random samples of all of it, ever).

Download from ePrints
Licensed under the MIT license

The heavy lifting is plain C code, but needs libxml2. The final processing is in Perl. The ePrints distribution of the code inclues some sample data (GFDL licensed).

Read the papers

I presented papers at WWW 2008's Social Web and Knowledge Management track, and Blogtalk 2009. The wikipedia study was also published as a chapter of Weaving Services and People on the World Wide Web, in an expanded form including the micro-scale experiment. The Blogtalk model paper was compiled as the first chapter of Recent Trends and Developments in Social Software.

View on ePrints
Hyperstructure Maintenance Costs in Large-scale Wikis View on ePrints
Studies on Editing Patterns in Large-scale Wikis View on ePrints
A Model for Open Semantic Hyperwikis

The presentation for my Blogtalk 2009 paper is also available in an online viewer:

View on SlideShare
A Model for Open Semantic Hyperwikis

ePrints can export citation information; scroll to the bottom of the page.