Over this weekend (9 – 11th April) I watched on Ustream the Citability CODEATHON. I already knew about Citability​.org from Silona Bonewald (@Silona on twitter), but the codeathon (from an spectator point of view) was very interesting as both discussions and prototypes.

What is citability​.org?
Citability supports making public government documents and data available online and citable such that they can be easily referenced for public debate, commentary and analysis. This requires that archived versions of documents be stored and linkable so that changes can be easily spotted and reference links remain intact.
(from http://​dccodeathon​.pbworks​.com)

I was thinking of changes, a live document as opposed to its archived snapshot may:

  • change its location
  • change its presentation, and by this possibly breaking intra-​document addressing
  • undergo minor changes (let’s call them lexical ones) such as spellcheck/​grammar/​punctuation changes
  • undergo major changes, semantic ones (as opposed to the above lexical ones)

In 2000, Thomas A. Phelps and Robert Wilensky wrote “Robust Hyperlinks and Locations” where they describe how lexical signatures can be used to find a moved document (or to find copies of the same document) and also they addressed the issue of “Robust Intra-​document Locations.”

Basically a lexical signature of a document is a set of keywords (set computed with a TF/​IDF–like algorithm) which when used in a search with Google will return as top hit the same document, and/​or copies of it. Since this signature is computed against “Google’s corpus,” in time the results will skew as the corpus changes. But since Citability keeps snapshots of the original documents, their lexical signatures can be re-​computed.

I believe that Citability could easily employ those lexical signatures technique to locate moved or duplicate documents, and help detect minor lexical changes (which could be more flexible than raw hashes). Moreover, the intra-​document re-​attachment algorithm can help in detecting document structural changes and re-​attach citations or just help observe documents’ evolution.

In the past years, part of my experiments regarding semantic navigation, I used such lexical signature as input into an ontology search engine (Watson) to discover what ontologies can cover a specific document, and provide a way to discover semantically (and not just lexically) related documents, while steering the document discovery through domain-​specific ontologies.

I believe that such semantic signatures (think of the lexical ones, but elevated to interlinked concepts) apart of enabling topic-​based navigation between documents or measure semantic similarity, could be an interesting way to detect major changes, “semantic” ones as opposed to minor lexical ones.

I cannot wait for the Citability project to take off, as their archives would be a valuable corpus for researching document similarity, change detection and document evolution.

Note: The ‘semantic signatures’ I refer to, are different from the TextWise.com’s ones.

Reblog this post [with Zemanta]