Learning to Fear the Semantic Web, by Paul Ford (Ftrain)

Oct 14, 2008 · Post · The Semantic Web

Learning to Fear the Semantic Web

Zotero is an open-sourced bibliography-management tool that runs inside Firefox-based browsers (see screencast). It helps you keep track of your research. I’ve enjoyed using it as I work on writing projects. From the about page:

Zotero is a production of the Center for History and New Media at George Mason University. It is generously funded by the United States Institute of Museum and Library Services, the Andrew W. Mellon Foundation, and the Alfred P. Sloan Foundation.

Nice! Except today, a good bit after the fact, I learned of a peculiar lawsuit that information and news giant Thomson Reuters Inc. filed last month against the makers of Zotero. From the website of The Chronicle of Higher Education, October 3, 2008, by Jeffrey R. Young (links added):

Thomson Reuters Inc. sued George Mason University in a Virginia court this month, arguing that a free software tool made by the university makes improper use of the company’s EndNote citation software....

Thomson Reuters argues that the latest release of George Mason’s software, which can import files created by EndNote and turn them into files that can be used and shared online using Zotero, “is willfully and intentionally destroying Thomson’s customer base for the EndNote software.” The company seeks $10-million in damages for each year the university has offered the software and to stop the university from distributing versions of Zotero that can convert EndNote files.

One person who commented on the lawsuit is Michael Feldstein, who writes a blog about online learning. He posted the following on October 5:

Apparently, the Zotero team did create their own style format and is crowd-sourcing the creation of import styles. As you can see from this Zotero developer discussion thread, the developers considered and explicitly rejected supporting the redistribution of Thomson-supplied EndNote conversion files. In fact, while Zotero can read EndNote style files, it specifically does not convert them into Zotero’s own format, in large part to discourage the redistribution (deliberately or accidentally) of Thomson-created files. What the import feature does facilitate is (a) users who have already licensed EndNote and want to migrate to Zotero can use the EndNote styles that they have already paid for, and (b) Zotero users can take advantage of the EndNote import styles that individual journal publishers (as opposed to Thomson itself) make available for the convenience of their subscribers. These uses strike me as totally within bounds.

(More is available from the Disruptive Library Technology Jester blog.)

Given my biases this lawsuit seems like an anachronistic, hamfisted attempt to block competition. While as a programmer I love being able to adapt open-source software to my particular needs, I use a mix of closed-source and open-source software without many qualms. That said, non-standard, closed-source document formats are awful stuff that block competition between software vendors and, worse, waste god-awful amounts of my time. If you wish to dispute me on this then come to my office tomorrow to help me, over the course of several hours, yank a magazine’s-worth of text out of Quark XPress, using a mix of applications and balky emacs macros. (Imagine if you could take back all the time spent wrangling closed, proprietary document formats. You could finish Perl 6; you could probably write it in Arc.)

I’m not an Endnote user and I don’t like to borrow trouble (which is why I’ve been avoiding this blog; blogging is a great way to borrow trouble). But not only does this lawsuit invoke the dread specter of legally-enforced proprietary data formats, it raises questions about Thomson Reuters’s legal attitude towards the data produced by its other software offerings—including, in this case, a piece of software called OpenCalais.

OpenCalais is a web-based application that consumes text and returns special Semantic Web-style metadata that you can use to do interesting, Semantic Web-style things, like: create topic pages, improve search, or enhance local taxonomies. It has a Facebook group and its website features both video of straight-talking bearded coders and a creatively borrowed terms of service statement:

We based these Terms of Service under those released by Automattic under a Creative Commons Sharealike license. Thanks to Automattic and WordPress.com for sharing.

I have a quarter-million-page corpus at work and I’m looking for simple, inexpensive ways to enhance it, so I’ve followed the development of their platform for some time—joining the FaceBook group, signing up for an account, and using their free endpoint for testing (go ahead and give it a spin). My grand, entirely unrealized plan was to include a direct hook to OpenCalais in our content management system. The OpenCalais team seem trustworthy, progressive, and smart, and committed to openness. But, at least for now, the lawsuit against Zotero has scared me off using the product.

This despite, as pointed out by the Panlibus blog at Talis, in a post on OpenCalais as it relates to the Zotero lawsuit, the following statement from the OpenCalais folk:

We want to make all the world’s content more accessible, interoperable and valuable. Some call it Web 2.0, Web 3.0, the Semantic Web or the Giant Global Graph—we call our piece of it Calais.

So why am I overreacting? Well, that “our piece of it” bit is a little tricky, but I think I get what they mean, and the Endnote people and the OpenCalais people are in different parts of a very large organization and working on different projects with different goals. But the parent company is the same, and, professionally I feel required to overreact, because in every situation—as editor, coder, designer, and so forth—I to my great regret must always concern myself with liability.

I hate that part of my job. From worrying about copyright and fair use, to questioning whether we can reuse art or prose from our own archives, to sending out cease and desists—it all fills me with gloom and despair, the sense of being a culpable cog in a lumbering legal machine. It’s the opposite of creative, interesting work, but if you get something wrong the consequences can be dire, so worrying about getting sued is something that has to be done, every day, even on the subway. I’m worried about getting sued right now, sitting here, typing this. If you’ve had someone threaten you with a lawsuit, you know the sort of fear and second-guessing it engenders. Even if I am certain that I have followed every ethical and legal guideline, it’s an instant panic attack to see the words “contacting a lawyer” or “liable for damages” in an email; it leads to second-guessing, and I know there will be phone calls, meetings, and several months of followups to comply with the needs of insurers. If I can see the shadow of a lawsuit anywhere I am obligated to shine a light upon it and freak out at least a little; otherwise I’m not doing my job.

And that’s what’s going on here. This recent lawsuit against George Mason/Zotero immediately brought to to mind a scenario: Thomson Reuters maintains control over the taxonomy, the thesaurus, of terms used in OpenCalais, and they do the indexing of content to associate that content with terms. The use pattern I was considering was as follows:

Create text within a content management system;
Send that text to OpenCalais;
Store the metadata it returns;
Over time, use aggregated metadata, integrated with our existing ~80,000 subjects, to create a local taxonomy for faceted search and automatically-compiled topic pages, along with other interesting interfaces.
Share as much of the taxonomy as possible as downloadable RDF;
Make sure to provide links back to OpenCalais wherever possible, on their terms, as defined in their Terms of Service (TOS) document.

That’s probably not a big deal. I doubt anyone would even notice. But... is it at all possible, conceivable, even a tiny bit that at some point in the future Thomson Reuters could claim that we were misusing their data in step (4), above? From the TOS:

If you syndicate, publish or otherwise transmit any content containing, enhanced by or derived from Calais-generated metadata you will use your best efforts to incorporate the correct Calais-provided Globally Unique Identifier [GUID] in that content.

It seems straightforward, but that “best efforts....” The truth is, I don’t really know exactly what they mean there. Also from the TOS:

You will not use any metadata or GUIDs produced by Calais to create a metadata retrieval service similar to Calais.

And could they claim that we were somehow creating a derivative work without permission and distributing it in step (5)?

I would say, based on my far-from-authoritative reading of the TOS, and given the suit against George Mason University, there is now a precedent; that is, it is within the realm of possibility that if I passed thousands of web pages through OpenCalais and decided to adapt the resultant format for my own use in a way that Thomson Reuters disliked, I could get a fat letter from some lawyer someday demanding damages, accusing me of creating a derivative work based on their proprietary taxonomy, in violation of their terms.

I’m not saying it’s likely; I’m not saying I’m right; I’m not even saying that Thomson Reuters would be legally or ethically wrong to sue for damages. I would bet $10,000 right now against my fears coming to pass. But IANAL, which is exactly my problem here. And this is not a call to boycott anything, nor an attempt to get personalized service out of OpenCalais, where the developers are doing some very fine Semantic Web-bootstrapping work. I know Thomson Reuters could give a damn about me, and in that they are justified—I’m just another API key hash in their database, and even if I upgraded to their for-pay service I’d never represent more than a balance-sheet rounding error.

My only purpose in writing today is to point out how a lawsuit can have unintended chilling effects, at least for me. We’re in a remarkable downturn, and people are being told to “get real or go home.” One way corporations get “real” is to sue the living shit out of everything that blinks. It’s probably a good time to review the terms of service for all of your critical software to make sure you’re in compliance; I wonder if a lot of Web 2.0 mashup decentralized goodwill is going to go to good-faith heaven as companies under financial strain start to look closely at their patent portfolios and vendor agreements, and decide that printing out lawsuits is even cheaper than deploying to EC2. And now that the “Semantic Web,” or “Web 3.0,” or the “Linked Data Web,” or the “Web of Really, That’s How to Query Over an rdf:Bag?” or whatever they’re calling it, is viable enough that you can’t shrug off legal worries—now that the Semantic Web is no longer just a research project, if someone owns the taxonomy you’re using and changes it up on you, what rights do you have in the matter? Who owns the GUIDs? Your honor, I just wanted to build a hierarchy of topic pages. I never meant to hurt nobody. And so forth.

To summarize: working in web publishing, I have a healthy fear of lawsuits bordering on the insanely paranoid; and I wish it were not so, but that is now part of the job, as the web of ideas has given way to the web of pricks; and finally, actions speak louder than Creative Commons-licensed terms of service. You can still get handed a subpoena while you’re riding the Cluetrain.

Now that I got the fear, do I want to go to the effort to (1) educate a few people in management, none of whom would have great interest in the subject except as a soporific, about the far-fetched risks of using externally-generated taxonomies to organize our content; and do I (2) want to spend a number of hours in the near future educating myself over the completely nebulous rights issues connected to taxonomies, linking, and file formats, thus taking even more time away from code and prose to give it to the law; and do I possibly even (3) want to allocate the budget to work with a lawyer on taxonomy-related issues? All the while knowing that I’m overreacting and that this is probably pointless?

Not really. I’d rather let other people do that and read the judges’ opinions. Let deeper pockets set the precedent; what I do want to do is to port the CMS to Django, an open-sourced CMS published by a foundation, get the search into Solr, also published by a foundation, and introduce hierarchy to the 80,000 subjects we already have indexed. I’m just going to put OpenCalais away for a while and start looking at DBpedia again, then see how that whole Zotero suit works out over the next few months or decades.

In one way, this is all great because I love the Semantic Web to the point of stupidity—to the point of building a custom content management system entirely based on alpha-level technology using RDF for storage, creating a framework even slower than Rails. So I’m grateful to Zotero for taking the brunt of the lawsuit, because it gave me reason to take off my rose-tinted Linked Data goggles, and made me aware that all of my planned Semantic Web taxonomy-sharing fun could come crashing down if I don’t carefully track the provenance of every one of my triples, erring always on the side of raving terror.

Know what else is great? Now, finally, ten years on, I know that the Semantic Web is real and viable, because I’m afraid I’ll get sued for using it. That’s the true measure of a maturing technology—eat it, Gartner hype cycle.

I believe, as in don’t-get-him-started, that taxonomy-driven interactive editorial is essential to the future of the web, and thus to storytelling and narrative in general. Clearly a great deal of money is being spent by major companies in pursuit of the golden triple: It appears the AP is working on taxonomy tools, and Rupert Murdoch’s Dow Jones has Synaptica and publishes a cute taxonomy cookbook. A number of other companies are out there, building massive thesauri and indexing tools, hacking parsers and coding semantic disambiguators like mad, banging their heads against pronouns. There will be many, many competitors seeking to add their own structure our increasingly Web-content-driven reality, and we will, if we use their services, find ourselves beholden to their methods of indexing, with all manner of legal compliance and copyright issues as of yet untested in courts. Creating good, broad, world-describing taxonomies is extraordinarily expensive, because reality is large, and these companies will need to strike a balance between sharing their work and protecting it, so I imagine this will be a subject I’ll revisit, professionally, many times over the next few decades (barring complete societal breakdown, or a personal spiritual awakening that allows me to stop thinking about this sort of thing).

Such questions could keep a librarian up at night, staring at the wall, petting his or her sleek gray cat Otlet and wondering what, for instance, a political campaign looks like when all of the news and columns are automatically classified before being published. Competition, he or she might conclude, must be encouraged between these platforms; there must be a free, and yet somehow regulated (perhaps by the W3C, or preferably by an organization with a more attractive website), market of taxonomies—you can’t have people claiming to own concepts conjoined to unique identifiers, can you? Can you? You probably can? Oh.

But there’s likely no reason to worry; and I am just borrowing trouble; and maybe the Semantic Web won’t matter that much after all. Even if taxonomies do become increasingly important in our web of linked data, thank God we live in a society with an enlightened understanding of intellectual property, and that we can trust the tiny handful of organizations that control the world’s supply of news, as they become software providers as well as content providers, to do the right thing when it comes to serving the needs of a wider populace, in a culture that would rather foster dialogue, discussion, and mutually beneficial resolutions than use the ugly, blunt tool of potentially profitable lawsuits. I’m sure—really, I am—that mine is an overreaction. And onward, to progress.