Monday, February 26, 2007

Wikipedia citations, with feed

Update: Changed feed URL.

I've added a cool new feature, building on some work by library programmer Lars Aronsson—Wikipedia citations to all works pages. That is, work pages now list of all the Wikipedia articles that cite the work. The data is also available in feed form.

Here's how it goes. At the top of J. F. C. Fuller's A Military History of the Western World it lists how many citations, with a link:



And, down below, it shows all the articles:



How we I did it. Basically, I did a complete run through the Wikipedia dump files (source), parsing out anything that looked like an ISBN and checking if it is. It's pretty easy. So it sees:

Fuller, J.F.C. A Military History of the Western World. Three Volumes. New York: Da Capo Press, Inc., 1987 and 1988. — v. 1. From the earliest times to the Battle of Lepanto; ISBN 0-306-80304-6: 255, 266, 269, 270, 273 (Trajan, Roman Emperor).

and gets the ISBN. I've started in on the harder problem, parsing books without ISBNs, like:

Bowersock, G.W. Roman Arabia, Harvard University Press, 1983.

It's not actually that hard. But it's fiddly. And it's one of those problems where each additional percent of accuracy costs 50% more effort.

What's the most cited books? The most cited book on Wikipedia is... The Official Pokemon Handbook. Surprised? Don't be. In fact, eighteen of the top twenty most-cited works are Pokemon books. It boggles the mind. Somebody, or a bunch of somebodies went ISBN-happy on all the Pokemon entries. Fortunately, the existence of so many citations to Pokemon does not impair the quality of the rest. It's just... Wikipedia. There's a decidedly quirky character to many of the other winners, testimony to some serious passions. Number 28, with 177 citations, is Richard Grimmett's Birds of India, Pakistan, Nepal, Bangladesh, Bhutan, Sri Lanka and the Maldives. I think this effect would be diminished a lot if non-ISBN books were added.

Where did this come from? I owe the idea to Lars Aronsson, who came up with a simple script and ran it against the Wikipedia dumps and posted the results on Web4Lib back in September. I wrote him soon after to see if he was going to provide a public data feed, or if he minded if I did. He did not. His results differed a bit from mine. I'll be in touch with him to square the differences.

Unfortunately, the Wikipedia data is not updated as often as one might like. The most recent is from November of last year. I'll keep an eye on the download page, and reparse the data when a new dump comes available.

What's this about a feed? We're big fans of openness. And it's Wikipedia data anyway. So we've made a feed of it. You can get it here:

http://www.librarything.com/feeds/WikipediaCitations.xml.gz

UPDATE: I changed the URL and gzipped it. Needness to say, I'm not putting any restrictions on this, but if you do something cool, I'd love to hear about it.

As usual, tell me what you think.

*We've seriously considered open-sourcing LibraryThing. But given the state of the code, it would be, as Nabokov said of rough drafts, like passing around samples of our sputum. We may out-source pieces of the code—the pieces we're happiest about.
**LibraryThing is in the odd position of having almost as much bot traffic as we have person traffic. Google loves us. Guys, you love us too much!

22 Comments:

Blogger Caleb Bohon said...

Yes I have noticed that when I search for say an author LT is on the first or second page of results. I think that is a good thing.

2/26/2007 3:57 PM  
Anonymous Anonymous said...

It's an interesting idea, but I'm not sure it goes far enough. The true value, to my mind, would be not it seeing which/how many Wikipedia articles cite a book, but in which other books, articles, and sites are cited alongside it. If there were some way to pull that information and perhaps somehow rank it, then you'd have something really powerful.

2/26/2007 4:08 PM  
Blogger Tim said...

So, this book is cited in:

*Alexander the great, which also cites X Y and Z

?

2/26/2007 4:10 PM  
Anonymous Anonymous said...

you apparently scan only in en.wikipedia.org... is you parsing bot able to scan fr.wikipedia.org (in french). Everything Unicode ready?

Clément

2/26/2007 4:22 PM  
Blogger Tim said...

Yes, it could do them all easily. I'll do them once I get some feedback on this.

What's the second-largest Wikipedia, anyone?

2/26/2007 4:31 PM  
Blogger Caleb Bohon said...

Tim look at this:

http://meta.wikimedia.org/wiki/List_of_Wikipedias

2/26/2007 4:42 PM  
Blogger Unknown said...

Can living people outgoogle google?

There was an old rock song by a group called the Godz. They had one hit.
One of the lyrics was:

"Stop the machine machine machine"

2/26/2007 8:50 PM  
Anonymous Anonymous said...

Very interesting. Re the source files you used. Are the Mediawiki templates expanded out before or after the source file dump? I imagine parsing {{cite book | last = ... | first = ... | etc ... | isbn = 354063293X }} would be a lot easier than scraping the actual formated text that it expands into.

In fact making such activities as yours more feasible was a primary reason for introducing citation templates in the first place

2/27/2007 5:11 AM  
Blogger Tim said...

Yeah, but as with much of what LT does, it's better to hit the lowest-common denominator—the mere presence of the ISBN, ten or thirteen digits with a validating checksum. This happens whether someone uses a citation, an ISBN link or just drops the ISBN in running text.

2/27/2007 7:50 AM  
Blogger librarian@play said...

Yes. Again, speaking for myself, I'm not sure what value knowing that any encyclopedia article cites which book holds. But knowing which books are cited along with it opens the possibility of discovery of other literature that is similar to the book in hand--in as much as books in a bibliography constitute a body of literature.

2/27/2007 9:12 AM  
Anonymous Anonymous said...

Replying to Tim @ 7.50am.
Oh right in that case we could look in the other direction - all that ISBN detection you've done could be used to update Wikipedia to use the citation templates. Although this would not make your life easier, it would benefit the encyclopedia - so thanks for making the data available!

2/27/2007 11:01 AM  
Blogger Kurt Beard said...

It would be nice to be able to add this information as a column in our "our Library" page so we can quickly see which books have citations. It would also be nice to have the author links in a column.

2/27/2007 11:59 AM  
Anonymous Anonymous said...

Hmm, I just checked a Wikipedia stub (Yamphu) I wrote in 2006. It has an ISBN, but LibraryThing says: Citations, Wikipedia NONE [LT#: 2263673 / ISBN 90-5789-012-7].

Stephan

2/27/2007 3:14 PM  
Blogger Tim said...

Did you go back to Nov. 5, 2006?

2/27/2007 3:16 PM  
Anonymous Anonymous said...

Ray Gray: Two links from a Wikipedia article later... would that be "Gotta Keep a Runnin"?

2/27/2007 7:15 PM  
Blogger jmnlman said...

Nice to see a reference to Fuller.:)

2/27/2007 9:00 PM  
Anonymous Anonymous said...

Tim said:
Did you go back to Nov. 5, 2006?


Thanks for the quick response! The date was October 23, 2006.

Stephan

2/28/2007 12:38 PM  
Anonymous Anonymous said...

I really like these links. I've recently begun looking into wikipedia as I enter and tag all my books to find general info on the book, original publication dates, and author information. So, my tags are as accurate as wiki.

I hope you find a good way to link without ISBNs. Relying on them limits the data quite a bit. For example, A Tale of Two Cities has no wiki links listed, even though it has it's own page:

Social page:
http://www.librarything.com/work/17728

Wiki Page:
http://en.wikipedia.org/wiki/A_Tale_of_Two_Cities

3/01/2007 1:09 PM  
Blogger AbbotOfUnreason said...

Gosh, I'm a month late on noticing this. Have you thought about doing anything with the COinS microformat? I've been thinking about embedding them in my blog entries, but it would be easier if I could get the encoding from you. Then, I'm thinking it'd be pretty easy to make a greasemonkey script to link to you.

3/16/2007 5:17 PM  
Blogger Unknown said...

I'm impressed you are using the ISBN references on Wikipedia. Some of us Wikipedia editors spend quite a bit of time correcting invalid ISBNs. It's good you are getting ISBNs from the Wikipedia database dumps, instead of crawling the pages. A report of the Library Thing accomplishment has been posted in a Wikipedia discussion forum at [[CT:INV]] (entered in the Wikipedia search box). Of course you could log in to Wikipedia and add your own commentary, if you wish. I'm [[User:EdJohnston]] on Wikipedia.

4/03/2007 3:27 PM  
Anonymous Anonymous said...

Are these ISBNs harvested across all language versions of Wikipedia? or is it just the English version?

6/05/2007 4:47 PM  
Blogger Tim said...

Just the English right now.

T

6/05/2007 6:39 PM  

Post a Comment

<< Home