Saturday, February 07, 2009

Distinct authors, phase 1 / Steve Martin is funny again

Short version. I've added a mechanism to "split" distinct authors with the same name. You can find it on the right of any author page, under "Author Disambiguation." The feature is only partially rolled-out, without separate pages for distinct authors or other rammifications for the LibraryThing system.

Long version. Since its inception, LibraryThing has been plagued by the "Steve Martin" problem. We all know Steve Martin, the comic and author of Shop Girl. But what about Steve Martin the author of Britain's Slave Trade, Sold! How to Make it Easy for People to Buy from You or some book about Newfoundland ships. Why was the original wild-and-crazy-guy writing such evidently unfunny books—or who were these other people?

The problem is deep in the data. Libraries have a system for disambiguating authors, called Authority Control, based on coming up with authorized forms of a name and adding dates and other metadata to make them unique, and then applying these forms across the books. Authority control is a good idea—if often problematic to implement—but it falls down in the face of LibraryThing's data. Libraries don't coordinate their authority control as much as you'd think, and LibraryThing draws from almost 700 libraries. And even if authority control worked in libraries, 90% of LibraryThing content comes from other sources, mostly Amazon. This data has no concept of authority control. (See Steve Martin at Amazon, for example.)

In solving the problem, I decided to ignore how libraries solved the issue and concentrate on how LibraryThing could do it most easily. Authority control requires librarians to assemble data (eg., birth and death dates) about name variants before a split is made. (Thus was born librarians' unfortunate policy of putting out hits on individuals they could not otherwise distinguish.*) Although LibraryThing members have done an amazing job finding birth and death dates, it was still a lot of work. And a full authority-control solution would have members updating each other's records with the "authorized" forms of the names!

I felt a better way could be found. Instead of establishing unique names and pushing them to records, members could split works arbitrarily, and the authors would come to be known by the name they share and the works that cluster under them. This is actually an old system—calling someone "the author of Ivanhoe" or "the one who wrote the Parthian history." And, as with other features of LibraryThing cataloging, it accords with how regular people talk about. In a real-world situation, like a meeting of Newfoundland commedians, you wouldn't refer to "Martin, Steve, 1945-" and "Martin, Steve, 1947-" but "Steve Martin, you know, the one who wrote Shopgirl" and "Steve Martin, the one who wrote that book about that boat."

How it works. To split an author, find the area on the right labelled "Author Disambiguation." It will take you to a splitting page; here's Steve Martin's. This page allows you to assign all the author's works to numbers. As you assign the works, LibraryThing assigns separate colors, making it easy to see at a glance how the thing is going.

More to do. This is just a first step. The "distinct authors" feature has to "go" all sorts of places on the site. First up will be separate pages for distinct authors--and a "disambiguation page" (a la Wikipedia) tying them together. Once that's done we can move to separate author metadata, such as Common Knowledge, bettween distinct authors.

Quite frankly, I'm going to do a few more things and then let this sit for a while. My main focus right now—and Chris'—is to see "collections" to the finish line. When I realized I could bang out the first phase of distinct authors in a long evening (it's after 5am now), I went ahead and did it. But now I need to refocus on collections.

Talk about it. I've set up a New features post to discuss the change, and its potential rammifications. I suspect that the Combiners! group will get in on the act quickly as well, working out various technical issues. They have a number of threads (here, here and here, at least), in which members have made lists of "identically named authors." They would be a good starting-point.

*The hits are, of course, carried out by OCLC.

Blogger Donogh said...

I'm sure this'll be a big hit with people - nice one!

2/07/2009 5:19 AM  
Anonymous Anonymous said...

Very nice, well done. Bob

2/07/2009 5:30 AM  
Anonymous Anonymous said...



2/07/2009 6:17 AM  
Anonymous Anonymous said...

A different and neat way to solve the problem. Thanks!

2/07/2009 7:23 AM  
Anonymous Anonymous said...

Ditto on all above. This is great. I'm off to test!

2/07/2009 7:27 AM  
Anonymous Anonymous said...

Very cool. (Of course, I'm biased, I've been wanting a system like this for libraries for a while. Somehow day to day stuff consumes my work hours and I just haven't done any projects at home for a while...)

The current state of the authority system for libraries is understandable. It probably was the best system for most libraries back when they were filing cards. Hence the extreme focus on finding a "unique string" to be field by rather than just assigning them some sort of identifier. The sorting mechanism was also the user interface. We have the luxury now of being able to hide the identifier from users and offer much more useful information.

One could imagine that the Library of Congress could have tried to publish some sort of similar thing to the LCSH for authors, but it would have been expensive and it would have devoted a lot of space when most authors (at least allegedly from what I have heard) publish once or twice. Lookup time of course would have also been slow, even for the fastest librarians it would probably add five minutes to each record. So instead librarians just kept a smaller list of authority headings for just their institution and created new ones only when there would be a duplicate entry for two obviously different authors.

Of course now I find that authority records for many have become a form of tradition. It's really hard for them to think about ideas like not spending time finding a unique string and instead supply information. (Or the fact that all we need to make sure is that the works are linked to an author id and then we could automatically display all the works in a "authors" page.

I'm babbling though. Just thought some folks might want a little insight to why libraries authority data isn't the greatest.

2/07/2009 9:49 AM  
Blogger Andrew said...

Tim, kudos on finishing a very useful feature in a clever way!

One issue that I've just discovered: if you make an author distinct, then merge the author with another, the distinctions are lost and need to be reedited.

2/07/2009 10:35 AM  
Anonymous Anonymous said...

Looks good. Been waiting to get these people disambiguated and clustered (see John Randolph)

Speaking from my experience as a cataloger, libraries do not usually maintain their own authority records. They use the records from the national set of Name Authority Records (NAF). These records try to maintain different forms of entry for distinct authors across the entire universe of names (not just the ones in use at one particular library)

There are records for names that are undifferentiated (the same name form is used for multiple authors with the different works cited in the authority record body)

New names can be added to the national files by individual libraries through the NACO program.

Authority control is not simply a form of tradition. These authority records disambiguate authors and aggregate them in a catalog display. The authority records themselves often give librarians interactive control over the entries in the bibliographic records in a catalog.

But this new system you've developed looks very interesting to get these authors disambiguated with distinctive author pages (I remember one very angry author who refused to help when he saw his author page included other peoples works)

2/07/2009 10:55 AM  
Blogger Katya said...

The grand irony here? I'm using WorldCat and LC authorities records to split the authors. ;)

2/07/2009 11:33 AM  
Blogger Martin K Jones said...

Yay! Collections!

2/07/2009 2:37 PM  
Blogger Tim said...

It's too bad the Gorman wasn't Michael Gorman...

2/07/2009 2:41 PM  
Anonymous Anonymous said...

I don't have to put out a hit - I work in Cataloguing-in-Publication - instead I get to send the publisher an exasperated email requesting that they distinguish between their John Smith and the other handful on the catalogue.

2/07/2009 8:44 PM  
Anonymous Anonymous said...

Great! Thanks!

2/08/2009 4:50 AM  
Blogger Robert Seddon said...

Yay; though I wonder why the default is '1' rather than 'unknown'.

2/08/2009 9:51 AM  
Anonymous Anonymous said...

2/08/2009 8:40 PM  
Anonymous Anonymous said...

Thank you!

2/09/2009 10:17 AM  
Anonymous Anonymous said...

Why too bad my first name isn't Michael? I can't tell if it's because comment (yes, hastily written) would have carried more weight or if people would be happy to see something like that from the glorious past president of ALA ;).

Sorry, but NAF and NACO seem painfully slow. For all practical intents and purposes if you attempt authority control, you'll diverge from NAF for at least some headings. Or you just export to some other vendor who claims to do NAF, but it's rarely double-checked.

My comment about authority practice becoming a tradition is more to reflect in my cataloging experience (hey, what do you know, I got a MSLIS too) that there isn't a lot of questioning going on or experimentation from either the systems side or the cataloging side when it comes to authorities.

Undifferentiated names are a great example. Why don't we just create different authority records for any work where we cannot positively identify the author? Then make it easy to merge or split authors if more information ever becomes available? Why in this day and age when it's so easy to create indexes, browse lists, etc do we still rely on mechanisms that are based of printed organizational systems?

Of course, I'm somewhat exaggerating. There are people talking about change and experimenting in the library world, but it seems that there's far less than I would hope there to be.

2/09/2009 12:08 PM  
Anonymous Anonymous said...

Is anything in the works to solve the problem of joint (or multiple) authors?

2/12/2009 7:13 PM  
Blogger Brunellus said...

Just noticed this. Wonderful stuff – thanks.

2/15/2009 3:07 PM  
Blogger souci said...

It's a great idea to have users sort authors by works. Now how can we remove a disambiguated other author (David Mitchell #11) who is a librarything author, from our connections, when our books are by David Mitchell #1?

2/22/2009 7:50 AM  

