Saturday, November 18, 2006

8,388,608 books—sort of

Last night LibraryThing hit 8,388,608 books. That's not books in LibraryThing—which stands at 7,268,540—but books ever in LibraryThing, including ones later deleted and some shadows. You might not think it, but 8,388,608 is a significant number. It's half of 224, the largest number you can store in three bytes. It's also the limit for MySQL's "signed medium integer." It's 111111111111111111111111. The drawers are full of ones and there ain't no twos.

Anyway, we hit the brick wall last night. I had previously expanded the book number field, but I forgot to change the databases that store some related metadata and reviews. So, last night, you couldn't add a book, and this morning you couldn't review one.

I'm really sorry about this. We're good to go now. We won't hit another wall until 8.4 billion books.

Interestingly, the same thing happend to Slashdot last week. Even Homer nods.

PS: I also fixed a bad problem with "search all fields." Some queries ran quickly but some took ten or twenty minutes, by which time the user has generally gone on to better things (after re-running the query a dozen times which, let me tell you, doesn't help much). It turns out MySQL was making periodic mis-guesses about which index to use. Somehow the index with eight million integers looked better than the one with a few hundred strings.

10 Comments:

Anonymous Anonymous said...

I am of the opinion that 8 million is a very small number. Most of the big libraries have 20+ million and I've heard estimates of over 150 million possible books out there. Assuming 150 million, why have we only cataloged about %5? Just about every new book I add to LT has already been added by someone else, which tells me that there is a long tail effect, where %90 of the books most people own are in top %5 of all possible books. This is probably not the right thread to discuss.

11/18/2006 8:52 PM  
Blogger Tim said...

lr: At a couple points it skipped numbers, so some numbers never had a book.

Anonymous: You're actually wrong there. Very few libraries have 20 million—only one library in the US has that many. I've never found a good list of world library holdings, but from what I can see the BL and BNF have 10 and 11 million respectively. What would be larger than them? Note that we're talking *volumes*. There are all sorts of other things in libraries and all sorts of ways of measuring them.

Within the United States, LibraryThing now surpasses all but fifteen libraries. Adding about one million books/mo, LibraryThing will be in second place—15.2 million—in eight or nine months. The Library of Congress, at 29 million, is a lot more. (See http://www.ala.org/ala/alalibrary/libraryfactsheet/alalibraryfactsheet22.htm)

Now, LibraryThing's 7.2 million count is of volumes, including duplicates. It's *VERY* hard to count "different books." The question is almost philosophical. LibraryThing's "works" system combines different editions so that, in theory, every Homer's Iliad falls under one work. By this count, LibraryThing has 1.39 million works. If you count unique ISBNs it's about this number. Of course, many books don't have ISBNs.

Anyway, all of this comes to the following: LibraryThing has a lot of data about books, but that shouldn't be taken too far. Having multiple records in LibraryThing adds to the metadata in the system (eg., we have thousands of tags for many books), but there's still something odd about having eight thousand copies of the latest Harry Potter. LibraryThing's collection would surely lose out on coverage to all but the smallest college libraries. Most importantly, it's not a real library, of course. It's all a question of data, not stuff you can borrow and use like a real library.

LibraryThing is small compared to the LC, but it's getting up there compared to other collections. It is, however, dwarfed by all the books ever printed and by the books sold. The simple fact is that most books are in private hands, not in libraries. LibraryThing tries to get at those books, and make something of them.

LibraryThing's holdings are a CLASSIC long tail—no question about it. I need to spend some time and generate the graph.

11/18/2006 10:42 PM  
Anonymous Anonymous said...

Why do you need a signed integer to count books? When are you ever going to have negative books?

11/19/2006 1:12 AM  
Blogger Tim said...

Mike: You're so right. I probably wasn't thinking.

11/19/2006 1:24 AM  
Anonymous Anonymous said...

It seems from the individual who has some 13,000 books listed, that you no longer have to own books to list them in LibraryThing. You just have to be able to get up a list of specific books you say you desire to own.

Or am I misunderstanding something?

I'd just like to know if the books listed in LibraryThing are books people actually own, or now include also books people want to own.

Ellen

11/20/2006 10:44 AM  
Anonymous Anonymous said...

@Ellen

Whatever made you think people would limit themselves to book they own? While the site's basic premise was to let people catalog what they actually own, it is the peculiarity of social software to end up being used in unexpected ways. Look the flickr colorpickr for example.

11/20/2006 3:19 PM  
Anonymous Anonymous said...

I knew all along that people could cite as books they owned any book they wanted to cite. Nevertheless, now that it's becoming a norm to the point that thousands of books may be so cited (or imagined, invented, copied out of other highly varied databases and then in one case boasted about), the whole nature of what's put onto Library Thing changes fundamentally.

As I've said before, I've found using the software in the way I originally envisaged (searching to find books in my own library) not useful. The name (Library Thing) misled me.

E.

11/21/2006 6:34 AM  
Blogger Tim said...

Well, I think the vast majority of books on LT are owned, with some read and returned to the library, lost or etc. Wish list books are maybe 2%. I'm not sure there are any other categories that matter. The person who tried to load the contents of project Gutenberg was stopped.

As for searching, are you finding the search doesn't work. I fixed something a few days ago that should have helped that a lot. Do you have specific suggestions?

Lastly, if the point to you is searching for items within your personal library, why do you care about what other people have in theirs, where they get it and etc? LibraryThing has a quite sharp distinction between your data and others. They're not changing your data. It's in the catalog the way you entered it. It's only on global pages (and on the green data in your catalog, which is removable) that other users' data comes through.

11/21/2006 10:15 AM  
Anonymous Anonymous said...

Quite right. I shouldn't care whether the books in other people's libraries are not owned by them. As I said before I understood immediately someone could type out data for a book he or she didn't own.

I was startled by the person who had some 13,000 odd books where he announced on his profile about 2/3s were books he desired to own.

I do sometimes though look at other people's libraries under specific authors' names. I am then using Library Thing as a convenient bibliography. I've been told by others that they use my catalogue that way. Thus look up Burney and you have many books about her as well as by her. I like to think the book is somehow worth while since the person bought and kept it.

As to problems, I'll try to remember them. It's the usual thing of struggling to find books in my library I know I have, so instead I have to find them on the shelves and part of the point of cataloguing them all for me was this wasn't going to happen any more. I was going to find my books with ease.

Ellen

11/24/2006 5:59 PM  
Anonymous Anonymous said...

Have you looked at Postgresql? I found in one situation it had less of a tendency than MySQL to guess in a certain situation, that although caused a bug in my application, helped me appreciate Postgresql even more than I already did. In general I have found Postgresql to be excellent, and I think guessing is generally a bad thing for databases to do, given their focus on data integrity.

11/27/2006 10:41 PM  

Post a Comment

<< Home