An Algorithm for a Better Bookshelf

Drop in at a library, and you’ll likely notice that most shelves aren’t full—librarians leave some empty space on each shelf. That way, when they get new books, they can slot them into place without having to move too many other books.

It’s a simple-enough idea, but one that arises in a host of settings in computer science that involve sorted data, such as an alphabetically ordered census repository, or a list of connections between members of a social network. In such situations, where the entries can number in the hundreds of billions, the strategic positioning of empty spaces takes on great significance.

“Problems are getting bigger and bigger as we get more and more data,” said Helen Xu, an assistant professor in the School of Computational Science and Engineering at the Georgia Institute of Technology (Georgia Tech) in Atlanta. “At these scales, it becomes important to efficiently manage how you add new entries.”

The bookshelf problem (which computer scientists call the “list labeling” problem) is one of the most basic topics in the field of data structures. “It’s the kind of problem you’d teach to freshman or sophomore undergraduates,” said Guy Blelloch, a professor of computer science at Carnegie Mellon University in Pittsburgh.

Yet until recently, there was a wide gap between what computer scientists could achieve algorithmically and what they knew about the theoretical lower limit on how many books you should expect to have to move when a new book arrives. On the algorithmic side, “There was pretty much no progress for 30 years or more,” Blelloch said.

Now, researchers have come up with a new algorithm that comes close to that theoretical lower limit. Blelloch described it as “a very elegant result.”

The new approach will “hopefully open the door to new applications of list labeling in settings where it wasn’t useful before because the cost was infeasible,” said William Kuszmaul, an assistant professor of computer science at Carnegie Mellon University and one of the researchers who came up with the new algorithm.

The list labeling problem starts off with a bookshelf of size n, along with some upper limit—say, 75% or 90%—for how much of the bookshelf you’re allowed to fill. There’s some chosen ordering for books (say, alphabetical by author), and as books arrive, one by one, you choose spots for them that respect the ordering, moving other books as needed. To find book-placing algorithms that are robust under many different conditions, computer scientists imagine that the books are being sent by an adversary who knows the algorithm and is trying to force you to move as many books as possible. (In some versions of the problem, the adversary may also remove books.)

If you were to use the most naïve possible algorithm—the one that places each book as close as possible to the start of the bookshelf—then your adversary could force you to move every book you’ve placed so far, simply by sending you books that precede every book they’ve already sent. As the bookshelf fills up, the cost of accommodating new books becomes proportional to n.

In 1981, computer scientists came up with a much better algorithm, whose average cost for adding a new book is only about log2n. This algorithm starts by dividing the bookshelf into two equal chunks; then it divides each of those chunks in half, and so on. For each different size scale, the algorithm sets a threshold for how full the chunks of that size are allowed to be, with small chunks allowed to be fuller than large chunks. Once the books start arriving, if a new book pushes some chunk over its density threshold, the algorithm spreads out the books in the next larger chunk so they are evenly spaced, easing the pressure on the smaller chunk.

... continue reading