This month we chat with long time friend of Nestoria Simon Wistow. Simon is Senior Search Engineer at Scribd - the world’s largest social publishing and reading site. Previously he survived the first Dot.Com boom and bust cycle, was a Senior Search Engineer at Yahoo! Europe and worked at blogging company Six Apart on search and a variety of other open web and social technologies. He also spent several years working on R&D for a Visual Effects Company and once worked on a Cattle and Sheep farm in the Australian Outback - mostly because it seemed like a good idea at the time.
Simon thanks for speaking with us about the state of search.
1. Scribd is one of the largest document sets in the world. What are some of the technical challenges of searching it?
The most immediate problem is that the documents tend to be so much larger - our average document length is 10 pages long which translates to about 3000 words or 21,000 characters. This is much longer than most web pages and certainly more than the average blog post. This in turn makes our indexes much bigger and means that some of the standard tricks such as boosting the importance of the title have to be more finely calibrated. For example, if a user searches for “Alexander” then a one page document with one sentence and a title that exactly matches the search term isn’t likely to be as good ‘quality’ as a 20-page document on Alexander the Great and the Macedonian empire. It’s a slightly contrived example but we see the same class of problems a lot. The other problem we have is the fact that we index so many different types of documents. It can be quite tricky to index both a thesis in PDF format and a PowerPoint presentation since they have very different information densities. Density doesn’t necessarily translate to quality. On top of that we have to handle documents in about 70 different languages and varying layers of privacy controls (private and public documents, offered for sale and for free). Because we have so much “long tail” content popularity based boosting in a style similar to Google’s PageRank algorithm isn’t an automatic win. Small companies like Scribd deal with the volume of data that only huge multi-national web companies were able to handle just a few years back. While the hardware has improved a lot I still think it’s pretty amazing what we’re doing on comparatively tiny budgets.
2. What are the interesting new trends you see in search?
If real estate is all about location, location, location then search is all about context, context context. And part of that context is location. If I search for “taqueria” then that’s a very broad search term. But, if you happen to know that I’m searching on my mobile phone and that I’m currently standing in the Mission district in San Francisco, then suddenly you can give me much higher quality results. And, the more information you know about me then the better the results can be. If I search for “polish” then you can serve me up different results depending on whether I’ve just purchased a new table off eBay or whether I’ve just booked plane tickets to Krakow. As for more esoteric trends there are a few new technologies beyond the standard inverted index set up which I think are really interesting. Latent Semantic Indexing and Contextual Network Graphs give scarily good results - they return documents that are about the same concept that you’re searching for rather than just documents with the exact terms. This means querying for “Kitty Litter” will return documents that refer to it as “Cat Toilet” as well because both those phrases are in the same conceptual locality. Classification algorithms like Support Vector Machines are pretty damn sexy as well.
3. One theory in the search world is that search (and information organization generally) will go “social” with Twitter, Facebook and others heralded as the future of information organisation. Your thoughts?
The other opportunity that sites like Facebook and Twitter represent is that it’s a much faster moving world. At one point Google only re-indexed everything once every 40 days or so, now they’re adding content in real-time from Twitter’s fire hose. This leads to some interesting issues with index latency and replication but also allows for some interesting ranking based on timeliness, trending topics and ephemerality. Again, context is king here. If I’m searching for something and then one of my friends searches for similar terms soon after then it’s a pretty good bet that she’s looking for the same thing as I am. Based on that we can assume the results I clicked on would be more relevant to her. Using technologies like Multi Layer Perceptron Networks we can do some really interesting result time boosting based on what other users and, more importantly, your friends are doing. The flip side of this, of course, is privacy. If I’m logged out and I search for a phrase and get regular results but then log in andsearch for the same thing again and get more, shall we say, adult results then I know that’s because of my friends. If I only have one friend then he’s well and truly busted. Or, less cynically, if you’re searching for a place to hold your birthday party and you’re getting a lot of results for places that specialise in throwing surprise parties then it kind of undermines your friends’ efforts. You could get around this by only using this data if the user has over a certain threshold number of friends. But incidents like the AOL search log leak and the extracted data from the Netflix challenge have shown that even if you try really really hard to scrub this stuff inevitably some information is still recoverable.
4. Before moving to California you were a long time member and organizer of London’s Perl community (Perl of course being the main programming language used here at Nestoria) and worked at several internet/technology companies. How does the Silicon Valley scene differ from London?
Well, I live and work in San Francisco itself rather than down in the Valley, although I have lots of friends who do both. The vibe in the city is different from down there - you have to drive everywhere in the valley so it’s a much less social atmosphere. That said, San Francisco and the whole Bay Area is different from London. Plenty of people have elaborated at length on the subject and, while I’ve rarely outright disagreed with what they’ve said, I’ve never totally agreed with them entirely either. I’m not entirely sure how to put my finger on it exactly to be honest. In general, there’s a lot less stop energy here. If you tell people an idea then they will generally overwhelm you with enthusiasm whereas, to some extent, in London, they will patiently and non-maliciously tell you all the reasons why that’s a stupid idea and what the obvious and gaping flaws are. Which can be enough to deflate you and knock all that crucial early momentum out of you. The flip side is that if you propose a stupid idea in San Francisco then you may not get that reality check you so sorely need, which is why you occasionally see proposed standards and protocols and products which just make you wonder what the hell the author was thinking. Sometimes it feels like there’s a willful ignorance of the lessons learnt from history and of the greater world outside the rarefied echo-chamber bubble that we exist in. Also, while San Francisco has stuff like Maker Faire and Bacon and Cupcake Camp I think people are more likely to do start ups which involve something physical in the UK. I’m sure people can point out counter examples but places like Newspaper Club and Moo seem to be somehow a British thing.
Other than that they’re actually remarkably similar. Hell, there’s such a flow of people back and forth across the Atlantic that I see some friends more often here than I did back in London.
Simon, thanks for the detailed insights. Couldn’t agree more that it’s amazing what small companies are technically able to do these days. Likewise we can confirm that different types of data require totally different search thinking. To learn more, or to try to understand what sort of twisted thinking would lead a perfectly normal British man to go work on an Australian sheep farm, I invite all readers to follow Simon over on twitter where he goes by the moniker @deflatermouse
past Nestoria interviews: Chris Osborne, Kevin Burke, and Nick Turner-Samuels.