Fellow Nestoriticians!
Today we wanted to give you a bit more insight into some of the challenges we face in building Nestoria. In attempting to provide our users with the easiest way to search for property in the UK we consider four major factors: comprehensiveness, usability, relevancy, and freshness. Comprehensiveness is seemingly the simplest to measure of these parameters. Essentially it is asking - “how many properties are there in the database?” As with many things though that at first glance seem simple, the actual answer is not so easy. The question is whether you measure the gross number of properties or the net. Of all the raw properties that come in, we unfortunately find some that are spam and of course we don’t want to show those to our users. Next, we also attempt to remove non-residential properties. Then there is the significant number of sold or ’sold subject to contract’ homes that we need to strip out. Detecting all of these types of ‘bad’ listings is conceptually straight forward (which isn’t to say we’re perfect - don’t hesitate to let us know when one has slipped through our nets).The final challenge we face is a bit more difficult. Because we have listings from many sources we often have to grapple with duplicates - when we have the same property from multiple sources. This is often not trivial because the same house can have a different description or slightly different address details. Often the data from different sources disagrees slightly; source A may tell us the property is a freehold, while source B thinks it’s a leasehold. With limited and/or conflicting information the decision about what is and what isn’t a duplicate isn’t always clear. And of course the universe of properties we have to consider is continually changing - homes are continually coming on and off the market.
One possible solution you might propose is to analyze the photos of the property. This occasionally works, but even if they are the same original photos they may have slightly different size, cropping, sharpness, red-eye-reduction (just kidding), or image quality. All which makes the images look the same for the human eye, but different for a computer. Here are some examples we found recently of duplicates with slightly different photos of the same house:


