Wednesday, March 28, 2007

Ideas from Jeff Jonas

There were some powerful ideas in Jeff Jonas's talk on analytics that I think are quite reasonable to implement in a simple form. One is the idea of "persistent queries"

Most of Jeff's work has been on identifying people - e.g. terrorists and criminals. For example, you get a tip that a criminal is flying into an airport using a certain name etc. So you query the passenger lists but you don't find anything. You're not sure when he's coming in, so you can keep querying every day or hour, but that's not really practical. Instead, you make the query "persistent" so if new data arrives that matches your query you will be notified.

The naive way to implement this is to simply run the query against incoming data. But that doesn't scale. The more persistent queries you have, the slower it will get to enter new data. You can do it as a batch process e.g. nightly - Jeff calls this trying to "boil the ocean" - but it still doesn't scale well and it also doesn't provide the results in real time.

Instead, you turn the problem around. You store the persistent queries as "data", and you treat the incoming data as "queries". So each incoming record requires one "query" regardless of the number of persistent queries. Very cool.

Obviously, there are some issues here. One is that a persistent query probably involves only a few attributes whereas the incoming data will have many attributes. So you're not doing a normal exact match. And you probably will need ways to expire queries.

You might think that this is cool but you don't need to search for terrorists. I think it can be more broadly applied. For example, lets say you're a real estate agent and someone comes in asking for a certain type of property. You do a search but you don't find anything. So you make the query persistent and a few days later you get notified of a new property that's come available. You call your client and make the sale.

No comments: