I’ve been tinkering. I was grumbling about how bad sites’ own search engines generally are, and it occurred to me that it might actually be quite hard to make one work well. So I’ve started writing my own from first principles.
I’ve got as far as something which can index pages it’s fed (doesn’t spider yet) and comes up with keyword densities per page and then for the whole dataset, which it then indexes against each other… the idea being that you quickly identify if a page has unusual content versus its peers. Equally, words in common usage like “the”, “and”, “if” etc are automatically under-indexed as a result.
I’ve now developed it to handle multiple search keywords (it ranks the rarity of each keyword searched for and then weights the page rankings accordingly) and to reduce all indexed and searched words to their stems, making the system blind to the differences between “car” and “cars” for example.
The next challenge is identifying meaningful phrases (e.g. “Prime Minister”) then indexing them and possibly identifying associated words (e.g. “10 Downing Street”) by looking for pages which over-index for particular combinations of phrases. Strictly I’m getting into a world of Bayesian logic and matrices, but as long as I can avoid going back to my text books I will; though this excellent article summarising some work by Google gives me nightmares about Finals again.
I’ll publish some PHP code on here at some point, or maybe I’ll start a Sourceforge project and share the love. Having added a comma to a Wikipedia article the other day and been sucked into Facebook, I’m feeling very Web 2.0 this week.
See related posts...
- No related posts found.