Why do search engines lie?

Here, do a search for Memetrackers (Google, MSN, Yahoo). Now, why are none of their numbers accurate? Google says there are 713 results, but can only display 62. MSN says there are 101, but only can display 100. Yahoo says there are 368, but only can display 44.

Why aren’t there any truth in advertising laws for search engines?

Update: the numbers are changing. Google now says there are 699 results, but can only display 692 (this is after you tell it to display all duplicates).

Oh, and no engine can display more than about 1,000 results, so if they say there’s 42,000,000 results there’s no way to verify whether those numbers are accurate or not.

Update 2: Yahoo is actually accurate once you tell it to display all duplicates. It says 429 results and displays 429 results. So, Yahoo wins! (although I wish they’d all be a little clearer up front).

  • http://www.the-rodeo.com/ Ian Tyrrell

    Did you read the bottom of that page?

    “In order to show you the most relevant results, we have omitted some entries very similar to the 62 already displayed.
    If you like, you can repeat the search with the omitted results included.”

    There are 713 results, but a bunch of them are very similar, so you might not want to bother looking at them.

    I guess they are just trying to make the results more relevant, and decrease the amount of ‘wading’ you have to do through the results.

  • http://www.the-rodeo.com Ian Tyrrell

    Did you read the bottom of that page?

    “In order to show you the most relevant results, we have omitted some entries very similar to the 62 already displayed.
    If you like, you can repeat the search with the omitted results included.”

    There are 713 results, but a bunch of them are very similar, so you might not want to bother looking at them.

    I guess they are just trying to make the results more relevant, and decrease the amount of ‘wading’ you have to do through the results.

  • http://scobleizer.wordpress.com/ scobleizer

    Ian: I just did that and updated my post. Turns out they still aren’t all that accurate.

  • http://scobleizer.wordpress.com/ scobleizer

    Ian: I just did that and updated my post. Turns out they still aren’t all that accurate.

  • http://jackkonblog.blogspot.com/ Jack Krupansky

    And, different users (or the same users at different times of day) may see different numbers since Google seems to have multiple copies of their database that are *not* in sync.

    But lets be honest here about “marketing”… are Google, et al really any less “accurate” than McDonalds used to be when they told us how many billion hamburgers they had served?

    Or, are you trying to suggest a “Truth in Marketing” fantasy?

    The real problem that you are beginning to stumble into here is that core search engine technology really hasn’t advanced very far over the past five years (or even since 1998 when I was involved in it).

    – Jack Krupansky

  • http://jackkonblog.blogspot.com Jack Krupansky

    And, different users (or the same users at different times of day) may see different numbers since Google seems to have multiple copies of their database that are *not* in sync.

    But lets be honest here about “marketing”… are Google, et al really any less “accurate” than McDonalds used to be when they told us how many billion hamburgers they had served?

    Or, are you trying to suggest a “Truth in Marketing” fantasy?

    The real problem that you are beginning to stumble into here is that core search engine technology really hasn’t advanced very far over the past five years (or even since 1998 when I was involved in it).

    – Jack Krupansky

  • http://jackkonblog.blogspot.com/ Jack Krupansky

    Oops, I forgot to answer your initial question…

    Why do search engines lie? … Because they can.

    – Jack Krupansky

  • http://jackkonblog.blogspot.com Jack Krupansky

    Oops, I forgot to answer your initial question…

    Why do search engines lie? … Because they can.

    – Jack Krupansky

  • JG

    Earlier last year G began offering a new search option: the wildcard “*”. The idea is that you would use it to match queries in which you did not know what one of the search terms should be, i.e. “* scoble” would match all the different two-word phrases ending with scoble.

    Well, being the search geek that I am, I immediately went to G and typed “*” by itself, to see (1) how many matches it returned (i.e. what was the size of the index) as well as (2) what was the top-ranked page.

    As you can guess, G did not process that query, and simply returned a “no results here” page. So I tried a different query “* *”. In other words, return all the pages to me in which there are (at least) any two words on the page.

    That worked.

    Wierd thing was, at the time Google was claiming something like 10 billion web pages, and the results I got said something like “Results 1 – 10 of about 8,500,000,000″.

    So either there were 3 billion web pages in G’s index, each page with only 1 word on that page (so that “* *”, which requires two words, wouldn’t match), or else G was somewhat less than true in its advertising.

    Just five minutes ago I tried this “* *” query again. This time, it returned 11 billion results. Two minutes ago, I tried it yet again. It returned 9.7 billion results. Same exact query. Hmm. Ok, well, I can attribute that to perhaps my query being farmed out to different server clusters, and each cluster having unsynchronized indices.

    So then I went to google.co.uk, and tried the query “* *”. Wanna guess? 18 billion results. Huh? How is it that the UK datacenter has almost twice the index size of the US datacenter?

    Could it be that Google has internet filters not only on China, but on the US, too? Why is it that the UK has twice the index size? :-)

    Anyway, my point is simply that I agree with you. There should be more truth in advertising. Something is definitely screwy with all these indices and numbers.

  • JG

    Earlier last year G began offering a new search option: the wildcard “*”. The idea is that you would use it to match queries in which you did not know what one of the search terms should be, i.e. “* scoble” would match all the different two-word phrases ending with scoble.

    Well, being the search geek that I am, I immediately went to G and typed “*” by itself, to see (1) how many matches it returned (i.e. what was the size of the index) as well as (2) what was the top-ranked page.

    As you can guess, G did not process that query, and simply returned a “no results here” page. So I tried a different query “* *”. In other words, return all the pages to me in which there are (at least) any two words on the page.

    That worked.

    Wierd thing was, at the time Google was claiming something like 10 billion web pages, and the results I got said something like “Results 1 – 10 of about 8,500,000,000″.

    So either there were 3 billion web pages in G’s index, each page with only 1 word on that page (so that “* *”, which requires two words, wouldn’t match), or else G was somewhat less than true in its advertising.

    Just five minutes ago I tried this “* *” query again. This time, it returned 11 billion results. Two minutes ago, I tried it yet again. It returned 9.7 billion results. Same exact query. Hmm. Ok, well, I can attribute that to perhaps my query being farmed out to different server clusters, and each cluster having unsynchronized indices.

    So then I went to google.co.uk, and tried the query “* *”. Wanna guess? 18 billion results. Huh? How is it that the UK datacenter has almost twice the index size of the US datacenter?

    Could it be that Google has internet filters not only on China, but on the US, too? Why is it that the UK has twice the index size? :-)

    Anyway, my point is simply that I agree with you. There should be more truth in advertising. Something is definitely screwy with all these indices and numbers.

  • JG

    Whups, in my post above I realize some of my math is wrong. When I talk about there being 3 billion web pages with only a single term, that should actually be 1.5 billion web pages. So I can’t do simple math any better than Google can.

    But go and try this query yourself. I was extremely surprised to see a 9 billion page difference between the google.com and the google.co.uk sites. YMMV, but I’ll bet you also see a huge difference.

  • JG

    Whups, in my post above I realize some of my math is wrong. When I talk about there being 3 billion web pages with only a single term, that should actually be 1.5 billion web pages. So I can’t do simple math any better than Google can.

    But go and try this query yourself. I was extremely surprised to see a 9 billion page difference between the google.com and the google.co.uk sites. YMMV, but I’ll bet you also see a huge difference.

  • http://scobleizer.wordpress.com/ scobleizer

    Another one: if Google is really looking at more sites than anyone else, why doesn’t it display the most non-duplicated sites? Yeah, my sample size is small here, but I get similar results everytime I try.

  • http://scobleizer.wordpress.com/ scobleizer

    Another one: if Google is really looking at more sites than anyone else, why doesn’t it display the most non-duplicated sites? Yeah, my sample size is small here, but I get similar results everytime I try.

  • Mujibur

    Keep on fighting the fights that need fighting Scoble!

  • Mujibur

    And BTW — still no update on why Microsoft dropped Windows Media Player for the Mac? Fascinating.

    If I didn’t know better, I would have to conclude that you don’t want to offer an answer.

  • Mujibur

    Keep on fighting the fights that need fighting Scoble!

  • Mujibur

    And BTW — still no update on why Microsoft dropped Windows Media Player for the Mac? Fascinating.

    If I didn’t know better, I would have to conclude that you don’t want to offer an answer.

  • http://ninefish.wordpress.com/ ninefish

    It’s another example of why search needs fixing. I realise it’s a big issue and not something that can be coded before afternoon tea, but it is probably the biggest issue that faces us in the future… how will we find things, either on the web or even on our hard drive?

  • http://ninefish.wordpress.com/ ninefish

    It’s another example of why search needs fixing. I realise it’s a big issue and not something that can be coded before afternoon tea, but it is probably the biggest issue that faces us in the future… how will we find things, either on the web or even on our hard drive?

  • http://spaces.msn.com/erickwa Erick Watson

    In a variation on this theme, as an IT Pro Audience Marketing Manager at Microsoft, I’ve created what I call the “Microsoft Sucks” index as a rude (no pun intended) way to track popular sentiment about Microsoft in the online community – my goal being to somehow have a positive impact on this.

    Once a day, I visit, MSN, Google, & Yahoo and simply type in the phrase “microsoft sucks” to see how many hits I get. Recently I’ve also begun tracking “linux sucks” so as to form some basis for comparison.

    Try this for a couple of days – you’ll see that the results bear your theory out. Yahoo is the worst offender of all, as their number has remained unchanged at 5,850,000 – they seem to employ some sort of averaging. Google’s number vacillates wildly, with no consistent pattern either up or down. MSN’s number is a steady linear progression which tracks very nicely for the most part, but dips toward the end of each work week for some reason. Maybe there are some background processes going on here that account for the variations?

    I’ll keep you posted if I find a pattern here, but at the end of the day, you’re correct in that they are very inconsistent.

  • domovoi

    “And BTW — still no update on why Microsoft dropped Windows Media Player for the Mac? Fascinating.”

    Because it sucked and now they’re supporting a superior solution? What’s the problem?

  • http://spaces.msn.com/erickwa Erick Watson

    In a variation on this theme, as an IT Pro Audience Marketing Manager at Microsoft, I’ve created what I call the “Microsoft Sucks” index as a rude (no pun intended) way to track popular sentiment about Microsoft in the online community – my goal being to somehow have a positive impact on this.

    Once a day, I visit, MSN, Google, & Yahoo and simply type in the phrase “microsoft sucks” to see how many hits I get. Recently I’ve also begun tracking “linux sucks” so as to form some basis for comparison.

    Try this for a couple of days – you’ll see that the results bear your theory out. Yahoo is the worst offender of all, as their number has remained unchanged at 5,850,000 – they seem to employ some sort of averaging. Google’s number vacillates wildly, with no consistent pattern either up or down. MSN’s number is a steady linear progression which tracks very nicely for the most part, but dips toward the end of each work week for some reason. Maybe there are some background processes going on here that account for the variations?

    I’ll keep you posted if I find a pattern here, but at the end of the day, you’re correct in that they are very inconsistent.

  • domovoi

    “And BTW — still no update on why Microsoft dropped Windows Media Player for the Mac? Fascinating.”

    Because it sucked and now they’re supporting a superior solution? What’s the problem?

  • http://scobleizer.wordpress.com/ scobleizer

    Regarding Media Player for the Mac: I haven’t found who made that decision yet, so don’t want to speak for him or her. I don’t know every one of our 60,000+ employees! :-)

    But, from the rumor mill inside the halls here they decided to invest the developer time on other projects. Sounds sane to me since there’s a whole lot more important things to work on.

  • http://scobleizer.wordpress.com/ scobleizer

    Regarding Media Player for the Mac: I haven’t found who made that decision yet, so don’t want to speak for him or her. I don’t know every one of our 60,000+ employees! :-)

    But, from the rumor mill inside the halls here they decided to invest the developer time on other projects. Sounds sane to me since there’s a whole lot more important things to work on.

  • http://www.lopico.com/ J.D. Amer

    JG, It’s all about the data center that you’re vewing from Google. If you go to the so called “big Daddy” data center you’ll get 25,270,000,000 results.

    Big Daddy data center address is 66.249.93.104

  • http://www.lopico.com J.D. Amer

    JG, It’s all about the data center that you’re vewing from Google. If you go to the so called “big Daddy” data center you’ll get 25,270,000,000 results.

    Big Daddy data center address is 66.249.93.104

  • http://mcmanus.typepad.com/ Jeffrey McManus

    yahoo aer teh winnar!!!1!!

  • http://mcmanus.typepad.com/ Jeffrey McManus

    yahoo aer teh winnar!!!1!!

  • http://www.bynkii.com/ John C. Welch

    well that, and the WM team evidently couldn’t code a proper non-windows application on a dare.

    First…the search engines aren’t lying. Lying implies a deliberate effort to deceive and spread falsehoods, and I really doubt any of the major search engines are doing that. They just have different algorithms for searching and displaying results, and that’s why you get the difference in the results.

    But “lying” is *so* not the right term.

  • http://www.bynkii.com/ John C. Welch

    well that, and the WM team evidently couldn’t code a proper non-windows application on a dare.

    First…the search engines aren’t lying. Lying implies a deliberate effort to deceive and spread falsehoods, and I really doubt any of the major search engines are doing that. They just have different algorithms for searching and displaying results, and that’s why you get the difference in the results.

    But “lying” is *so* not the right term.

  • Tadeu

    I tried the “* *” wildcard search on Google Brazil site (google.com.br) and my results ? 25.270.000.000

    Maybe I hit the “big daddy” data center as J.D commented.

  • Tadeu

    I tried the “* *” wildcard search on Google Brazil site (google.com.br) and my results ? 25.270.000.000

    Maybe I hit the “big daddy” data center as J.D commented.

  • Pingback: An occasional interruption » hating the performance review season

  • http://www.webclipz.com/ Peggy

    The litte engine that could. Yahoo! wins, and will rise to win again, after the Web Two Oh, NO Hoopla is but yesterday’s dream. Yahoo!! :)

  • http://www.webclipz.com Peggy

    The litte engine that could. Yahoo! wins, and will rise to win again, after the Web Two Oh, NO Hoopla is but yesterday’s dream. Yahoo!! :)

  • http://flaco.wordpress.com/ flaco

    Very interesting little research you have done there!

  • http://flaco.wordpress.com/ flaco

    Very interesting little research you have done there!

  • http://jameskew.blogspot.com/ James Kew

    Tim Bray explained this a while ago: nobody cares about the long tail of less-relevant results, so why waste huge amounts of time sorting them?

    http://www.tbray.org/ongoing/When/200x/2004/11/26/SearchSort

    The whole of his On Search series is good reading on the gnarly issues underlying searching:

    http://www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchTOC

  • http://jameskew.blogspot.com James Kew

    Tim Bray explained this a while ago: nobody cares about the long tail of less-relevant results, so why waste huge amounts of time sorting them?

    http://www.tbray.org/ongoing/When/200x/2004/11/26/SearchSort

    The whole of his On Search series is good reading on the gnarly issues underlying searching:

    http://www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchTOC

  • http://spaces.msn.com/ianmcallister Ian McAllister

    The big search engines are non-deterministic. That’s why two searches for the same keyword phrase may bring back different results, both in terms of count and ranking. Often different mechanisms are used to arrive at search result counts vs. the results themselves.

  • http://spaces.msn.com/ianmcallister Ian McAllister

    The big search engines are non-deterministic. That’s why two searches for the same keyword phrase may bring back different results, both in terms of count and ranking. Often different mechanisms are used to arrive at search result counts vs. the results themselves.

  • http://helpdesk.wordpress.com/ Michael

    Because nobody cares about the little guys at the end of the list.

  • http://helpdesk.wordpress.com/ Michael

    Because nobody cares about the little guys at the end of the list.

  • Pingback: Svizzer Blog » Blog Archive » Warum lügen Suchmaschinen?

  • http://www.inperspective.com/ Mark

    Re paging results: I dont think its anything sinister – just trying to avoid server overload.
    From a technical point of view it uses more resources to show you results 10,000 to 10,010 to than it does to show you results 0 to 10.
    All the docs aren’t sat on one machine.If you are merging results from a server farm each server needs to return its “top” results for the query in question. If you are merging for results 0-10 then you only need the top 10 results from each server to merge and get the true list of top scoring results over all.
    If you are attempting to merge scores for result 10,000 to 10,010 you either need each server to return 10,010 results and merge their scores or have them collaborate on what the likely score range of 10,000 to 10,010 is likely to be for this query across the farm and then return docs in this range (HARD)

    It’s hard enough responding to millions of queries on billions of docs without trying to be 100% accurate on numbers of results and letting people page through them all.

  • http://www.inperspective.com Mark

    Re paging results: I dont think its anything sinister – just trying to avoid server overload.
    From a technical point of view it uses more resources to show you results 10,000 to 10,010 to than it does to show you results 0 to 10.
    All the docs aren’t sat on one machine.If you are merging results from a server farm each server needs to return its “top” results for the query in question. If you are merging for results 0-10 then you only need the top 10 results from each server to merge and get the true list of top scoring results over all.
    If you are attempting to merge scores for result 10,000 to 10,010 you either need each server to return 10,010 results and merge their scores or have them collaborate on what the likely score range of 10,000 to 10,010 is likely to be for this query across the farm and then return docs in this range (HARD)

    It’s hard enough responding to millions of queries on billions of docs without trying to be 100% accurate on numbers of results and letting people page through them all.

  • http://42quirks.com Shrikant Joshi

    The wildcard search has one interesting pattern. It returns search results for all sites with hyphenation and such other punctuations, viz., colons (:) or semi-colons (;), etc. What are the chances that a page does not have such punctuation?

    (Almost) None.

    Which means the double wildcard search string is a good approximation of the total number of pages. I would say approx 98-99% LoC.

    As far as duplication of content goes, I believe the algorithm is responsible for it. Don’t ask me how, yet. I am still trying to figure that one out…

  • http://corporatespices.blogspot.com Shrikant Joshi

    The wildcard search has one interesting pattern. It returns search results for all sites with hyphenation and such other punctuations, viz., colons (:) or semi-colons (;), etc. What are the chances that a page does not have such punctuation?

    (Almost) None.

    Which means the double wildcard search string is a good approximation of the total number of pages. I would say approx 98-99% LoC.

    As far as duplication of content goes, I believe the algorithm is responsible for it. Don’t ask me how, yet. I am still trying to figure that one out…