Why do search engines lie?

by on February 8, 2006

Here, do a search for Memetrackers (Google, MSN, Yahoo). Now, why are none of their numbers accurate? Google says there are 713 results, but can only display 62. MSN says there are 101, but only can display 100. Yahoo says there are 368, but only can display 44.

Why aren’t there any truth in advertising laws for search engines?

Update: the numbers are changing. Google now says there are 699 results, but can only display 692 (this is after you tell it to display all duplicates).

Oh, and no engine can display more than about 1,000 results, so if they say there’s 42,000,000 results there’s no way to verify whether those numbers are accurate or not.

Update 2: Yahoo is actually accurate once you tell it to display all duplicates. It says 429 results and displays 429 results. So, Yahoo wins! (although I wish they’d all be a little clearer up front).

  • Did you read the bottom of that page?

    "In order to show you the most relevant results, we have omitted some entries very similar to the 62 already displayed.
    If you like, you can repeat the search with the omitted results included."

    There are 713 results, but a bunch of them are very similar, so you might not want to bother looking at them.

    I guess they are just trying to make the results more relevant, and decrease the amount of 'wading' you have to do through the results.
  • Ian: I just did that and updated my post. Turns out they still aren't all that accurate.
  • And, different users (or the same users at different times of day) may see different numbers since Google seems to have multiple copies of their database that are *not* in sync.

    But lets be honest here about "marketing"... are Google, et al really any less "accurate" than McDonalds used to be when they told us how many billion hamburgers they had served?

    Or, are you trying to suggest a "Truth in Marketing" fantasy?

    The real problem that you are beginning to stumble into here is that core search engine technology really hasn't advanced very far over the past five years (or even since 1998 when I was involved in it).

    -- Jack Krupansky
  • Oops, I forgot to answer your initial question...

    Why do search engines lie? ... Because they can.

    -- Jack Krupansky
  • JG
    Earlier last year G began offering a new search option: the wildcard "*". The idea is that you would use it to match queries in which you did not know what one of the search terms should be, i.e. "* scoble" would match all the different two-word phrases ending with scoble.

    Well, being the search geek that I am, I immediately went to G and typed "*" by itself, to see (1) how many matches it returned (i.e. what was the size of the index) as well as (2) what was the top-ranked page.

    As you can guess, G did not process that query, and simply returned a "no results here" page. So I tried a different query "* *". In other words, return all the pages to me in which there are (at least) any two words on the page.

    That worked.

    Wierd thing was, at the time Google was claiming something like 10 billion web pages, and the results I got said something like "Results 1 - 10 of about 8,500,000,000".

    So either there were 3 billion web pages in G's index, each page with only 1 word on that page (so that "* *", which requires two words, wouldn't match), or else G was somewhat less than true in its advertising.

    Just five minutes ago I tried this "* *" query again. This time, it returned 11 billion results. Two minutes ago, I tried it yet again. It returned 9.7 billion results. Same exact query. Hmm. Ok, well, I can attribute that to perhaps my query being farmed out to different server clusters, and each cluster having unsynchronized indices.

    So then I went to google.co.uk, and tried the query "* *". Wanna guess? 18 billion results. Huh? How is it that the UK datacenter has almost twice the index size of the US datacenter?

    Could it be that Google has internet filters not only on China, but on the US, too? Why is it that the UK has twice the index size? :-)

    Anyway, my point is simply that I agree with you. There should be more truth in advertising. Something is definitely screwy with all these indices and numbers.
  • JG
    Whups, in my post above I realize some of my math is wrong. When I talk about there being 3 billion web pages with only a single term, that should actually be 1.5 billion web pages. So I can't do simple math any better than Google can.

    But go and try this query yourself. I was extremely surprised to see a 9 billion page difference between the google.com and the google.co.uk sites. YMMV, but I'll bet you also see a huge difference.
  • Another one: if Google is really looking at more sites than anyone else, why doesn't it display the most non-duplicated sites? Yeah, my sample size is small here, but I get similar results everytime I try.
  • Mujibur
    Keep on fighting the fights that need fighting Scoble!
  • Mujibur
    And BTW -- still no update on why Microsoft dropped Windows Media Player for the Mac? Fascinating.

    If I didn't know better, I would have to conclude that you don't want to offer an answer.
  • It's another example of why search needs fixing. I realise it's a big issue and not something that can be coded before afternoon tea, but it is probably the biggest issue that faces us in the future... how will we find things, either on the web or even on our hard drive?
  • In a variation on this theme, as an IT Pro Audience Marketing Manager at Microsoft, I've created what I call the "Microsoft Sucks" index as a rude (no pun intended) way to track popular sentiment about Microsoft in the online community - my goal being to somehow have a positive impact on this.

    Once a day, I visit, MSN, Google, & Yahoo and simply type in the phrase "microsoft sucks" to see how many hits I get. Recently I've also begun tracking "linux sucks" so as to form some basis for comparison.

    Try this for a couple of days - you'll see that the results bear your theory out. Yahoo is the worst offender of all, as their number has remained unchanged at 5,850,000 - they seem to employ some sort of averaging. Google's number vacillates wildly, with no consistent pattern either up or down. MSN's number is a steady linear progression which tracks very nicely for the most part, but dips toward the end of each work week for some reason. Maybe there are some background processes going on here that account for the variations?

    I'll keep you posted if I find a pattern here, but at the end of the day, you're correct in that they are very inconsistent.
  • domovoi
    "And BTW — still no update on why Microsoft dropped Windows Media Player for the Mac? Fascinating."

    Because it sucked and now they're supporting a superior solution? What's the problem?
  • Regarding Media Player for the Mac: I haven't found who made that decision yet, so don't want to speak for him or her. I don't know every one of our 60,000+ employees! :-)

    But, from the rumor mill inside the halls here they decided to invest the developer time on other projects. Sounds sane to me since there's a whole lot more important things to work on.
  • JG, It's all about the data center that you're vewing from Google. If you go to the so called "big Daddy" data center you'll get 25,270,000,000 results.

    Big Daddy data center address is 66.249.93.104
  • yahoo aer teh winnar!!!1!!
  • well that, and the WM team evidently couldn't code a proper non-windows application on a dare.

    First...the search engines aren't lying. Lying implies a deliberate effort to deceive and spread falsehoods, and I really doubt any of the major search engines are doing that. They just have different algorithms for searching and displaying results, and that's why you get the difference in the results.

    But "lying" is *so* not the right term.
  • Tadeu
    I tried the "* *" wildcard search on Google Brazil site (google.com.br) and my results ? 25.270.000.000

    Maybe I hit the "big daddy" data center as J.D commented.
  • The litte engine that could. Yahoo! wins, and will rise to win again, after the Web Two Oh, NO Hoopla is but yesterday's dream. Yahoo!! :)
  • Very interesting little research you have done there!
  • Tim Bray explained this a while ago: nobody cares about the long tail of less-relevant results, so why waste huge amounts of time sorting them?

    http://www.tbray.org/ongoing/When/200x/2004/11/...

    The whole of his On Search series is good reading on the gnarly issues underlying searching:

    http://www.tbray.org/ongoing/When/200x/2003/07/...
  • The big search engines are non-deterministic. That's why two searches for the same keyword phrase may bring back different results, both in terms of count and ranking. Often different mechanisms are used to arrive at search result counts vs. the results themselves.
  • Because nobody cares about the little guys at the end of the list.
  • Re paging results: I dont think its anything sinister - just trying to avoid server overload.
    From a technical point of view it uses more resources to show you results 10,000 to 10,010 to than it does to show you results 0 to 10.
    All the docs aren't sat on one machine.If you are merging results from a server farm each server needs to return its "top" results for the query in question. If you are merging for results 0-10 then you only need the top 10 results from each server to merge and get the true list of top scoring results over all.
    If you are attempting to merge scores for result 10,000 to 10,010 you either need each server to return 10,010 results and merge their scores or have them collaborate on what the likely score range of 10,000 to 10,010 is likely to be for this query across the farm and then return docs in this range (HARD)

    It's hard enough responding to millions of queries on billions of docs without trying to be 100% accurate on numbers of results and letting people page through them all.
  • The wildcard search has one interesting pattern. It returns search results for all sites with hyphenation and such other punctuations, viz., colons (:) or semi-colons (;), etc. What are the chances that a page does not have such punctuation?

    (Almost) None.

    Which means the double wildcard search string is a good approximation of the total number of pages. I would say approx 98-99% LoC.

    As far as duplication of content goes, I believe the algorithm is responsible for it. Don't ask me how, yet. I am still trying to figure that one out...
  • @JG: >Could it be that Google has internet filters not only on China, but on the US, too? Why is it that the UK has twice the index size?

    Maybe G UK has all those Scientology pages it made Google remove in the US. :)
  • Yet studies show out of Google, MSN and Yahoo!, Google is the best.
  • The maior problem of searching is that the result that I look for is often not on the first page! From user viewpoint, why would that matter if a search engine returns 1 million or 1 billion results?
  • And that's why AltaVista died - number of results is meaningless. It's the quality of the first page of results that ultimately matters. That's why Google is on top of search right now - because a year ago, their first page of results was vastly better than Yahoo, MSN, etc. Even though the other guys have largely caught up, Google got there early enough to establish themselves as "the" search provider.
  • John: totally agreed.
  • JG
    To all who answered my "* *" wildcard query post: Yes, I buy the BigDaddy explanation. But.. that still begs the question: Why do I only see 8-11 billion results in the U.S.? I've done this wildcard query from home, from work, from the east coast, from the west coast. Never in the past 8-9 months since I started doing the query have I even come across even half of the 25 billion web pages suppposedly in the Google index. Why does the UK get to see 18 billion? Why does Brasil get to see all 25 billion?

    Some of you above have said that size does not matter, that nobody cares about the long tail, so it doesn't matter if Google's index only shows me 8 billion or 18 billion or whatever. Well, I could debate that, but I don't want to go too far off topic. Let's just say that, even if that long-tail argument is true, for any one query (i.e. nobody scrolls past 30, much less 10 documents, anyway), it is probably not true, for the entire set of possible queries.

    What I mean is that for your query, you might never need to look at document #10,575. But for somebody else's differently-worded query, that same document will be ranked 3rd. And if that document is not in the index, because Google U.S. is only showing us 8 billion of the 25 billion web pages, then for this latter query, someone is not getting the information they need.

    So there are two long tails here: (1) is the long tail for one query. (2) is the long tail of all possible queries. I would argue that (2) is much more important, if not vital. By only showing 8+ billion pages in the U.S., Google is robbing us of the top-10 results to all those queries.

    The final problem with this whole "non-truth in advertising" thing is that, when I do the wildcard search, is there no notice at the bottom of the page saying "Because of the DMCA, pages have been removed from your search"? I mean, because, after all, if my wildcard search really is returning all 8 billion pages from the non-bigdaddy server, then some of those will have already been removed, right? So Google has removed pages, and has not told me that it has removed them.

    That's a big problem. That's a big trust issue.
  • google
    Results 1 - 10 of about 830 for memetrackers. (0.19 seconds)
  • Google
    Results 1 - 10 of about 192 for memetrackers. (0.06 seconds)

    As stated above its all about which datacenter you're at. You can compare results between teh different datacenters here
  • Tetra
    I think I just got dumber reading this blog entry. Who cares about results past the first couple of pages?
  • Innocent Bystander
    Kinda depends on the goal of the search.

    It would make no sense for the search on ebay or amazon to show you something they can't sell you - so they show you something kind of similar that lots of people buy.

    Its not hard to argue that this is better than vanilla dumb search.
  • Google ranks my memetrackers post first, hence Google wins. Simple. :)
  • anonymous hero
    scoble,

    you are not an engineer (it appears), stop with this inane babble. it's like fielding support calls from users - 'why dont my intarweb work like i want to?!'. there's nothing wrong with being non-technical, but i'd have thought that you would develop some self restraint to prevent embarrassing yourself in public.

    depending on the implementation, there is a big difference in cost between returning a result containing first X matches ordered by Y (eg relevancy) and returning the exactly correct number of matches. the latter may have a cost equivalent to actually finding *all* of the matches. generally, when this is the case and you don't *have to* be absolutely precise, you do statistical extrapolation, which will by definition be imprecise. that is what you are seeing.
  • Sam Dipiazza
    You need to make sure you have all results limiting preferences turned off. When I set Google to return results in any language, with safe search turned off and personalized results turned off, I get 25,270,000,000 hits for * * at www.google.com.
  • Vinay
    You can only see (and verify) only first 1000 results.
    If you try to change the url (HTTP GET variables) like this

    http://www.google.com/search?q=Scobleizer&n...

    and try to get results over 1000, it says

    "Sorry, Google does not serve more than 1000 results for any query. (You asked for results starting from 10000.)"

    So, it actually doesn't matter. I don't even look at that number.
  • gudipudi
    "42,000,000 results"


    These are referred to web pages ......so i dnt find any problem with the kind of results provided by google.
  • toby33
    Google is often to complex to understand, and that, is a problem. The results do not normally add up, and how do we know there are 44m sites. SEO Solutions are becoming a necessity for even small business which, is interesting - to say the least. With business focusing less time on their business and its customers and more time on understanding the complexity of SE's, something is going to have to break apart. We will see what it is.
blog comments powered by Disqus