Moogle1

Because Mike has too many answers and not enough questions.

Web Usage Data has Value to Search - Anyone for “TrafficRank”?

Posted by Mike Bijon December 13, 2005

To a certain extent Google must understand that web usage data has value. The free Google Analytics service isn’t free because it helps Google’s ad-driven business model, but because the aggregate usage data gathered (referrer, click-outs, overall traffic, etc.) on thousands of sites is incredibly more valuable than the servers and bandwidth spent to run it. Web usage data is directly from the behavior of web users, the chief decision-makers on what is relevant, and can be used to find the context of the pages it tracks.

Search is the new distribution channel.
With the democratization of distribution created by the web no company can produce even a large fraction of available content, but it is still possible to index and search the majority of content. Thus, better search engines actually serve as improved distribution channels and, like all media channels before, the profits are in advertising. Google’s income is in the intelligence or their advertising engine (AdSense), but without an intelligent search engine there would not be enough traffic for AdSense to be more than another also-ran. With Yahoo, MSN, and others quickly eroding Google’s lead in search relevance there is a risk that Google’s traffic growth might slow or cease, reducing their revenue growth and related stock price.

Links built Google’s massive traffic.
Returning contextual search results is difficult because it’s relatively easy to create relevant-looking variables to feed to a relevancy-seeking program, like a search engine algorithm. It was the use of hyperlinks as a variable in search relevancy that allowed Google to return dramatically improved search results versus competitors. Links aren’t contained in the actual page being indexed, so they are harder to fake the way that page variables or content can be faked to skew the context of a page. Now, all major search engines are using hyperlinks in their algorithms and content providers & spammers alike focus on link-building to get listed in search engines. This results in a greatly decreased value of hyperlinks. Just like keywords, meta tags, and contextual language filters did before - links increasingly fail to build relevant context for search results or to differentiate one search engine from another. Without “better” search results Google may lose its key differentiator and its ability to grow search as its primary source of traffic and ad revenue.

Web usage data is the next indexing variable.
A pure search provider has no traffic data available to collect other than referrers and click data on its own results, there’s just nothing else to collect. Referrals just point out who the most popular sites on the web are and click throughs are largely based on the position or order of search results - it’s bad data to base the relevance of search results on. Broad web usage data, like the click-out and viewing-time data in the aforementioned Google Analytics, is the next variable that can be used to return relevant search results. That data is one step more removed from the control of website builders and one step closer to the uses of web surfers, making it even harder to than links to bend and create fake relevance with.

Does Google realize the value yet?
Google is an excellent position to collect a large amount of web usage data from their own web services, 3rd-party sites displaying ads, and sites using their Analytic service. Increasingly, their suite of products appears to be tuned toward gathering data and not just to drive traffic search traffic. I am convinced that Google has already realized the value of usage data in determining context. What I do not think Google understands yet is how to properly use the collected data to determine context.

How to discover the “TrafficRank” algorithm?
Trying to extract an algorithm from user click-out and viewing-time will be harder than it was to build PageRank. PageRank was built with only two variables (# links in and # links out) while the traffic data is statistical in nature - making curve-fitting and “inexact” statistics necessary. The missing data is also more complex; PageRank is normalized so missing links in from an index doen’t heavily skew it, but the nature of users not on Google’s websites does create a skew. I suspect that the web users Google does have data on are heavily skewed toward the technical and bandwidth-hungry elite, since many low-bandwidth and unskilled users start their web surfing from the default AOL or MSN home pages.

Have they already started building it?
Remember the furor over eval.google.com and Google’s use of human testers? The data built from that testing are likely to be very relevant when compared to real web usage data. I’m educated as an engineer and one of the things engineers loved most about university lab courses was the ease of getting “great” results. It’s easy to curve-fit and statistics are fungible when there’s only one side to the data. The best thing Google can do to test the traffic data will be to build a small, neutral dataset of user behavior, or several small datasets with know skews that can be cross referenced to the large traffic dataset. By using eval.google.com to mirror the link neigborhoods of a typical blog network and an AOL network the “skew” I suggested earlier could be measured - and eliminated.

…BTW, if anyone from any of the GAMY search engines is still reading (yeah, I’ve even bored myself by now), drop me an email “myname” @ gmail.com (my name is mikebijon). I did some studies a while back that cross referenced total area population and population density using the Poisson distribution (fun with statistics again…) that just might be relevant to what analyzing web usage might yield. I’d love to hear if that’s true.

Bookmark this post: · Del.icio.us · YahooMyWeb · Spurl · Furl · Incoming links

One Response to “Web Usage Data has Value to Search - Anyone for “TrafficRank”?”

Comments

  1. Peter Oliver Dec 15 2005 / 2pm

    There was something similar to this at 2.0Ventures. It was more about centralizing stats data, something Google could do with their Analytics. http://2.0ventures.com

Leave a Reply