Moogle1

Because Mike has too many answers and not enough questions.

Web Usage Data has Value to Search - Anyone for “TrafficRank”?

Posted by Mike Bijon December 13, 2005

To a certain extent Google must understand that web usage data has value. The free Google Analytics service isn’t free because it helps Google’s ad-driven business model, but because the aggregate usage data gathered (referrer, click-outs, overall traffic, etc.) on thousands of sites is incredibly more valuable than the servers and bandwidth spent to run it. Web usage data is directly from the behavior of web users, the chief decision-makers on what is relevant, and can be used to find the context of the pages it tracks.

Search is the new distribution channel.
With the democratization of distribution created by the web no company can produce even a large fraction of available content, but it is still possible to index and search the majority of content. Thus, better search engines actually serve as improved distribution channels and, like all media channels before, the profits are in advertising. Google’s income is in the intelligence or their advertising engine (AdSense), but without an intelligent search engine there would not be enough traffic for AdSense to be more than another also-ran. With Yahoo, MSN, and others quickly eroding Google’s lead in search relevance there is a risk that Google’s traffic growth might slow or cease, reducing their revenue growth and related stock price.

Links built Google’s massive traffic.
Returning contextual search results is difficult because it’s relatively easy to create relevant-looking variables to feed to a relevancy-seeking program, like a search engine algorithm. It was the use of hyperlinks as a variable in search relevancy that allowed Google to return dramatically improved search results versus competitors. Links aren’t contained in the actual page being indexed, so they are harder to fake the way that page variables or content can be faked to skew the context of a page. Now, all major search engines are using hyperlinks in their algorithms and content providers & spammers alike focus on link-building to get listed in search engines. This results in a greatly decreased value of hyperlinks. Just like keywords, meta tags, and contextual language filters did before - links increasingly fail to build relevant context for search results or to differentiate one search engine from another. Without “better” search results Google may lose its key differentiator and its ability to grow search as its primary source of traffic and ad revenue.

Web usage data is the next indexing variable.
A pure search provider has no traffic data available to collect other than referrers and click data on its own results, there’s just nothing else to collect. Referrals just point out who the most popular sites on the web are and click throughs are largely based on the position or order of search results - it’s bad data to base the relevance of search results on. Broad web usage data, like the click-out and viewing-time data in the aforementioned Google Analytics, is the next variable that can be used to return relevant search results. That data is one step more removed from the control of website builders and one step closer to the uses of web surfers, making it even harder to than links to bend and create fake relevance with.

Does Google realize the value yet?
Google is an excellent position to collect a large amount of web usage data from their own web services, 3rd-party sites displaying ads, and sites using their Analytic service. Increasingly, their suite of products appears to be tuned toward gathering data and not just to drive traffic search traffic. I am convinced that Google has already realized the value of usage data in determining context. What I do not think Google understands yet is how to properly use the collected data to determine context.

How to discover the “TrafficRank” algorithm?
Trying to extract an algorithm from user click-out and viewing-time will be harder than it was to build PageRank. PageRank was built with only two variables (# links in and # links out) while the traffic data is statistical in nature - making curve-fitting and “inexact” statistics necessary. The missing data is also more complex; PageRank is normalized so missing links in from an index doen’t heavily skew it, but the nature of users not on Google’s websites does create a skew. I suspect that the web users Google does have data on are heavily skewed toward the technical and bandwidth-hungry elite, since many low-bandwidth and unskilled users start their web surfing from the default AOL or MSN home pages.

Have they already started building it?
Remember the furor over eval.google.com and Google’s use of human testers? The data built from that testing are likely to be very relevant when compared to real web usage data. I’m educated as an engineer and one of the things engineers loved most about university lab courses was the ease of getting “great” results. It’s easy to curve-fit and statistics are fungible when there’s only one side to the data. The best thing Google can do to test the traffic data will be to build a small, neutral dataset of user behavior, or several small datasets with know skews that can be cross referenced to the large traffic dataset. By using eval.google.com to mirror the link neigborhoods of a typical blog network and an AOL network the “skew” I suggested earlier could be measured - and eliminated.

…BTW, if anyone from any of the GAMY search engines is still reading (yeah, I’ve even bored myself by now), drop me an email “myname” @ gmail.com (my name is mikebijon). I did some studies a while back that cross referenced total area population and population density using the Poisson distribution (fun with statistics again…) that just might be relevant to what analyzing web usage might yield. I’d love to hear if that’s true.

Missing the Boat: Microsoft’s Online Services Shouldn’t be Clones

Posted by Mike Bijon December 09, 2005

I’m an IT manager by profession, and while Google has become “the ultimate IT answer guide” it irritates me that so few IT staffers actually document their fixes publicly so Google can make them available. Of course, several of my staff blog about video games and online poker on a daily basis, so maybe the problem is just a passion/work thing. When I have time I’ve been posting tech tips that should save anyone else in IT some time.

In a recent article I wrote about Windows Mobile 5.0 I’ve criticized the new OS and ActiveSync heavily for their problems - maybe I just have unrealistic expectations of the biggest software company in the world… I’ve also been playing with the Windows Live services and am disappointed - maybe I just have unrealistic expectations of the biggest software company in the world…

Microsoft really needs to sell to the needs of their current software users before they embrace and extend to the entertainment markets. For another $20 a year I’d love to do away with ActiveSync on the desktop and connect the PDA to a Windows Live account - though it will probably take another 2 years even though ActiveSync 4 already speaks TCP/IP. In the meantime, Microsoft is trying to figure out how to deliver enough value in their pseudo-Google services to bring in viewers to click their ads - and failing to deliver versions of Windows Mobile with all the fuctionality many users want.

Maybe a trackback to Scoble will at least help someone at Microsoft realize that they can actually make some money from this services thing if they stay focused small features that will sell to current Microsoft users. Then, if Microsoft still wants in on “the media thing” just buy out Sling in two years, assuming Apple hasn’t already done it better.

Google Print Hack: View and Read ALL Pages of ANY Book

Posted by Mike Bijon December 06, 2005

Since Google’s search supports using OR, it’s possible to find (almost) every page in a book with a simple search string. Find the book you would like to read, click on the book to pull up some result pages, then “search inside the book” with the string below. Just about every page will be in the search results:

and | but | of | a | on | or | the | from | I | you | it

There should be enough natural language that the pages gaps are less than 5 pages long - letting you “next page” through the gaps. I’ve already found one exception, Combinatorial Optimization: Algorithms and Complexity. Anyone have suggestions to improve on my search string above?

Update: Just one day later, they didn’t take long… Google Print now places a limit on the number of pages you can view in any single book.

Microsoft Patents Real-Time Software License Enforcement (Will they be the next ‘Big Brother’?)

Posted by Mike Bijon December 01, 2005

Is it still spyware if it’s the OS double-checking its licensing startup?

It’s not to say that Microsoft will enforce the schema documented in their new patent app or that the application will be approved or restricted in light of prior art. Nonetheless, I wonder if Big Brother’ish enforcement of licensing will really be more profitable for Microsoft. As a major software publisher trying to stay ahead of free, open source competitors I wouldn’t risk alienating customers by shutting down their systems. System shutdown is included in Microsoft’s real-time enforcement patent app:

The system receives digitized licenses associated with computer applications in a secure license store. The licenses are then monitored and compared with the actual use by users to determine compliance with licenses. If users employ an application in violation of licensing terms then corrective action can be taken such as providing warnings and/or shutting down or denying access to a licensed application.

With malware snooping on every 2nd PC and Google “anonymously” recording every page viewed, can anyone say they didn’t expect a major software vendor to enforce licensing over the internet? After all, even the auto companies are investigating ways to shut off your car when you stop paying the bills. Then again, a patent doesn’t mean anyone can actually pull it off. Odds are both Microsoft and an auto major will bungle their implementations and lock half their customers out of their cars AND computers…

Thanks to the IP news at dave’s district for catching recent Microsoft patent apps.

« Older blog posts • Newer blog posts »