Posted by Mike Bijon December 28, 2005
Steve Rubel suggests you can Read Most of O’Reilly’s Hacks Books for Free Using Google via the Google Print book-search system. His method only works well on books that have been “opened up” by Google, presumably with the publisher’s permission. Books that have not been opened up are much more restricted, and even books with publisher permission have some restricted passages (examples below).
Google Print has implemented protection for copyright holders (and to protect themselves from copyright holders) so Steve’s technique doesn’t work well in most books. Not long after the idea of reading entire books by searching Google Print with terms like “and | but | of | a | on | or | the | from | I | you | it” came out - Google shows about two pages of results before informing you that “Your search is too general. Please try again with a more specific query.“. Google Print also requires users to be signed in to a Google account before viewing whole pages of books at all. Once someone is signed in Google tracks page-views on a per-book basis and restricts viewing to just 10-20 total pages in most books. Books with more pages available for viewing are typically presented with the permission of the copyright holder, as the notice “Provided by O’Reilly through the Google Books Partner Program” on pages of both Podcasting Hacks and Google Hacks indicates, or are copyright-free.
Example of restrictions in an un/semi-restricted book, Podcasting Hacks
:
The book Podcasting Hacks, which Steve Rubel uses as an example, is unrestricted enough that O’Reilly has probably granted permission for it to be almost entirely available. Nevertheless, some content can not be accessed regardless of the search terms or method of viewing: try to access page #173 in Podcasting Hacks and let me know if you’re successful. I wasn’t able to view page #173 with Steve’s method, the list-of-all-pages method, or by any other search of page-specific terms.
Example of a restricted book with more typical restrictions, Google Hacks
:
Using my own and two other Google accounts, I found that I can’t read more than 12 pages from Google Hacks before I see the message:
You have reached your viewing limit for this book (why?). You may continue browsing to view unrestricted or already viewed pages, or visit the About this Book page.
Additionally, page #7 is restricted altogether from viewing - and if I had more accounts to test with I’m sure I would find other restricted pages.
It’s likely that smart publishers, a group I think includes Tim O’Reilly, are finding a balance of promotion and sales by releasing slower/older content via Google Print. and, judging from the layers of page-count restrictions and even page-specific limitations, I suspect that Google Print is really becoming a tool for promoting publications, rather than a tool for reading printed content for free.
Posted by Mike Bijon December 15, 2005
Anyone remember that Yahoo bought del.icio.us this week?
Well, if you also haven’t forgotten that Yahoo bought blo.gs in mid-June this year and that Yahoo promptly “broke” the ability to add new links to blo.gs favorites for almost 6 months - then the current del.icio.us one hour maintenance outage (or see image below) won’t strike you as surprising. Unless my internet connection is stuck in a hole and my DNS isn’t updating right, the current del.icio.us outage started around 6pm PST and is still going 6 hours later:

Is Yahoo doing maintenance, rebuilding these sites from scratch, or trying to completely devalue their new investments?
Update: Sometime in the past 6 hours (it’s 6am PST) del.icio.us has been brought back online. I haven’t noticed any obvious changes. Has anyone else? Maybe the greatly extended blo.gs outage is actually due to major architectural or security issues…
Posted by Mike Bijon December 13, 2005
To a certain extent Google must understand that web usage data has value. The free Google Analytics service isn’t free because it helps Google’s ad-driven business model, but because the aggregate usage data gathered (referrer, click-outs, overall traffic, etc.) on thousands of sites is incredibly more valuable than the servers and bandwidth spent to run it. Web usage data is directly from the behavior of web users, the chief decision-makers on what is relevant, and can be used to find the context of the pages it tracks.
Search is the new distribution channel.
With the democratization of distribution created by the web no company can produce even a large fraction of available content, but it is still possible to index and search the majority of content. Thus, better search engines actually serve as improved distribution channels and, like all media channels before, the profits are in advertising. Google’s income is in the intelligence or their advertising engine (AdSense), but without an intelligent search engine there would not be enough traffic for AdSense to be more than another also-ran. With Yahoo, MSN, and others quickly eroding Google’s lead in search relevance there is a risk that Google’s traffic growth might slow or cease, reducing their revenue growth and related stock price.
Links built Google’s massive traffic.
Returning contextual search results is difficult because it’s relatively easy to create relevant-looking variables to feed to a relevancy-seeking program, like a search engine algorithm. It was the use of hyperlinks as a variable in search relevancy that allowed Google to return dramatically improved search results versus competitors. Links aren’t contained in the actual page being indexed, so they are harder to fake the way that page variables or content can be faked to skew the context of a page. Now, all major search engines are using hyperlinks in their algorithms and content providers & spammers alike focus on link-building to get listed in search engines. This results in a greatly decreased value of hyperlinks. Just like keywords, meta tags, and contextual language filters did before - links increasingly fail to build relevant context for search results or to differentiate one search engine from another. Without “better” search results Google may lose its key differentiator and its ability to grow search as its primary source of traffic and ad revenue.
Web usage data is the next indexing variable.
A pure search provider has no traffic data available to collect other than referrers and click data on its own results, there’s just nothing else to collect. Referrals just point out who the most popular sites on the web are and click throughs are largely based on the position or order of search results - it’s bad data to base the relevance of search results on. Broad web usage data, like the click-out and viewing-time data in the aforementioned Google Analytics, is the next variable that can be used to return relevant search results. That data is one step more removed from the control of website builders and one step closer to the uses of web surfers, making it even harder to than links to bend and create fake relevance with.
Does Google realize the value yet?
Google is an excellent position to collect a large amount of web usage data from their own web services, 3rd-party sites displaying ads, and sites using their Analytic service. Increasingly, their suite of products appears to be tuned toward gathering data and not just to drive traffic search traffic. I am convinced that Google has already realized the value of usage data in determining context. What I do not think Google understands yet is how to properly use the collected data to determine context.
How to discover the “TrafficRank” algorithm?
Trying to extract an algorithm from user click-out and viewing-time will be harder than it was to build PageRank. PageRank was built with only two variables (# links in and # links out) while the traffic data is statistical in nature - making curve-fitting and “inexact” statistics necessary. The missing data is also more complex; PageRank is normalized so missing links in from an index doen’t heavily skew it, but the nature of users not on Google’s websites does create a skew. I suspect that the web users Google does have data on are heavily skewed toward the technical and bandwidth-hungry elite, since many low-bandwidth and unskilled users start their web surfing from the default AOL or MSN home pages.
Have they already started building it?
Remember the furor over eval.google.com and Google’s use of human testers? The data built from that testing are likely to be very relevant when compared to real web usage data. I’m educated as an engineer and one of the things engineers loved most about university lab courses was the ease of getting “great” results. It’s easy to curve-fit and statistics are fungible when there’s only one side to the data. The best thing Google can do to test the traffic data will be to build a small, neutral dataset of user behavior, or several small datasets with know skews that can be cross referenced to the large traffic dataset. By using eval.google.com to mirror the link neigborhoods of a typical blog network and an AOL network the “skew” I suggested earlier could be measured - and eliminated.
…BTW, if anyone from any of the GAMY search engines is still reading (yeah, I’ve even bored myself by now), drop me an email “myname” @ gmail.com (my name is mikebijon). I did some studies a while back that cross referenced total area population and population density using the Poisson distribution (fun with statistics again…) that just might be relevant to what analyzing web usage might yield. I’d love to hear if that’s true.
Posted by Mike Bijon December 09, 2005
I’m an IT manager by profession, and while Google has become “the ultimate IT answer guide” it irritates me that so few IT staffers actually document their fixes publicly so Google can make them available. Of course, several of my staff blog about video games and online poker on a daily basis, so maybe the problem is just a passion/work thing. When I have time I’ve been posting tech tips that should save anyone else in IT some time.
In a recent article I wrote about Windows Mobile 5.0 I’ve criticized the new OS and ActiveSync heavily for their problems - maybe I just have unrealistic expectations of the biggest software company in the world… I’ve also been playing with the Windows Live services and am disappointed - maybe I just have unrealistic expectations of the biggest software company in the world…
Microsoft really needs to sell to the needs of their current software users before they embrace and extend to the entertainment markets. For another $20 a year I’d love to do away with ActiveSync on the desktop and connect the PDA to a Windows Live account - though it will probably take another 2 years even though ActiveSync 4 already speaks TCP/IP. In the meantime, Microsoft is trying to figure out how to deliver enough value in their pseudo-Google services to bring in viewers to click their ads - and failing to deliver versions of Windows Mobile with all the fuctionality many users want.
Maybe a trackback to Scoble will at least help someone at Microsoft realize that they can actually make some money from this services thing if they stay focused small features that will sell to current Microsoft users. Then, if Microsoft still wants in on “the media thing” just buy out Sling in two years, assuming Apple hasn’t already done it better.