Several years past, my morning ritual involved launching a web browser and loading a dozen or more web pages. First as separate windows, and eventually as tabs, when they became part and parcel of the browser. Sites like cnn.com, slashdog.org were my launching point for broad swathes of news. I poured over arstechnica.com and anandtech.com for more in-depth information to sate my technothirst. Enthusiast sites like nvnews.net and tomshardware.com kept me up to date on the minutia of gaming hardware.
I would spend my morning catching up on news from all of these sites and more. As the day wore on, I would periodically hit refresh on each of these pages, then swap to a new page, hit refresh, and continue. By the time I got through the list, the content on the first page would be done loading, and I could start reading. I would consume a site until I hit an overlap point from my previous viewing, then continue on to the next. God forbid I skipped reading a high volume site for a day, I’d have too much to catch up on. Taking a break from this ritual was not an option; eventually I’d have to accept the fact that I missed some news, and start anew at today’s news.
An Easier Way
I’m still a bit crazy about missing news, but thankfully RSS has at least made this process manageable. It amuses me to no end that the average user has no idea that this technology exists, despite the fact that it is used so widely. That said, not every page puts out a feed. Sometimes they do publish one, but its riddled with information you don’t want: Occasional advertisements, rants about off-topic issues, and other information that doesn’t interest you. While stumbling my way to a solution for this problem, I discovered Yahoo Pipes.
Imagine if you could take a source of information, anything from a webpage to a CSV file, and filter out what you don’t want. The resulting data can then be merged with another data source, and then republished as a single new data source. With Yahoo Pipes, you can do exactly this.
Tor doesn’t offer feeds for its authors, so I my first pipe took their master feed and filtered by the author I wanted. I created a pipe to filtered out a friend’s blog posts from a collaborative site. I made a pipe to watch the price change feed for appshopper.com, and filter out only applications I was interested in buying. I even made a pipe that takes a calendar and strips out events I don’t want to attend. You can take most any published data as input, transform and combine it at will, and output it in whatever format you desire. It isn’t exactly the most user friendly of tools, but its got enough power to make up for that.
There’s a serious amount of overlap in my feeds; I’ll read about news from its original source, from someone reporting on that source, from someone reporting twice removed, and then it’ll hit mainstream media, and I’ll read about it 10x over the next few weeks as it gets repeated. I like knowing about bleeding edge technology news before the mainstream media does. Often I’ll hear about things days or weeks before they hit CNN. Often rather than follow a pundit and read the news he puts out, I’ll go straight to his sources and follow them instead. I have to constantly prune my sources, lest I become overwhelmed. Do I want to read information from sources that are among the first to report it, or would I rathe read a more critical analysis from a site that reports thirdhand a few days later? Often the best choice still involves some level of overlap.
I’ve tried to create pipes to handle this overlap, but its not easy. Assuming two sites publish a story and use the same title, the stories are easily matched and one can be discarded. This rarely happens. Even using lots of metadata (author, source link, rough date/time, categories), its a beast of a matching problem and difficult to do consistently. Also, there’s the problem of comments, which offer actual useful information from time to time. If you discard one article, you lose the comments of all the other duplicate articles.
Clearly this issue needs to be handled at a higher level than pipes was intended to solve. The underlying data to solve this problem is there. Most reputable sites link back to their sources. Brute force comparison of articles can easily establish if they’re discussing the same topic. I want an engine that sees all news from all sources, and groups them together by common threads. For critical topics, an actual person could serve to quickly make links between related stories.
As a story develops, additional sources and comments can be added to the aggregate news source. If I read about a product, and it is mentioned a dozen times over the course of the next month, with each source offering no new information, I shouldn’t have to see any of the duplicate articles. If these new sources offer actual new information, or a decent amount of analysis, I’d like the information to be merged with the original story, and the story marked as updated with new information. No matter what site you visit to learn about a news story, you should be able to easily link to the master source of all news on that story.
Someday soon I hope to have my overlap handled automatically and seamlessly. Perhaps then I’ll free up a few more minutes each day to do something besides reading. Then again, perhaps I’ll just broaden my horizons and start reading new topics.
A tool far less powerful than pipes, but quite interesting, has cropped up recently: ifft.com. You can use recipes which allow you to have triggered actions happen on pre-specified events. It has only a small overlap with pipes in terms of functional uses, but you can do some interesting things with it. Feeds can be created from every post you star in Twitter. Instagram pictures can auto save to your Dropbox. Checking in on Foursquare can trigger a Facebook status change. A text message can be sent if the weather report says its going to rain. If someone tags you in a picture in Facebook, you can have the picture emailed to your mom. I’ve found ways to use this service to complement pipes, by creating quick and easy to use data sources that pipes can then work with.