Diffbot is a geeky and incredibly interesting technology that uses bots, algorithms, computer vision and artificial intelligence to process the content on the Web the way a human being can. “The entire Internet can be broken down into 30 different page types” explains Co-founder Mike Tung, also known as “Diffbot Mike,” and “Diffbot can identify them all.” Diffbot knows the difference between a social network profile, a blog post, a site’s front page, a product page, an event page and dozens more.
Today, Diffbot is releasing its first set of APIs, now open to all developers for free. The launch has the potential to dramatically impact the types of applications developers can build, and for consumers, it means a whole host of intelligent applications are about to emerge.
The New APIs: On-Demand & Follow
With the two API’s available now, developers can build apps that automatically extract meaning from pages, apps that understand what’s trending and who’s talking about it, apps that provide RSS feeds where none were available before and apps that read just the relevant parts of webpages aloud, ignoring ads, header and footer copy.
And that’s just for starters. Future API’s will enable developers to automatically turn event pages into calendar appointments, social network profiles into vCards or automatically extract shipping prices or reviews from product pages, among other things. While Diffbot doesn’t have a set roadmap, it expects to launch these additional API’s over the new few months.
Today, the first 2 API’s available are:
- On-Demand API: This API is divided into page types “Frontpage” and “Article.” The former is used to analyze site homepages and index pages using common layout markers like headlines, bylines, images, articles, ads, etc. The Article API extracts clean article text, pictures and tags. (For example, see Readably.)
- Follow API: This is used to track the changes or updates made to any webpage. Diffbot automatically determines the part of the page that the developer wants to follow and extracts metadata like title, images, text summary and more, then segments the page into meaningful sections (See above photo).
What Can Diffbot Actually Do?
These same APIs are already being used by companies like speech recognition system maker Nuance, AOL (disclaimer: TechCrunch is owned by AOL), social media monitoring firm SocMetrics, and others.
AOL uses Diffbot to extract the title, author, image, text, videos, topics and other metadata for its new iPad mag, AOL Editions. Nuance uses the technology to improve its natural language processing in a product for doctors, which requires comprehension of complex medical terminology. SocMetrics sends bit.ly shortened links to Diffbot to get the full article text and topics, so it can determine which social media users are talking about which topics the most.
These are just a few big-name examples. There are smaller, but just as innovative use cases out there, too. Like Hacker News Radio, for example, which reads Hacker News and comments to you. Or FeedBeater, which makes it easy to turn any URL into an RSS feed automatically (one of Diffbot’s first creations). Or this Diffbot-generated Twitter feed, which tracks changes to the webpage for the city of São Paulo, Brazil (as it lacks RSS), and tweets the updates.
The new self-serve platform for developers is free up to 50,000 API calls per month. The cloud plan provides 100,000 calls for free, then is $0.002/call afterwards. The Managed plan for Enterprise requires custom pricing.
Diffbot was founded by Mike Tung and Leith Abdulla, both Stanford PhD students on a leave of absence to build the company. The idea sprung from Tung’s desire to automatically track new assignments on the class website automatically, through the use of technology. Diffbot was also the first startup funded by Stanford’s incubator program, now called StartX (formerly SSE Labs).
Diffbot, founded in 2008 by two Stanford students, applies computer vision techniques in order to extract the semantic structure of webpages.
Diffbot analyzes documents much like a human would, using…