
Essentiall, MetaRSS is comprised of the following components:
- A cron job
- An indexer
- Metafilter content
- An HTML cache
- The user interface
The process is initiated by a user request for an RSS feed. This is currently done with a Greasemonkey script or a bookmarklet. The request simply feeds the user interface a URL. The user interface caches the HTML and stores the url for user later by the indexer. The user interface parses the html and returns the RSS to the user. At fixed intervals, a cron jobs kicks off the indexer, which simply caches the HTML, which is parsed when a user requests the cache.
Pretty simple and open to a lot of improvement.
For example, the parser is very rigid. Essentially it’s hard coded, using some weird combo of HTML::Parser and regular expressions. If the parser could scrape more flexibly, MetaRSS could return more types of RSS feeds.
Additionally, the algorithm that indexes the requested Metafilter pages could be optimized by using last modified times to determine whether or not to grab the whole page.