We have a set of distributed parsers whose responsibility is to fetch and parse feeds in a given order. These parsers have the following characteristics :
Feed urls are stored in the database, along with a time of next fetch. The feeds are fetched in groups from this database, based on this time of next fetch. The next fetch time is calculated after we parse each feed, based on the HTTP result code, the number of new entries detected, the number of subscribers to a feed... etc. (this is probably our most secret ingredient!)
When new entries are detected, the parsers will send the entries to our firehosers.
The firehosers are the XMPP component in charge of dispatching new entries to the right accounts. They also handle the subscription/unsubscription requests. They are load-balanced by an XMPP eJabberd server.
Determining the best time of next fetch is the trickiest part of Superfeedr. We employ various Helpers, or independent daemons, which are constantly monitoring various streams on the web. When one of these helpers detects a feed we are parsing in one of these streams, we immediately fetch it, parse it, and send notification.
Some of the streams we use are:
If your service publishes feeds (RSS or ATOM) and that you've got a similar firehose, please let us know.
As described above, our architecture is fully distributed. This allows us to duplicate/mirror each of the system's components (database, XMPP servers, etc). Duplicates are synchronized to different data-centers around the world. This allows Superfeedr to guarantee a 99% uptime.
We have selected our providers based on their ability to provide on-demand hosting, which allows Superfeedr to expand to meet demand and maintain real-time parsing needs.