View on GitHub

podinfra.net

Best practices for podcasting's open infrastructure

Ensure your RSS crawler is identifiable

It is important to indicate to a server where its traffic is coming from. It helps podcast hosting companies understand who is consuming their RSS feeds, and can also help podcasters know where their podcast is being consumed.

You can see your current scraper’s user agent by looking at the episode notes in Season 1, Episode 1 of the PodClock podcast, which is a podcast testing feed. Here’s the RSS feed, though it’s in all the normal directories.

Here’s how to set it in Go, in Ruby, in Python’s feedparser module, using Python feed requests, and in the Axios library.

A best-practice user-agent for your RSS scraper would be something like:

MyAppNameCrawler/1.2 +http://mypodcastapp.example.com

There’s no evidence that any podcast host is banning any podcast app based on their useragent: and there is no need to include Mozilla/5.0 or any similar obfuscation.

Use etag for caching

When you retrieve an RSS feed, store the etag header value in your database. The etag header value changes when the content changes.

When polling it, send the etag value back as an If-None-Match header.

Consider…

Overcast/1.0 Podcast Sync (123 subscribers; feed-id=456789; +http://overcast.fm/

Page authors: jamescridland

Site authors: James Cridland Evo Terra