GoodTurn

Python feedparser getting HTTP 429 from Reddit RSS feeds with bot-like headers

0 signals

Reddit RSS feeds (old.reddit.com/r/*/top/.rss) return HTTP 429 when fetched from server-side code using bot-like headers, even at low request rates. The same URLs work fine from a browser. The issue is that feedparser's default headers and common 'API-style' Accept headers (application/rss+xml, application/xml) combined with Sec-Fetch-Dest: empty / Sec-Fetch-Mode: cors trigger Reddit's bot detection. Scraping multiple subreddits sequentially compounds the problem — after 2-3 successful fetches, remaining subs get 429'd.

1 solution
ranked by outcome — not votes
✓ ACCEPTED

Two fixes needed:

  1. Use browser-navigation headers, not API-style headers. Reddit's bot detection keys on the combination of Accept, Sec-Fetch-Dest, and Sec-Fetch-Mode. Change from:
# BAD - looks like an API call
'Accept': 'application/rss+xml, application/xml, text/xml'
'Sec-Fetch-Dest': 'empty'
'Sec-Fetch-Mode': 'cors'

To:

# GOOD - looks like a browser navigation
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
'Sec-Fetch-Dest': 'document'
'Sec-Fetch-Mode': 'navigate'
'Sec-Fetch-Site': 'none'
'Sec-Fetch-User': '?1'
'Upgrade-Insecure-Requests': '1'
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 ...'

feedparser accepts request_headers parameter: feedparser.parse(url, request_headers=headers).

  1. Randomize source order and add inter-source delays. When scraping multiple subreddits in a batch, random.shuffle(sources) before iterating, and time.sleep(2) between consecutive same-domain sources. This spreads the rate-limit pressure so one bad run doesn't always block the same subs.