How to download RSS feeds with a simple script

Background

Rss is a wonderful system to get headlines of online news from many independent sources and browse them as quickly as possible, without subscribing to any website, giving away personal information and/or depending on any third-party website to aggregate everything for you.


In order to save time and to not depend on any Rss reader, I have written two simple scripts. One downloads all the RSS feeds I want to read and saves them in a format suitable for further processing. The other reads that temporary file and generates one single HTML page with all the news titles and links. The format of the temporary file is very simple. It’s just plain text with three fields per line, separated by a “|” (pipe) character: feed name, article title and article URL. Here’s an example of that file:

  Repubblica|Pisani, difesa a oltranza "Non ho dato notizie a Iorio"|http://example.com/url_1.html
  Repubblica|Caffè, sigaretta, persino l'email così la pausa diventa un privilegio|http://example.com/url_2.html
  Repubblica|L'ultimo trucco "ad aziendam" di Berlusconi il 'padrone' del paese corrompe la democraz
  ia|http://example.com/url_3.html

The real problem is how to generate that file, that is how to download, parse and reformat RSS from the command line. Here’s how I do it. It works almost perfectly, with one exception explained below, for which I ask for your help.

Rss downloader script

The simplest way I’ve found to download and parse Rss feeds is the Python feedparser module. Once it is installed, it only takes 15 lines of code to generate the list shown above:


   1#! /usr/bin/python
   2
   3 import sys
   4 import feedparser
   5 import socket
   6
   7 timeout = 120
   8 socket.setdefaulttimeout(timeout)
   9
  10 feed_name = sys.argv[1]
  11 feed_url     = sys.argv[2]
  12 d               = feedparser.parse(feed_url)
  13
  14 for s in d.entries:
  15	print feed_name + "|" + unicode(s.title).encode("utf-8") + "|" + unicode(s.link).encode("utf-8") + "\n"

The scripts takes as argments the feed name and the RSS URL (lines 10 and 11). Line 12 is the one that actually downloads the feed and saves all its content in an object named “d”. The timeout in lines 7/8 is needed to not have the script freeze when some website is unreachable. The last two lines look at each element of the RSS object and print (together with the feed name) the title (s.title) and URL (s.link) of each entry. That’s it, really.


One little problem: encoding

As I said, the script works almost perfectly as is, and I hope you’ll find it useful. The only problem I haven’t solved yet is how to handle non-ASCII characters in URLs and, especially, news titles. As an example of what I mean, here’s what I get when I convert to HTML the three lines shown above.

(in case it matters, this happens on Fedora 14 x86_64). As you can see, the accented letters are messed up. Similar things happen with quotes and other non-ASCII stuff. How do I fix this? Before I added the encode("utf-8") command it looked even worst (**), but there’s something still missing here. I have tried to figure out what, but I must say the relevant Python documentation isn’t so simple and easy to find (or recognize at least), so your feedback is very welcome. Thanks!


(**) this is why I believe that the problem is, and should be fixed, in the Python script itself and not in the other script that creates the HTML page, but I may be wrong. Regardless of this, I want to understand better how encoding is handled in Python

Did you like this? Share it:

About marco

Author of the Digital Citizens Basics online course. Freelance writer and trainer specialized in digital rights issues
This entry was posted in Uncategorized and tagged , , , . Bookmark the permalink.

12 Responses to How to download RSS feeds with a simple script

  1. Grant Wagner says:

    I appreciate what you’re doing, but I’m a little confused as to why a proper RSS feed reader such as Gnome’s Liferea won’t suffice for you? The added benefit of a fully automated tracking makes it a lot easier to work with.

    • marco says:

      Hi Grant,

      Fair question! I explicitly wrote in the post that I wrote this script because (besides wishing to learn!) I do not want to depend on any Rss reader. The reason for that is being able to read all my Rss feeds in one window even when I’m away from my computer, without using Google Reader or similar services. I run this on the same VPS that handles my email.

  2. WorBlux says:

    I think it may be a font problem.

    Terminus looks like it supports a lot of those special characters.
    http://kmandla.wordpress.com/2009/08/24/noteworthy-linux-console-fonts/

    • marco says:

      But I see the “bad” characters in Firefox, not in a terminal. I’m not sure I see the link

  3. Pingback: Links 6/7/2011: AMD Gets More Linux Devs, AriOS 3.0 Released | Techrights

  4. David Kemp says:

    The problem is not with your feed reader, but that you’re creating ascii html. You’ve not posted the script that you use to create html, but try putting <meta http-equiv=”Content-Type” content=”text/html; charset=UTF-8″ /> (apologies if that get’s mangled) in the HEAD of your html and see if you still get the problem

    • marco says:

      Hi David,
      Thanks for the tip. I just checked and the HTML I generate has the header:

      <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

      changing the charset value to UTF-8 doesn’t change anything.

  5. Pingback: Linux News » How to download RSS feeds with a simple script

  6. Pingback: How to write your own pipe menu scripts | TechRepublic

  7. Hi,
    You can just try taking feed’s encoding by feedparser to encode your string and not just simply choose UTF-8, you can use something like this:

    print feed_name + ” | ” + s.title.encode(d['encoding']) + ” | ” + unicode(s.link).encode(d['encoding']) + “\n”

    see you :-)

  8. Pingback: How and why to use KAlarm from the command line | TechRepublic