How to download RSS feeds with a simple script

Background

Rss is a wonderful system to get headlines of online news from many independent sources and browse them as quickly as possible, without subscribing to any website, giving away personal information and/or depending on any third-party website to aggregate everything for you.

In order to save time and to not depend on any Rss reader, I have written two simple scripts. One downloads all the RSS feeds I want to read and saves them in a format suitable for further processing. The other reads that temporary file and generates one single HTML page with all the news titles and links. The format of the temporary file is very simple. It’s just plain text with three fields per line, separated by a “|” (pipe) character: feed name, article title and article URL. Here’s an example of that file:


  Repubblica|Pisani, difesa a oltranza "Non ho dato notizie a Iorio"|http://example.com/url_1.html
  Repubblica|Caffe, sigaretta, persino l'email cosi' la pausa diventa un privilegio|http://example.com/url_2.html
  Repubblica|L'ultimo trucco "ad aziendam" di Berlusconi il 'padrone' del paese corrompe la democraz
  ia|http://example.com/url_3.html


The real problem is how to generate that file, that is how to download, parse and reformat RSS from the command line. Here’s how I do it. It works almost perfectly, with one exception explained below, for which I ask for your help.

Rss downloader script

The simplest way I’ve found to download and parse Rss feeds is the Python feedparser module. Once it is installed, it only takes 15 lines of code to generate the list shown above:


   1#! /usr/bin/python
   2
   3 import sys
   4 import feedparser
   5 import socket
   6
   7 timeout = 120
   8 socket.setdefaulttimeout(timeout)
   9
  10 feed_name = sys.argv[1]
  11 feed_url     = sys.argv[2]
  12 d               = feedparser.parse(feed_url)
  13
  14 for s in d.entries:
  15    print feed_name + "|" + unicode(s.title).encode("utf-8") + "|" + unicode(s.link).encode("utf-8") + "n"


The scripts takes as argments the feed name and the RSS URL (lines 10 and 11). Line 12 is the one that actually downloads the feed and saves all its content in an object named “d”. The timeout in lines 7/8 is needed to not have the script freeze when some website is unreachable. The last two lines look at each element of the RSS object and print (together with the feed name) the title (s.title) and URL (s.link) of each entry. That’s it, really.

One little problem: encoding

As I said, the script works almost perfectly as is, and I hope you’ll find it useful. The only problem I haven’t solved yet is how to handle non-ASCII characters in URLs and, especially, news titles. As an example of what I mean, here’s what I get when I convert to HTML the three lines shown above.

python_encodingproblems

(in case it matters, this happens on Fedora 14 x86_64). As you can see, the accented letters are messed up. Similar things happen with quotes and other non-ASCII stuff. How do I fix this? Before I added the encode("utf-8") command it looked even worst (**), but there’s something still missing here. I have tried to figure out what, but I must say the relevant Python documentation isn’t so simple and easy to find (or recognize at least), so your feedback is very welcome. Thanks!

(**) this is why I believe that the problem is, and should be fixed, in the Python script itself and not in the other script that creates the HTML page, but I may be wrong. Regardless of this, I want to understand better how encoding is handled in Python

How to post content to a WordPress blog from the command line

WordPress is a great publishing system, but managing it manually can be a very time consuming process. This is especially true when you want to upload lots of posts, or if you would like to write content in your preferred, full-blown text editor and then have it “magically” appear online.

WordPress takes care of these needs allowing remote posting via email or the WordPress XML-RPC interface (if you enable the WordPress, Movable Type, MetaWeblog and Blogger XML-RPC checkbox in goinig to Settings > Writing > Remote Publishing). The first method is explained here and in other places, but requires setting up a dedicated email account. For several reasons I preferred not to do it that way, so I looked at the other system.

PHP scripts for this purpose are here and here. The first uses the cURL capabilities of PHP to send the data over SSL. The second uses IXR, the Incutio XML-RPC Library for PHP, to “incorporate both client and server classes, as it is designed to hide as much of the workings of XML-RPC from the user as possible”.

Both those scripts work, and the second can also be used to edit existing posts or get lists of the latest published articles. However I was looking for something that didn’t require PHP (personal preference, really). Eventually, I found the WordPress-CLI utility by Leo Charre and have already used it successfully to upload hundreds of posts to several of my WordPress websites (see bottom of this page for examples). Here’s how I did it.

WordPress-CLI installs like any other Perl Module, see the instructions in the README file. In order for it to work, however, I also had to download and install from CPAN the Perl Modules Getopt-Std-Strict-1.01 and LEOCHARRE-Debug-1.03.

Once everything is installed, you’ll have in your path a script called wordpress-upload-post. Run it at the command prompt or in a script, giving as options the name of the HTML file containing the post, plus its required publication date, title and category, as well as your user name and password, and you’ll have your post online.

To make things faster, I use wordpress-upload-post inside this script:


  #! /bin/bash
  # usage: post2wp.sh postfile blogname
  # postfile: text file containing post content in txt2tags format
  # blogname: name of blog

  POST=$1
  BLOGNAME=$2

  HTML=/tmp/tmp_wordpress_post.html
  ACCOUNTS_DIR= "$HOME/.blog_accounts"

  #########################################################################
  #extract title, category and publication date of the post
  TITLE=`grep '%TITLE: ' $POST | cut -c9-`
  CATEGORY=`grep '%CATEGORY: ' $POST | cut -c12-`
  DATE=`grep '%DATE: ' $POST | cut -c8-`

  YEAR=`echo $DATE | cut -c1-4`
  MONTH=`echo $DATE | cut -c5-6`
  DAY=`echo $DATE | cut -c7-8`
  HOUR=`echo $DATE | cut -c9-10`
  MIN=`echo $DATE | cut -c11-12`
  DATE="$HOUR:$MIN $YEAR/$MONTH/$DAY"

  rm -f $HTML.tmp*
  txt2tags -t xhtml --no-headers -i $POST -o $HTML

  ###########################################################################
  # source blog parameters: user name, password, XML_RPC url

  if [ -e $ACCOUNTS_DIR/$BLOGNAME ]
  then
    source $ACCOUNTS_DIR/$BLOGNAME
  else
    echo "Error! $BLOGNAME account file doesn't exist!"
    exit
  fi

  ###########################################################################
  # upload post to blog
  WP_OPTS="-D '$DATE' -t '$TITLE'  -c '$CATEGORY'  -u $USER -p '$PW' -x $XMLRPC_URL"
  WP_CMD="wordpress-upload-post $WP_OPTS $HTML"
  eval $WP_CMD
  exit


I use txt2tags as source format for most things I publish online. It is a very simple ASCII markup format, that I customize adding to each article to be published on WordPress comments like these:


  %TITLE: Is it OK for a School or Charity to accept software donations?
  %CATEGORY: Digiworld
  %DATE: 200707010900


The script extracts these variables from the source text and reformats them in the way required by wordpress-upload-post.

Next, it converts the article from the ASCII markup to HTML calling the txt2tags Python Script. Password, user name and the URL of the wordpress blog to use to upload the HTML post are in a separate file ($ACCOUNTS_DIR/$BLOGNAME) that has this format:


  USER='your user name'
  PW='your password'
  XMLRPC_URL='http://your.blog.home.page/xmlrpc.php'


and is read right after generating the HTML version of the post. The last part of the script, “upload post to blog” concatenates all the parameters in one option string WP_OPTS, build the publishing command WP_CMD and evaluates, thus publishing your post online. Enjoy! Of course, if you only post once in a while you won’t save much time, but if you ever need, as I did, to post lots of stuff, try this!

What’s missing

wordpress-upload-post and this script work great (see bottom of page), but they aren’t perfect. The biggest limit right now of WordPress-CLI is that you can’t specify WordPress tags or, if you have a multilingual blog, the language of the current post. I’d also like to use it to add comments to existing posts, but that’s not essential really. I discussed these things with Leo. His answer is that “some of these things simply won’t work. For example- adding tags- because xmlrpc.php does not implement a call to add a tag. I’ve made some hacks to be able to do so- but this works on a local/server level. [also] I can’t recall right now if they have a comment call”. Leo also asked me to forward to all readers of this page this invitation:

  I'd be open to actually share in maintenance of things like WordPress::XMLRPC and a WordPress::CLI revised version. If you want to do this level of changes/additions, we could set up a branch off some cvs server... and... implement bugzilla on some server- to keep track of changes and todos.

Another thing I’d like to figure out is a way to upload images to WordPress so that they are associated to that post and get thumbnails. If that were possible, it would be easy to figure out in advance the right URLs for both the images and their thumbnails, and add them to the txt2tags source. As it is today, in the rare cases where I need to upload with this script a post that does have images, I add them in a second moment by hand. Suggestions?

How I’ve used this script

If you are curious to what real posts written in txt2tags format, then converted to HTML and automatically uploaded to a WordPress blog look like, and don’t mind a bit of self-promotion, have a look at the following links. I have already used the script above to: