Parsing Atom with Erlang

2007-08-28T17:23:21Z

A simple program for parsing memes.atom. Below is an annotated version.

-module(memes).
-export([scan/0]).
-include_lib("xmerl/include/xmerl.hrl").

Define a module named memes that exports a single function named scan which takes zero parameters. Include the headers for xmerl, a library for processing XML.

memes_url() ->
  "http://planet.intertwingly.net/memes.atom".

Define a simple function that returns a constant string. Some people prefer to use macros for things like this.

scan() ->
  application:start(inets),
  { ok, {_Status, _Headers, Body }} = http:request(memes_url()),
  { Xml, _Rest } = xmerl_scan:string(Body),
  format_entries(xmerl_xpath:string("//entry",Xml)),
  init:stop().

Main program

start inets. Erlang is all about loadable modules and long running processes.
fetch the memes_url() using http. The result is pattern matched against a tuple of length two, the first of which must be the atom (think symbol or interned String) ok. The second term in the pair must itself be a tuple of length three, of which the first two terms are discarded, and the final term is bound to the variable named Body. Erlang is designed to deal with components that fail, and each of these assertions are a part of that philosophy.
parse the Body using xmerl_scan. The resulting structure is bound to a variable named Xml, and the remainder of the string is discarded.
invoke an XPath expression on the Xml using xmerl_xpath. The result is an array, which is passed directly to a function named format_entries/1.
init:stop is called to gracefully shutdown all running threads.

format_entries([]) -> done;
format_entries([Node|Rest]) ->
  [ #xmlText{value=Title} ] = xmerl_xpath:string("title/text()", Node),
  [ #xmlAttribute{value=Link} ] = xmerl_xpath:string("link/@href", Node),
  Message = xmerl:export_simple_content([{a,[{href,Link}],[Title]}],xmerl_xml),
  io:format('~s~n', [xmerl_ucs:to_utf8(Message)]),
  format_entries(Rest).

In lieu of looping constructs, Erlang programs tend to use sequential logic and pattern matching.

When format_entries/1 is called with an empty list, done is returned.
Otherwise when format_entries/1 is called with a list, the first node is bound to the variable Node, and the remainder of the list is bound to a variable named Rest, at which point:

Another XPath expression is used to extract the title. Assertions are made that the result is an array of length one, the first and only item in that array is a record of type #xmlAttribute, and the field named value in that data structure is bound to a variable named Title.
Yet another XPath expression is used to extract the href attribute of the link element.
The Link and Title are combined to form an XHTML anchor element, which is exported into a string and bound to the variable Message.
The Message is then converted to utf-8 (from a list of unbounded integers each representing a Unicode character) and output using io:format.
The function format_entries/1 is again called, this time with the remainder of the list.

Clearly dumping XHTML fragments to stdout isn’t ideal (perhaps XHTML-IM instead?), and you wouldn’t want to dump every meme on every run, but those are problems for another day.