Official Google Reader Blog - News, Tips and Tricks from the Reader team

XML Errors in Feeds

12/23/2005 09:50:00 AM
Posted by Mihai Parparita, Software Engineer

Dealing with the millions of RSS and Atom feeds out there is hard work. We're not trying to make you feel sorry for the Reader team, but as anyone who has attempted to implement a feed parser knows, there are many subtle deviations from the spec that you have to handle if you want to have any hope of satisfying the needs of your users (who shouldn't have to care about such things).

The feed generating/parsing world has had the debate about Postel's Law, as it applies to XML and feeds, several times. We are not here to weigh in on either side of the argument. Instead, we hope to provide some data so that such discussions can be made on more than philosophical grounds. Without further ado, here are the top XML errors that we have encountered when parsing all of the feeds that our users have added to Reader (and there are a lot of them):

% of errors Error description
15.6%Input claims to be UTF-8 but contains invalid characters.
14.9%Opening and ending tags mismatch
13.9%An undefined entity is used (e.g.   in an XML document without importing the HTML set)
7.8%Documented expected to begin with a start tag, but no < was found
5.7%Disallowed control characters present
5.5%Extra content at the end of the document
4.2%Unterminated entity reference (missing semi-colon)
4.2%Unquoted attribute value
3.8%Premature end of data in tag (truncated feed)
3.3%Naked ampersand (should be represented as &amp;)
2.1%XML declaration allowed only at the start of the document
1.8%Namespace prefix is used but not defined
0.75%Comment not terminated
0.64%Attribute without value
0.17%Unescaped < not allowed in attributes values
0.11%Malformed numerical entity reference
0.11%Unsupported/invalid encoding
0.10%Comment must not contain '--'
0.10%Attribute defined more than once
0.07%Char out of allowed range
0.03%Comment not terminated
0.02%Sequence ]]> not allowed in content

As a whole, about seven percent of all feeds that we know about have some of these errors (this data is based on a one-day snapshot, so transient errors may be present). Note that these are all XML errors, meaning that the feed is not well-formed. We are not talking about complying with and validating against the RSS or Atom specs - that is an even higher bar than we have set here. In general, our recommendation to feed producers is to use the work that the community has put into the feed validator.

On a related note, we're aware that Reader has some issues with titles. It's great that there are test cases, and we will add this bug to our to-do list.