Okay, first thing: big ups to @IconFactory@iconfactory.world for releasing the first beta of Tapestry. This promises to improve my quality of life and I’m thrilled to be a supporter. As requested, I shall not be posting screenshots.
After installing the beta, I immediately fired up Vienna on my laptop and started adding all the feeds I check. Super awesome. This is great. So that got me thinking about feeds I’ve stopped following, most notably those for the Washington Post. Some time ago, I forget exactly when, the Post did some kind of website redesign that meant that most of their stories were getting “published” multiple times per day. I noticed this when I opened up Vienna one morning and instead of the normal daily 20 to 30 stories there were well over 100; then, later that day, another 100 — all the same story.
What this says to me is that the RSS specification doesn’t require a unique identifier for the items in the feed (and, lo!, this is true — it’s an optional property of an item) and even if it did, the Post would probably generate a new unique ID every time it “republished” a story, since the point of republishing is, presumably, to make their garbage JS website reload the data. The story I tell myself about this is that some genius product marketer wanted to have the website look like a physical newspaper, but with flashing, shifting images and a self-updating scroll of “breaking news”. Oh, and the link that each item had, the part that’s supposed to indicate how a browser might go and get the whole story as opposed to the 2 line summary that’s part of the RSS feed, was just the link to the Post’s home page. No deep linking at all, any more.
This got me thinking about how to solve this problem so I might be able to add the Post to my timeline in Tapestry. First, of course, would be figuring out how to deep link again. Assuming I could solve that problem, what else would need to happen? Well, I’d want to construct a local database to keep track of “seen” items, so that if the same item came up again, it’d only get added to the timeline if it were “new”. And maybe the “seen” property would have an expiration, so that if there really were some kind of update (e.g.: no, wait, that tornado we reported on yesterday turns out to have actually been aliens) then it could get through.
This, of course, is a recurring problem in stream processing. At some point, the volume of data coming over the stream is so great that the local database of “seen” items gets overflowed and “old” items coming in are treated as “new”. The mechanisms differ for specific cases, but it’s really an unsolvable problem when at least one of the publishers doesn’t care about anyone else. And this is my periodic reminder that, even though our household pays money to subscribe to the Washington Post, it is clear that the Post’s tech team does not care about anyone else. So no, I shan’t be writing a custom adapter to read the Post’s feeds, de-dupe them, and provide them to Tapestry. And this is because I love myself too much to make me work on that.