Custom newsfeed with web scraping

2023-11-28 · 5 min · ¶

TL;DR

To address limitations in my flat-file CMS (Yellow), I implemented web scraping to customize the RSS feed on my website. The CMS, saving content in Markdown, excluded layout-added details from the feed. A lengthy feed URL also posed aesthetic concerns. To overcome these, I created a dedicated page layout for feed articles, designed to match RSS 2.0 data. Leveraging cURL and DOMDocument, I fetched and parsed HTML, extracting key elements. The script then transformed the data into an XML RSS feed. Caching was employed for efficiency, reducing the frequency of web scraping. This approach provides a personalized and aesthetically pleasing RSS feed, showcasing web scraping's efficacy for content customization and optimization.

The problem

My content management system (Yellow) is a flat-file CMS and as such it saves the content in Markdown files, from which it then also creates the newsfeed. Unfortunately, this also means that content that I add using the layout files (e.g. the "comment by e-mail" link or the date of the last edit) is not included in the feed. Apart from that, I could not reconcile the feed URL (/feed/page:feed.xml), which in my eyes is far too long, with my aesthetic conscience. Since these things are important to me, I had to find another solution.

The solution

As I'm not an experienced programmer, I quickly discarded the attempt to "hack" into the markdown files and remembered an older project of mine in which I created a newsfeed for a site without RSS support using web scraping.

The preparation

Yellow creates a page[^1] with the yellow-feed extension on which all articles for the feed are displayed. By default, only the titles with a link are listed there in an <ul>. I then had to design this page in such a way that all the data required for an RSS 2.0 feed was displayed. I added classes to this data, as this makes subsequent web scraping much easier. Now that the data was available, it was time to turn the HTML output into rss.xml, which is then accepted by common feed readers.

The implementation

So I wrote a PHP script, with which I perform web scraping on the "https://frittiert.es/feed/" page to generate the custom RSS feed for my website, focusing on the content within certain HTML elements.

Fetching and Parsing HTML

I use cURL to fetch the content from the specified feed URL and then utilize the DOMDocument and DOMXPath classes to extract relevant information.

$feedUrl = "https://frittiert.es/feed/";

$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $feedUrl);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_PROXY, "");
$data = curl_exec($ch);
curl_close($ch);

$dom = new DOMDocument();
$loadSuccess = @$dom->loadHTML($data);

// Handle errors if HTML loading fails
if (!$loadSuccess) {
    // ... Handle errors ...
    die("Error loading HTML document");
}

XPath Queries

XPath queries are employed to select specific elements from the HTML structure, such as article dates, links, titles, and content.

$newsfeedDate  = $xpath->query("//article/p[@class='feedDate']");
$newsfeedLink  = $xpath->query("//article/h2[@class='feedTitle']/a[@class='feedLink']/@href"); 
$newsfeedTitle = $xpath->query("//article/h2[@class='feedTitle']");
$newsfeedContent = $xpath->query("//article/div[@class='articleContent']");

Creating the RSS Feed

The script then constructs an XML document representing an RSS feed using DOMDocument.

$xml = new DOMDocument("1.0", "utf-8");
$xml->formatOutput = true;

$domStyle = $xml->createProcessingInstruction("xml-stylesheet", "href='/pretty-feed-v3.xsl' type='text/xsl'");
$xml->appendChild($domStyle);

$rss = $xml->createElement("rss");
$rss->setAttribute("version", "2.0");
$rss->setAttribute("xmlns:atom", "http://www.w3.org/2005/Atom");
$rss->setAttribute("xmlns:dc", "http://purl.org/dc/elements/1.1/");
$xml->appendChild($rss);

$channel = $xml->createElement("channel");
$rss->appendChild($channel); 

// Head des Feeds    
$head = $xml->createElement("title", "frittiert.es");
$channel->appendChild($head);

$head = $xml->createElement("description", "It's all about the web here, at least that's what I'm trying to do. From opinions and practical guides to development projects and web technologies.");
$channel->appendChild($head);

$head = $xml->createElement("language", "en");
$channel->appendChild($head);

$head = $xml->createElement("atom:link");
$head->setAttribute("href", "https://frittiert.es/rss.xml");
$head->setAttribute("rel", "self");
$head->setAttribute("type", "application/rss+xml");
$channel->appendChild($head);

$head = $xml->createElement("link", "https://frittiert.es");
$channel->appendChild($head);

//Create RFC822 Date format to comply with RFC822
$date_f = date("D, d M Y H:i:s O", time());
$build_date = gmdate(DATE_RFC2822, strtotime($date_f));

$head = $xml->createElement("lastBuildDate", $build_date);
$channel->appendChild($head);

The fetched data is looped through, and individual items (posts) are added to the RSS feed.

for ($i = 0; $i < count($newsfeedTitle); $i++) {
    $item = $xml->createElement('item');
    $channel->appendChild($item);

    $data = $xml->createElement('title', $newsfeedTitle[$i]->nodeValue);
    $item->appendChild($data);

    $data = $xml->createElement('description');
    $cdata = $xml->createCDATASection($newsfeedContentHtml[$i]);
    $data->appendChild($cdata);
    $item->appendChild($data);  

    $data = $xml->createElement('link', $newsfeedLink[$i]->nodeValue);
    $item->appendChild($data);

    $data = $xml->createElement('pubDate', $newsfeedDate[$i]->nodeValue);
    $item->appendChild($data);

    $data = $xml->createElement('guid', $newsfeedLink[$i]->nodeValue);
    $item->appendChild($data);
}

Caching

To improve performance, I implement caching. If a cached XML file exists and is not expired, I send it directly to the client.

$cacheFile = 'rss.xml';
$cacheTime = 1200; // Cache for 20 minutes

if (file_exists($cacheFile) && time() - filemtime($cacheFile) < $cacheTime) {
    header('Content-Type: application/rss+xml');
    readfile($cacheFile);
    exit();
}

If the cache is not valid or does not exist, the script proceeds with the web scraping process, generates a new XML file, and saves it for future use.

// Save the newly generated XML file in the cache
$xml->save($cacheFile);

// Send the XML file to the client
header('Content-Type: application/rss+xml');
readfile($cacheFile);

By employing caching, the script reduces the frequency of web scraping, improving the overall efficiency of the RSS feed generation.

The summary

In this PHP script, I leverage web scraping on "https://frittiert.es/feed/" to tailor the RSS feed representation to my preferences. By employing cURL, DOMDocument, and DOMXPath, I extract specific content from the HTML structure. This allows me to customize the feed's appearance by selecting and organizing elements such as dates, links, titles, and content. Additionally, I implement an efficient caching mechanism to optimize performance, reducing the need for frequent web scraping. This approach not only ensures a personalized presentation of the RSS feed on my website but also highlights the effectiveness of web scraping as a valuable tool for content customization and optimization.

Reply via e-mail