Tue Dec 27 20222,555 words
Any website should have a XML sitemap to provide a list of all pages to search engines like Google. However there are other sitemaps and feeds that should be created to allow blog pages / articles to show in other places. An RSS feed can be submitted to Google Publisher Center to enable these pages to show on Google news.
This guide will show you how to create 3 different XML feeds using PHP.
The PHP classes to create these feeds are heavily abstracted and there are other approaches that could be used here. But these classes give a good example of why using the built in classes such as DOMDocument can remove a lot of complexity when building XML, and how you can use closures effectively to chain methods.
See the complete tutorial files in the PHP XML sitemaps and feed example repository
The base XML service class is an abstract class responsible for creating the DOM document instance and has methods to create the root node, child nodes, and add attributes to nodes. It contains the common logic across the 3 XML classes that can be overridden if needed.
The first things to be done here is set up the DOM document instance which can be done in the constructor making it available to all child classes through the dom property.
We add a createEntry abstract method which each child class implements. This method is responsible for transforming a page object into an XML entry and appending it to a parent node.
use DOMDocument;
use DOMNode;
abstract class XMLService
{
protected $dom;
public function __construct()
{
$this->dom = new DOMDocument('1.0', "UTF-8");
$this->dom->formatOutput = true;
}
abstract public function createEntry(DOMNode $parent, object $page): void;
}
This class contains two public methods.
The setPages method sets the array of page data objects that properties are pulled from.
The getXML method is where all the logic runs to create the XML data and return it. Here the array of page objects is looped through one by one, the XML data created for each, and this XML data is appended to the root DOM node. The root node is then appended to the DOMDocument and converted to a string and returned as the final rendered XML through the saveXML method.
abstract class XMLService
{
protected $root;
protected $pages = [];
public function setPages(array $pages): static
{
$this->pages = $pages;
return $this;
}
public function getXML(): string
{
array_map(
fn (object $page) => $this->createEntry($this->root, $page),
$this->pages
);
$this->dom->appendChild($this->root);
return $this->dom->saveXML();
}
}
The final part of this class is 3 protected methods used by child classes to simplify the process of creating DOM nodes and adding attributes. These are essentially wrappers around the DOMDocument methods.
The root method simply creates the root DOMNode and appends attributes to it.
The element method creates a DOMNode with an optional value. The callback here allows the DOMNode to be manipulated in ways such as adding attributes and child nodes without breaking method chaining.
use Closure;
use DOMNode;
abstract class XMLService
{
protected function root(string $tag, array $attributes): DOMNode
{
$this->root = $this->dom->createElement($tag);
return $this->attributes($this->root, $attributes);
}
protected function element(DOMNode $parent, string $name, string $value, Closure $callback = null): static
{
$parent->appendChild(
$node = $this->dom->createElement($name, htmlspecialchars($value))
);
if ($callback) {
$callback($node);
}
return $this;
}
protected function attributes(DOMNode $node, array $attributes): DOMNode
{
foreach ($attributes as $name => $value) {
$attribute = $this->dom->createAttribute($name);
$attribute->value = $value;
$node->appendChild($attribute);
}
return $node;
}
}
It important to note here the value is passed to htmlspecialchars in the element method. If certain values are not converted correctly in XML they can break the rendering of the document.
Most of the logic required to create the XML nodes is contained in the parent XML service class and in turn the code required to generate the XML sitemap page is minimal.
The getXML method in the SitemapService simply sets the root node to a urlset tag with a couple of attributes.
The createEntry method here creates a url node representing the page and appends the page URL, last modified, change frequency, and priority attributes. Other data like the page images can be added if needed.
You can see a closure being passed to the first element call which returns the passed node and allows child nodes to be added.
use DOMNode;
class SitemapService extends XMLService
{
public function getXML(): string
{
$this->root('urlset', [
'xmlns' => 'http://www.sitemaps.org/schemas/sitemap/0.9',
'xmlns:image' => 'http://www.google.com/schemas/sitemap-image/1.1'
]);
return parent::getXML();
}
public function createEntry(DOMNode $node, object $page): void
{
$this
->element($node, 'url', '', fn($node) => $this
->element($node, 'loc', $page->absolute_url)
->element($node, 'lastmod', date('c', strtotime($page->updated_at)))
->element($node, 'changefreq', $page->change_frequency)
->element($node, 'priority', number_format($page->priority, 1))
);
}
}
Google Search Console page sitemap example XML output is as follows:
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
<url>
<loc>https://php.fyi/</loc>
<lastmod>2022-12-26T15:25:37+00:00</lastmod>
<changefreq>weekly</changefreq>
<priority>1.0</priority>
</url>
</urlset>
The SitemapNewsService is very similar to the SitemapService. The only real difference is there is more data in the page entry and more complex nesting. Again the parent XMLService takes care of most of the complex logic.
use DOMNode;
class SitemapNewsService extends XMLService
{
public function getXML(): string
{
$this->root('urlset', [
'xmlns' => 'http://www.sitemaps.org/schemas/sitemap/0.9',
'xmlns:news' => 'http://www.google.com/schemas/sitemap-news/0.9'
]);
return parent::getXML();
}
public function createEntry(DOMNode $node, object $page): void
{
$this
->element($node, 'url', '', fn($node) => $this
->element($node, 'loc', $page->absolute_url)
->element($node, 'news:news', '', fn($node) => $this
->element($node, 'news:publication', '', fn($node) => $this
->element($node, 'news:name', $page->website)
->element($node, 'news:language', $page->language)
)
->element($node, 'news:publication_date', date('c', strtotime($page->published_at)))
->element($node, 'news:title', $page->meta_title)
)
);
}
}
Google Search Console news sitemap example XML output is as follows:
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9">
<url>
<loc>https://php.fyi/articles/php-pagination</loc>
<news:news>
<news:publication>
<news:name>PHP.FYI</news:name>
<news:language>en</news:language>
</news:publication>
<news:publication_date>2016-05-18T12:00:00+00:00</news:publication_date>
<news:title>How to create PHP Pagination</news:title>
<news:keywords/>
</news:news>
</url>
</urlset>
The Google publisher center XML feed is much more complex in terms of data and nesting than the previous XML feeds.
The getXML method here sets the root rss node with the required attributes. Rather than appending the page items to this root node as in the previous sitemaps, a channel node is created here with various nodes and attributes containing data about the website.
The DOMCdataSection class is used to wrap data in CDATA tags so these values can be parsed correctly.
The default looping contained in the parent getXML is not used and is overridden instead as the page nodes are appended to the channel node instead of the root rss node.
use DOMCdataSection;
class NewsRssService extends XMLService
{
public function getXML(): string
{
$this->root('rss', [
'version' => '2.0',
'xmlns:atom' => 'http://www.w3.org/2005/Atom',
'xmlns:content' => 'http://purl.org/rss/1.0/modules/content/',
'xmlns:media' => 'http://search.yahoo.com/mrss/'
]);
$this
->element($this->root, 'channel', '', fn($node) => $this
->element($node, 'title', 'Articles')
->element($node, 'link', '', fn($node) =>
$node->appendChild(new DOMCdataSection('https://php.fyi'))
)
->element($node, 'description', 'Engineering and marketing articles')
->element($node, 'language', 'en')
->element($node, 'atom:link', '', fn($node) =>
$this->attributes($node, [
'href' => 'https://api.phpfyi.local/rss/news',
'rel' => "self",
'type' => "application/rss+xml"
])
)
);
array_map(
fn (object $page) => $this->createEntry($this->root->firstChild, $page),
$this->pages
);
$this->dom->appendChild($this->root);
return $this->dom->saveXML();
}
}
The createEntry method is again more complex than previous sitemaps.
Its important to note the guid value here. This is used as the primary identifier by Google for each entry in the feed. Once this value is set it should not be changed unless you unpublished and re-submit your feed.
PHP contains a handy DATE_RSS format tha can be passed to the date function to output the correct format for a data in an RSS feed.
The DOMCdataSection class is used to wrap with CDATA tags and encode the page HTML in a way that wont break the XML feed.
The enclosure image here is what is used to specify the image that will appear in Google news with your article.
use DOMCdataSection;
use DOMNode;
class NewsRssService extends XMLService
{
public function createEntry(DOMNode $node, object $page): void
{
$this
->element($node, 'item', '', fn($node) => $this
->element($node, 'guid', $page->slug, fn($node) =>
$this->attributes($node, [ 'isPermaLink' => 'false'])
)
->element($node, 'title', $page->meta_title)
->element($node, 'link', $page->absolute_url)
->element($node, 'author', $page->author)
->element($node, 'category', $page->category)
->element($node, 'description', $page->meta_description)
->element($node, 'pubDate', date(DATE_RSS, strtotime($page->published_at)))
->element($node, 'content:encoded', '', fn($node) =>
$node->appendChild(new DOMCdataSection($page->html))
)
->element($node, 'enclosure', '', fn($node) =>
$this->attributes($node, [
'url' => $page->image_url,
'type' => 'image/jpeg',
'length' => '0'
])
)
);
}
}
Google Publisher Center RSS feed example XML output is as follows:
<rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
<channel>
<title>Web Development and SEO Guides</title>
<link>
<![CDATA[ https://php.fyi ]]>
</link>
<description>Articles, guides, and tutorials for design, development, SEO, and marketing</description>
<language>en</language>
<atom:link href="https://api.php.fyi/rss/news" rel="self" type="application/rss+xml"/>
<item>
<guid isPermaLink="false">php-pagination</guid>
<title>PHP Pagination</title>
<link>https://php.fyi/articles/php-pagination</link>
<author>test@php.fyi (Andrew Mc Cormack)</author>
<category>Site News</category>
<description>Learn about search engine crawling and indexing and how to avoid duplicate pages.</description>
<pubDate>Sat, 24 Dec 2022 01:00:00 +0000</pubDate>
<content:encoded>
<![CDATA[ <p>Indexing and the crawling of a website</p> ]]>
</content:encoded>
<enclosure url="https://php.fyi/img/articles/php-pagination/summary.jpg" type="image/jpeg" length="0"/>
</item>
</channel>
</rss>
All the XML classes us the same interface and can be swopped out in ioc containers etc within frameworks easily if needed.
The just need to be initialized, page data passed in, and the getXML method called to return the rendered XML.
The only difference is the page data that needs to be passed in. Some of the classes use difference page properties.
$pages = [
(object) [
'absolute_url' => 'https://php.fyi',
'updated_at' => '2022-12-26 21:05:11',
'change_frequency' => 'weekly',
'priority' => 0.9
]
];
(new SitemapService())->setPages($pages)->getXML();
$pages = [
(object) [
'absolute_url' => 'https://php.fyi',
'published_at' => '2022-12-26 21:05:11',
'meta_title' => 'Test title',
'website' => 'PHP.FYI',
'language' => 'en'
]
];
(new SitemapNewsService())->setPages($pages)->getXML();
$pages = [
(object) [
'slug' => 'test-slug',
'meta_title' => 'Test title',
'meta_description' => 'Test description',
'published_at' => '2022-12-26 21:05:11',
'absolute_url' => 'https://php.fyi',
'image_url' => 'https://php.fyi/img/articles/php-pagination/summary.jpg',
'html' => '<p>Content</p>',
'author' => 'test@php.fyi (Andrew Mc Cormack)',
'category' => 'Site News'
]
];
(new NewsRssService())->setPages($pages)->getXML();