PHP blog SEO: JSON-LD, server-side ToC and crawlable pagination without a framework

Google Search Console showed zero indexed articles after three weeks of publishing. The articles existed. The URLs returned 200. The sitemap had been submitted. The problem was elsewhere: the blog listing loaded posts.json via fetch(), then rendered all the HTML in JavaScript. As far as Googlebot was concerned, the page was an empty shell with a spinner.

What follows is the set of fixes applied to the shared PHP template — no framework, no build step, plain Apache. Each point has a concrete reason, not just "SEO best practices say so".

The JS-first problem

Googlebot can execute JavaScript. Google has been saying this for years, and it's true. But "can" doesn't mean "systematically" or "immediately". JS crawling goes through a secondary rendering queue — plain HTML pages are indexed first. The result: a fully JS-rendered blog can take weeks to get indexed, and even then, only if Googlebot decides the page is worth the rendering cost.

The second problem is structural: if pagination is handled entirely in JS (click "Page 2" → client-side re-render), page 2 doesn't exist from a crawler's perspective. There's only one URL, and it always shows the first 10 articles. Googlebot doesn't click JavaScript buttons to see more content. Articles on page 3 or 4 will never be crawled.

The fix is straightforward in principle: the server must render the HTML for articles on the current page. JavaScript can then take over for interactive navigation — but the first render must be visible in the page source.

JSON-LD and articleBody: what Google actually wants

Schema.org BlogPosting is the baseline structured data for a blog article. It lets Google display the date, author, and potentially the breadcrumb in rich results. But the most useful field is articleBody: it's the plain text of the article, which Google uses for featured snippets.

The problem: when JSON-LD is generated inside blog_header() (at the top of the page), the article content hasn't been rendered yet. You can't inject articleBody at that point.

The solution is PHP output buffering. In blog_header(), start a buffer and put a placeholder in the JSON-LD:

ob_start();
// JSON-LD with placeholder
$json_ld = '{ "@type": "BlogPosting", "articleBody": "Table of contentsThe JS-first problemJSON-LD and articleBody: what Google actually wantsServer-side ToC and the iconv trapCrawlable pagination: the hybrid approachhreflang, canonical, og:type: the details that matterConclusion Google Search Console showed zero indexed articles after three weeks of publishing. The articles existed. The URLs returned 200. The sitemap had been submitted. The problem was elsewhere: the blog listing loaded posts.json via fetch(), then rendered all the HTML in JavaScript. As far as Googlebot was concerned, the page was an empty shell with a spinner. What follows is the set of fixes applied to the shared PHP template — no framework, no build step, plain Apache. Each point has a concrete reason, not just \"SEO best practices say so\". The JS-first problem Googlebot can execute JavaScript. Google has been saying this for years, and it\'s true. But \"can\" doesn\'t mean \"systematically\" or \"immediately\". JS crawling goes through a secondary rendering queue — plain HTML pages are indexed first. The result: a fully JS-rendered blog can take weeks to get indexed, and even then, only if Googlebot decides the page is worth the rendering cost. The second problem is structural: if pagination is handled entirely in JS (click \"Page 2\" → client-side re-render), page 2 doesn\'t exist from a crawler\'s perspective. There\'s only one URL, and it always shows the first 10 articles. Googlebot doesn\'t click JavaScript buttons to see more content. Articles on page 3 or 4 will never be crawled. The fix is straightforward in principle: the server must render the HTML for articles on the current page. JavaScript can then take over for interactive navigation — but the first render must be visible in the page source. JSON-LD and articleBody: what Google actually wants Schema.org BlogPosting is the baseline structured data for a blog article. It lets Google display the date, author, and potentially the breadcrumb in rich results. But the most useful field is articleBody: it\'s the plain text of the article, which Google uses for featured snippets. The problem: when JSON-LD is generated inside blog_header() (at the top of the page), the article content hasn\'t been rendered yet. You can\'t inject articleBody at that point. The solution is PHP output buffering. In blog_header(), start a buffer and put a placeholder in the JSON-LD: ob_start(); // JSON-LD with placeholder $json_ld = \'{ \"@type\": \"BlogPosting\", \"articleBody\": \"ARTICLE_BODY_PLACEHOLDER\", ... }\'; Then in blog_footer(), retrieve the full rendered HTML, extract the content of the .article-content div, strip the tags, and inject the text in place of the placeholder before sending everything to the browser: // In blog_footer() — capture the HTML rendered by the article $article_html = ob_get_clean(); if (preg_match(\'/(.*)/s\', $article_html, $body_match)) { $article_body = strip_tags($body_match[1]); $article_body = preg_replace(\'/\\s+/\', \' \', trim($article_body)); $article_body = substr($article_body, 0, 5000); $json_ld = str_replace(\'\"ARTICLE_BODY_PLACEHOLDER\"\', json_encode($article_body), $json_ld); } echo $article_html; The complete JSON-LD is injected into <head> just before the buffer is sent. Google sees a real articleBody, not a placeholder. The 5000-character limit is arbitrary — Google truncates anyway. Server-side ToC and the iconv trap A server-side table of contents has two advantages: it\'s visible in the page source (therefore crawlable), and it has no JavaScript dependency. The implementation is straightforward — parse the <h2> and <h3> tags from the article HTML, generate anchors, and inject the ToC at the beginning of .article-content. The classic trap is generating anchor IDs for headings with accented characters. The standard reflex is to use iconv for transliteration: // ❌ Depends on system locale — returns empty string on servers without fr_FR.UTF-8 $id = preg_replace(\'/[^a-z0-9]+/\', \'-\', strtolower(iconv(\'UTF-8\', \'ASCII//TRANSLIT\', $heading_text))); On a server where the fr_FR.UTF-8 locale isn\'t installed, iconv with //TRANSLIT returns an empty string for accented characters. The generated anchor becomes #--- instead of #les-3-pieges. The ToC link points to nothing. The fix is an explicit mapping with strtr(): // ✅ Explicit mapping, portable $id = preg_replace(\'/[^a-z0-9]+/\', \'-\', strtolower(strtr($heading_text, [ \'à\'=>\'a\',\'â\'=>\'a\',\'é\'=>\'e\',\'è\'=>\'e\',\'ê\'=>\'e\',\'ë\'=>\'e\', \'î\'=>\'i\',\'ï\'=>\'i\',\'ô\'=>\'o\',\'ù\'=>\'u\',\'û\'=>\'u\',\'ü\'=>\'u\', \'ç\'=>\'c\',\'æ\'=>\'ae\',\'œ\'=>\'oe\', ]))); No locale dependency. No unpredictable behavior between dev and production. The same heading always generates the same ID. The same pattern is applied both to the <h2> IDs in the article body and to the links in the ToC — consistency guaranteed. Crawlable pagination: the hybrid approach The goal is for Googlebot to reach every article, not just those on the first page. The constraint: don\'t break the existing interactive navigatio", ... }';

Then in blog_footer(), retrieve the full rendered HTML, extract the content of the .article-content div, strip the tags, and inject the text in place of the placeholder before sending everything to the browser:

// In blog_footer() — capture the HTML rendered by the article
$article_html = ob_get_clean();

if (preg_match('/
(.*)<\/article>/s', $article_html, $body_match)) { $article_body = strip_tags($body_match[1]); $article_body = preg_replace('/\s+/', ' ', trim($article_body)); $article_body = substr($article_body, 0, 5000); $json_ld = str_replace('"Table of contentsThe JS-first problemJSON-LD and articleBody: what Google actually wantsServer-side ToC and the iconv trapCrawlable pagination: the hybrid approachhreflang, canonical, og:type: the details that matterConclusion Google Search Console showed zero indexed articles after three weeks of publishing. The articles existed. The URLs returned 200. The sitemap had been submitted. The problem was elsewhere: the blog listing loaded posts.json via fetch(), then rendered all the HTML in JavaScript. As far as Googlebot was concerned, the page was an empty shell with a spinner. What follows is the set of fixes applied to the shared PHP template — no framework, no build step, plain Apache. Each point has a concrete reason, not just \"SEO best practices say so\". The JS-first problem Googlebot can execute JavaScript. Google has been saying this for years, and it\'s true. But \"can\" doesn\'t mean \"systematically\" or \"immediately\". JS crawling goes through a secondary rendering queue — plain HTML pages are indexed first. The result: a fully JS-rendered blog can take weeks to get indexed, and even then, only if Googlebot decides the page is worth the rendering cost. The second problem is structural: if pagination is handled entirely in JS (click \"Page 2\" → client-side re-render), page 2 doesn\'t exist from a crawler\'s perspective. There\'s only one URL, and it always shows the first 10 articles. Googlebot doesn\'t click JavaScript buttons to see more content. Articles on page 3 or 4 will never be crawled. The fix is straightforward in principle: the server must render the HTML for articles on the current page. JavaScript can then take over for interactive navigation — but the first render must be visible in the page source. JSON-LD and articleBody: what Google actually wants Schema.org BlogPosting is the baseline structured data for a blog article. It lets Google display the date, author, and potentially the breadcrumb in rich results. But the most useful field is articleBody: it\'s the plain text of the article, which Google uses for featured snippets. The problem: when JSON-LD is generated inside blog_header() (at the top of the page), the article content hasn\'t been rendered yet. You can\'t inject articleBody at that point. The solution is PHP output buffering. In blog_header(), start a buffer and put a placeholder in the JSON-LD: ob_start(); // JSON-LD with placeholder $json_ld = \'{ \"@type\": \"BlogPosting\", \"articleBody\": \"ARTICLE_BODY_PLACEHOLDER\", ... }\'; Then in blog_footer(), retrieve the full rendered HTML, extract the content of the .article-content div, strip the tags, and inject the text in place of the placeholder before sending everything to the browser: // In blog_footer() — capture the HTML rendered by the article $article_html = ob_get_clean(); if (preg_match(\'/(.*)/s\', $article_html, $body_match)) { $article_body = strip_tags($body_match[1]); $article_body = preg_replace(\'/\\s+/\', \' \', trim($article_body)); $article_body = substr($article_body, 0, 5000); $json_ld = str_replace(\'\"ARTICLE_BODY_PLACEHOLDER\"\', json_encode($article_body), $json_ld); } echo $article_html; The complete JSON-LD is injected into <head> just before the buffer is sent. Google sees a real articleBody, not a placeholder. The 5000-character limit is arbitrary — Google truncates anyway. Server-side ToC and the iconv trap A server-side table of contents has two advantages: it\'s visible in the page source (therefore crawlable), and it has no JavaScript dependency. The implementation is straightforward — parse the <h2> and <h3> tags from the article HTML, generate anchors, and inject the ToC at the beginning of .article-content. The classic trap is generating anchor IDs for headings with accented characters. The standard reflex is to use iconv for transliteration: // ❌ Depends on system locale — returns empty string on servers without fr_FR.UTF-8 $id = preg_replace(\'/[^a-z0-9]+/\', \'-\', strtolower(iconv(\'UTF-8\', \'ASCII//TRANSLIT\', $heading_text))); On a server where the fr_FR.UTF-8 locale isn\'t installed, iconv with //TRANSLIT returns an empty string for accented characters. The generated anchor becomes #--- instead of #les-3-pieges. The ToC link points to nothing. The fix is an explicit mapping with strtr(): // ✅ Explicit mapping, portable $id = preg_replace(\'/[^a-z0-9]+/\', \'-\', strtolower(strtr($heading_text, [ \'à\'=>\'a\',\'â\'=>\'a\',\'é\'=>\'e\',\'è\'=>\'e\',\'ê\'=>\'e\',\'ë\'=>\'e\', \'î\'=>\'i\',\'ï\'=>\'i\',\'ô\'=>\'o\',\'ù\'=>\'u\',\'û\'=>\'u\',\'ü\'=>\'u\', \'ç\'=>\'c\',\'æ\'=>\'ae\',\'œ\'=>\'oe\', ]))); No locale dependency. No unpredictable behavior between dev and production. The same heading always generates the same ID. The same pattern is applied both to the <h2> IDs in the article body and to the links in the ToC — consistency guaranteed. Crawlable pagination: the hybrid approach The goal is for Googlebot to reach every article, not just those on the first page. The constraint: don\'t break the existing interactive navigatio"', json_encode($article_body), $json_ld); } echo $article_html;

The complete JSON-LD is injected into <head> just before the buffer is sent. Google sees a real articleBody, not a placeholder. The 5000-character limit is arbitrary — Google truncates anyway.

Server-side ToC and the iconv trap

A server-side table of contents has two advantages: it's visible in the page source (therefore crawlable), and it has no JavaScript dependency. The implementation is straightforward — parse the <h2> and <h3> tags from the article HTML, generate anchors, and inject the ToC at the beginning of .article-content.

The classic trap is generating anchor IDs for headings with accented characters. The standard reflex is to use iconv for transliteration:

// ❌ Depends on system locale — returns empty string on servers without fr_FR.UTF-8
$id = preg_replace('/[^a-z0-9]+/', '-', strtolower(iconv('UTF-8', 'ASCII//TRANSLIT', $heading_text)));

On a server where the fr_FR.UTF-8 locale isn't installed, iconv with //TRANSLIT returns an empty string for accented characters. The generated anchor becomes #--- instead of #les-3-pieges. The ToC link points to nothing.

The fix is an explicit mapping with strtr():

// ✅ Explicit mapping, portable
$id = preg_replace('/[^a-z0-9]+/', '-', strtolower(strtr($heading_text, [
    'à'=>'a','â'=>'a','é'=>'e','è'=>'e','ê'=>'e','ë'=>'e',
    'î'=>'i','ï'=>'i','ô'=>'o','ù'=>'u','û'=>'u','ü'=>'u',
    'ç'=>'c','æ'=>'ae','œ'=>'oe',
])));

No locale dependency. No unpredictable behavior between dev and production. The same heading always generates the same ID. The same pattern is applied both to the <h2> IDs in the article body and to the links in the ToC — consistency guaranteed.

Crawlable pagination: the hybrid approach

The goal is for Googlebot to reach every article, not just those on the first page. The constraint: don't break the existing interactive navigation (search, category filters, click-through pagination).

The hybrid solution: PHP loads posts.json and renders the cards for the current page (determined by the ?page=N query parameter). JavaScript keeps its search and filter functions, but detects whether PHP has already rendered content on initial load — and if so, skips the re-render.

<?php foreach ($page_posts as $post):
    $meta = $post[$lang];
    $url = $base_url . $post['slug'];
?>
<article class="blog-card" data-slug="<?= htmlspecialchars($post['slug']) ?>">
    <h2 class="blog-card-title">
        <a href="<?= htmlspecialchars($url) ?>"><?= htmlspecialchars($meta['title']) ?></a>
    </h2>
    ...
</article>
<?php endforeach; ?>

Pagination links are real <a href="?page=2"> elements, with a data-page attribute so JavaScript can intercept them:

document.querySelectorAll('a[data-page]').forEach(function(link) {
    link.addEventListener('click', function(e) {
        e.preventDefault();
        currentPage = parseInt(this.dataset.page);
        render(filterPosts());
        window.history.pushState({}, '', '?page=' + currentPage);
    });
});

To prevent JS from re-rendering cards that PHP just rendered, add a guard at initial load:

// If no active filter and PHP already rendered articles, skip re-render
var hasPhpContent = document.querySelector('#posts-container article') !== null;
if (searchTerm || (activeCategory !== 'Tous' && activeCategory !== 'All') || !hasPhpContent) {
    render(filterPosts());
}

Googlebot crawls /blog/?page=2, sees the cards in plain HTML, follows the links to individual articles. JavaScript is no longer a prerequisite for indexation. For the user, nothing changes — navigation stays instant.

hreflang, canonical, og:type: the details that matter

These three are often treated as copy-paste boilerplate. Each has a specific purpose.

The canonical must point to the URL without a query string. Without it, /blog/my-article?page=1 and /blog/my-article are two distinct URLs as far as Google is concerned, and it doesn't know which one to index. One line is enough:

$canonical = strtok($current_url, '?');

The hreflang links tell Google that French and English versions of the same content exist. Without them, Google may decide to show the wrong version based on the visitor's language. The x-default is mandatory — it specifies which version to show when no locale matches (typically a Japanese user on a FR/EN blog):

<link rel="alternate" hreflang="fr" href="https://www.web-developpeur.com/blog/" />
<link rel="alternate" hreflang="en" href="https://www.web-developpeur.com/en/blog/" />
<link rel="alternate" hreflang="x-default" href="https://www.web-developpeur.com/blog/" />

The og:type should be article on article pages, and website on the listing and homepage. The distinction affects how Facebook and LinkedIn generate the share preview. og:locale is complementary — fr_FR vs en_US depending on the article language.

Conclusion

None of these changes required rethinking the blog's architecture. Everything lives in the shared PHP template, with a minor adjustment to the listing page JavaScript. No SSR framework, no build step, no dedicated CDN.

The most counter-intuitive part is the output buffering for articleBody: it's a technique from the early 2000s that neatly solves a sequencing problem between a template's header and footer. Same for the strtr() mapping — it's less "elegant" than iconv, but it's what actually works in production.

Two weeks after deploying these changes, Search Console was showing the first indexed articles. Not because Google had suddenly decided to crawl JavaScript better — but because the HTML was finally there, visible from the very first byte of the response.

Comments (0)