The Table of Contents script – my old nemesis
One thing I like about – let me rephrase that – one of the amazingly few things that I like about Microsoft Word is that you can generate a Table of Contents from a document. Word would go through the headings and create a nested TOC from them for you:
Now, I always like to do that for documents I write in HTML, too, but maintaining them by hand is a pain. I normally write my document outline first:
<h1 id="cute">Cute things on the Interwebs</h1>
<h2 id="rabbits">Rabbits</h2>
<h2 id="puppies">Puppies</h2>
<h3 id="labs">Labradors</h3>
<h3 id="alsatians">Alsatians</h3>
<h3 id="corgies">Corgies</h3>
<h3 id="retrievers">Retrievers</h3>
<h2 id="kittens">Kittens</h2>
<h2 id="gerbils">Gerbils</h2>
<h2 id="ducklings">Ducklings</h2>
I then collect those, copy and paste them and use search and replace to turn all the hn to <li><a and the id=" to href="#:
<li><a href="#cute">Cute things on the Interwebs</a></li>
<li><a href="#rabbits">Rabbits</a></li>
<li><a href="#puppies">Puppies</a></li>
<li><a href="#labs">Labradors</a></li>
<li><a href="#alsatians">Alsatians</a></li>
<li><a href="#corgies">Corgies</a></li>
<li><a href="#retrievers">Retrievers</a></li>
<li><a href="#kittens">Kittens</a></li>
<li><a href="#gerbils">Gerbils</a></li>
<li><a href="#ducklings">Ducklings</a></li>
<h1 id="cute">Cute things on the Interwebs</h1>
<h2 id="rabbits">Rabbits</h2>
<h2 id="puppies">Puppies</h2>
<h3 id="labs">Labradors</h3>
<h3 id="alsatians">Alsatians</h3>
<h3 id="corgies">Corgies</h3>
<h3 id="retrievers">Retrievers</h3>
<h2 id="kittens">Kittens</h2>
<h2 id="gerbils">Gerbils</h2>
<h2 id="ducklings">Ducklings</h2>
Then I need to look at the weight and order of the headings and add the nesting of the TOC list accordingly.
<ul>
<li><a href="#cute">Cute things on the Interwebs</a>
<ul>
<li><a href="#rabbits">Rabbits</a></li>
<li><a href="#puppies">Puppies</a>
<ul>
<li><a href="#labs">Labradors</a></li>
<li><a href="#alsatians">Alsatians</a></li>
<li><a href="#corgies">Corgies</a></li>
<li><a href="#retrievers">Retrievers</a></li>
</ul>
</li>
<li><a href="#kittens">Kittens</a></li>
<li><a href="#gerbils">Gerbils</a></li>
<li><a href="#ducklings">Ducklings</a></li>
</ul>
</li>
</ul>
<h1 id="cute">Cute things on the Interwebs</h1>
<h2 id="rabbits">Rabbits</h2>
<h2 id="puppies">Puppies</h2>
<h3 id="labs">Labradors</h3>
<h3 id="alsatians">Alsatians</h3>
<h3 id="corgies">Corgies</h3>
<h3 id="retrievers">Retrievers</h3>
<h2 id="kittens">Kittens</h2>
<h2 id="gerbils">Gerbils</h2>
<h2 id="ducklings">Ducklings</h2>
Now, wouldn’t it be nice to have that done automatically for me? The way to do that in JavaScript and DOM is actually a much trickier problem than it looks like at first sight (I always love to ask this as an interview question or in DOM scripting workshops).
Here are some of the issues to consider:
- You can easily get elements with
getElementsByTagName()but you can’t do agetElementsByTagName('h*')sadly enough. - Headings in XHTML and HTML 4 do not have the elements they apply to as child elements (XHTML2 was proposing that and HTML5 has it to a degree – Bruce Lawson write a nice post about this and there’s also a “pretty nifty HTML5 outliner”:http://code.google.com/p/h5o/ available).
- You can do a
getElementsByTagName()for each of the heading levels and then concatenate a collection of all of them. However, that does not give you their order in the source of the document. - To this end PPK wrote an infamous TOC script used on his site a long time ago using his getElementsByTagNames() function which works with things not every browser supports. Therefore it doesn’t quite do the job either. He also “cheats” at the assembly of the TOC list as he adds classes to indent them visually rather than really nesting lists.
- It seems that the only way to achieve this for all the browsers using the DOM is painful: do a
getElementsByTagName('*')and walk the whole DOM tree, comparingnodeNameand getting the headings that way. - Another solution I thought of reads the
innerHTMLof the document body and then uses regular expressions to match the headings. - As you cannot assume that every heading has an ID we need to add one if needed.
So here are some solutions to that problem:
Using the DOM:
(function(){
var headings = [];
var herxp = /h\d/i;
var count = 0;
var elms = document.getElementsByTagName('*');
for(var i=0,j=elms.length;i<j;i++){
var cur = elms[i];
var id = cur.id;
if(herxp.test(cur.nodeName)){
if(cur.id===''){
id = 'head'+count;
cur.id = id;
count++;
}
headings.push(cur);
}
}
var out = '<ul>';
for(i=0,j=headings.length;i<j;i++){
var weight = headings[i].nodeName.substr(1,1);
if(weight > oldweight){
out += '<ul>';
}
out += '<li><a href="#'+headings[i].id+'">'+
headings[i].innerHTML+'</a>';
if(headings[i+1]){
var nextweight = headings[i+1].nodeName.substr(1,1);
if(weight > nextweight){
out+='</li></ul></li>';
}
if(weight == nextweight){
out+='</li>';
}
}
var oldweight = weight;
}
out += '</li></ul>';
document.getElementById('toc').innerHTML = out;
})();
You can “see the DOM solution in action here”:http://isithackday.com/demos/tocit/toc_dom.html. The problem with it is that it can become very slow on large documents and in MSIE6.
The regular expressions solution
To work around the need to traverse the whole DOM, I thought it might be a good idea to use regular expressions on the innerHTML of the DOM and write it back once I added the IDs and assembled the TOC:
(function(){
var bd = document.body,
x = bd.innerHTML,
headings = x.match(/<h\d[^>]*>[\S\s]*?<\/h\d>$/mg),
r1 = />/,
r2 = /<(\/)?h(\d)/g,
toc = document.createElement('div'),
out = '<ul>',
i = 0,
j = headings.length,
cur = '',
weight = 0,
nextweight = 0,
oldweight = 2,
container = bd;
for(i=0;i<j;i++){
if(headings[i].indexOf('id=')==-1){
cur = headings[i].replace(r1,' id="h'+i+'">');
x = x.replace(headings[i],cur);
} else {
cur = headings[i];
}
weight = cur.substr(2,1);
if(i<j-1){
nextweight = headings[i+1].substr(2,1);
}
var a = cur.replace(r2,'<$1a');
a = a.replace('id="','href="#');
if(weight>oldweight){ out+='<ul>'; }
out+='<li>'+a;
if(nextweight<weight){ out+='</li></ul></li>'; }
if(nextweight==weight){ out+='</li>'; }
oldweight = weight;
}
bd.innerHTML = x;
toc.innerHTML = out +'</li></ul>';
container = document.getElementById('toc') || bd;
container.appendChild(toc);
})();
You can “see the regular expressions solution in action here”:http://isithackday.com/demos/tocit/toc_js.html. The problem with it is that reading innerHTML and then writing it out might be expensive (this needs testing) and if you have event handling attached to elements it might leak memory as my colleage Matt Jones pointed out (again, this needs testing). Ara Pehlivavian also mentioned that a mix of both approaches might be better – match the headings but don’t write back the innerHTML – instead use DOM to add the IDs.
Libraries to the rescue – a YUI3 example
Talking to another colleague – Dav Glass – about the TOC problem he pointed out that the YUI3 selector engine happily takes a list of elements and returns them in the right order. This makes things very easy:
<script type="text/javascript" src="http://yui.yahooapis.com/3.0.0/build/yui/yui-min.js"></script>
<script>
YUI({combine: true, timeout: 10000}).use("node", function(Y) {
var nodes = Y.all('h1,h2,h3,h4,h5,h6');
var out = '<ul>';
var weight = 0,nextweight = 0,oldweight;
nodes.each(function(o,k){
var id = o.get('id');
if(id === ''){
id = 'head' + k;
o.set('id',id);
};
weight = o.get('nodeName').substr(1,1);
if(weight > oldweight){ out+='<ul>'; }
out+='<li><a href="#'+o.get('id')+'">'+o.get('innerHTML')+'</a>';
if(nodes.item(k+1)){
nextweight = nodes.item(k+1).get('nodeName').substr(1,1);
if(weight > nextweight){ out+='</li></ul></li>'; }
if(weight == nextweight){ out+='</li>'; }
}
oldweight = weight;
});
out+='</li></ul>';
Y.one('#toc').set('innerHTML',out);
});</script>
There is probably a cleaner way to assemble the TOC list.
Performance considerations
There is more to life than simply increasing its speed. – Gandhi
Some of the code above can be very slow. That said, whenever we talk about performance and JavaScript, it is important to consider the context of the implementation: a table of contents script would normally be used on a text-heavy, but simple, document. There is no point in measuring and judging these scripts running them over gmail or the Yahoo homepage. That said, faster and less memory consuming is always better, but I am always a bit sceptic about performance tests that consider edge cases rather than the one the solution was meant to be applied to.
Moving server side.
The other thing I am getting more and more sceptic about are client side solutions for things that actually also make sense on the server. Therefore I thought I could use the regular expressions approach above and move it server side.
The first version is a PHP script you can loop an HTML document through. You can “see the outcome of tocit.php here”:http://isithackday.com/demos/tocit/tocit.php?file=plain.html:
<?php
$file = $_GET['file'];
if(preg_match('/^[a-z0-9\-_\.]+$/i',$file)){
$content = file_get_contents($file);
preg_match_all("/<h([1-6])[^>]*>.*<\/h.>/Us",$content,$headlines);
$out = '<ul>';
foreach($headlines[0] as $k=>$h){
if(strstr($h,'id')===false){
$x = preg_replace('/>/',' id="head'.$k.'">',$h,1);
$content = str_replace($h,$x,$content);
$h = $x;
};
$link = preg_replace('/<(\/)?h\d/','<$1a',$h);
$link = str_replace('id="','href="#',$link);
if($k>0 && $headlines[1][$k-1]<$headlines[1][$k]){
$out.='<ul>';
}
$out .= '<li>'.$link.'';
if($headlines[1][$k+1] && $headlines[1][$k+1]<$headlines[1][$k]){
$out.='</li></ul></li>';
}
if($headlines[1][$k+1] && $headlines[1][$k+1] == $headlines[1][$k]){
$out.='</li>';
}
}
$out.='</li></ul>';
echo str_replace('<div id="toc"></div>',$out,$content);
}else{
die('only files like text.html please!');
}
?>
This is nice, but instead of having another file to loop through, we can also “use the output buffer of PHP“:http://isithackday.com/demos/tocit/toc_ob.php:
<?php
function tocit($content){
preg_match_all("/<h([1-6])[^>]*>.*<\/h.>/Us",$content,$headlines);
$out = '<ul>';
foreach($headlines[0] as $k=>$h){
if(strstr($h,'id')===false){
$x = preg_replace('/>/',' id="head'.$k.'">',$h,1);
$content = str_replace($h,$x,$content);
$h = $x;
};
$link = preg_replace('/<(\/)?h\d/','<$1a',$h);
$link = str_replace('id="','href="#',$link);
if($k>0 && $headlines[1][$k-1]<$headlines[1][$k]){
$out.='<ul>';
}
$out .= '<li>'.$link.'';
if($headlines[1][$k+1] && $headlines[1][$k+1]<$headlines[1][$k]){
$out.='</li></ul></li>';
}
if($headlines[1][$k+1] && $headlines[1][$k+1] == $headlines[1][$k]){
$out.='</li>';
}
}
$out.='</li></ul>';
return str_replace('<div id="toc"></div>',$out,$content);
}
ob_start("tocit");
?>
[... the document ...]
<?php ob_end_flush();?>
The server side solutions have a few benefits: they always work, and you can also cache the result if needed for a while. I am sure the PHP can be sped up, though.
See all the solutions and get the source code
I showed you mine, now show me yours!
All of these solutions are pretty much rough and ready. What do you think how they can be improved? How about doing a version for different libraries? Go ahead, fork the project on GitHub and show me what you can do.
Tags: dom, generator, headings, HTML, javascript, outline, php, tableofcontents, toc, word, YUI3




January 6th, 2010 at 3:33 pm
@codepo8 http://www.kryogenix.org/code/browser/generated-toc/
– Posted using Chat Catcher
January 6th, 2010 at 4:05 pm
Here’s a python toc generator script I wrote a while back for a wiki plugin I ended up not using.
https://code.edge.launchpad.net/~muffinresearch/+junk/toc
January 6th, 2010 at 4:32 pm
Good post! The javascript solutions would create a nice bookmarklet ;-)
Anyway, wouldn’t be a better choice? (just nitpicking…)
January 6th, 2010 at 5:18 pm
Wouldn’t jQuery selectors also return the headings in the order they are found?
A quick (one liner) test seems to show it does:
http://alastairc.ac/testing/headings-test.html
That was with clone, and with append.
January 6th, 2010 at 5:54 pm
I’m pretty sure the innerHTML writing will actually be much faster on the majority of browsers. Great stuff. And thanks for the Python solution Stuart. May come in very handy for me.
January 6th, 2010 at 8:09 pm
The Table of Contents script – my old nemesis [link to post]
– Posted using Chat Catcher
January 6th, 2010 at 8:36 pm
You can easily use the HTML5 algorithm to TOCify HTML4, as HTML5 is supposed to be backwards compatible. The only catch is that in such case you do have to be “strict” about your heading tags, i.e. no jumps like H2 to H4 (HTML4 spec says headink rank only means importance within the context of the page).
As for perfomance of the algorithm from the HTML5 spec – I can’t say yet – it does have to walk the full tree anyways (and touches every element twice), however, I suppose that Y.all(‘h1,h2,h3,h4,h5,h6′) does getElementsByTagName(‘*’) and then loops through everything anyways. I smell a performance study.
However, it does seem very correct in what it does (as is to be expected) and it is future proof, because SECTION tag (and the like) reset heading “rank”, so basically “H1 + ARTICLE H1″ is essentially same as “H1 + H2″ with virtually unlimited nesting depth.
And thanks for the backlink to the outliner :) I now may just go and do the navigation part for it…
January 6th, 2010 at 8:44 pm
Christian, I wanted to suggest the use of OL instead of UL. The html has been removed in the comment.
January 7th, 2010 at 8:03 pm
I’d change the first regex:
/]*>[\S\s]*?$/mg
to this:
/]*>[\S\s]*?$/mg
This guarantees that the closing heading matches the opening heading.
Also, it’s Gandhi, not Ghandi ;)
January 7th, 2010 at 8:04 pm
it looks like my last comment was munged badly.
old regex: /<h\d[^>]*>[\S\s]*?<\/h\d>$/mg
suggested regex: /<h(\d)[^>]*>[\S\s]*?<\/h\1>$/mg
January 7th, 2010 at 11:02 pm
RT @smashingmag: The Table of Contents Script: Which one do you use? – [link to post]
– Posted using Chat Catcher
January 7th, 2010 at 11:33 pm
The Table of Contents Script: Which one do you use? – [link to post]
– Posted using Chat Catcher
January 8th, 2010 at 1:32 am
The Table of Contents Script: Which one do you use? – [link to post] | @smashingmag
– Posted using Chat Catcher
January 8th, 2010 at 2:33 am
RT: @ smashingmag: The Table of Contents Script: Which one do you use? – [link to post] http://ow.ly/16i3ib
– Posted using Chat Catcher
January 8th, 2010 at 3:58 am
[link to post]
Wait till I come! » The Table of Contents script – my old nemesis
– Posted using Chat Catcher
January 8th, 2010 at 4:04 pm
“The Table of Contents script – my old nemesis” – [link to post] (via @codepo8)
– Posted using Chat Catcher
January 8th, 2010 at 11:40 pm
Here’s my take: http://github.com/joshwnj/toc/
I was curious to see how it could be done by DOM traversal, rather than fetching everything up-front.
Uses jQuery.
January 9th, 2010 at 1:36 am
I posted on my website a script that does this but using jQuery framework: http://lab.diogovincenzi.com/blog/view/table-contents-jquery
January 23rd, 2010 at 3:43 pm
Great code, thanks! I noticed a few problems with the PHP version. I’ve modified it to handle this:
http://c55srj2.eps.manchester.ac.uk/~web1_mbasssma/projects/toc/
PHP source at http://c55srj2.eps.manchester.ac.uk/~web1_mbasssma/projects/toc/toc.php.txt
The fixes I made I think solve the following issues I came across:
1. Allows for headers being nested in any order e.g. h4>h1>h1>h2>h6>h3…
2. Correctly handles the case when a silly user puts images or any other elements inside an h tag (particularly images) i.e. it strips them out to prevent them breaking to contents table.
I’m by no means a seasoned PHP programmer, but I hope what I have done is useful to someone. Many thanks!
January 23rd, 2010 at 3:45 pm
oh, and it also handles the situation where two headers have the same level (e.g. h2) and identical text. To fix this I included the str_replace_once function. Without this the duplicates end up with the same anchor tag.
–Stuart