• Who will read your Semantic HTML?

    Jan 3 2007

    I've talked about Semantic HTML before, and many other people have. But the one thing I find missing in these discussions is an explanation of why we should use Semantic HTML, or more specifically, who or what it will be that later reads your Semantic HTML to extract meaning from it.

    Semantic HTML is all about adding meaning to your document by using appropriate HTML elements. It's a great concept. Why waste time and space with unnecessary <div> and <span> elements when other more meaningful elements are available like <h1> or <label>?

    Certainly, Semantic HTML has many practical benefits. <label> has usability and accessibility benefits, like allowing people to click some text to check a checkbox, or allowing people with screenreaders understand what a text input is for. Headers like <h1> give a document structure by allowing hierarchical naming of the sections of a document. <title> lets you give a document a title. And of course, <a> lets you create hyperlinks which tie web pages together.

    All these are very clear and understandable benefits and ways of using Semantic HTML. I find there are other benefits to using a variety of elements, like being able to work with the HTML and CSS of a document more easily. If a document was make completely with <div>s, you'd spend a lot of time giving things class names and IDs unnecessarily, and you spend a lot of time trying to figure which </div> goes with which <div>.

    Now what really bugs me is when people start arguing about the semantic meaning of a certain element. Recently on Snook was a discussion of the Use of ADDRESS Element. Now, I understand where such discussions are coming from. The W3C HTML specs do try to define what each element is supposed to be used for. But if such a definition isn't totally clear, then you really can't use the element for anything. And you know a definition is vague when a dozen semantic and standards enthusiasts can't quite agree. And if these people can't agree, then what about the millions of other people who make web pages and haven't even heard of the W3C?

    I believe it comes down to practicalities. For example, with the <address> element, you could argue that the element should only be used for web page contact information because the specs imply this. But what will this allow us to do? Is someone going to make a tool that looks for <address> elements on a page and uses it to let you contact the owner of the page? I doubt it, except maybe spam email harvesters. And even if there was, the tool would find itself to be nearly useless since so many web pages are likely using <address> for a wide variety of purposes that have little to do with web page contact info.

    And what about other elements that don't even have a specific use like ordered, unordered and definition lists? I find it hard to imagine a scenario where the semantic meaning implicit in a list of items can be utilized. It's possible Google Sets uses lists this way, but chances are it mostly uses comma-separated words. And maybe the definitions feature of Google could use definition lists, but most of the results come from sites that don't use definition lists at all.

    Well this brings me to the point I wanted to make. We need to think about who or what it is that will actually be extracting the meaning we're adding to our documents by using Semantic HTML. And basically I can think of three groups:

    1. Web developers

      Yourself or others that will actually be reading and working with the HTML you produce. For this purpose, class names, IDs and elements all add semantic meaning or at least readability to a document. This makes it easier to work with the HTML, and understand what each element represents structurally.

    2. Search engine spiders and other bots

      These are tools that read a large number of web pages and try to extract some meaning from them. Search engines understand that text in titles, meta tags, links and headers is special. Technorati's Microformats Search is another great example of semantics being utilized.

    3. Web browsers, screen readers and other clients

      These understand what many of the different elements are for and allow the visitor to interact with these elements in a unique way, like with a checkbox or link. Also, the client can communicate semantics to a visitor by displaying elements a certain way, like numbering the items of an ordered list. However, this semantic communication can be messed with by using CSS. An ordered list with list-style: none and list items floated will communicate no semantics to a user of a visual web browser.

    In terms of these three groups of web page users, try to think of what difference it will make if a <dd> gets used for a blog post body instead of strictly a word definition. If the semantics can't be used or even considered to really mean anything, then can they even be considered semantic?

  • Comments

    1. Jonathan Snook at 9:35pm on January 3, 2007

    The reason why much of this debate occurs is because we want (need?) a consensus. It's the chicken vs egg thing. We need to establish a base from which quality tools can be built on top of. This is why microformats are taking off. Who cares if I use a class called "telephone" or "tel" or "classA"? They all do the same thing ... until tools can extract reliable data, but it's not reliable until there's a consensus.

    So, web standards establish a baseline. Microformats establish a baseline. Then, tools can take advantage of them. Then, you can automatically book an event in Upcoming.org from a date in another web page. Then, you can add a contact to Outlook with the vcard data embedded in the page.

    I think address is particularly maligned because the element's name seems to evoke so much meaning that one would think it obvious what it should do.

    2. Jason Barnabe at 12:06am on January 4, 2007

    Another advantage is sane rendering when a stylesheet is not applied. Headers look like headers, lists look like lists, etc.

    I think you're being a bit close-minded when it comes to possible uses of semantic HTML to bots. For example, a theoretical "Contact the webmaster" bot or extension doesn't need to have set data to make use of an address element - a possible algorithm could be
    1. Look for anchors with a mailto: href in an address element
    2. Look for text like *@*.* in an address element
    3. Look for anchors with a mailto: href elsewhere
    4. Look for text like *@*.* elsewhere
    So for this bot/extension, the address element is certainly useful for it to reduce "false positives", but address elements used for other purposes and a lack of address elements don't trip it up. I'd find it surprising if Google Sets and Google Definitions *didn't* make use of semantic data in this manner. Even if W3C came out and said "put e-mail addresses in the address element", you'd still have to deal with all the same issues.

    I don't see any downsides to being as semantic as possible other than possibly having to override the default CSS. With so many upsides, both theoretical and practical, why wouldn't you semantic as possible?

    3. Emil Stenström at 5:18pm on January 4, 2007

    I tend to look at things a little differently. I believe websites should be written for humans not robots. Robots can be given info in other ways, link to a .vcard file instead of pushing it in with strange classnames.

    As a web developer I think the biggest reason to use semantic HTML is to "do things the right way". In programming you don't repeat yourself in your code, extract methods and call them. With CSS don't define the style over and over again in the HTML, you extract it to a separate file and link it. Semantic HTML is a lot like (declarative) programming, and I think it should be compared to that.

    4. Jonathan Snook at 8:40pm on January 4, 2007

    Emil: Using vCard as an example specifically, the problem is that a browser can't render a vCard and they may not have an application that understands vCard. It can, however, render HTML just fine. So, a microformat still creates something that is flexible and usable by browsers and users but adds a consistent layer that allows applications to make use of it, too. And it saves you from duplicating contact information in the page AND in a vCard (D.R.Y.!).

    5. Keith Alexander at 8:04am on May 8, 2007

    Jonathon: Emil has a valid point. In the excitement over the possibilities of aggregating microformats, it often seems to be forgotten that vcard and ical files are already published in large numbers on the web, and have very good tools for creating them.

    In many cases, it makes more sense to use a script to generate html from the vcard or ical, and link to the original file, than it does to try to start with the html and generate a machine readable format.  You/your client can use existing calendar and address book apps, and you don't have to compromise between accessibility and information loss (ie: the abbr[@title] hack).

    That said, Emil, I don't understand the humans vs. machines dichotomy. Web pages are necessarily processed by machines for the value of humans, so what's wrong with increasing the value to humans by making it easier for machines where you can?

    6. lewis litanzios at 8:16pm on March 27, 2008

    slightly off topic considering the way the comments, albeit very interesting, are going i know, but i always wondered whether it mattered how you name your classes/IDs?

    is 'camelCase' any different to using an 'under_score' in terms of how machines will interpret your semantic conventions? i feel more conformable using camelCase these days, but do get slightly jealous of under_scores when i see them sometimes, for some strange reason (don't ask me why)? i've gone off using hyphens since learning XML best practices.

    i did think about blogging this myself, but it did occur to me it would be rather a short post. i think there's already been enough written on semantics recently to be jumping on the wagon.

    thanks for raising this issue jesse.

    ps. first i've heard of google sets - could this be used for generating meta keywords?

    pps. do you 'ping' to technorati (http://technorati.com/ping) out of interest?

    7. Jesse Skinner at 2:43pm on March 29, 2008

    @lewis - Class names are only semantic in terms of communicating with other designers/developers working on the code. The only time machines/bots really care about class names is when dealing with microformats or other pre-defined meanings, and in that case the format is also unimportant as long as it's documented. Google, for example, doesn't search/index class names.

    ps. sure, try it out. Google also has a keyword selector tool.

    pps. I used to manually, but I get so little traffic from technorati (a few hits a month max) that I don't usually bother. My blog code is handrolled and I haven't bothered to build a ping tool.

    8. lewis litanzios at 3:20pm on March 29, 2008

    safe jesse, thanks for the heads up :)

    9. Cooper at 10:48pm on September 2, 2008

    While on topic of microformats I think they are going in the right direction but only apply to certain types of data such as address, geo, and so fourth... With this in mind I think class/id naming needs to be standardized as well. For example, <div id="wrapper"> means absolutely nothing. Maybe, if wrapper was changed to <div id="page-content"> this would be more semantic. We need to entice a new movement in regards to a new semantic web.
    @Snook - What's your thoughts about the semantics of naming conventions?

    10. brian at 5:47pm on February 9, 2009

    is the div really out of place on a page? It means there is a logical division in the page and therefore if you have two columns for instance, each column should be inside a div because they are seperated from eachother.

    11. Lewis Litanzios at 6:33pm on February 9, 2009

    2009 and now semantic naming is VERY important I feel.

    I have a list on my wall now with a list of semantic naming conventions I put together from a number of articles around the web the other day. The best of, and most comprehensive being this: http://www.stuffandnonsense.co.uk/archives/whats_in_a_name_pt2.html by Andy Clarke.

    CSS signatures, Microformats, and even clients asking for this s**t now, so I'm rolling with it. Plus if you use SNCs it comes off like you're an organised f**ker too, not to mention it's great for human readability when you start pilfering jQuery functions (Jesse will agree with me here no doubt).

    Excuse my swearing, this is practically the last sentence I will write before bed today :|

    12. Aaron at 10:43am on March 11, 2010

    One thing that always bugs me is when I see people using block-level elements wrapped inside of DIV tags.

    Something like this:
    <div class="heading">
    <h2>The heading</h2>
    </div>

    or worse yet:
    <div class="heading">
    <img src="theheadingimg.jpg" />
    </div>

    While we can all nitpick about microformatting and argue about the merits of what should go inside of an address tag, can we at least all agree that inventing new CSS classes to duplicate the function of perfectly usable HTML tags is ridiculous?

    Commenting is now closed. Come find me on Twitter.