THOSE WHO WILL NOT FOLLOW ARE DOOMED TO LEAD
Pages Documents Links
Username:
Password:


Valid XHTML 1.0 Strict

Introduction to XML

For centuries before computers were invented book publishers would "mark up" the margins of paper manuscripts with handwritten printing instructions and proofreading symbols. With the advent of computers machine readable markup langauges evolved and have since been widely used by the publishing industry to streamline many of these processes.

With the dawn of the internet a new use for markup languages was found, namely, encoding the text of web pages. Just as a publisher needs additional information in order to correctly typeset printed text computer software needs additional information in order to correctly structure information from the internet.

XML, or Extensible Markup Language, is a simple metalanguage for markup languages. Just what is a metalanguage? I'm glad you asked. A metalanguage is a language used to make statements about other languages. In other words, you can use XML and one of its schema languages to define your very own markup language!

Popular applications of XML include XHTML, RSS, MathML, GraphML, and SVG, among many others. After you complete this tutorial you should be able to create your own application of XML, in other words, your very own markup language! It may not be as useful or as popular as XHTML, RSS, or SVG but hey, you've got to start somewhere.

Before we go about creating our own XML application though we need to learn some basic concepts and XML's relatively simple syntax. Just what does an XML document consist of? The list is remarkably short:

Declarations
Elements & Attributes
Comments
Entity References
Processing Instructions
Character Data (essentially, everything else)

Elements are the primary method used to define the structure of a document (data) and attributes are the primary method used to provide additional information (metadata) about those elements. If you've ever seen an XML or even HTML document you already have an idea of just how easy these elements and attributes are to spot. Let's take a look at an example:

<?xml version="1.0" encoding="UTF-8"?>
<book id="55" genre='action'>
<!-- This is a comment. -->
<chapter>
<title>Introduction</title>
<picture reference="sexytime"/>
<text>...</text>
</chapter>
<chapter>
<title>A Quiet Morning</title>
<text>...</text>
</chapter>
<chapter>
<title>Into the Fray</title>
<text>...</text>
</chapter>
</book>


book, chapter, title, picture, and text are all XML elements while id, genre, and reference are all XML attributes. book is referred to as the "root" element because it's the first element to appear in the document and all other elements are a part of its content. The word "root" is used here because the root element is like the base of tree, all of the other elements "branch" off of it or, to put it another way, are "nested" inside of it. In fact, every XML document can be thought of as an upside down tree of sorts, at least as far as the analogy goes. The three chapter elements are called "children" of root because, like a father or mother in a family tree, the root can be thought of as their parent (ignoring the fact that people have two parents). Similarily, elements with the same parent (like the three chapter elements above) are referred to as siblings and, if you're really family tree happy, you can even refer to grandparents or ancestors. Though not commonly used outside of computer science any element without a child is called a "leaf" (back to the tree analogy again) and any element with lots of children is called a "pimp." The root element, for example, is often but not always a big pimp. As you can see, this allows us to define a kind of tree-like hierarchy for our documents which, though you may not realize if you've never studied computer science, is very conveinent for software to process. In any case, in the above example it should be obvious that we're dealing with some sort of book and that it consists of three chapters - this illustrates why some people refer to XML as "self-documenting," although it should be pointed out that if it's trivial to come up with counter-examples. For instance, if you didn't already know what a book is then you wouldn't necessarily find it intuitive for it to consist of chapters.

Anyway, elements are very easy to find in an XML document because their names are always surrounded with '<' and '>' just like they are in HTML and (sometimes) in XML's predecessor, SGML. An element is denoted with "tags" - a term you may be familiar with if you've written any HTML. Specifically, there are two kinds of elements - elements with content and "empty" elements, or elements with no content. Elements with content must have a start tag (like <chapter>) and an end tag (like </chapter>). The only difference between the start tag and the end tag is the '/' after the '<' in the end tag. Everything between the start tag and the end tag is part of the element's content.

Empty elements, on the other hand, have no end tag because they have no content to enclose. In other words, they cannot have any children and nothing can be nested inside of them :(. Instead the '/' that you would normally see after the '<' in the end tag is placed directly before the '>' in the start tag (like <picture reference="sexytime"/> above).

Attributes are equally simple to identify and can be associated with elements with content or empty elements. They are always placed inside the start tag of an element after the '<' and element name and are separated from the element's name and each other by spaces. They take the form of the attribute name followed directly by '=' followed directly by the attribute's value enclosed in single or double quotes (like <book id="55" genre='action'>). In the above example "id" is an attribute of the root element "book" and its value is "55". Similarly, "genre" is also an attribute of the root element "book" and its value is "action". Finally "reference" is an attribute of the empty element "picture" and its value is "sexytime" (lol). Attributes, unlike Elements, are not used to define structure. Instead they're intended to provide information about the elements - sometimes called "metadata" (there's that "meta" prefix again, see what I did there?). People argue all the time about what should be considered data and what should be considered metadata, so don't hurt yourself attempting to figure it out. If you're trying to decice whether something should be an element or an attribute later on just do your best to pick the one that makes the most sense.

Comments, like elements and attributes, are easy to spot. They always begin with "<!--" (without the quotes) and end with "-->" (again without the quotes), just like they do in HTML. In the above example there is one comment, specifically "<!-- This is a comment. -->" (again, without the quotes dammit). Comments are intended to provide, well, comments for use by the author(s) of the document or anyone viewing its source. Comments are a great way to document your... documents, lol.

As you've already seen there are certain special characters used in XML, for example, the '<' for the beginning of element tags and (as you'll soon see) the '&' for entity references. There are five easy to remember "entity references" that can (and in some cases, should) be used in place of these characters in your documents. They are as follows:

&lt; < (less than)
&gt; > (greater than)
&amp; & (ampersand or and)
&apos; ' (apostrophe)
&quot; " (quotation mark)

You should always, and I mean always, use &lt; in place of '<' and &amp; in place of '&'. Why? Because I said so, that's why. No, but seriously, don't make the internet sad by using the characters that denote the beginning of tags and entity references, respectively. Your interwebs will become tangled, the tubes will get clogged, and Internet Explorer will slit its wrists and blacken its eyes, though not necessarily in that order.



Back | POSTED BY Robert Otlowski

president@alum.rpi.edu | Robert Otlowski | rpipresident@gmail.com