Wattle Software - producers of XMLwriter XML editor
 Home | Search | Site Map 
XMLwriter
 Screenshots
 Features
 About Latest Version
 Awards & Reviews
 User Comments
 Customers
Download
 Download XMLwriter
 Download Plug-ins
 Download Help Manual
 MSXML Updates
 Downloading FAQ
Buy
 Buy XMLwriter
 Pricing
 Upgrading
 Sales Support
 Sales FAQ
Support
 Sales Support
 Technical Support
 Submit a Bug Report
 Feedback & Requests
 Technical FAQ
Resources
 XML Books
 XML Links
 XML Training
 XMLwriter User Tools
 The XML Guide
 XML Book Samples
Wattle Software
 About Us
 Contact Details
 News
Beginning XML

Buy this book

Back Contents Next

Well-Formed XML

We've discussed some of the reasons why XML makes sense for communicating data, so now let's get our hands dirty and learn how to create our own XML documents. This chapter will cover all you need to know to create "well-formed" XML.

Well-formed XML is XML that meets certain grammatical rules outlined in the
XML 1.0 specification.

You will learn:

 

How to create XML elements using start- and end-tags

How to further describe elements with attributes

How to declare your document as being XML

How to send instructions to applications that are processing the XML document

Which characters aren't allowed in XML, and how to put them in anyway

 

Because XML and HTML appear so similar, and because you're probably already familiar with HTML, we'll be making comparisons between the two languages in this chapter. However, if you don't have any knowledge of HTML, you shouldn't find it too hard to follow along.

 

If you have Internet Explorer 5, you may find it useful to save some of the examples in this chapter on your hard drive, and view the results in the browser. If you don't have IE5, some of the examples will have screenshots to show what the end results look like.

 

Tags and Text and Elements, Oh My!

It's time to stop calling things just "items" and "text"; we need some names for the pieces that make up an XML document. To get cracking, let's break down the simple <name> document we created in Chapter 1:

 

<name>

  <first>John</first>

  <middle>Fitzgerald Johansen</middle>

  <last>Doe</last>

</name>

 

The words between the < and > characters are XML tags. The information in our document (our data) is contained within the various tags that constitute the markup of the document. This makes it easy to distinguish the information in the document from the markup.

 

As you can see, the tags are paired together, so that any opening tag also has a closing tag. In XML parlance, these are called start-tags and end-tags. The end-tags are the same as the start-tags, except that they have a "/" right after the opening < character.

 

In this regard, XML tags work the same as start-tags and end-tags do in HTML. For example, you would create an HTML paragraph like this:

 

<P>This is a paragraph.</P>

 

As you can see, there is a <P> start-tag, and a </P> end-tag, just like we use for XML.

 

All of the information from the start of a start-tag to the end of an end-tag, and including everything in between, is called an element. So:

 

<first> is a start-tag

</first> is an end-tag

<first>John</first> is an element

 

The text between the start-tag and end-tag of an element is called the element content. The content between our tags will often just be data (as opposed to other elements). In this case, the element content is referred to as Parsed Character DATA, which is almost always referred to using its acronym, PCDATA.

 

Whenever you come across a strange-looking term like PCDATA, it's usually a good bet the term is inherited from SGML. Because XML is a subset of SGML, there are a lot of these inherited terms.

 

The whole document, starting at <name> and ending at </name>, is also an element, which happens to include other elements. (And, in this case, the element is called the root element, which we'll be talking about later.)

 

To put this new-found knowledge into action, let's create an example that contains more information than just a name.

 

 

 

 

 

 

 

 

 

 

 

Try It Out – Describing Weirdness

We're going to build an XML document to describe one of the greatest CDs ever produced, Dare to be Stupid, by Weird Al Yankovic. But before we break out Notepad and start typing, we need to know what information we're capturing.

 

In Chapter 1, we learned that XML is hierarchical in nature; information is structured like a tree, with parent/child relationships. This means that we'll have to arrange our CD information in a tree structure as well.

 

Since this is a CD, we'll need to capture information like the artist, title, and date released, as well as the genre of music. We'll also need information about each song on the CD, such as the title and length. And, since Weird Al is famous for his parodies, we'll include information about what song (if any) this one is a parody of.

Here's the hierarchy we'll be creating:

 

 

Some of these elements, like <artist>, will appear only once; others, like <song>, will appear multiple times in the document. Also, some will have PCDATA only, while some will include their information as child elements instead. For example, the <artist> element will contain PCDATA for the title, whereas the <song> element won't contain any PCDATA of its own, but will contain child elements that further break down the information.

With this in mind, we're now ready to start entering XML. If you have Internet Explorer 5 installed on your machine, type the following into Notepad, and save it to your hard drive as cd.xml:

<CD>

  <artist>"Weird Al" Yankovic</artist>

  <title>Dare to be Stupid</title>

  <genre>parody</genre>

  <date-released>1990</date-released>

  <song>

    <title>Like A Surgeon</title>

    <length>

      <minutes>3</minutes>

      <seconds>33</seconds>

    </length>

    <parody>

      <title>Like A Virgin</title>

      <artist>Madonna</artist>

    </parody>

  </song>

  <song>

    <title>Dare to be Stupid</title>

    <length>

      <minutes>3</minutes>

      <seconds>25</seconds>

    </length>

    <parody></parody>

  </song>

</CD>

 

For the sake of brevity, we'll only enter two of the songs on the CD, but the idea is there nonetheless.

Now, open the file in IE5. (Navigate to the file in Explorer and double click on it, or open up the browser and type the path in the URL bar.) If you have typed in the tags exactly as shown, the cd.xml file will look something like this:

 

 

How It Works

Here we've created a hierarchy of information about a CD, so we've named the root element accordingly.

 

The <CD> element has children for the artist, title, genre, and date, as well as one child for each song on the disc. The <song> element has children for the title, length, and, since this is Weird Al we're talking about, what song (if any) this is a parody of. Again, for the sake of this example, the <length> element was broken down still further, to have children for minutes and seconds, and the <parody> element broken down to have the title and artist of the parodied song.

 

You may have noticed that the IE5 browser changed <parody></parody> into <parody/>. We'll talk about this shorthand syntax a little bit later, but don't worry: it's perfectly legal.

 

If we were to write a CD Player application, we could make use of this information to create a play-list for our CD. It could read the information under our <song> element to get the name and length of each song to display to the user, display the genre of the CD in the title bar, etc. Basically, it could make use of any information contained in our XML document.

Rules for Elements

Obviously, if we could just create elements in any old way we wanted, we wouldn't be any further along than our text file examples from the previous chapter. There must be some rules for elements, which are fundamental to the understanding of XML.

XML documents must adhere to these rules to be well-formed.

We'll list them, briefly, before getting down to details:

 

Every start-tag must have a matching end-tag

Tags can't overlap

XML documents can have only one root element

Element names must obey XML naming conventions

XML is case-sensitive

XML will keep white space in your text

Every Start-tag Must Have an End-tag

One of the problems with parsing SGML documents is that not every element requires a start-tag and an end-tag. Take the following HTML for example:

 

<HTML>

<BODY>

<P>Here is some text in an HTML paragraph.

<BR>

Here is some more text in the same paragraph.

<P>And here is some text in another HTML paragraph.</p>

</BODY>

</HTML>

 

Notice that the first <P> tag has no closing </P> tag. This is allowed – and sometimes even encouraged – in HTML, because most web browsers can detect automatically where the end of the paragraph should be. In this case, when the browser comes across the second <P> tag, it knows to end the first paragraph. Then there's the <BR> tag (line break), which by definition has no closing tag.

 

Also, notice that the second <P> start-tag is matched by a </p> end-tag, in lower case. HTML browsers have to be smart enough to realize that both of these tags delimit the same element, but as we'll see soon, this would cause a problem for an XML parser.

 

The problem is that this makes HTML parsers much harder to write. Code has to be included to take into account all of these factors, which often makes the parsers much larger, and much harder to debug. What's more, the way that files are parsed is not standardized – different browsers do it differently, leading to incompatibilities.

 

For now, just remember that in XML the end-tag is required, and has to exactly match the start-tag.

Tags Can Not Overlap

Because XML is strictly hierarchical, you have to be careful to close your child elements before you close your parents. (This is called properly nesting your tags.) Let's look at another HTML example to demonstrate this:

 

<P>Some <STRONG>formatted <EM>text</STRONG>, but</EM> no grammar no good!</P>

 

This would produce the following output on a web browser:

 

Some formatted text, but no grammar no good!

 

As you can see, the <STRONG> tags cover the text formatted text, while the <EM> tags cover the text text, but.

 

But is <em> a child of <strong>, or is <strong> a child of <em>? Or are they both siblings, and children of <p>? According to our stricter XML rules, the answer is none of the above. The HTML code, as written, can't be arranged as a proper hierarchy, and could therefore not be well-formed XML.

 

If ever you're in doubt as to whether your XML tags are overlapping, try to rearrange them visually to be hierarchical. If the tree makes sense, then you're okay. Otherwise, you'll have to rework your markup.

 

For example, we could get the same effect as above by doing the following:

 

<P>Some <STRONG>formatted <EM>text</EM></STRONG><EM>, but</EM> no grammar no good!</P>

 

Which can be properly formatted in a tree, like this:

 

<P>

  Some

  <STRONG>

    formatted

    <EM>

      text

    </EM>

  </STRONG>

  <EM>

    , but

  </EM>

  no grammar no good!

</P>

An XML Document Can Have Only One Root Element

In our <name> document, the <name> element is called the root element. This is the top-level element in the document, and all the other elements are its children or descendents. An XML document must have one and only one root element: in fact, it must have a root element even if it has no content.

 

For example, the following XML is not well-formed, because it has a number of root elements:

 

<name>John</name>

<name>Jane</name>

 

To make this well-formed, we'd need to add a top-level element, like this:

 

<names>

  <name>John</name>

  <name>Jane</name>

</names>

 

So while it may seem a bit of an inconvenience, it turns out that it's incredibly easy to follow this rule. If you have a document structure with multiple root-like elements, simply create a higher-level element to contain them.

Element Names

If we're going to be creating elements we're going to have to give them names, and XML is very generous in the names we're allowed to use. For example, there aren't any reserved words to avoid in XML, as there are in most programming languages, so we have a lot flexibility in this regard.

However, there are some rules that we must follow:

 

Names can start with letters (including non-Latin characters) or the  "_" character, but not numbers or other punctuation characters.

After the first character, numbers are allowed, as are the characters "-" and ".".

Names can't contain spaces.

Names can't contain the ":" character. Strictly speaking, this character is allowed, but the XML specification says that it's "reserved". You should avoid using it in your documents, unless you are working with namespaces (which are covered in Chapter 8).

Names can't start with the letters "xml", in uppercase, lowercase, or mixed – you can't start a name with "xml", "XML", "XmL", or any other combination.

There can't be a space after the opening "<" character; the name of the element must come immediately after it. However, there can be space before the closing ">"character, if desired.

Here are some examples of valid names: