Thursday, March 3, 2011

XML In .NET

One of the most exciting recent advances in computing has been XML. Designed as a stricter and simpler document format than SGML, XML is now used everywhere to produce cross-platform interoperable file formats.
It's also core to .NET, and is something every .NET developer will need to come to grips with. I've tried to keep this introduction to XML as broad as possible, so it should be of use to users of all developmental persuasions.
XML 101: Learning to Crawl
Before we look into the specifics of XML, it is important to know why XML exists and where it can be used. A proper understanding will allow you to use it effectively in your projects.
Where HTML was designed to display data and specify how that data should look, XML was designed to describe and structure data. In this way, an XML file itself doesn't actually do anything. It doesn't say how to display the data or what to do with data, just as a text file doesn't.
But XML crucially differs from plain text in that it allows you to structure your data in a standard manner. This is important -- it means that other systems can interpret your XML, which is not as easily achievable in plain text. This describes what is meant by "interoperable file format" -- once you produce an XML file, it is open to everyone. An input, and all the information required to understand the structure of your data, is included in the file.
Let's take an example. Here's a text file and an XML file that both store the same information:
mymusic.txt
The Bends,Radiohead, Street Spirit
Is This It?,The Strokes, Last Nite

mymusic.xml
<catalog>
<cd>
   <title>The Bends</title>
   <artist>Radiohead</artist>
   <tracks>
     <track name="Street Spirit"/>
   </tracks>
 </cd>
<cd>
   <title>Is This It?</title>
   <artist>The Strokes</artist>
   <tracks>
     <track name="Last Nite"/>
   </tracks>
 </cd>
</catalog>

Notice how the subject of our data is defined in the XML file. We can see clearly that there is a catalogue containing CDs, each of which contains some tracks (music aficionados will notice that I have cut down the track listings for space!). You can also see that XML can be less efficient than some other file formats. Yet, in many cases, the loss in efficiency that results from the increased size can be made up by the speed of processing a well-defined XML file, as parsers (programs that read XML) can predict the structure.
The way we'd interpret the plain text file would be dependent on how we designed our own format. No information exists to tell others what the actual data means, its order, or how to parse (read) it in other projects. By contrast, the XML file shows clearly what each piece of information represents and where it belongs in the data hierarchy. This "data-describing data" is known as metadata, and is a great strength of XML in that you can create your own specifications and structure your data to be interpreted by any other system.
Terminology
To start using XML effectively, a sound knowledge of its terminology and file structures needs to be gained.
<catalog>
<cd>
   <title>The Bends</title>
   <artist>Radiohead</artist>
   <tracks>
     <track name="Street Spirit"/>
   </tracks>
 </cd>
</catalog>

XML files are hierarchical, with each tag defining an element. All elements need both an opening and a closing tag (<catalog> being an opening tag, </catalog> being its closing tag). Some elements are self-contained and do not require any information to be enclosed. These tags can be made self-closing by the addition of "/>" to the end of the opening tag, as with the track element above.
The structure of the catalogue is such that it contains CDs, which in turn contain tracks. This is our hierarchy, and will be important later, when we need to parse the document. For example, the track "Street Spirit" corresponds to the CD "The Bends," just as the track "Last Nite" corresponds to the CD "Is This It?" If we didn't use a suitable hierarchy, we wouldn't be able to ascertain this during parsing.
Sometimes, it doesn't make sense for information to appear between opening and closing tags. For example, if we need more than one piece of information to describe an element, we might like to include those multiple pieces of information within a single tag. We therefore define attributes of the element in the form attribute="value".
Once you have produced your own set of elements and structures, these formats can be referred to as dialects. For example, RSS is an XML dialect.
Namespaces
With so many different dialects floating around, conflicts of meaning can easily arise. For example, take the following XML files, both of which describe some data:
<film-types>
 <film-type>Action</film-type>
 <film-type>Adventure</film-type>
</film-types>

<film-types>
 <film-type>black and white</film-type>
 <film-type>colour</film-type>
</film-types>

The first file specifies genres of movies, while the second specifies different types of camera film. But, as the consumers of these files, how can we differentiate between them?
Namespaces provide the answer. An XML namespace allows us to qualify an element in the same way as telephone area codes qualify phone numbers. There might be thousands of telephone numbers of 545-321. When we add an area code and, perhaps, an international code, we make the number unique: +44 020 545-321.
The "area code" for XML namespaces is a URI, which is associated with a prefix for the namespace. We define a namespace using an xmlns declaration, followed by the prefix, which is equal to a URI that uniquely identifies the namespace:
xmlns:movie="http://www.sitepoint.com/movies">
By adding this namespace definition as an attribute to a tag, we can use the prefix movie in that tag, and any tags it contains, to fully qualify our elements:
<movie:film-types xmlns:movie="http://www.sitepoint.com/movies">
 <movie:film-type>Action</movie:film-type>
 <movie:film-type>Adventure</movie:film-type>
</movie:film-types>

Similarly, with the second, we can choose a different namespace "camera":
<camera:film-types xmlns:camera="http://www.sitepoint.com/camera">
 <camera:film-type>black and white</camera:film-type>
 <camera:film-type>colour</camera:film-type>
</camera:film-types>

Parsers can now recognise both meanings of "film type" and handle them accordingly.
Valid XML
In order for an XML file to be valid, it needs at the very least to conform to the XML specification, version 1.0. This standardises exactly how your XML file is formed so that other systems can understand it. For example, XML 1.0 requires that all XML files consist of one root element; that is, a single element contains all other elements. In our music library example above, catalog is our root element, as it contains all our other elements.
The full XML specification can be read here, although, as we'll see shortly, .NET gives you the tools to write valid XML automatically.
XML Schemas
While an XML file might conform to the XML specification, it might not be a valid form of a particular dialect. An XML schema lets you verify that certain elements are present, while making sure that the values presented are of the correct type.
There are a few different specifications of schemas: XSD, DTD, and XSX. Though DTD (Document Type Definition) is the most common schema used today, XSD (XML Schema Definition) is a newer standard that's gaining acceptance, as it provides the finest grained control for XML validation.
As XSD has many features, this introduction will concentrate on some of the simpler features you'll be able to employ.
The first line of a schema file usually looks something like the following:
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
The above defines that this file is a schema and that all the elements we're going to use for validation belong to the XML Schema namespace (to which we assign the prefix xs). You can set up additional namespaces and a number of XSD options in this tag. Check the full specification for more information.
One of the principal validations we may want to check is whether elements are of the correct type (e.g. when we're expecting a number, we don't want to receive a string of text). This check is performed in XSD using the element tag:
<xs:element name="foo" type="xs:integer"/>
This means that any elements named foo must contain an integer.
Hence, the following would validate.
<foo>10</foo>
However, the below code would not.
<foo>This is some text</foo>
You can check for a range of different types; again, see the specification for more detail.
We can also check for a set of valid values using an enumerator. For example, our element foo may only be able to take the values "apple," "orange," and "grape." Here, we wish to define our own type, as it isn't purely a string we're after. XSD provides the simpleType to let us do this.
<xs:element name="foo">
 <xs:simpleType>
   <xs:restriction base="xs:string">
     <xs:enumeration value="Apple"/>
     <xs:enumeration value="Orange"/>
     <xs:enumeration value="Grape"/>
   </xs:restriction>
 </xs:simpleType>
</xs:element>

Notice the restriction element\, which gives us a base type from which to work. As we have a list of text strings, this is set as the string type.
So, we have basic validation on values for our elements, but what about the attributes of those elements? Well, in a similar way, we can define attributes inside an element tag:
<xs:element name="foo">
 <xs:attribute name="colour" type="xs:string"/>
</xs:element>

Again, attributes can have their own types, using the simpleType elements we used for element above.
By default, all attributes are required. However, if we don't always require an attribute to be placed on an element, we can override this default using the use="optional" attribute on the attribute element:
<xs:attribute name="colour" type="xs:string" use="optional"/>
A complex element is one that contains other elements, such as the "CD" element in the catalogue example given earlier:
<cd>
   <title>The Bends</title>
   <artist>Radiohead</artist>
   <tracks>
     <track name="Street Spirit)"/>
   </tracks>
 </cd>

Here, we need to make sure CD elements contain a title, an artist, and a tracks element. We do so using a sequence:
<xs:element name="cd">
 <xs:complexType>
   <xs:sequence>
     <xs:element name="title" type="xs:string"/>
     <xs:element name="artist" type="xs:string"/>
     <xs:element name="tracks" type="xs:string"/>
   </xs:sequence>
 </xs:complexType>
</xs:element>

We can create custom types or define attributes within other elements, to help us create our hierarchical structure.
You should notice that "tracks" is itself a complex type, as it is made up of other elements. Thus, we need to define another complexType, this time inside our tracks element. Our schema will then look like this:
<xs:element name="cd">
 <xs:complexType>
   <xs:sequence>
     <xs:element name="title" type="xs:string"/>
     <xs:element name="artist" type="xs:string"/>
     <xs:element name="tracks">
       <xs:complexType>
         <xs:sequence>
           <xs:element name="track">
             <xs:attribute name="name" type="xs:string"/>
           </xs:element>
         </xs:sequence>
       </xs:complexType>
     </xs:element>
   </xs:sequence>
 </xs:complexType>
</xs:element>

Of course, this just touches the surface of XSD's abilities, but it should give you a good understanding from which to begin to write your own. Then again, you can always cheat, thanks to .NET and Microsoft's XSD Inference tool, which builds a "best guess" XSD schema file from any given XML file.
Learning to Read and Write
Before we start getting involved with the more exciting aspects of XML, we need to know how to produce and consume XML files in .NET. .NET organises its XML classes under the System.Xml namespace, so you may want to take advantage of, and familiarise yourself with the key classes we need to use:
  • XmlTextReader: The XmlTextReader class is just one of the methods of reading XML files. It approaches an XML file in a similar way to a DataReader, in that you step through the document element by element, making decisions as you go. It's by far the easiest class to use to parse an XML file quickly.
  • XmlTextWriter: Similarly, the XmlTextWriter class provides a means of writing XML files line by line.
  • XmlValidatingReader: XmlValidatingReader is used to validate an XML file against a schema file.
Reading an XML File In .NET
Let's get acquainted with the XmlTextReader. XmlTextReader is based upon the XmlReader class, but has been specially designed to read byte streams, making it suitable for XML files that are to be located on disk, on a network, or in a stream.
As with any class, the first step is to create a new instance. The constructor takes the location of the XML file that it will read. Here's how to do it in C#:
// file
XmlTextReader reader = new XmlTextReader("foo.xml");

// url
XmlTextReader reader = new XmlTextReader("http://www.sitepoint.com/foo.xml");

// stream (here, a StringReader s)
XmlTextReader reader = new XmlTextReader(s);

Once it's loaded, we can only move through the file in a forward direction. This means that you need to structure your parsing routines so that they're order-independent. If you cannot be sure of the order of elements, your code must be able to handle any order.
We move through the file using the Read method:
while (reader.Read())
 {
   // parse our file
 }

This loop will continue until we reach the end of our file, or we formally break the loop. We need to inspect each node, ascertain its type, and take the information we need. The NodeType property exposes the current type of node that's being read, and this is where things get a little complicated!
An XmlReader will see the following element as 3 different nodes:
<foo>text</foo>
The <foo> part of the element is recognised as an XmlNodeType.Element node. The text part is recognised as an XmlNodeType.Text node, and the closing tag </foo> is seen as an XmlNodeType.EndElement node.
The code below shows how we can output the XML tag through the reader object that we created earlier:
while (reader.Read())
 {
   switch (reader.NodeType)
   {
   case XmlNodeType.Element:
     Console.Write("<"+reader.Name+">");
     break;
   
case XmlNodeType.Text:
     Console.Write(reader.Value);
     break;
         
case XmlNodeType.EndElement:
     Console.Write("</"+reader.Name+">");
     break;
   }        
 }

Here's the output of this code:
<foo>text</foo>
So, what about the attributes? Well, these can be picked up in a number of ways. An attribute of type XmlNodeType.Attribute, can be parsed in the fashion shown above. More elegantly however, when we hit an XmlNodeType.Element type, we can iterate through attributes using XmlTextReader.MoveToNextAttribute:
case XmlNodeType.Element:
   Console.Write("<"+reader.Name);
   while (reader.MoveToNextAttribute())
   {
     Console.WriteLine(reader.Name+" = "+reader.Value);
   }
 break;

Now, if we fed in this XML:
<foo first="1" second="2">text</foo>
we would receive this output:
<foo first="1" second="2">text</foo>
Earlier, we had ignored the attributes, and would have output:
<foo>text</foo>
Note that, if an element does not contain any attributes, the loop is never started. This means that we don't have first to check to see whether we have attributes. That said, the number of attributes on a node can be found using the AttributeCount property.

Validating XML in .NET
Remember those schema files? Using the XmlValidatingReader class, we can easily validate an XML file against a schema file. However, there are a few other classes we have to use first.
We can point the XmlValidatingReader object towards the schema files we wish to use, by filling an XmlSchemaCollection with our schema files:
XmlSchemaCollection xsdCollection = new XmlSchemaCollection();  
 xsdCollection.Add("schemafile", "schema.xsd");

In this example, we're using the schema contained in the file schema.xsd.
The actual XmlValidatingReader class takes an XmlReader in its constructor. Consequently, we need to create and fill an XmlTextReader with the XML file we wish to validate:
XmlTextReader reader;  
 XmlValidatingReader validatingReader;  
 
 reader = new XmlTextReader("foo.xml");  
 validatingReader = new XmlValidatingReader(reader);

Then, we add our schemaCollection to the schemas we wish the validation reader to use:
validatingReader.Schemas.Add(xsdCollection);
If an error is found by the validator, an event is fired. We need to catch this event by creating an event handler, and code our response to the errors:
validatingReader.ValidationEventHandler += new ValidationEventHandler(validationCallBack);  
       
public void validationCallBack (object sender, ValidationEventArgs args)  
{  
 Response.Write("Error:" + args.Message);  
}

Finally, we can validate our file using the Read method on the validating reader. We could actually read and process the nodes in the file as we did before, but, for the purposes of this example, we'll just use an empty while loop to step through the file, validating it as we go:
while (validatingReader.Read()){}
As errors are found during file processing, the event we catch and handle through the validationCallBack method will fire.
Writing XML in .NET
Now that we have a good understanding of how to read and validate an XML file, we can move on to writing XML.
Writing XML is made very painless by the XmlTextWriter class. Again, taking a forward-only approach, we can build our XML files from different node types, which are output in order.
To begin, we need to create an instance of the class:
XmlTextWriter writer = new XmlTextWriter("newfoo.xml", null);
The second parameter, which, here, is set to null, allows us to specify the encoding format to use on our XML file. Setting it to null produces a standard UTF-8 encoded XML file without an encoding attribute on the document element.
We can now proceed to start writing elements and attributes using our code, comprised of the methods exposed in the XmlTextWriter. Let us write our earlier CD catalogue example:
<catalog>  
<cd>  
   <title>The Bends</title>  
   <artist>Radiohead</artist>  
   <tracks>  
     <track name="Street Spirit "/>  
   </tracks>  
 </cd>  
</catalog>

Our first line is the element catalog. We use the WriteStartElement method to write this to the writer:
writer.WriteStartElement("catalog");
Notice that we don't want to close this element until we've written our CD elements. Thus, the next node we wish to add is an opening CD element:
writer.WriteStartElement("cd");
We now have a title element (<title>The Bends</title>). As before, we write a start element, but this type will differ because we have a specific value we wish to use. As we have no attributes on the title element, we can use the WriteElementString method:
writer.WriteElementString("title", "The Bends");  
writer.WriteElementString("artist", "Radiohead");  
 
writer.WriteStartElement("tracks");

Now we've reached an element, track, with an attribute, name. We first need to write the start of the element, then write the attributes, and finally, close our element:
writer.WriteStartElement("track");  
writer.WriteAttributeString("name", "Street Spirit");  
 writer.WriteEndElement();

Notice how we don't need to say which element we wish to close, because the writer will automatically write the closing tag for the last-opened element, which, in this case, is track.
We can now close our remaining elements in the same fashion:
writer.WriteEndElement();  
writer.WriteEndElement();

Finally, we need to "flush" the writer, or, in other words, to output the information we've requested to our XML file, then close the writer to free our file and resources:
writer.Flush()  
 writer.Close()

Summary
So completes this introduction to XML and its use within .NET. Hopefully, the world of XML no longer seems as intimidating as you might have originally thought, and you're now ready to harness its power for your next project.

No comments:

Post a Comment