Monday, April 11, 2011

Generate SQL Database Schema from XML

Introduction

This is the second in a series of three articles describing a project that I undertook to solve a specific problem in my working environment. The first article explained how I was dissatisfied with SQL's Data Definition Language, and decided to create an alternative based on XML. By the end of the article, I had presented an XSD for the new format.
I knew I would need some way of transforming my XML into SQL so I did some Googling, and in the end I considered two alternatives: XSLT or a bespoke program. Tools for code generation usually feature a template file with some sort of 'special' markup that demarcates areas of embedded code, e.g. in an ASP.NET aspx file:
Collapse
<%=SomeProperty%> 
This markup is replaced by the value of the public property SomeProperty of the page class when the page is viewed. XSLT is more extreme in that it contains fragments of output surrounded by large volumes of XML query language. Consider this XSLT which is designed to do just what my project does: convert an object model into an SQL schema definition. My complaint with this approach to code generation is that it is essentially write-only code. It is hard to visualize the intended output, because it is obscured by the embedded code. Visual Studio can't recognize the fragments of code as being SQL so it won't syntax colour them, and if you want to generate some code that includes any of XML's special characters, then you have to escape them.
So for these reasons, I decided to write a bespoke, command-line program, which performs the following steps:
  • Read command line
  • Deserialize database schema
  • Build code model
  • Generate SQL based on model
I will leave everything else for the final article, and for now just discuss what a code model is, why they are a good idea and how I decided to build mine.

Why Have a Code Model?

In the sense that I am using it, 'code model' just means a representation of a data file format written in some programming language. For example, scripts on a web page use the Document Object Model to inspect and modify web pages dynamically. Of course C# already has a generic XML code model that I could use to read my file format, and I can even get it to validate the XML against my schema. A couple of lines of code and I have my whole file in an XMLDocument ready to use.
Collapse
XmlDocument mySchema = new XmlDocument();
mySchema.Load("mySchema.xml");
However, there are several problems with this simplistic approach. When I am writing code, to use my data I have to know the structure of my XML document. This may not always be obvious because of quirks and limitation in XML or XSD. I also have to know the names of elements and attributes in my XML document, and put them somewhere in my code as constant strings. For example:
Collapse
// Find an nchar column with a length > 10foreach (XmlNode tableNode in mySchema.DocumentElement["tables"].ChildNodes)
{
 XmlElement tableElement = tableNode as XmlElement;
 if (tableElement != null)
 {
  foreach (XmlNode columnNode in tableElement["columns"].ChildNodes)
  {
   XmlElement columnElement = columnNode as XmlElement;
   if (columnElement != null)
   {
    // Remember that the type of a column is encoded     // in a child element rather than an attribute    if (columnElement.FirstChild.Name == "nchar")
    {
     XmlElement ncharElement = 
      columnElement["nchar"];
     XmlAttribute ncharLengthAttribute = 
     ncharElement.Attributes["length"];
     int ncharLength = 1;
     if (ncharLengthAttribute != null)
     {
      ncharLength = int.Parse
      (ncharLengthAttribute.Value);
     }
     if (ncharLength > 10)
     {
      // Success     }
    }
   }
  }
 }
}
The next problem illustrated by this example is type conversion. The XML code model can't possibly know the type of attributes at compile-time, so everything has to be returned as strings and they have to be converted manually. The end result is verbose, obfuscated code littered with assumed knowledge. It will also be hard to make any changes to the file format later, because there is no compile-time checking of these assumptions.
A code model solves these problems, essentially by hiding all the assumed knowledge and keeping it in one place. It can smooth out structural artifacts from the XML and convert all the strings to their proper types. Later, changes to the file format can either be hidden from client code by keeping the code model the same and processing the new format on load, or exposed to the client code as changes that break at compile time and can be confidently fixed.

Code Model Implementation

Microsoft has provided a handy tool called XSD.exe that will generate a code model from an XSD file that can be deserialized directly from any XML file that uses the XSD. The XSD for my file format is embedded as a resource in an assembly (see the project DatabaseSchemaModelResource in the sample code), for reasons which will become clear later on. The post-build event of this project runs XSD.exe and generates the file DatabaseSchemaModel.cs which is included in another project; DatabaseSchemaModel. This project is set as a dependency of DatabaseSchemaModelResource, so that the generated code file is always up to date before DatabaseSchemaModel is compiled.
Note that XSD.exe may be in different places depending on your setup, so if you have problems building that may be why. To fix it, you'll have to edit the post-build event of the DatabaseSchemaModelResource project.
The generated code is type-safe where possible and provides a great intermediate set of data to work with; much nicer than poking around directly in the XML DOM. But it still was not good enough for me. For one thing, it still directly corresponded to the structure of the XML, and for another, it didn't handle xs:choice very nicely: just treating it as an object. The clincher was the fact that in the file format, objects refer to each other by name (e.g. a primary key refers to one or more columns). In the code model, it would be much nicer if the objects were linked directly (e.g. if a primary key object had a direct reference to one or more column objects). So I decided to call this generated code my 'raw' code model. I would hide it from the client, but it would still save me a lot of work in parsing and validating and provide a type-safe foundation.
The classes of the public facing code model are all in the DatabaseSchemaModel namespace in the project of the same name. The root class of the model is called Database. When thinking about how the model would be loaded from the file, it seemed natural for this class to be XML serializable directly to and from the file format. As I wanted to use my raw code model to do the heavy lifting, it was clear I would have to implement IXmlSerializable. Here is the declaration of the Database class:
Collapse
...
[XmlRootAttribute(Namespace = http://olduwan.com/DatabaseSchemaModel.xsd,  IsNullable = false)]
[XmlSchemaProvider("MySchema")]
public class Database : IXmlSerializable
{
...
The XmlRootAttribute tells the serializer how to write the root node of the file, in this case telling it to include the namespace declaration. The XmlSchemaProvider attribute is needed to tell the serializer which schema to expect when reading the XML. The schema is actually provided to the framework by a public static method, referenced in the attribute. This is where I use the XSD file that is embedded as a resource in the DatabaseSchemaModelResource assembly:
Collapse
/// <summary>/// This is the method named by the XmlSchemaProviderAttribute applied to the type./// </summary>public static XmlQualifiedName MySchema(XmlSchemaSet xs)
{
 // This method is called by the framework to get the schema for this type. // The schema is embedded in an assembly as a resource. XmlSerializer schemaSerializer = new XmlSerializer(typeof(XmlSchema));
 XmlSchema s = (XmlSchema)schemaSerializer.Deserialize
  (new XmlTextReader(new StringReader
  (DatabaseSchemaModelResource.Resources.Schema)), null);
 xs.XmlResolver = new XmlUrlResolver();
 xs.Add(s);

 return new XmlQualifiedName("database", xmlNameSpace);
}
The implementation of ReadXml deserializes the entire raw model (the code generated by XSD.exe) then uses the objects from the raw model to construct the code model. The deserialization looks like this:
Collapse
public void ReadXml(System.Xml.XmlReader reader)
{
 // This is not very efficient but I don't expect performance to matter. // The idea here is to use the code that was generated from the xml schema // to actually read the xml, then extract the data into the hopefully more // usable data structures in this namespace. string outerXml = reader.ReadOuterXml();
 XmlSerializer serializer = new XmlSerializer(typeof(database));
 // Note this is a database object from the 'raw' code model database db = (database)serializer.Deserialize(new StringReader(outerXml));

 // Copy database attributes Name = db.name;
 if (db.ExampleFolderPath != null)
 {
  ExampleFolderPath = db.ExampleFolderPath;
 }
 if (db.SQLServerOutputPath != null)
 {
  SQLServerOutputPath = db.SQLServerOutputPath;
 }
 ...
Notice that the first line reads the entire file as a string, which is then passed to a new serializer instance. This is definitely not very efficient, but I think that it is unlikely to matter in real use-cases. A bigger issue with this approach is that the deserialization process discards any information about the source file, such as the line number where an element occurs. If any errors occur later during the construction of the model it is not possible to describe the location of the problem in the exception. I'll explain later how I deal with this problem but to continue with the ReadXml method, the next step is to walk the raw code model and construct the objects for the code model:
Collapse
...
    
// Walk raw code model and construct code modelReadContext context = new ReadContext();

Tables = new IndexedList<string,Table>();
foreach (table rawTable in db.tables)
{
 Tables.Add(rawTable.name, new Table(context, rawTable));
}
Procedures = new IndexedList<string, Procedure>();
foreach (procedure rawProcedure in db.procedures)
{
 Procedures.Add(rawProcedure.name, new Procedure(rawProcedure));
}
...
As you can see, the tables and procedures are stored in an IndexedList which is an ordered container with dictionary-style lookup by name. Each class in the code model has an internal constructor that takes the corresponding raw model class as a parameter. For example, here is part of the internal constructor for Table:
Collapse
internal Table(
 ...
 table raw
 )
{
 ...
 Name = raw.name;
 foreach (column rawColumn in raw.columns)
 {
  Columns.Add(rawColumn.name, new Column(..., rawColumn));
 }
 ...
The constructor takes its own description from the raw object, then walks the raw columns and creates code model columns by passing on the raw data. This pattern repeats in all the code model classes. To see another pattern that is shared by all the internal constructors, I will expand the same example:
Collapse
internal Table(
 ReadContext context,
 table raw
 )
{
 context.ReadingTable = this;
 context.Stack.Push("Reading Table: " + raw.name);
 Name = raw.name;
 foreach (column rawColumn in raw.columns)
 {
  Columns.Add(rawColumn.name, new Column(context, rawColumn));
 }
 ...
 context.ReadingTable = null;
 context.Stack.Pop();
} 
The ReadContext class is passed down the stack through all the nested constructors. By setting the property ReadingTable for the duration of the constructor, all the nested constructors can discover which Table they are part of. Similarly the ReadContext has a stack for reporting progress and each nested constructor adds some detail to the stack. If an error is found in the data, the current state of this stack provides a way to report the location of the error to the user. e.g.
Collapse
Column does not exist: XXXXXXX
Reading Constraint: Primary
Reading Table: Categories
At the end of the constructor, the ReadingTable property is nulled and the detail on the stack is popped, ready to read the next Table.
To finish off this section, I'll return to the problem of fixing up internal references. The problem is that a relationship from one Table may refer to another Table that has not been loaded yet. One way of solving this problem would be to store the name of the table in the class. Once the whole model is loaded, it is easy to look up the table by name as-needed, and hide the lookup in a property getter. I prefer not to do this because I don't like the extra data members hanging around, and I don't like the slightly boilerplate code you end up with. Instead, the ReadContext class is responsible for remembering each Table reference that needs to be resolved during deserialization. For example, the Relationship class has a reference to the 'primary key table' that needs to be resolved. The following code is called during its constructor:
Collapse
internal Relationship(
 ReadContext context,
 relationship raw
 )
{
 ...
 context.ResolveTableReferences.Add(raw.primaryKeyTable.name, 
  delegate(Table table)
 {
  PrimaryKeyTable = table;
  foreach (relationshipColumn rawRelationshipColumn in 
   rawPrimaryKeyTableColumns)
  {
   PrimaryKeyColumns.Add(table.ResolveColumn
    (stackTrace, rawRelationshipColumn.name));
  }
 });
 ...
The ResolveTableReferences property is a MultiDictionary, which means it can have multiple values for a single key. The key is the name of the table to be resolved, and the value being added here is a delegate that will assign a reference to the table when it is called. The advantage of using an anonymous method is that the compiler automatically handles the fact that I've used properties of the RelationShip class by storing a hidden pointer to the Relationship instance (it creates a closure). Later, after all the tables have been deserialized, the following code is called at the end of Database.ReadXml:
Collapse
...
 foreach (var resolveTableReferenceContainer in context.ResolveTableReferences)
 {
  Table table = Tables[resolveTableReferenceContainer.Key];
  foreach (var resolveTableReference in 
   resolveTableReferenceContainer.Value)
  {
   resolveTableReference(table);
  }
 }
}
Each key is a table name to be resolved, and the corresponding value is a container full of delegates. Once the table has been looked up, each delegate in the container is called and the table is passed as a parameter.
With one exception, all the classes in the code model map to types in the XSD, which ultimately map to elements in a database schema so I won't go into a lot of detail about them. The exception is in the Column class. In the XSD, there is a nested element describing the type-specific attributes of the column (e.g. length). In the code model, I have chosen to 'flatten' these properties into the Column class. This is a debatable decision because it does mean there are several irrelevant properties in any given Column instance. However, my feeling is that at the point you are using a column's type specific properties you already know what they are (e.g. that an nchar column has length) and so it is a bearable compromise. Here is the code of the internal constructor for Column:
Collapse
internal Column(
 ReadContext context,
 column raw
 )
{
 Name = raw.name;
 Type = raw.ItemElementName.ToString();
 if (raw.allowNullsSpecified)
 {
  AllowNulls = raw.allowNulls;
 }
 switch (raw.Item.GetType().FullName)
 {
  case "DatabaseSchemaModel.Raw.bigint":
   bigint rawBigInt = raw.Item as bigint;
   if (rawBigInt.defaultSpecified)
   {
    Default = rawBigInt.@default.ToString();
   }
   else
   {
    Default = rawBigInt.defaultExpression;
   }
   break;

  case "DatabaseSchemaModel.Raw.int":
   @int rawInt = raw.Item as @int;
   if (rawInt.defaultSpecified)
   {
    Default = rawInt.@default.ToString();
   }
   else
   {
    Default = rawInt.defaultExpression;
   }
   break;

  case "DatabaseSchemaModel.Raw.smallint":
   smallint rawSmallInt = raw.Item as smallint;
   if (rawSmallInt.defaultSpecified)
   {
    Default = rawSmallInt.@default.ToString();
   }
   else
   {
    Default = rawSmallInt.defaultExpression;
   }
   break;

  case "DatabaseSchemaModel.Raw.tinyint":
   tinyint rawTinyInt = raw.Item as tinyint;
   if (rawTinyInt.defaultSpecified)
   {
    Default = rawTinyInt.@default.ToString();
   }
   else
   {
    Default = rawTinyInt.defaultExpression;
   }
   break;

  case "DatabaseSchemaModel.Raw.decimal":
   @decimal rawDecimal = raw.Item as @decimal;
   if (rawDecimal.precision != null)
   {
    Precision = int.Parse(rawDecimal.precision);
   }
   if (rawDecimal.scale != null)
   {
    Scale = int.Parse(rawDecimal.scale);
   }
   if (rawDecimal.defaultSpecified)
   {
    Default = rawDecimal.@default.ToString();
   }
   else
   {
    Default = rawDecimal.defaultExpression;
   }
   break;

  case "DatabaseSchemaModel.Raw.decimalScale0":
   decimalScale0 rawDecimalScale0 = raw.Item as decimalScale0;
   Scale = 0;
   if (rawDecimalScale0.precision != null)
   {
    Precision = int.Parse(rawDecimalScale0.precision);
   }
   if (rawDecimalScale0.defaultSpecified)
   {
    Default = rawDecimalScale0.@default.ToString();
   }
   else
   {
    Default = rawDecimalScale0.defaultExpression;
   }
   break;

  case "DatabaseSchemaModel.Raw.float":
   @float rawFloat = raw.Item as @float;
   if (rawFloat.mantissaBits != null)
   {
    MantissaBits = int.Parse(rawFloat.mantissaBits);
   }
   if (rawFloat.defaultSpecified)
   {
    Default = rawFloat.@default.ToString();
   }
   else
   {
    Default = rawFloat.defaultExpression;
   }
   break;

  case "DatabaseSchemaModel.Raw.real":
   @real rawReal = raw.Item as @real;
   if (rawReal.defaultSpecified)
   {
    Default = rawReal.@default.ToString();
   }
   else
   {
    Default = rawReal.defaultExpression;
   }
   break;

  case "DatabaseSchemaModel.Raw.variablePrecisionTime":
   variablePrecisionTime rawVariablePrecisionTime = 
    raw.Item as variablePrecisionTime;
   if (rawVariablePrecisionTime.fractionalSecondsPrecision != null)
   {
    FractionalSecondsPrecision = int.Parse
    (rawVariablePrecisionTime.fractionalSecondsPrecision);
   }
   if (rawVariablePrecisionTime.@default != null)
   {
    Default = Quote(rawVariablePrecisionTime.@default);
   }
   else
   {
    Default = rawVariablePrecisionTime.defaultExpression;
   }
   break;

  case "DatabaseSchemaModel.Raw.char":
   @char rawByte = raw.Item as @char;
   if (rawByte.length != null)
   {
    if (rawByte.length == "max")
    {
     MaxLength = true;
    }
    else
    {
     Length = int.Parse(rawByte.length);
    }
   }
   if (rawByte.@default != null)
   {
    Default = Quote(rawByte.@default);
   }
   else
   {
    Default = rawByte.defaultExpression;
   }
   break;

  case "DatabaseSchemaModel.Raw.nchar":
   nchar rawNChar = raw.Item as nchar;
   if (rawNChar.length != null)
   {
    if (rawNChar.length == "max")
    {
     MaxLength = true;
    }
    else
    {
     Length = int.Parse(rawNChar.length);
    }
   }
   if (rawNChar.@default != null)
   {
    Default = Quote(rawNChar.@default);
   }
   else
   {
    Default = rawNChar.defaultExpression;
   }
   break;

  case "DatabaseSchemaModel.Raw.bit":
   bit rawBit = raw.Item as bit;
   if (rawBit.defaultSpecified)
   {
    Default = rawBit.@default ? "1" : "0";
   }
   else
   {
    Default = rawBit.defaultExpression;
   }
   break;

  case "DatabaseSchemaModel.Raw.smallmoney":
   smallmoney rawSmallMoney = raw.Item as smallmoney;
   if (rawSmallMoney.defaultSpecified)
   {
    Default = rawSmallMoney.@default.ToString();
   }
   else
   {
    Default = rawSmallMoney.defaultExpression;
   }
   break;

  case "DatabaseSchemaModel.Raw.money":
   money rawMoney = raw.Item as money;
   if (rawMoney.defaultSpecified)
   {
    Default = rawMoney.@default.ToString();
   }
   else
   {
    Default = rawMoney.defaultExpression;
   }
   break;

  case "DatabaseSchemaModel.Raw.parameterlessStringType":
   parameterlessStringType rawParameterlessStringType = 
     raw.Item as parameterlessStringType;
   if (rawParameterlessStringType.@default != null)
   {
    Default = Quote(rawParameterlessStringType.@default);
   }
   else
   {
    Default = 
        rawParameterlessStringType.defaultExpression;
   }
   break;

  case "DatabaseSchemaModel.Raw.uniqueidentifier":
   uniqueidentifier rawUniqueIdentifier = 
     raw.Item as uniqueidentifier;
   if (rawUniqueIdentifier.@default != null)
   {
    Default = Quote(rawUniqueIdentifier.@default);
   }
   else
   {
    Default = rawUniqueIdentifier.defaultExpression;
   }
   break;
 }

 withIdentity rawWithIdentity = raw.Item as withIdentity;
 if (rawWithIdentity != null && rawWithIdentity.identity != null)
 {
  Identity = new Identity(rawWithIdentity.identity);
 }
}
As you can see, the switch is used to identify the actual type of the nested element in the raw code model. Type-specific properties are then copied into the Column and the irrelevant properties are left uninitialised. There is one more thing worth noting at the end of this method. withIdentity is an interface that corresponds to the withIdentity XSD group. XSD.exe does not generate this interface or acknowledge the relationship between types that use the same group in an XSD. However, it does generate partial classes which makes it possible to add the interface by hand. This is done in 'DatabaseSchemaModelExtended.cs':
Collapse
namespace DatabaseSchemaModel.Raw
{
 /// <summary> /// This interface corresponds to the 'withIdentity' group in the XSD /// </summary> public interface withIdentity
 {
  identity identity
  {
   get;
  }
 }

 public partial class bigint : withIdentity
 { }

 public partial class tinyint : withIdentity
 { }

 public partial class @int : withIdentity
 { }

 public partial class smallint : withIdentity
 { }

 public partial class decimalScale0 : withIdentity
 { }
}

Using the Code Model

In contrast to the raw XML example, loading a database schema with the code model looks like this:
Collapse
XmlSerializer serializer = new XmlSerializer(typeof(Database));
Database database = (Database)serializer.Deserialize(new StreamReader("mySchema.xml"));
And the (admittedly rather artificial) example of finding an nchar with length >10 looks like this:
Collapse
foreach (Table table in database.Tables.Values)
{
 foreach (Column column in table.Columns.Values)
 {
  if (column.Type == "nchar" && column.Length > 10)
  {
   // Success  }
 }
}

No comments:

Post a Comment