How to extract data from Microsoft Excel XML Files

Today I want to show how to extract data from Microsoft Excel Files. I needed it myself the time I looked for a solution to create some testdata for my application. Instead of putting the testdata into a database – which would be very hard to handle – I decided to put them into an Excel file.

As you can see the Excel file contains multiple worksheets for Users, Projects, Tasks and more. And every worksheet has a single header row.

excelread.jpg

The key solution to extract data from it is XML. Therefore the Excel file has to be stored in XML. You can also see this in the screenshot, when you look at the header of Microsoft Excel.

Knowing this presumtions, I wrote a method to extract a single worksheet from the excel file. The result is a list of rows that contains the data as a string array.

private List<string[]> ExtractWorksheet(string excelXmlFile, string worksheet, out string[] header)
{
    const string SpreadSheetSchema = "urn:schemas-microsoft-com:office:spreadsheet";
    
    XmlNamespaceManager nsMgr = new XmlNamespaceManager(new NameTable());
    nsMgr.AddNamespace("ss", SpreadSheetSchema);
    
    /*
     * Load XPath Document
     */
    XPathDocument testData = new XPathDocument(excelXmlFile);
    XPathNavigator navigator = testData.CreateNavigator();
    
    /*
     * Select amount of columns
     */
    XPathExpression expression = navigator.Compile(string.Format("/ss:Workbook/ss:Worksheet[@ss:Name='{0}']/*", worksheet));
    expression.SetContext(nsMgr);
    XPathNodeIterator expIterator = navigator.Select(expression);
    int columns = 0;
    while (expIterator.MoveNext())
    if (expIterator.Current.Name == "Table")
    {
        string attrib = expIterator.Current.GetAttribute("ExpandedColumnCount", SpreadSheetSchema);
        columns = int.Parse(attrib);
        break;
    }
    
    /*
     * Select rows
     */
    expression = navigator.Compile(string.Format("/ss:Workbook/ss:Worksheet[@ss:Name='{0}']/ss:Table/*", worksheet));
    expression.SetContext(nsMgr);
    expIterator = navigator.Select(expression);
    List<string[]> rows = new List<string[]>();
    header = null;
    while (expIterator.MoveNext())
    {
        if (expIterator.Current.HasChildren == false)
        continue;
        
        string[] row = new string[columns];
        int currentColumn = 0;
        
        /*
         * Select columns
         */
        XPathNodeIterator colIterator = expIterator.Current.SelectChildren(XPathNodeType.Element);
        while (colIterator.MoveNext())
        row[currentColumn++] = colIterator.Current.Value;
        
        if (header == null)
        header = row;
        else
        rows.Add(row);
    }
    
    return rows;
}

As you can imagine, it’s know easy to read the testdata from the Excel file.

/*
* Create Users
*/
string[] userColumns;
List<string[]> usersSheet = ExtractWorksheet("TestData.xml", "Users", out userColumns);
foreach (string[] userRow in usersSheet)
{
    User user = Factory.UserService.CreateUser(connection, userRow[0]);
    user.UserName = userRow[1];
    user.Rights = (UserRights) Enum.Parse(typeof (UserRights), userRow[2]);
    
    Factory.UserService.SaveUser(connection, user);
}

Hope you enjoyed my explations.
Cheers

- Gerhard

kick it on DotNetKicks.com

About these ads
Posted in .NET, C#. 3 Comments »

3 Responses to “How to extract data from Microsoft Excel XML Files”

  1. Jonathan Says:

    Hey this is a great example and a very powerful tool thanks.

    However, the Xml/excel file I was given did not have a ExpandedColumnCount attritbute. As a result the columns count is a 0, and then very bad things happen. What do I do if ExpandedColumnCount is not in the xml file?

  2. Gerhard Says:

    Hi Jonathan,

    sry. that I respond that late, but I checked some excel files and exported it to XML. But in all that files, the ExpandedColumnCount attribute did exist.

    Could it be, that you’re using another version of Excel that I do? I used Excel 2007 in order to do the XML export.

    • Jonathan Says:

      alas I didn’t generate the xml file, it was handed to me.

      however I have a nice solution. compare my version of extracworksheet vs yours:

      /*********************************************/
      private List ExtractWorksheet(string excelXmlFile, string worksheet, out string[] header)
      {
      const string SpreadSheetSchema = “urn:schemas-microsoft-com:office:spreadsheet”;

      XmlNamespaceManager nsMgr = new XmlNamespaceManager(new NameTable());
      nsMgr.AddNamespace(“ss”, SpreadSheetSchema);

      /*
      * Load XPath Document
      */
      XPathDocument testData = new XPathDocument(excelXmlFile);
      XPathNavigator navigator = testData.CreateNavigator();

      /*
      * Select amount of columns
      */
      XPathExpression expression = navigator.Compile(string.Format(
      “/ss:Workbook/ss:Worksheet[@ss:Name='{0}’]/*”, worksheet));
      expression.SetContext(nsMgr);
      XPathNodeIterator expIterator = navigator.Select(expression);
      //int columns = 0;
      while (expIterator.MoveNext())
      if (expIterator.Current.Name == “Table”) {
      string attrib = expIterator.Current.GetAttribute(“ExpandedColumnCount”, SpreadSheetSchema);

      //had to add this attribute to the data. not good!!!!
      //columns = int.Parse(attrib);
      break;
      }//then

      /*
      * Select rows
      */
      expression = navigator.Compile(string.Format(
      “/ss:Workbook/ss:Worksheet[@ss:Name='{0}’]/ss:Table/*”, worksheet));
      expression.SetContext(nsMgr);
      expIterator = navigator.Select(expression);
      List rows = new List();
      header = null;
      while (expIterator.MoveNext()) {
      if (expIterator.Current.HasChildren == false)
      continue;

      string[] row;

      List row_list_form =new List();

      //int currentColumn = 0;

      /*
      * Select columns
      */
      XPathNodeIterator colIterator = expIterator.Current.SelectChildren(XPathNodeType.Element);
      while (colIterator.MoveNext()) {
      //row[currentColumn++] = colIterator.Current.Value;
      row_list_form.Add(colIterator.Current.Value);
      }//while
      row = row_list_form.ToArray();
      if (header == null)
      header = row;
      else
      rows.Add(row);
      }//while

      return rows;
      }//extractworksheet

      /*****************************************/
      /*****
      Note the use of the List variable “row_list_form” it allows the data to be loaded dynamically so I don’t have to pre-declare the size. Afther is all done, I convert it back to a normal array to work with the rest of the program.
      ******/
      Now as a follow-up, I now have an issue where I don’t know the name of the worksheet. How can I get a list of all the worksheets of an xml document???


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 106 other followers

%d bloggers like this: