Home » , , » How to convert Word (.docx) to XML in C#

How to convert Word (.docx) to XML in C#

Written By M.L on செவ்வாய், 15 நவம்பர், 2011 | நவம்பர் 15, 2011

I am developing one Windows Forms Application in C# and I require to covert the Word (.docx) file to XML and saved it to SQL Server Database.
I am taking the Word file from the File Upload and need to convert it to XML at run time. I don’t need to save the XML file.

Can you please share your ideas and/or code on this?



ANSWER:-
In my past experience, I have implemented the similar kind of utility where I used to rename the word doc to word xml files and used to store them temporarily on standalone system/machine for dumping it into SQL database, then after final processing those xml’s files were deleted programmatically from the standalone system/machine.

If you don’t want to save those word xml, then conceptually you should use office automation to open/manipulate word doc in memory and process it as per your needs. Unfortunately, I never succeed to implement that functionality (due to time constraint and other limitation).

Please note – here I am using Interop.Word object (which requires the installation MS Office on system/server) and if you are planning to running this utility on Server, then just FYI – it is NOT RECOMMENDED to install office on server, so before proceeding further with this solution, please reach out to your TechArch or Deployment team for server installation details.

For your further easy reference I am sharing my sample code with you – Which converts word doc into word xml and below highlighted object (xmlDocObject) is holding word xml; which can be use for dump word xml into your DB (and I don’t think that saving xml into database will be difficult one for you J) . Also please make sure that you dispose Interop.Word object after completion of final processing.

Hope this will help you!!!

Code for converting Word Doc into Word XML

private static void ConvertWordtoXML()
       {
      // Creating Interop Object
              Microsoft.Office.Interop.Word.Application myWordAppObject = new Microsoft.Office.Interop.Word.Application();
      object oMissing = System.Reflection.Missing.Value;

      // Specific folder location from you want to pull out Word doc’s for XML transformation
              DirectoryInfo dirInfo = new DirectoryInfo("C:\\MyWordDoc\\");
              FileInfo[] wordFiles  = dirInfo.GetFiles("*.docx");
              myWordAppObject.Visible = false;  myWordAppObject.ScreenUpdating = false;

      XmlDocument xmlDocObject = new XmlDocument();
           
       // iterate through each word file and transform it into word XML
       foreach (FileInfo wordFile in wordFiles)
             {
                Object filename = (Object)wordFile.FullName;

               // Open specific word doc
                Document doc = myWordAppObject.Documents.Open(ref filename, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing,
         ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing);
                doc.Activate();

        // replace extension from *.docx to *xml
                object outputFileName = wordFile.FullName.Replace(".docx", ".xml");
               object fileFormat = WdSaveFormat.wdFormatXML;

                doc.SaveAs(ref outputFileName, ref fileFormat, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing,  
         ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing);

                object saveChanges = WdSaveOptions.wdDoNotSaveChanges;

                ((_Document)doc).Close(ref saveChanges, ref oMissing, ref oMissing);

                doc = null;
                xmlDocObject.Load(outputFileName.ToString());

                XmlNamespaceManager nsmgr = new XmlNamespaceManager(xmlDocObject.NameTable);
                nsmgr.AddNamespace("w", "http://schemas.microsoft.com/office/word/2003/wordml");
                XmlNodeList node = xmlDocObject.SelectNodes("//w:document/descendant::w:t|//w:document/descendant::w:p|//w:document/descendant::w:tab", nsmgr);
            }
            ((_Application)myWordAppObject).Quit(ref oMissing, ref oMissing, ref oMissing);
            
            myWordAppObject = null
        }


Sample OUTPUT – of Word XML

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<?mso-application progid="Word.Document"?>
- <w:wordDocument xmlns:aml="http://schemas.microsoft.com/aml/2001/core" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882"
                  xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office"
                  xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w10="urn:schemas-microsoft-com:office:word"
                  xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml" xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint"
                  xmlns:wsp="http://schemas.microsoft.com/office/word/2003/wordml/sp2" xmlns:sl="http://schemas.microsoft.com/schemaLibrary/2003/core"
                  w:macrosPresent="no" w:embeddedObjPresent="no" w:ocxPresent="no" xml:space="preserve">
  <w:ignoreSubtree w:val="http://schemas.microsoft.com/office/word/2003/wordml/sp2" />
- <w:body> - <w:p wsp:rsidR="00621F81" wsp:rsidRPr="006D196D" wsp:rsidRDefault="00621F81" wsp:rsidP="00621F81">
  <w:t>Hi,</w:t>
  </w:r> </w:p>
- <w:p wsp:rsidR="00621F81" wsp:rsidRPr="006D196D" wsp:rsidRDefault="00621F81" wsp:rsidP="00621F81">
  <w:t>Hi Prashant.</w:t> </w:p>
- <w:p wsp:rsidR="00621F81" wsp:rsidRPr="006D196D" wsp:rsidRDefault="00621F81" wsp:rsidP="00621F81">
  <w:t>I am developing one Windows Forms Application in C# and I require to covert the Word file to XML and saved it to SQL Server Database.</w:t> </w:p>
- <w:p wsp:rsidR="00621F81" wsp:rsidRPr="006D196D" wsp:rsidRDefault="00621F81" wsp:rsidP="00621F81">
  <w:t>Awaiting for your response.</w:t> </w:p>
- <w:sectPr wsp:rsidR="00022807" wsp:rsidRPr="006D196D" wsp:rsidSect="00022807">
  <w:pgSz w:w="12240" w:h="15840" />
  <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0" />
  <w:cols w:space="720" />
  <w:docGrid w:line-pitch="360" />
  </w:sectPr>
  </w:body>
  </w:wordDocument>

0 comments:

கருத்துரையிடுக

Popular Posts

General Category