8

Convert MSword to XML/HTML on Linux

view full story
linux-howto

http://stackoverflow.com – I need to convert MSWord file into XML or HTML, while preserving the structure of the file (mainly tables). I happened to find tika, which is quite powerful in extracting text from MSword files (and any files), as follows: curl www.vit.org/downloads/doc/tariff.doc \ | java -jar tika-app-1.3.jar --text and I can select from the options to save the output into html/XML, as follows: curl www.vit.org/downloads/doc/tariff.doc \ | java -jar tika-app-1.3.jar --html But the output is basically like a plain text written in HTML, so it is not possible to get the table structure and other document (HowTos)