Frequently Asked Questions
- Can it convert any word processor file?
-
Generally yes. Docvert deals with documents at a structural level and because of this it's able to use many mature conversion libraries.
Note: It is possible to embed obscure formats in MSWord documents and these may not convert successfully, however common formats such as JPEG, GIF, WMF, SVG, BMP and many others will convert without problems.
- Can it produce any type of HTML or XML?
- It uses XSLT which can produce any XML or HTML, so yes it can produce any tags or attributes.
- Is there a public demo?
-
Here's a public demo that does OpenDocument to HTML.
Although Docvert also supports MS Word this feature isn't enabled in the demo due to server resources. You'll have to upload .ODT files.
- How does it compare to other software?
-
There are several families of HTML converters,
- .DOC + Manual Config → HTML;
- .DOC + Ruleset → DocBook or HTML;
- .DOC → OpenDocument → DocBook → HTML via XSLT.
a can't be automated which means it's error prone and that complex rulesets aren't feasible. This family of software includes MSWord itself, where the HTML is so terrible that it's spawned a market for software whose sole job it is to clean it. Typically the conversion process isn't open.
b is better. However MSWord's .DOC isn't a structured format and it doesn't deal in chapters and sections. It's just a flat list of paragraphs and headings (without any useful hierarchy). The source format wasn't designed for programmers and any ruleset based on that will be a pain in the ass to use.
There's also Microsoft's OOXML/WordML which is the same list of paragraphs and headings that Microsoft Word has always had but now it's in XML. It's now easier to read but still a hassle to use because it lacks structure and hierarchy. The format is poorly designed and currently legal uses of the format are unclear.
c .DOC is mapped to OpenDocument XML, which is then converted to a structured format – making it easy to write HTML themes. Each stage is open and verifiable, so the code can be more easily improved in the future. The conversions are done using a W3C standard, not a proprietary one.
Docvert uses the later process, which I believe is easiest to code for. It provides an appropriate level of abstraction without the inflexibility of cramming too many stages intogether.
- Is this just a wrapper around other conversion libraries that make HTML?
-
No. Docvert does use external software to convert Microsoft Word to OpenDocument, however the subsequent conversion from OpenDocument to DocBook and HTML is custom and we've been improving it for years now.
Infact if you upload OpenDocument files then Docvert does not use any external software. This conversion process involves using XML Pipelines, DocBook and XSLT.
- Should I structure my documents with Word Styles?
-
Yes! For all but trivial examples you'll need to use Word Styles for any conversion software so that it knows how to section your document and format everything correctly. Like most conversion software, Docvert ignores font sizes and background colours and instead makes decisions based on structural Word Styles that describe paragraphs, headings, lists, tables, etc.
When a document lacks styles Docvert does attempt to infer structure but this does involve guessing and there are obvious limits to this.
Learning how to use Word Styles —how to use Word correctly— is out of the scope of this FAQ, but there are online tutorials teaching this (Microsoft's one is pretty good).
There has been some interest in maintaining these font sizes and colours throughout, and I don't have a problem with this, so if someone wants to code up some XSLT to do that...
- My document doesn't convert
- Send it in and I'll have a look.
- OpenOffice.org... Abiword... for server software?
-
It's taken Word Processors years to mature their reverse engineering libraries to deal with .DOC. So far as I know other converters libraries haven't reached this level of compatibility.
Docvert builds upon desktop word processors such as OpenOffice.org and Abiword because they have the best chance of dealing with the vagaries of the MS Word format.
As of Docvert 3.2 we use OpenOffice.org or Abiword in an efficient server-like mode and stream documents to them.
If there are any other converters that produce OpenDocument I'd be happy to include them, so contact me if you know of any (better yet, send in a patch).
- XML Pipelines?
-
XML Pipelines are a simple way of describing a sequence of stages. For example, one stage could split the document into chapters, the next could style it for your site. It looks like this,
<?xml version="1.0" encoding="UTF-8"?>
<pipeline>
<stage process="TransformToDocBook"/>
<stage process="Transform" withFile="webstyle.xsl"/>
<stage process="Serialize" toFile="index.html"/>
</pipeline>Each Docvert theme gets its own pipeline to customise and can have as many stages as it wants.
There are more complex examples in the download.
- REST Web Services
-
REST is a simple Web Service where you send a request and receive a response via HTTP. In the case of Docvert, it responds to HTTP POSTs of .DOC files with a .ZIP of the converted documents.
The Docvert download contains an example application demonstrating this API.
REST is comparable to W3C SOAP, Hessian, etc.
- Does it run on Windows and Linux?
- Yes. And it probably runs on Mac OSX too but I haven't tested it.