Need document filter API in C++

Post by Rufus DeDu » Mon, 02 Feb 2004 10:49:36

I am looking for a (preferably freeware) document filter API that covers a
wide range of document formats, e.g. HTML, Excel, word DOC, Access MDB, etc.

To clarify, I would get a set of libraries housing functions that look like

HANDLE FindFirstWord(char * szDocumentName) //opens a search
HANDLE FindNextWord(char * sWord, int nChars) //returns next content word -
i.e., with formatting stripped out - in the document
void Close(HANDLE h) //closes

To change docs that look like:

<TEXT>The quick red fox etc.</TEXT>

Into a series like:

And it needs to work for a bunch of formats.

Verity has a commercial product - is there anything else out there?

I am on Windows BTW.


Post by johnquigle » Tue, 24 Feb 2004 06:47:41

This may not be exactly what you're looking for, but it's a start.
Check out wvWare at I've used it to convert MS
Word docs into straight ASCII text. It has an HTML-to-ASCII converter
as well (just like your example). It may handle Excel -- you'll have
to check. As far as Word is concerned, it converts it into an HTML
document and then into a straight text document.

It's mostly written in C but there are binaries that you can call on
the command line. You'd have to compile for Windows -- I've only used
it on Unix so I can't comment on that.

In my experience with document filtering, I'd say you'd be hard
pressed to find one freeware tool that would convert the wide range of
file types you mentioned. Probably you'll have to piece together
different tools to cover that range of file types, especially with PDF
and DB files.

Hope that helps.


Post by eDoc » Wed, 25 Feb 2004 01:10:52


I looked at this library briefly. Looks great but does it compile to a dll
or is there a dll available?
I require something to dump word and excel files to straight ascii; and pdf
would be nice too. I use vb and have no knowledge of C++.


