LOTS of people do some kind of information extraction for web pages.
There are lots of packages available, but it is likely that none of
them will do anything like what you want without substantial
Look for GATE from Sheffield. They document their work well and have
good pointers to alternative systems.
Look also for "owl, rdf, web" for a plethora of literature based on the
idea of the semantic web. For the most part, I consider this trend
completely hopeless with respect to the likelihood of producing
anything useful but there are likely to be a few gems in the
On a very pragmatic level, there are also HTML parsing libraries such
as htmlparser that can help you build your own extraction systems.