Information Extraction in NLP for Web pages!!??

Information Extraction in NLP for Web pages!!??

Post by barclay_ji » Fri, 02 Sep 2005 00:24:04


Hi all!

Could anyone out there tell me about any latest development of
information extraction for web pages? Are there any software packages
(especially public domain packages) that extract information from web
pages, with multi-lingual support and preferably with support of
configurable industry domains?

Thanks in advance.

Barclay Jiang
 
 
 

Information Extraction in NLP for Web pages!!??

Post by Ted Dunnin » Sat, 03 Sep 2005 07:21:06

LOTS of people do some kind of information extraction for web pages.
There are lots of packages available, but it is likely that none of
them will do anything like what you want without substantial
modification.

Look for GATE from Sheffield. They document their work well and have
good pointers to alternative systems.

Look also for "owl, rdf, web" for a plethora of literature based on the
idea of the semantic web. For the most part, I consider this trend
completely hopeless with respect to the likelihood of producing
anything useful but there are likely to be a few gems in the
literature.

On a very pragmatic level, there are also HTML parsing libraries such
as htmlparser that can help you build your own extraction systems.

Good luck!