Multilingual content indexing

Multilingual content indexing

Post by UmFkcml6em » Sat, 25 Mar 2006 23:58:02


Hi there,
We have a Windows Sharepoint Services installation and are indexing its
content. I have a question regarding the index so I thought it might be
better to post here instead of the WSS newsgroups.

Anyway, here we go:
We have thousands of documents with content in multiple langauges (e.g.
English and german, english and portuguese, etc...).
Of course there are documents which contain only 1 language.
Now when trying a search for let's say: "muito" which is in the list of
Portuguese noisewords, the index return some documents, but not all which
contain this word.
So I was wondering how exactly do noise words work. I can understand that
the index woudln't return portuguese only documents becaus it is in the
langauges noisewords list. But what about multi-language documents? Does
Sharepoint/Index Server determine what langauge a document is in and then
apply the correct noise word file. Or does it always apply the English
noisewords (the server is an english installation).
How is it possible that some documents which contain a given word are
returned but not all?

Hope I was a little bit clear and someone can help me

Regards

Gilles
 
 
 

Multilingual content indexing

Post by Hilary Cot » Sun, 26 Mar 2006 00:33:03

Basically some ifilters will respect embedded language tags for some
document types (word, xml, html). These documents may be broken by different
language word breakers than the default one for your server.

The words will be broken according to language rules and stored in your
catalog as such.

Then when you search the default language rules will be applied at query
time (or overridden if you use the language predicate).

Consider a word doc tagged as German. The words will be broken according to
language rules - so wanderlust would be broken and stored in your catalog as
wanderlust, wandern, and lust.

If you search on it using the English language options you will only get
hits to this document. If you search on lust using the English language
options you will get hits to this document. If you search on wanderlust
using the German language options you will get hits to documents in a
variety of languages containing wanderlust, wandern, and lust.

Watch out for false friends/false conjugates and wander words/wanderworts.
--
Hilary Cotter
Director of Text Mining and Database Strategy
RelevantNOISE.Com - Dedicated to mining blogs for business intelligence.

This posting is my own and doesn't necessarily represent RelevantNoise's
positions, strategies or opinions.

Looking for a SQL Server replication book?
http://www.yqcomputer.com/

Looking for a FAQ on Indexing Services/SQL FTS
http://www.yqcomputer.com/



"Radrizzi Gilles" < XXXX@XXXXX.COM > wrote in