Translating phrases to compounds

Post by Marcin Jun » Mon, 26 Jun 2006 18:33:50


I have noticed that there is a lot of work done on translating
compounds from the source language to (preferably) compounds or phrases
of the target language, but it seems to me, that there are no efforts
to translate phrases to compounds in cases where the target language
might prefer a compound instead of a perhaps unlikely or unnatural

Polish for instance is a language where composition is not very
productive. When translating a Polish sentence to German we should
often have compounds in German where we had phrases in Polish. An
simple example would be "book case".

A phrase to phrase translation:

"pka do ksiazek" -> "Regal f Bher"

This is a perfectly correct translation, but the following would be
much more preferable:

"pka do ksiazek" -> "Bherregal"

This example is a very simple one and would probably be covered on
dictionary level, but the possibilities to build compounds in German
are rather unlimited and often used by native speakers, especially in
written text. So lots of Polish phrases should possibly translate to
compounds. I suppose this problem appears also with more popular
language pairs. I do not know any Spanish, but from what I have seen,
it is also a language without compounds, is this correct? So we would
have a similar problem with Spanish-German (maybe Spanish-English).

Do you know anything about research on this kind of problem. Some
statistical approaches on determining what phrases are likely to be
translated to compounds in a compounding target language?

Do you think there is much sense in such research?


Translating phrases to compounds

Post by Ian Parke » Tue, 27 Jun 2006 20:54:43

If you were making a balanced effort at language translayion - in
general, this would come up in the statistical anaysis. Analysis might
thereby condense a number of words into a single concept.


Translating phrases to compounds

Post by Marcin Jun » Wed, 28 Jun 2006 01:26:26

> If you were making a balanced effort at language translayion - in

Would that work for non-lexicalized ad hoc compounds?

Translating phrases to compounds

Post by Ian Parke » Wed, 28 Jun 2006 20:15:09

What you are asking is essentiaaly the same as "could a lexicon be
constructed statistically". Yes if 2 or more words had a frequent
association this would condense them into one concept which would then
be included in the Lexicon.

Translating phrases to compounds

Post by Ted Dunnin » Thu, 29 Jun 2006 07:29:57

No, Ian's suggestion wouldn't work all that well.

In this specific case, however, statistical techniques might work very,
very well, if you can produce alternative forms (noun-noun compound,
noun-noun phrase and prepositionally modified noun). If you can
generate the alternatives reliably, then you could use a fairly simple
statistical language model to pick the alternative that "sounds" right.

Even in German where compounds are fairly productive or in English were
noun-noun phrases are very common, a decent language model would be
able to rate the alternatives pretty well. This is especially true if
you are able to use a class-based model which generalizes lexical

This is a special case of the larger instance of statistical
translation where a mono-lingual language model for the target
utterance is used to repair defects in the translation model. For
full-on statistical MT, the translation model is often as weak as a bag
of words model, but in your case, you have a much stronger translation
model so the requirements on your language model would be much less
stringent and your search algorithm needn't be nearly as clever.

Translating phrases to compounds

Post by Marcin Jun » Fri, 30 Jun 2006 07:09:07

> No, Ian's suggestion wouldn't work all that well.

Thought so. The semantic complexity of some German compounds can be
compared to that of short sentences, with the difference, that the
relations between the segments of the compound are expressed
implicitly. Treating them as lexicon entries would in my opinion be
equivalent to building a database of static sentence translations,
which would not be very useful, I think.

So, let's assume we are trying to build a set of templates, which map
Polish phrases with explicit functional words like prepositions and
particles to German compounds that express the same relations
implicitly. Let's futher assume that we have a corpus of aligned German
and Polish language data. Most probably every complex non-lexicalized
German compound will have a kind of phrase representing it in Polish
and this phrase will not be a compound itself (since they are not
really productive in Polish).

By analizing this kind of data, we can build those templates, treating
the Polish phrases as an atempt of paraphrasing the compound in a
language other than German. I think an attempt to translate this Polish
structure back to German (guided perhaps by a previous brute force
analization of the German compound) could give some interesting clues
about possible semantic relations within those compounds. That way we
could build the mentioned templates. Those again could among other
things perhaps be used for German monolingual paraphrasing.

The basic problem would be to assign the correct probabilities to those
templates according to their grade of "sounding right", as you put it.
I understand, such templates could be lexicalized, for instance by
emphasizing the head of the phrase and and the compound's head or its
base. Might be interesting to see, how such an approach might do when
encountering more complex phrases and compounds.

Translating phrases to compounds

Post by Ian Parke » Fri, 30 Jun 2006 23:20:26

Yes, but the actual German compounds tend to be frequently used

Translating phrases to compounds

Post by Marcin Jun » Sat, 01 Jul 2006 00:24:50

> Yes, but the actual German compounds tend to be frequently used

Not neccessarily.

For illustration you can choose the first article on the home page of
ther German newsmag "Der Spiegel": ,1518,424328,00.html

In this short article of only 227 words are eight compounds that are
not listed in the Duden-Universalwterbuch (14 compounds in total), a
dictionary that contains about 90.000 nouns, of which about 65.000 are

Not listed:

al-Qaida- und Taliban-Kpfer (= al-Qaida-Kpfer &

Translating phrases to compounds

Post by Ian Parke » Sat, 01 Jul 2006 01:55:06

Yes but nobody is held in Hawaii so that is unlikely.

Translating phrases to compounds

Post by Ted Dunnin » Sat, 01 Jul 2006 06:26:04

This is not a particularly trenchant response.

The point is that you can't depend on finding examples of compounds in
German. The example article indicates that (very roughly) half of the
compounds could be found in in Duden.

Presumably some of the missing ones would also be found in existing
large corpora.

Some would still be missing.

The problem here is to estimate plausibility of a putative compound or
compound/prepositional hybrid.

It should be noted that German compounds are very similar to English
noun-noun phrases except for the missing spaces. It should also be
noted that existing German compounds can be segmented relatively
reliably using statistical techniques, especially if you have a high
quality dictionary to start with.

That means that you can consider compounds to be essentially the same
as noun-noun phrases except that the punctuation differs a bit (i.e.
they involve some very thin space characters). You should also be able
to build a pretty reasonable probabilistic language model that
recognizes and evaluates compounds it has never before seen based on
generalizations of what it has seen. Thus the exististence of
EU-USA-Gipfel makes it plausible that you would see EU-Japan-Gipfel,
but does not make it much more plausible that you would see
EU-Kafee-Gipfel. Similarly, the now largely historical term
Europaischewirtschaftgemeinschaft would make
Nordamericanischewirtschaftgemeinschaft plausble.

Obviously observing the exact compound that you are wondering about
would make it very plausible, but even for the compounds that have
never been observed, it should be possible to assign probabilities.