Tux

...making Linux just a little more fun!

Apertium-transfer-tool info request

Jimmy O'Regan [joregan at gmail.com]


Fri, 11 Jul 2008 17:47:01 +0100

[I'm assuming Arky's previous permission grant still stands, also cc'ing the apertium list, for further comment]

---------- Forwarded message ----------

From: Rakesh 'arky' Ambati <rakesh_ambati@yahoo.com>
Date: 2008/7/11
Subject: Apertium-transfer-tool info request
To: joregan@gmail.com

Hi,

I am trying to use apertium-transfer-tools on Ubuntu Hardy, can you kindly point to working example/tutorial where transfer rules are generated from alignment templates.

Cheers

--arky

Rakesh 'arky' Ambati Blog [ http://playingwithsid.blogspot.com ]


Top    Back


Jimmy O'Regan [joregan at gmail.com]


Fri, 11 Jul 2008 17:48:18 +0100

2008/7/11 Jimmy O'Regan <joregan@gmail.com>:

> Hi,
>
> I am trying to use apertium-transfer-tools on Ubuntu Hardy, can you
> kindly point to working example/tutorial where transfer rules are
> generated from alignment templates.
>

Sure - there's an 'example' directory in the source package :) (http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/apertium-transfer-tools/example/)

The alignment templates we use are similar to Moses' 'Factored Models', if you have experience with those (http://www.statmt.org/moses/?n=Moses.FactoredModels)

I should come clean about something: there's a lot of work involved before you can use a-t-t.

First, you need a bilingual corpus: sentence aligned, one language per file, one sentence per line. I assume that you have those.

It's good, at this point, to make sure you have a clear understanding of Apertium's whole architecture.

Taking my example sentence, and running it through the 'Alpha testing' section (http://www.apertium.org/testing/) with 'Print intermediate representation' checked.

Esta es Gloria, mi amiga argentina
 
lt-proc (morphological analysis mode):
^Esta/Este<prn><tn><f><sg>/Este<det><dem><f><sg>$
^es/ser<vbser><pri><p3><sg>$
^Gloria/Gloria<n><f><sg>/Gloria<np><ant><f><sg>$^,/,<cm>$
^mi/mío<det><pos><mf><sg>$ ^amiga/amigo<adj><f><sg>/amigo<n><f><sg>$
^argentina/argentino<adj><f><sg>/argentino<n><f><sg>$
 
apertium-tagger:
^Este<prn><tn><f><sg>$ ^ser<vbser><pri><p3><sg>$
^Gloria<np><ant><f><sg>$^,<cm>$ ^mío<det><pos><mf><sg>$
^amigo<n><f><sg>$ ^argentino<adj><f><sg>$
 
apertium-pretransfer:
^Este<prn><tn><f><sg>$ ^ser<vbser><pri><p3><sg>$
^Gloria<np><ant><f><sg>$^,<cm>$ ^mío<det><pos><mf><sg>$
^amigo<n><f><sg>$ ^argentino<adj><f><sg>$
 
apertium-transfer:
^Prn<SN><tn><mf><sg>{^this<prn><tn><3><4>$}$
^verbcj<SV><vbser><pri><p3><sg>{^be<vbser><pri><p3><sg>$}$
^ant<SN><f><sg>{^Gloria<np><ant><f><sg>$}$^coma<cm>{^,<cm>$}$
^det_nom_adj<SN><f><sg>{^my<det><pos><sg>$ ^Argentinian<adj>$
^friend<n><3>$}$
 
apertium-interchunk:
^Prn<SN><tn><mf><sg>{^this<prn><tn><3><4>$}$
^verbcj<SV><vbser><pri><p3><sg>{^be<vbser><pri><p3><sg>$}$
^ant<SN><f><sg>{^Gloria<np><ant><f><sg>$}$^coma<cm>{^,<cm>$}$
^det_nom_adj<SN><f><sg>{^my<det><pos><sg>$ ^Argentinian<adj>$
^friend<n><3>$}$
 
apertium-postchunk:
^This<prn><tn><mf><sg>$ ^be<vbser><pri><p3><sg>$
^Gloria<np><ant><f><sg>$^,<cm>$ ^my<det><pos><sg>$ ^Argentinian<adj>$
^friend<n><sg>$
 
lt-proc (generation mode):
This is Gloria, my Argentinian friend
 
lt-proc (orthographic correction mode - unused in this example):
This is Gloria, my Argentinian friend

a-t-t only generates input to 'apertium-transfer' - everything before that point (and after) needs to be provided first: you need morphological analysers for each language involved - I assume that you're going to use a pair of analysers that we already have.

Also; the rules that a-t-t generates are for the 'transfer only' mode of apertium-transfer: this example uses the chunk mode - most language pairs, unless the languages are very closely related, would really be best served with chunk mode. Converting a-t-t to support this is on my todo list, and though doing it properly may take a while, I can probably get a crufty, hacked version together fairly quickly. With a couple of sed scripts and an extra run of GIZA++ etc., we can also generate rules for the interchunk module.

At around this point, I think it would be best if you told me what languages you're interested in using, as I can give you a much clearer picture of what's necessary. In some cases, some minor changes to the source may be necessary. The file 'TransferRule.C' has hardcoded assumptions about gender and number:

  head+=L"  <def-attr n=\"gen\">\n";
  head+=L"    <attr-item tags=\"m\"/>\n";
  head+=L"    <attr-item tags=\"f\"/>\n";
  head+=L"    <attr-item tags=\"mf\"/>\n";
  head+=L"    <attr-item tags=\"GD\"/>\n";
  head+=L"  </def-attr>\n";
 
  head+=L"  <def-attr n=\"num\">\n";
  head+=L"    <attr-item tags=\"sg\"/>\n";
  head+=L"    <attr-item tags=\"pl\"/>\n";
  head+=L"    <attr-item tags=\"sp\"/>\n";
  head+=L"    <attr-item tags=\"ND\"/>\n";
  head+=L"  </def-attr>\n";

Russian, for example, has 4 genders:

  head+=L"  <def-attr n=\"gen\">\n";
  head+=L"    <attr-item tags=\"ma\"/>\n";
  head+=L"    <attr-item tags=\"mi\"/>\n";
  head+=L"    <attr-item tags=\"f\"/>\n";
  head+=L"    <attr-item tags=\"nt\"/>\n";
  head+=L"    <attr-item tags=\"mf\"/>\n";
  head+=L"    <attr-item tags=\"GD\"/>\n";
  head+=L"  </def-attr>\n";

Slovenian has 3 numbers (I think the 'singular/plural' can be safely removed, but it's best to keep it):

  head+=L"  <def-attr n=\"num\">\n";
  head+=L"    <attr-item tags=\"sg\"/>\n";
  head+=L"    <attr-item tags=\"du\"/>\n";
  head+=L"    <attr-item tags=\"pl\"/>\n";
  head+=L"    <attr-item tags=\"sp\"/>\n";
  head+=L"    <attr-item tags=\"ND\"/>\n";
  head+=L"  </def-attr>\n";

Next, you need probability files for the part-of-speech taggers. This is where we hit our first snag, as we don't have those for any Indian languages.

We can cheat around this, but it's better to work on those first. We have information on the wiki: http://wiki.apertium.org/wiki/Tagger_training http://wiki.apertium.org/wiki/TSX_format

Newer releases of CG (http://beta.visl.sdu.dk/cg3.html) have (partial) support for Apertium's stream format. CG is a much better general purpose tagger than Apertium's, but Apertium's is much faster. Again, the Apertium wiki has some information.

http://wiki.apertium.org/wiki/Constraint_Grammar http://wiki.apertium.org/wiki/Apertium_and_Constraint_Grammar

We also have some instructions for converting CG to TSX, for tagger training. With a good enough CG grammar, it should be possible to use the 'supervised training' mode of the tagger.

http://wiki.apertium.org/wiki/Constructing_a_TSX_file_with_a_Constraint_Grammar

We also need a bilingual dictionary. If they aren't available, we have tools available to help construct them automatically: 'crossdics' (http://wiki.apertium.org/wiki/Crossdics) as I mentioned in my article, and ReTraTos (http://sourceforge.net/projects/retratos) which can build Apertium-format dictionaries from the same alignments generated by GIZA++ - the output of this should be manually checked, however, as it can output many questionable entries, particularly with multiword expressions.

The need for the bilingual dictionary seemed a little strange to me at first, but Mikel, Apertium's BDFL, explained that it really helps to reduce bad alignments. This probably means that a-t-t can't generate rules for things like the Polish to English 'coraz piękniejsza' -> 'prettier and prettier', but I haven't checked that yet.

So far, these are all things that are necessary for the translator anyway. Next, there are two specific types of files that are required by a-t-t: an 'atx' file, which specifies lexicalised words, and two 'ptx' files. It should be possible to use the example .atx file that comes with a-t-t after just changing the language identifiers. The .ptx files are used to specify 'mlu's - multiple lexical units. For Spanish, these are verbs with enclitic pronouns ('Dímelo' - 'Say it to me' is analysed as: '^Dímelo/Decir<vblex><imp><p2><sg>+prpers<prn><enc><p1><mf><sg>+lo<prn><enc><p3><nt>/Decir<vblex><imp><p2><sg>+prpers<prn><enc><p1><mf><sg>+prpers<prn><enc><p3><m><sg>$'); in the other direction, "John's dog"[1] becomes "el perro de John" - a simple ptx for Spanish would look like this:

<?xml version="1.0" encoding="UTF-8"?>
<posttransfer>
<mlu>
  <lu tags="vblex.*"/>
  <lu tags="prn.enc.*"/>
  <lu tags="prn.enc.*"/>
</mlu>
</posttransfer>

and for English, like this:

<?xml version="1.0" encoding="UTF-8"?>
<posttransfer>
<mlu>
  <lu tags="n.*"/>
  <lu tags="gen.*"/>
</mlu>
</posttransfer>

Generally speaking[1] you can find the relevant tags for mlus by grepping for '<j/>' in the morphological analysers.

Finally(!), you need a modes file; the sample modes file can be used, substituting language abbreviations.

[1] The analysis of this is "^John/John<np><ant><m><sg>$^'s/'s<gen>$ ^dog/dog<n><sg>$" - the '+' is missing here because the analysis broke off at the non-alphabet character ("'").

> Cheers
>
> --arky
>
> Rakesh 'arky' Ambati
> Blog [ http://playingwithsid.blogspot.com ]


Top    Back


Jimmy O'Regan [joregan at gmail.com]


Fri, 11 Jul 2008 18:25:50 +0100

2008/7/11 Jimmy O'Regan <joregan@gmail.com>:

> So far, these are all things that are necessary for the translator
> anyway. Next, there are two specific types of files that are required
> by a-t-t: an 'atx' file, which specifies lexicalised words, and two
> 'ptx' files. It should be possible to use the example .atx file that
> comes with a-t-t after just changing the language identifiers. The
> .ptx files are used to specify 'mlu's - multiple lexical units.

Just to be clear: we consider this kind of 'multiword' to be different to things like 'once and for all' and 'Act of God'/'Acts of God', which are both handled inside the morphological analyser. a-t-t comes with a script to convert the spaces in these kinds of multiwords to underscores, so GIZA++ will treat them as single words (as we do).

Another thing I forgot to mention is that you need to make a couple of tweaks to GIZA++ before you can use it. It's best to use the version available here: http://code.google.com/p/giza-pp/ (because that actually compiles :)

In GIZA++-v2/Makefile change:

CFLAGS_OPT = $(CFLAGS) -O3 -DNDEBUG -DWORDINDEX_WITH_4_BYTE
-DBINARY_SEARCH_FOR_TTABLE

to:

CFLAGS_OPT = $(CFLAGS) -O3 -DNDEBUG -DWORDINDEX_WITH_4_BYTE

so you can use the 'trainGIZA++.sh' script; in that, you need to change the line (to be able to use it with Debian's csh, at least):

if( $# != 3 )

to:

if( $#argv != 3 )


Top    Back