Tux

...making Linux just a little more fun!

Transliterating Arabic

Ben Okopnik [ben at linuxgazette.net]


Mon, 21 Jan 2008 15:26:13 -0500

[[[ Discussion of UTF-8 problems in this thread have been split off to http://linuxgazette.net/147/misc/lg/problems_with_utf_8_over_smtp.html -- Kat ]]]

On Fri, Jan 18, 2008 at 08:43:12PM +0200, MNZ wrote:

> On Dec 30, 2007 5:50 PM, Ben Okopnik <ben@linuxgazette.net> wrote:
> > Hi, all -
> >
> > We've got somebody that just volunteered to translate bits and pieces of
> > LG into Arabic. Since he's working by himself, and is not a native
> > speaker (and since I can't read Arabic myself), does anyone here have
> > the ability to vet the stuff? It's at 'http://arlinux.110mb.com/lgazet/'.
> 
> Hi,
> I'm a native Arabic speaker. I can go through the translated text and
> check it but I'm terrible at actually typing Arabic and I'm not that
> good at it anyway. 

A week or two ago, I hacked up a cute little Latin-Russian (UTF8) converter (faking a few bits along the way, since the Russian alphabet is longer than the English one), so I thought "heck, I'll just adjust it so it can do Arabic - that'll give MNZ an easy way to do it." [laugh] I knew that it was written right-to-left - I could handle that bit - but having looked at the character set, as well as the whole initial/medial/final/isolated thing, I've concluded that I'd be crazy to even try.

> I'll help with the translation as much as I can. I'll
> start in a few days though because I have some exams right now.

That's great - just contact the project coordinator, and let me know if you guys need any help. Other than converters, of course. ;)

> PS: Is anyone doing an Esperanto translation? just wondering.....

LG's former editor, Mike Orr, is a one-man walking advert for the language - although he's not translating LG into it, AFAIK. You could always poke him about spreading the idea among his friends.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

Top    Back


MNZ [mnzaki at gmail.com]


Tue, 22 Jan 2008 20:59:27 +0200

On Jan 21, 2008 10:26 PM, Ben Okopnik <ben@linuxgazette.net> wrote:

> A week or two ago, I hacked up a cute little Latin-Russian (UTF8)
> converter (faking a few bits along the way, since the Russian alphabet
> is longer than the English one), so I thought "heck, I'll just adjust it
> so it can do Arabic - that'll give MNZ an easy way to do it." [laugh] I

Ummm I don't understand what a Latin-Russian converter is exactly. Do you mean transliterate from Latin to Russian?

> knew that it was written right-to-left - I could handle that bit - but
> having looked at the character set, as well as the whole
> initial/medial/final/isolated thing, I've concluded that I'd be crazy to
> even try.

It's a crazy language...... did you know that the written Arabic has very little to do with spoken Arabic? Worse, spoken Arabic is almost completely different in different countries/regions. I actually wouldn't understand someone from the Gulf very well.

> > I'll help with the translation as much as I can. I'll
> > start in a few days though because I have some exams right now.
>
> That's great - just contact the project coordinator, and let me know if
> you guys need any help. Other than converters, of course. ;)

Ok, I'll start soon.

-- 
//MNZ\\

Top    Back


Ben Okopnik [ben at linuxgazette.net]


Tue, 22 Jan 2008 14:39:21 -0500

On Tue, Jan 22, 2008 at 08:59:27PM +0200, MNZ wrote:

> On Jan 21, 2008 10:26 PM, Ben Okopnik <ben@linuxgazette.net> wrote:
> > A week or two ago, I hacked up a cute little Latin-Russian (UTF8)
> > converter (faking a few bits along the way, since the Russian alphabet
> > is longer than the English one), so I thought "heck, I'll just adjust it
> > so it can do Arabic - that'll give MNZ an easy way to do it." [laugh] I
> 
> Ummm I don't understand what a Latin-Russian converter is exactly.
> Do you mean transliterate from Latin to Russian?

Latin character set (ISO-8859-1 and such) to Russian, yes.

Eh... I'll send this example, and hope the 8-bit stuff makes it through the mail. Basically, it's a list of character mappings, from an English keyboard to Russian; the script takes transliterated English text (that is, Russian words written using those English characters) as input and produces the "re-mapped" output. It can do this with a file that's specified as input, or with lines typed at the console if no file is provided. I find it very handy whenever I just need to type a quick bit of Russian - e.g., Google searches, a quote in a text file, etc.

ben@Tyr:~$ tsl2utf8 -h
Mappings:
 
A|<90>     B|<91>     V|<92>     G|<93>     D|<94>     E|<95>     J|<96>     Z|<97>
I|<98>     Y|<99>     K|<9a>     L|<9b>     M|<9c>     N|<9d>     O|<9e>     P|<9f>
R|<a0>     S|<a1>     T|<a2>     U|<a3>     F|<a4>     H|<a5>     C|<a6>     X|<a7>
1|<a8>     2|<a9>     3|<aa>     4|<ab>     5|<ac>     6|<ad>     7|<ae>     8|<af>
a|<b0>     b|<b1>     v|<b2>     g|<b3>     d|<b4>     e|<b5>     j|<b6>     z|<b7>
i|<b8>     y|<b9>     k|<ba>     l|<bb>     m|<bc>     n|<bd>     o|<be>     p|<bf>
r|<80>     s|<81>     t|<82>     u|<83>     f|<84>     h|<85>     c|<86>     x|<87>
!|<88>     @|<89>     #|<8a>     $|<8b>     %|<8c>     ^|<8d>     &|<8e>     *|<8f>
+|<91>
 
ben@Tyr:~$ tsl2utf8
samovar
<81><b0><bc><be><b2><b0><80>
babu!ka
<b1><b0><b1><83><88><ba><b0>
7jno-^fiopskiy grax uv+l m$!% za hobot na s#ezd *@eric.
<ae><b6><bd><be>-<8d><84><b8><be><bf><81><ba><b8><b9> [...]
(The last is a Russian pangram - i.e., a phrase that uses every letter of the language.)

> > knew that it was written right-to-left - I could handle that bit - but
> > having looked at the character set, as well as the whole
> > initial/medial/final/isolated thing, I've concluded that I'd be crazy to
> > even try.
> 
> It's a crazy language...... did you know that the written Arabic has
> very little to do with spoken Arabic? Worse, spoken Arabic is
> almost completely different in different countries/regions. I actually
> wouldn't understand someone from the Gulf very well.

Wow. I've been learning a little Chinese, and while the written ideographs ('hanzi') are the same all over (well, except for the difference between the simplified and the original versions), the pronunciation can vary so much that it's common for people to draw the word they're talking with a finger, either in the air or on the palm of their hand.

Kat just mentioned that there's supposed to be a Modern Standard Arabic (MSA) that's a bit more, well, standardized. I can only imagine that it's greeted with a mixture of horror (from those who have to learn *yet another* version in addition to the ones they already know) and relief (from anyone who has to learn it from scratch.)

-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


MNZ [mnzaki at gmail.com]


Tue, 22 Jan 2008 22:06:01 +0200

On Jan 22, 2008 9:39 PM, Ben Okopnik <ben@linuxgazette.net> wrote:

> > Ummm I don't understand what a Latin-Russian converter is exactly.
> > Do you mean transliterate from Latin to Russian?
>
> Latin character set (ISO-8859-1 and such) to Russian, yes.
>
> Eh... I'll send this example, and hope the 8-bit stuff makes it through
> the mail. Basically, it's a list of character mappings, from an English
> keyboard to Russian; the script takes transliterated English text (that
> is, Russian words written using those English characters) as input and
> produces the "re-mapped" output. It can do this with a file that's
> specified as input, or with lines typed at the console if no file is
> provided. I find it very handy whenever I just need to type a quick bit
> of Russian - e.g., Google searches, a quote in a text file, etc.
[ snip ]

Oh ok. I can imagine how useful that would be! No more hunting for the Arabic letters around the keyboard. Why is it difficult to do it with Arabic? The initial/medial/final thing is just about the shape of the letter (handled by the font, I think) but it's the same letter (same byte).

> Wow. I've been learning a little Chinese, and while the written
> ideographs ('hanzi') are the same all over (well, except for the
> difference between the simplified and the original versions), the
> pronunciation can vary so much that it's common for people to draw the
> word they're talking with a finger, either in the air or on the palm of
> their hand.

Spoken Arabic is not just different in pronunciation! The words are different, completely! A whole set of vocab for each country/region.

> Kat just mentioned that there's supposed to be a Modern Standard Arabic
> (MSA) that's a bit more, well, standardized. I can only imagine that
> it's greeted with a mixture of horror (from those who have to learn *yet
> another* version in addition to the ones they already know) and relief
> (from anyone who has to learn it from scratch.)

Well I haven't heard about anything like that and I don't think people will be rushing to learn it :)

-- 
//MNZ\\

Top    Back


Ben Okopnik [ben at linuxgazette.net]


Thu, 24 Jan 2008 15:02:55 -0500

On Tue, Jan 22, 2008 at 10:06:01PM +0200, MNZ wrote:

> On Jan 22, 2008 9:39 PM, Ben Okopnik <ben@linuxgazette.net> wrote:
> > > Ummm I don't understand what a Latin-Russian converter is exactly.
> > > Do you mean transliterate from Latin to Russian?
> >
> > Latin character set (ISO-8859-1 and such) to Russian, yes.
> 
> Oh ok. I can imagine how useful that would be! No more hunting for
> the Arabic letters around the keyboard. Why is it difficult to do it
> with Arabic? The initial/medial/final thing is just about the shape of
> the letter (handled by the font, I think) but it's the same letter (same
> byte).

To get back to this - I'd be glad to do it, given your help. Can you come up with a reasonable 1-to-1 mapping? I.e., a parallel set of English characters and the hex values for the Arabic ones that would (at least mostly) make sense? I've looked at some of the existing ones, and most of them do a mix of one- and two-character keys, like

a/b/t/_t/j/7/.7/d/.d/r/z/s/^s/9/9./6/6./3/3./f/q/k/l/m/n/h/w/y/_a/T
and notes about handling, e.g., Alf, Waw, and Yaa in special ways when inside a word, and T at the end. That's a pain to program for if you don't understand a language; character-for-character replacement just requires a mapping.

In fact, that gives me an idea. I'm going to generalize that script so it can be used with any language - as long you provide a map file that looks something like this:

# This is a mapfile for English to UTF-8 Russian.
'A' = "А"
'B' = "Б"
'V' = "В"
'G' = "Г"
'D' = "Д"
 
[etc.]
 
The format is very flexible; a line can be anything from "Z3" to "'Z' = '3'", and the script will happily chew it up. The documentation is built in, and can be read with "perldoc translit".

I've just uploaded the tarball to my web server; it's available at

http://okopnik.com/temp/translit.tgz

and will be permanently available after we pub the upcoming issue from

http://linuxgazette.net/147/misc/tag/translit.tgz

Hopefully, that'll be useful to a lot of folks - and I'd really appreciate it if they sent me their mapfiles.

> > Kat just mentioned that there's supposed to be a Modern Standard Arabic
> > (MSA) that's a bit more, well, standardized. I can only imagine that
> > it's greeted with a mixture of horror (from those who have to learn *yet
> > another* version in addition to the ones they already know) and relief
> > (from anyone who has to learn it from scratch.)
> 
> Well I haven't heard about anything like that and I don't think people will
> be rushing to learn it :)

Supposedly, out of the 246 million speakers of Arabic worldwide, 206 million speak and understand it. :) I suspect that you know it, but not by that name.

  Modern Standard Arabic (MSA) is the literary standard across the Middle
  East and North Africa, and one of the official six languages of the
  United Nations. Most printed matter-including most books, newspapers,
  magazines, official documents, and reading primers for small children-is
  written in MSA^[citation needed].
http://en.wikipedia.org/wiki/Literary_Arabic#Modern_Standard_Arabic

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

Top    Back