<<<
Chronological Index
>>> <<<
Thread Index
>>>
[ssac-gnso-irdwg] FW: Questions about translation / transliteration of international addresses
- To: Ird <ssac-gnso-irdwg@xxxxxxxxx>
- Subject: [ssac-gnso-irdwg] FW: Questions about translation / transliteration of international addresses
- From: Steve Sheng <steve.sheng@xxxxxxxxx>
- Date: Mon, 21 Feb 2011 08:35:41 -0800
Dear IRD-WG,
Through a SSAC contacts, staff have reached out to the google’s
internationalization team about accuracy and cost of translation vs
transliteration of addresses, their thoughts in general about the different
models. As you can imagine, Google does internationalization for all sorts of
applications and purposes. I have summarized her comments and attached her
full email below.
* The first thing to consider is whether it make sense to translate or
transliterate. That depends on the usage of the address. In google, the
assumption is that if something is in a script that is unknown to the user, it
makes sense to show it in latin script as a suitable middle ground.
* For most cases, you probably don't want to translate an address at all. It
doesn't make much sense to change "Rue Arc-en-ciel" to "Rainbow Rd". The only
possibilities where this would be useful would be: a) for bilingual cities, but
even then respecting the language in which the user added it is probably a good
thing, or b) for cases where the script is hard to read for the user. But in
that case - transliteration may be sufficient, and a better idea than
translation. Or some combination of both of them - common words like "road"
translated, but the name only transliterated, even if it does mean something
like "flower".
* One way is to allow the user to enter the address in their language, [the
system] detect the script, and if it's not in Latin script already (note I say
Latin here, and not ASCII), then [the system can] transliterate and save a
(perhaps editable) Latin-script version too. (Perhaps allow the user to edit
that version - they might know better than auto-systems.)
* In terms of how to transliterate: ICU has transliteration tools freely
available (c++ and java), that you can use to auto-transliterate, based on CLDR
- http://demo.icu-project.org/icu-bin/translit.
* Alternately Google Translate would do a combination of
translate/transliterate for you - if there is a common name for the street in
English then it would use that, otherwise it would transliterate to Latin
script.
* Regarding transliteration engines: Yes, there are a few different systems,
because of things like French people developing one that makes more sense for
them and English-speakers finding something else more intuitive, or Chinese
characters having multiple pronunciations, and these changing.
http://cldr.unicode.org/index/cldr-spec/transliteration-guidelines has some
examples of different systems, and the different outputs. The default that CLDR
uses is probably fine for most cases.
Warm regards,
Steve
------ Forwarded Message
From: Lara Rennie <XXXXXX@xxxxxxxxxx>
Date: Wed, 16 Feb 2011 02:47:02 -0800
To: Steve Sheng <steve.sheng@xxxxxxxxx>
Subject: Re: Questions about translation / transliteration of international
addresses
Hi Steve,
Nice to hear from you! Warren did indeed warn me that I'd get an email from you.
First a few thoughts.
There are indeed some complications with viewing addresses in
different languages. The first thing is whether it makes sense to
translate or transliterate the language at all, and this depends on
the purpose of the address. In your case, the addresses will be either
postal or physical addresses. I imagine that they would be used for
things like: geocoding perhaps, sending postal mail.... perhaps other
uses?
Most people will know their address in one language, or perhaps in two
(in bilingual towns, for example, in parts of Sweden or Switzerland,
they would definitely be able to recite their address in both
languages.) Perhaps people in
China/Japan/Korea/Thailand/Russia/Bulgaria/etc will know the latin
variant of their address, perhaps not. Indeed, most people can read
the latin script - we make the assumption that if something is in a
script that is unknown to the user, it makes sense to show it in latin
script as a suitable middle ground. This is also recognised by postal
systems. But even here - do we count Polish as being in latin script,
since it has strange "l" characters with slashes through them? It's
definitely not ASCII. Is it the language, or the script that is most
important here? Should a Polish address use Warsaw, or Warszawa? I
would argue for the latter, since keeping the language of the address
consistent is probably worthwhile, and the Polish post system is
probably more accustomed to dealing with town-names in Polish
(although English would be ok, writing it in say Spanish might cause
problems).
So for most cases, you probably don't want to translate an address at
all. It doesn't make much sense to change "Rue Arc-en-ciel" to
"Rainbow Rd". The only possibilities where this would be useful would
be:
a) for bilingual cities, but even then respecting the language in
which the user added it is probably a good thing, or
b) for cases where the script is hard to read for the user. But in
that case - transliteration may be sufficient, and a better idea than
translation. Or some combination of both of them - common words like
"road" translated, but the name only transliterated, even if it does
mean something like "flower".
With this in mind, I would allow the user to enter the address in
their language, detect the script, and if it's not in Latin script
already (note I say Latin here, and not ASCII), then you can
transliterate and save a (perhaps editable) Latin-script version too.
(Perhaps allow the user to edit that version - they might know better
than auto-systems.)
In terms of how to transliterate (q2):
ICU has transliteration tools freely available (c++ and java), that
you can use to auto-transliterate, based on CLDR -
http://demo.icu-project.org/icu-bin/translit has a demo.
Alternately Google Translate would do a combination of
translate/transliterate for you - if there is a common name for the
street in English then it would use that, otherwise it would
transliterate to Latin script.
E.g. take this Chinese address: 中国北京市东城区前门东大街3号首都大酒店A座20层 and paste
into Google Translate. It gets a bit confused since it's all in one
line, but when you split it up like this:
中国
北京市
东城区前门东大街3号首都大酒店A座20层
it does a pretty good job and gives this:
China
Beijing
Qianmen East Street, Dongcheng District, Capital Hotel No. 3, Block A 20-storey
Note this is a combo of translate/transliterate. I would label this
"en" rather than saying it's Chinese in latin script.
See here for info on using this through an AJAX API:
http://translate.google.com/about/intl/en_ALL/tour.html#professional
There's also the possibility of using the Maps API and querying maps
for the address in English. This will fail for cases where maps can't
find the address though. See
http://code.google.com/intl/en/apis/maps/index.html
All these are free :) So, neither translation or transliteration of an
address need be more costly.
q3) Regarding transliteration engines: Yes, there are a few different
systems, because of things like French people developing one that
makes more sense for them and English-speakers finding something else
more intuitive, or Chinese characters having multiple pronunciations,
and these changing.
http://cldr.unicode.org/index/cldr-spec/transliteration-guidelines has
some examples of different systems, and the different outputs. The
default that CLDR uses is probably fine for most cases.
Hope this makes sense! Feel free to reach out if you have any more questions :)
Lara
2011/2/15 Steve Sheng <steve.sheng@xxxxxxxxx>:
> Hi Lara,
>
> Greetings! My name is Steve Sheng, I work for the Internet Corporation for
> Assigned Names and Numbers (ICANN), Warren Kumari forwarded your contact
> information.
>
> We are engaged in a policy development effort regarding the
> internationalization of domain registration data. One core piece of that
> information is the name and postal address of person / organization
> responsible for domain names.
>
> As you can imagine, traditionally all these information are in US-ASCII as
> the predominant Internet users understand English. Going forward, that might
> not be a safe assumption and there may be a need to allow different scripts
> in the composition of these data.
>
> When considering different models to internationalize the contact
> information, we are guided by three principles:
>
> 1) Avoid the “babble of tower” effect. We want these information can be
> accessed and be useful for people around the world even though they do not
> speak the same language.
>
> 2) Balance between cost and usability of the registration data. Translation
> would have the greatest usability, but it may also be costly.
>
> 3) Consider the needs of human users as well as legitimate automation
> (applications that parse and analyze registration data).
>
> With these in mind, I want to a you a few questions:
>
> 1) Is it true that translation of an address is more costly than the
> transliteration of an address?
>
> 2) are there available tools or systems that registrar could use to do the
> translation or transliteration of the addresses for them? (note, this is
> more like a universal transliteration engine or translation engine).
>
> 3) As I understand, for the same address, different transliteration /
> translation engines could give different results, is this the case? If so,
> how does one know which result is more accurate than the other?
>
> These are some of my questions, I appreciate if you could advise us with
> your expertise on this important internet policy matter. For your
> background, I have attached our latest slides as well.
>
>
> Warm regards,
> Steve
------ End of Forwarded Message
------ End of Forwarded Message
<<<
Chronological Index
>>> <<<
Thread Index
>>>
|