Machine Translation, SystranPRO,
and Localization

Dateline: January 18, 1998

I WAS turned on to machine translation when I discovered shortly before Christmas that AltaVista had added free and instantaneous translations of foreign-language websites found when you use it to search for information on the Web. I dashed off a quick review of a related service from Systran Software, maker of the machine translation software used by AltaVista.

Systran provided a trial copy of its flagship product, SystranPRO, so I could do a more thorough review. As I started to write it, I realized it would be helpful and appropriate, and perhaps even necessary, to provide some notes about machine translation in general, and to add a section on the topic of "localization" (or "localisation," as the British spell it), a context within which machine translation is increasingly being deployed.

Machine Translation

Machine translation is the translation by computer of texts from one human language into another. Translation itself, whether performed by machine or by human, is not an exact science. There can be no such thing as a perfect translation. One of the chief reasons is that to translate the exact meaning of a source text one would need, like Borges' Pierre Menard, to have exactly the same experience, worldview, and mood--in short, the same mind--as the author had at the time(s) he or she wrote the original, and that is impossible. All translations are but approximations of the meaning of the original. Lesson number one, therefore, is not to expect perfection in translation.

Translation that fails to meet your expectations may be useless, or worse--it may be misleading and damaging. If your expectations are not based on an understanding of the translation process (and not merely of the translator's--human or machine's--part in that process), then they are probably too high and you are doomed to disappointment. Understanding the process lets you intervene in it to help ensure that your expectations are, in fact, met. Translation is not a black box, with untouched source material going in at one end and emerging as polished translation at the other.

The first step in the process is that someone produces a source document that needs translating. If it is a new document and the author or an editor is available to make changes to it, then the translator has a better chance of meeting expectations for a high quality translation. The more difficult, complex, ungrammatical, obscure, idiomatic, and ambiguous is the language used in the original, the harder it is for the translator to do a good job. An author or editor can, however, intervene in the process by revising the original to make it less complex, obscure, etc., and thus improve the output from the translator. The GIGO Factor (garbage in, garbage out) looms large in machine translation, and for the time being not only Shakespeare and Joyce but even your own natural literary efforts are garbage to a machine.

The second step is to decide what sort of translation it is to be, based on who the audience is and what its needs are. Is it a technical document aimed at technicians, or a technical document aimed at a lay audience (a help file for a computer program, for instance)? Is technical terminology in the original intended to be used consistently with other occurrences of the same terminology in previous or intended future documents? Is the source document a work of literature in which the mood and nuances of the original must be translated along with the surface meaning of the words? Is it an international peace treaty, requiring translation into a dignified, formal, unequivocal expression in the target language, or is it a web page for teenagers, requiring an irreverent, colloquial, informal, libido-laden form of the target language? Is the translation intended to give the audience the complete, detailed meaning of the original, or just the gist? If just the gist, is that the end of it or is the intention to let the audience decide from reading the gist whether they want to see a more detailed translation?

The third step, particularly in the case of translations of a scientific or technical nature, is to create a dictionary, or a thesaurus, of terms for which you want to ensure consistent translation. This is known as terminology management. For even more accurate and faithful translation, the third step can go further, to include the creation of a sublanguage, a simplified language in which grammar, syntax and terminology are pre-specified so as to be easily handled by the machine. Software in the form of translation memory programs is available to help you create a sublanguage. By writing source documents in the sublanguage, the difficulties of dealing with the intricacies, anomalies, and above all the ambiguities of natural language are sidestepped. It will not result in great literature, but who (except, perhaps, me) would want their computer repair manual, jet engine installation instructions, or space shuttle re-entry procedures to read like Shakespeare?

As you can see, there are lots of issues to consider before you begin to translate, and if at all possible before you even begin to prepare a source document. It does not take long to consider the issues, and time spent preparing a special glossary of terms for the translator is time well spent. There are other things you can do both to the source document and to the output translation to handle such issues and ensure a result that meets reasonable expectations, but if you have not thought about the issues and have not specified your expectations and your plan for meeting them, you may be disappointed by MT and give up on it too soon. Even a human translator needs some direction from the client.

What will giving up too soon cost you? That depends on your expectations for (1) quality and (2) quantity, and as always, it's a tradeoff. The higher the literary quality you demand, the more you must employ the talents of human translators alongside or in place of the machine, and people are expensive and slow. If the quantity is not much, that may not matter. If, on the other hand, you produce voluminous technical or marketing documentation in support of products you sell abroad, then not availing yourself of a fast and inexpensive translator--a machine (more specifically, MT software to run on your existing PC)--will increase your costs.

At anything above the most basic "gist" or "indicative" level, where you just want a translation to give a rough idea or indication of the topic and content of the source document, machine translation must be accompanied by some level of human translator involvement. This will remain the case for at least several years if not decades. The algorithms we have devised for machine translation are nowhere near as sophisticated as the algorithms nature has devised for human translation. No matter how much you refine and train your MT program--adding to its custom dictionary and translation memory, or its list of grammatical rules, for example--current MT algorithms reach a sort of quality barrier beyond which they cannot go. With faster processors they may translate faster, but not better.

The use of sublanguages is a temporary cop-out, useful in restricted domains but by definition unsuitable for full-blown natural language. To handle the ambiguities and inconsistencies of the latter will require full-blown artificial intelligence--a Machina sapiens that is as aware as you and me of the hidden context and subconscious assumptions we use all the time in natural speech and writing, and as capable of dealing with them.

SystranPRO (Japanese/English module)

Installing SystranPRO from CD-ROM is not too hard. Setup asks you to select from a list of all the available language pairs (English/French, English/Spanish, English/Japanese, etc.), then it installs them on your machine. Since each pair takes up about 15MB of disk space, you should not load pairs you are not going to use and are not going to license. (You do have the choice of loading the language pairs from CD-ROM each time you use the program, to save disk space.)

Having loaded the appropriate language pair(s), you must then call Systran and tell them the serial number on the SystranPRO package to obtain a "license key," a number that will "unlock" the program for the particular language pair(s) you decide to purchase. A menu option for "License" lets you enter the number Systran gives you. In my case, since Systran only sent me a 30-day trial version, the software will lock up again after 30 days.

To translate between English and Japanese, you need software that will display Japanese and Chinese characters. (A Japanese document can make use of as many as four character sets: the native Japanese "hiragana" alphabet; "katakana," used for foreign words adopted into the Japanese language; "kanji," Chinese characters often used to replace hiragana script; and "romaji," the Roman alphabet, which they will use, for example, for western company names.)

There are several programs available for displaying Japanese (and Chinese and Korean) characters on a Windows PC. Systran provides a trial copy of UnionWay on the SystranPRO CR-ROM. Having installed it, you must then connect to UnionWay's website for its 30-day trial license key. If this all sounds like it's getting to be a bit of a pain, that's because it is.

None of the Japanese display programs I've set up in the past (including Twinbridge's AsianViewer and Japanese Partner) is very intuitive. UnionWay is no exception. But once you've figured out or (like me) stumbled blindly upon the correct settings, then they work pretty well. Once invoked, UnionWay sits in memory and intercepts any text that contains the particular language code you have selected in setup (in the case of Japanese, there are several choices of code--I use one called "Shift-JIS" since it seems to be the most common). Having intercepted the code, UnionWay will then display the text on screen (you can print it, too) in its own Japanese font, which it loads into your computer during setup. (If you want more choices of font style you have to pay for them.) A most annoying bug is that in MS Word at least (but not in Netscape Navigator Gold editor) UnionWay intercepts some punctuation characters such as quotation marks and em and en dashes and displays them as Chinese characters, so if you pull up English text in Word with UnionWay running in memory it is messed up. It doesn't do any permanent harm--things return to normal as soon as you close UnionWay--but it doesn't do any good, either, and needs to be fixed by someone--Microsoft or UnionWay. (Umm, actually, there is a deeper problem, and again it's unclear whether it's Microsoft or UnionWay at fault: When running Word and UnionWay together, my entire machine has locked up on several occasions. I am running Windows NT and not the notoriously crash-prone Windows 95, and until I started messing with UnionWay and Word together, my machine never locked up.)

If I were Systran, I would move Heaven and Earth to have Shift-JIS and Japanese/Chinese fonts built into SystranPRO. It would make setup and use much easier for the majority of its customers, who just want documents translated, not an obstacle course in an obscure area of computer science.

We got there in the end, and it was time to put SystranPRO through its paces.

Review method

After getting the software up and running, I translated several English-language files, including the preface from my book manuscript, into Japanese, and had my wife look over the results. I also visited several Japanese newspapers on the Web, cut-n-pasted the text of various stories into SystranPRO, and hit the Translate button. Yep, that's all it takes. I can't show you all the translations I did (well, OK, I could, but it would make this a huge document), so what follows is only a sample. Please be assured that my conclusions are based on a wider test than that described below.

For the English to Japanese test, I selected text from Japanese for Today (1973, Gakken Co. Ltd., Tokyo). This is a "teach-yourself-Japanese" book for English-speaking beginners, so while being perfectly grammatical in both languages, the text is more straightforward and less complex than you would find in a typical book, newspaper, technical document, business document, or website.

For the Japanese to English test, I selected text from the Yomiuri Shinbun (Japan's biggest daily newspaper.) I could have entered simple Japanese text, using UnionWay, from Japanese for Today, but this would have been a real pain, for reasons this is not the place to go into, and in any case I judged it important to give you a sense of how SystranPRO works in the real world of natural writing and not just in kindergarten.

My Japanese wife helped me review the translations. (My other wives brought me beer and mopped my fevered brow.)

Note: To view the Japanese and Chinese characters in the following translations, you must have UnionWay or some other Asian character display program running and set to Shift-JIS. Without that, you'll just see blocks of garbage characters where the Japanese should be. As it is, because I first wrote the article in MS Word, and because of the display problems mentioned above, the Japanese text you see here contains errors that were not present during the translation and did not affect it.

English-Japanese

Here's what I put in (adapted slightly from from Japanese for Today, page 241):

And here's what came out:

To check the accuracy of the translation, I showed my wife the Japanese text and asked for her extempore translation. It came out quite close to the original English for the first section, but was off base on the second. She said the Japanese was very poor, but understandable.

Japanese-English

Here's what I put in (from the Yomiuri Shinbun, January 18, 1998):

And here's what came out:

If you say this is close to gibberish, I would understand, but I would not agree. There is the gist of a story in it. I probably have considerable advantages over other readers in that I am used to the Japanese way of expressing things in English (because I have a Japanese wife) and because I once managed a team of Chinese investigators and a translator. The investigators wrote reports in Chinese, and the translator scribbled out translations in what we called "Chinglish," a form of English not much different from what you see above. I had to make sense of it all and put it into good English. So even a human translator, unless s/he is genuinely bilingual, is unlikely to do much better than the machine.

To me, the ability to browse the Yomiuri and get the gist of Japanese news is amazing and useful, and I know the quality will improve over time.

SystranPRO's Japanese translation does not seem nearly as good as its French translation (see previous article). This may be in part a reflection of the much greater fundamental difference between English and Japanese than between English and French, which share common philological roots. But I suspect it is mostly a reflection of the fact that Systran has not worked as long on the Japanese version and it is therefore less "mature." It is therefore only a matter of time and resources before the translations improve.

For gist or indicative translation of raw natural language, unedited to assist in the translation process, SystranPRO in my subjective opinion is good enough to be useful. I could read Japanese newspapers and magazines on the Internet and at least get an idea of the articles therein, and my wife could do likewise for English text. But what I have demonstrated above is what I advocate strongly against when it comes to professional translation: the black box approach--sticking something in one end and seeing what comes out the other.

For professional translation of business or technical documents, it clearly would be essential to use SystranPRO in conjunction with translation memory software and with careful attention to source document preparation. Systran evidently understands its own industry, and has equipped SystranPRO with the tools for terminology management and translation memory.

Systran is a major player in the MT business, as its adoption by AltaVista, the European Commission, Ford, and other big customers shows, but it is not the only game in town. I have not sampled any other MT program, so readers who are seriously interested in using MT software might want to do some research. Systran provided me with a reprint of a comparative review, published in French magazine InfoPC in March 1997, pitting SystranPRO against Globalink Telegraph. SystranPRO rated best overall and better on accuracy and speed, while Telegraph was ahead on interface (you can use it directly within Microsoft Word, for example) and on "personalization," meaning that it would allow you not only to add and modify special dictionaries (both programs do that) but also to edit the rules of grammar. It seems to me this ability would be essential if you want to go the sublanguage route.

Localization: Think Global, Act Local

Localization is the translation and adaptation of software, documentation, multimedia and web sites to other languages. Notice the word "adaptation." It is a complete process whereby a product such as an interactive computer game is not only translated with high technical and idiomatic fidelity into a foreign language, but also into the packaging, documentation, license agreements, warranties, and other cultural, legal, etc. equivalencies appropriate to the target country's commercial and cultural environment.

With the North American software market beginning to mature and with huge opportunities opening up in virgin markets of Asia and South America, it is imperative for software publishers to localize their products if they want to win market share overseas. Localization company SDL International, which boasts Computer Associates, Corel, Seagate, Disney Interactive, Sony Computer Entertainment, and Hewlett-Packard among its clients, claims there is 30% annual growth in the localization business. SDL itself claims to be growing by about 50% per annum, and values the market at $1.5 billion by the year 2000. To meet its growth targets and to position itself for a public offering or perhaps a merger or buyout, the five-year-old company has been on a venture capital-financed purchasing spree.

Its acquisitions include Polylang Ltd., Amphion Solutions Ltd., and Amptran, a translation memory software system. Polylang added experience in the localization of Internet services to SDL's suite of services. Its complement of translation, website development, software engineering, audio/video recording and post production, graphic design, and desktop publishing services for creating localized software, documentation, multimedia and web sites is typical of what a full-blown localizer needs to offer these days. Just being a translation house is part of the solution, but it is not the whole solution for businesses wanting to globalize.

Yahoo lists 69 localization companies as of January 1998. From their brief descriptions it looks as though many of them may simply be translation houses, but they are still a part of--and their inclusion indicates the size and growth in--the localization business. And since Yahoo cannot be relied upon to list all resources correctly or in some cases at all, their listing is just part of the story. For example, a major but unlisted player in localization is Bowne Internet Solutions, a subsidiary of the world's biggest printer of financial documents. At 220 years of age and with 56 offices all over the world plus an awesome customer base of large companies for which it provides not just financial printing but also systems integration and Internet/intranet/extranet services, it is not surprising that localization is one of the Bowne family of companies' key strengths and that this strength is somewhat buried in the diversity of its services.

Translation Memory

I've mentioned "translation memory" several times. What exactly is it?

On the one hand, it is jargon sprinkled (with many other regrettable samples, like "conversion of extended characters into appropriate escape sequences, automatic substitution of Translation Memory segments," and "fully double-byte enabled") liberally around the SDL website with little in the way of explanation for the non-technical visitor. (The website itself is a poor advertisement for their service, and it could use some attention to localization for native, English-speaking visitors. It makes use of cumbersome and passé frames, and has garish blinking icons that went out of style 200 years ago in the dog years by which the development of the Internet is measured.) On the other hand, Amptran is actually a rather pragmatic, useful tool that helps ensure "perfect" and consistent translation of words, phrases, and even whole sentences in a context where such words, phrases, or sentences are used consistently to mean exactly the same thing every time. In other words, it helps you create a sublanguage.

To illustrate, let's suppose you are the CEO of Short Ass Inc., a miniature horse stud farm, and the catchy slogan that forms part of your trademark is "Don't get on your high horse!" Your trademark appears on every Web page, every print brochure, every TV and radio spot, every newspaper ad, every price list, every quotation etc. etc. you prepare. And you're about to launch a major export drive into Hong Kong, which is too crowded for big horses and no-one else has spotted the opportunity.

Now, I happen to know, because I've tried it, that if you say "Don't get on your high horse!" literally in Cantonese to a Cantonese native, said native will quickly assume one of two expressions: blank or puzzled. But your machine translation program is not (yet) smart enough to recognize this sentence as a metaphor and will translate it literally. So you turn first to a bilingual English/Cantonese human translator and get the correct idiomatic translation, X, for the metaphor. Then you turn on Amptran's translation memory and tell it to remember that from henceforth "Don't get on your high horse!" is to be translated as X, not as "Don't get on your high horse!" Get it?

The chances are you would never use "Don't get on your high horse!" in any other context or with its literal meaning in your Short Ass documents, and you are not James Joyce, so you are not losing anything by forcing a rigid translation for the phrase. And this happens to be true for many business and technical domains: there are many phrases and sentences that might be incomprehensible to most MT programs, so you save an awful lot of trouble by committing the preferred translation to memory. "Using Amptran," says SDL's blurb, "means that you never need to translate the same text more than once." It also means a savings on human translators, since you don't need to pay over and over for translation work already done and paid for.

Closing Remarks

This site is about artificial intelligence, not about translation or localization. However, AI is already playing a part in machine translation and in translation memory/sublanguage creation, and MT has truly global impacts, so I hope you have borne with me thus far. AI will in due course have a lot more to contribute toward the goal of high-quality translation of natural language.

There is a deeply intriguing related topic, machine interpretation, or machine translation added to automatic speech recognition software, which I hope to cover at a future date. Meantime, if you have any comments on this article, I'd be delighted to hear from you.

Until next week,

 

 

 

 


NEXT WEEK: This article took a lot of work, and I haven't had time to think about next week! To be determined.

Previous Features