One challenge for enterprises dealing with confidential information in conjunction with cloud-based systems is that they must exercise due diligence to ensure that it remains confidential. The steps are beyond the scope of a technical blog, but generally it involves making sure that everyone processing the confidential information understands that it is sensitive and has agreed to protect it.
For cloud services like Enterprise Resource Planning (ERP), Human Resources, Video Conferencing and so on, the confidentiality issues are very well understood, but there are exceptions like machine translation. When we think of data leaks, we rightly look primarily to malicious software (worms, viruses, customized zero-days from Advanced Persistent Threats (APT's), etc.) when seeking to prevent confidential data from leaving a network.
Machine translation tools are an interesting member of the “other” category of legitimate tools that can result in confidential data leaks without malicious intent from user or developer. Machine translation tools range from simple web sites like “youdao” pictured above or Google Translate, where it is pretty clear that information is leaving, up to integrated desktop applications, where the movement of data is not nearly as obvious.
The Youdao Dictionary application is installed like any other and operates like any other, except that the translation engine is remote and the application sends it’s lookups in plain text via insecure HTTP GET's. The fact that the translation tool is an application running on a user’s PC, makes it less likely that the person making use of it would realize that they are leaking information because the appearance is that their computer is doing the translation, not a web site.
In the above dissection of a URL retrieved by the tool, we see the word “information” being queried in the “q” field, but it could just as well be that someone isn't entirely sure what “Лечение герпеса Боба Джонса не будет хорошо” means, and would highlight it and click translate. That act results in the application enerating something similar to the plaintext query above, except with that chunk of Russian. The user will then learn that the string translates to “Bob Jones' herpes treatment is not going well.” Unfortunately, the request and the translation are transferred in plaintext form, which can be learned by passive interception.
The application that we use as an example is from Youdao (有道), a major Chinese Internet company that, according to Wikipedia (http://en.wikipedia.org/wiki/Youdao
), ships an offline and free online version of their translation tool. Through some limited experimentation, Youdao's site does seem to support the same functionality over the more secure, encrypted HTTPS protocol. We have observed insecure communication in the wild for versions ranging from 2.2.16 to 5.4.43, but it would be unfair to discuss the tool without looking at the latest version. The latest version of the Youdao tool we could find, version 188.8.131.527, was downloaded from http://codown.youdao.com/cidian/static/6.3/20141203/YoudaoDict.exe
and tested on a Windows 7 machine and there was no significant difference in behavior.
Our test version also makes use of plaintext (HTTP) communication by default and appears to automatically translate whatever word is near the mouse pointer, whenever it stops moving, between Chinese and English. It also has an option where a small button appears that you can click (or hover over) to translate a highlighted piece of text. Having used the program, it is easy to imagine why this tool is popular with users who need to translate between Chinese and English. In addition to the translation features, it also keeps users from being bored by providing extra advertisements.
What the tool provides in features, it definitely does not provide in security – while it works as intended and does not appear to be up to anything overtly nefarious, it still sends all the translation requests via the insecure HTTP protocol to a back-end server where the translation takes place.
The conclusion for customers is simple: translation software might send data to networks / systems outside your realm of control – if it does, then exactly as would be the case for a cloud-based ERP or Human Resources system, it is important to know where it goes, how it gets there, and that the third parties processing the information do so in a manner that is compatible with your organization's policies and contractual obligations. Given that the messages to be translated are sent in clear text, anyone on the same network could easily intercept the communication by sniffing network traffic. Translated content could range from benign phrases to highly sensitive information.
The following experiment was performed to verify whether traffic is still passed in plaintext HTTP GET requests, as it was in previous versions. The setup is a fake letter being written in notepad by an associate at the law firm of Nerd, Geek, and Spaz, LLP, who are defending a client who is being sued for some reason…
When the two lines were highlighted, a little blue book popped up and hovering over the book results in a translation being executed. That translation is actually performed on a remote server and the following URL is visited by the software:
For convenience, we look at the same URL after decoding it and converting to pretty-printed JSON:
We can see the variables broken apart more easily in the JSON version and the sentence in our screen-shot it clearly visible with “%20” replacing the spaces and “%0A%0D” replacing the end of line. When decoded, the following is the result:
Bill Jones is getting sued for some really embarassing
porn that was found on his work computer. Please advise
This is the exact content of the highlighted region of the Notepad application. Clearly, the fact that the firm cannot spell “embarassing” correctly could put some egg on their face, making this a potentially very damaging leak. The tool also passes information about the application where the translated text came from, which is indeed “notepad.exe,” version numbers, affiliate identifiers (for companies distributing the program to presumably share in ad revenues,) and other miscellaneous information.