Chat GPT and Privacy: What Data is used to train ChatGPT?

The old adage ‘If it’s free – you’re the product’ comes to mind immediately.

At VEQTA Translations, we deal with coordinating a number of projects and languages, and we have therefore over the years accumulated a wealth of localization industry-specific knowledge. When a client of ours made an inquiry if we could source translators able to speak ‘Northern Pwo’, I decided to research the language and its script using Chat GPT 4 as this is indeed a language we don’t translate every day as we normally don’t receive localization request for translation of minority languages.

As expected, my exchange with Chat GPT presented me with an encyclopedia of knowledge available at my fingertips.

We exchanged specific information regarding localization challenges localizing the language and Chat GPT commented ‘excellent point’ followed by a new piece information. After our chat was done, Chat GPT finished with “ Thank you, I have learnt a lot chatting with you”.

Of course I knew general large language models (LLM)  such as Chat GPT is trained with web data but this comment forced me to think about Privacy. While I hadn’t exactly revealed the secret of Coca Cola’s recipe, certainly Chat GPT evidently had learned some from our approach and what we felt worked and didn’t work based on its shopping list of answers in bullet points. Information I shared were my ideas on how the process of deploying Subject Matter Experts in the Project Localization Life Cycle and strategies pertaining to Quality Assurance. Now, a lot of information regarding best practices for localization is certainly no secret and, on the contrary, it’s our ambition to educate clients on best practices to avoid localization pitfalls and such. Nevertheless, the comment after interacting with Chat GPT felt eerie.

Let’s therefore divulge further into this heatedly debated topic.

It’s a contested subject because the datasets used to train large LLMs are scraped online. Therefore, LLMs technologies such as the always online Chat GPT solution cannot be used by enterprises where data security and privacy are of paramount importance.

Intellectual property

It was perhaps inevitable that the issue of intellectual property would come to the fore given how the Open AI model is constructed to be trained. Open AI will continue to test the legal boundaries of intellectual property law and copyright as for example one can read here: OpenAI is being sued for allegedly violating copyright law in a class action lawsuit.  The claim is how AI is reproducing open-source code scraped from the web constitutes copyright infringement on a massive scale. As a Language Service Provider, VEQTA has several clients operating in the Cyber Intelligence field where proprietary information is at stake who can’t afford to take any chances when it comes to copyright and intellectual property.

Privacy Concerns, Data integrity and Non-Disclosure

Non-Disclosure of content is of paramount importance for most major commercial companies.
There is a good reason why NDAs oftentimes reads ‘The parties acknowledge that monetary damages may not be a sufficient remedy for unauthorized disclosure of Confidential Information’  because disclosure is a main concern. As a language service provider, we are frequently requested to sign Non Disclosure Agreements and share our data policies especially dealing with our European based clients complying to General Data Protection Regulation (GDPR).

Even for clients where no NDA exists, or whom are outside Europe, the agreement is always that we of course don’t launch any content shared to us into an open domain – but that’s exactly what Chat GPT does even though you personally can’t access it. 

Take for example the competitive nature of tender proposals and marketing pitches for multibillion dollar land development plans. The data may be both sensitive to affect stock prices if leaked or if leaked and scoped up by the competition it can undermine a proposal in the last minute.
Would you trust to upload your content to Chat GPT for a real time online translation?

The act of sharing information to any party outside the company is a violation of most NDAs.  Therefore there is a risk of jeopardizing data integrity and confidentiality, leading to potential breaches of non-disclosure agreements and compromising trust between clients and service providers.

Deployment Security

Deploying LLMs means a lack of control over data, which carries risks associated with centralized models shared among multiple users.

Industry expert of Linguistic AI technologies, Mihai Vlad -the General Manager of the tech giant driving RWS industry machine translation solutions (Language Weaver) argues that large language models (LLM) can’t be deployed securely and segregated from other users. “There is just one single ChatGPT model owned by OpenAI, shared with all customers.”  (Source)

Therefore, even if Chat GPT could be used as a translation tool directly, large enterprises where data and security and privacy are of paramount importance would not allow it.

The implications of directly using LLMs in translation services like ChatGPT for translation and privacy are inherently difficult, non-transparent and complex to assess.  Chat GPT may pose privacy risks such as data leakage, unauthorized access to sensitive information, or breaches of confidentiality agreements.

Navigating the privacy implications of using LLMs is difficult. Therefore, while Chat GPT is a brilliant creative tool in certain contexts, the case for Neural Machine Translation (NMT) becomes stronger. With NMT, data resides in a controlled environment and is provided with human post-editing, making it a robust, purpose-built solution suitable for large-scale translation needs. This approach prioritizes privacy and data security, especially in industries where proprietary information is at stake.

Read related posts:

  • Comparing ChatGPT (LLM) versus Neural Machine Translation (NMT)
  • Will GPT replace your human translator
  • ChatGPT and culturally diverse data sets