Fortune | FORTUNE 07月25日 23:00
AI chatbots struggle to function beyond English: ‘They know a lot…but they miss the culture’
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

全球领先的AI聊天机器人虽然在英语内容生成方面表现出色,但在其他语言上却面临性能滑坡。这主要是因为大多数大型语言模型缺乏对当地文化的深入理解,如同对亚洲研究的富布赖特学者,只知其然不知其所以然。例如,一个涉及“派”的数学问题,在印度这样的国家,许多人对“派”的概念都不熟悉,更不用说“青柠派”。为解决这一问题,需要更多本地语言的数据,但获取高质量数据并非易事。通过人工筛选、合成数据生成以及应对文化敏感性等方式,正在努力提升AI在非英语语言环境中的表现。

🧐 **文化理解的缺失导致AI在非英语语言环境中表现不佳**:大型语言模型在处理非英语内容时,往往缺乏对当地文化背景的深入理解,即使能够进行语言转换,也难以捕捉文化细微之处。例如,一个关于“派”的数学问题,由于“派”在某些文化中并非普遍概念,AI的直接翻译和应用就会出现偏差,无法贴合当地实际情况。

💪 **获取高质量的非英语训练数据面临挑战**:与英语拥有海量网络资源不同,许多其他语言的优质数据资源相对匮乏。开发者需要投入大量人力进行数据筛选,确保数据的质量和适用性。例如,在泰国,需要由母语者手工梳理海量数据,区分高质量的泰语数据源,这是一项耗时耗力的工作。

🤝 **合成数据与私有数据合作成为解决方案**:为了弥补高质量数据的不足,合成数据生成正成为一种有效手段。通过机器翻译大量英语内容到其他语言,可以补充有限的数据集,尤其对于数字影响力小的地区性方言。同时,与私有数据集所有者合作,获取受保护的数据,也是解决数据来源限制的重要途径。

⚖️ **应对文化敏感性是数据收集的关键**:在收集本地语言数据时,研究人员必须谨慎处理文化敏感性问题。某些话题可能触及当地社区的禁忌,即使在当地普遍存在。例如,在马来西亚,“种族、宗教和王室”是敏感话题。AI开发者需要通过发布“安全模型”等方式,在满足数据需求的同时,尊重并保护当地的文化习俗和隐私。

The world’s leading AI chatbots can now generate everything from emails to research papers—in English. But shift to a different language, and AI’s performance begins to slip.

Most large language models are “a bit like a Fulbright scholar who is interested in Asia as their area of study,” said Kalika Bali, a senior principal researcher at Microsoft Research India at the Fortune Brainstorm AI Singapore conference on Wednesday. “They know a lot about the [subject], but they miss the culture. It’s an outsider’s gaze into the culture of a country.”  

Bali pointed to a classic math question—”John and Mary have a key lime pie which they need to divide into five parts”—to show the trouble of using a culturally-clueless AI. 

Generic AI models will translate the prompt directly. But as Bali pointed out, “in a country like India, most people don’t know what a pie is, [let alone] a key lime pie.” 

To develop models that better understand local culture, more data is needed in local languages. But getting that data is not always simple. 

Roughly half of all web content is in English, meaning there’s no shortage of high-quality digital resources for LLMs to learn English from. For other languages that do not enjoy this same abundance, developers have to explore different methods of getting training data. 

Kasima Tharnpipitchai, head of AI strategy at SCB 10X, highlighted the foundational work by native speakers needed to build a training dataset. 

Tharnpipitchai led SCB 10X’s project to launch the Thai LLM Typhoon. To build a dataset in Thai, Tharnpipitchai said that native speakers had to shift through open large datasets by hand, sorting between which Thai data sources were high-quality, and which were not. 

“There are no tricks here, you really have to do the work,” he said. “It really is just effort. It’s almost brute force.” 

SCB 10X launched Typhoon a year and a half ago. Tharnpipitchai said Typhoon was able to outperform GPT-3.5 in Thai, a fact which “says more about how poorly GPT-3.5 was performing in Thai” than their own work. 

Yet scraping non-English web data is beginning to raise legal concerns.  

Khalil Nooh, founder and CEO of Malaysian startup Mesolitica, which is developing a Malay LLM, said that they have had data owners request their sources be removed from the training dataset, which is available online since they are an open-source model. 

This has further limited the already small pool of high-quality data they have in Malay. To solve this, “the challenge for us is to work with private data set owners,” Nooh said. 

Both Nooh and Bali are exploring synthetic data generation to help create more high-quality data in their target languages. Machines can translate the abundant English content online into other languages to supplement their limited datasets. This is especially useful for LLMs trying to work in regional dialects that have almost no digital presence otherwise. 

“How we are able to capture all the 16 dialects in Malaysia is through synthetic [data],” said Nooh. 

But there are some obstacles to getting data that neither “brute force” nor machine generation can overcome. In many communities, researchers must balance getting a full picture with managing cultural sensitivities when collecting data in local languages. 

While “on the whole, India is very tech positive,” Bali noted that, “there are things that you would not ask” when doing on-the-ground data collection. Local communities may not want to share information about certain topics, even if they are widely known about between people in the region. 

Nooh added that in Malaysia, the three Rs—“race, religion, and royalty”—are all subjects of regional sensitivity. 

Although there are currently no regulations on what LLMs can ‘say’ in Malaysia, Nooh said that Mesolitica has “gone ahead to prepare the components that are needed if ever that is required to be implemented.” 

To tackle cultural sensitivities in Thailand, Tharnpipitchai similarly explained that SCB 10X released a “safety model” for public sector use, in addition to their regular Typhoon model. 

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI语言模型 非英语语言 文化理解 数据收集 合成数据
相关文章