Info

This document is a work in progress and will be updated regularly.

Introduction

This document is meant to be an informal, ever-changing collection of interesting papers and resources related to Large Language Models (LLMs) and language revitalization.

LLMs have been shown to be remarkably capable at a wide variety of natural language tasks including machine translation, summarizing, question-and-answering, auto-completion, dialog, and more1. State-of-the-art LLMs are trained on vast amounts of natural language data from the internet2 and, as a result, do not perform as well on tasks that involve low/no-resource languages34.

We refer to languages with very little publicly available bilingual or monolingual corpora as “low-resource” languages and those with no publicly available corpora as “no-resource” languages.

Research Questions

In exploring how LLMs might be used for endangered language preservation and revitalization, we have identified the following research questions as some of the most interesting and important:

  • How do models “know” language? This is important for understanding how they might be taught new languages from scratch. By taught, I don’t necessarily mean fine-tuned or trained (in the ML sense of the word “train”). I am also interested in how pre-trained LLMs might be taught like a human is taught language: through dialog, question and answering, context, and experience (using prompt-engineering, retrieval augmented generation, etc.).

    • Black-box experimentation: Since the internal processes of the human brain are largely opaque to us, many of the advances in linguistics have come through creative black-box experiments567. Can these be recreated with LLMs? How might the results differ and what might that tell us about how LLMs “know” language?
    • Linguistic Probing: Probing techniques have been used to explore what linguistic features LLMs are learning891011. How can these techniques be used to understand how LLMs learn/know language?
    • We care less about whether or not LLMs learn like humans and more about understanding how LLMs learn so that we can leverage the knowledge to build useful tools for low/no-resource languages.
  • How can we use popular LLM tool-building techniques to create tools for the documentation, preservation, and revitalization of endangered languages?

    • In the context window: few-shot learning, prompt engineering, function calling, etc. We proposed a new approach for low/no-resource language machine translation using a combination of these techniques12.
    • Tokenization: Can adding tokens for target language words help with natural language tasks?
    • Fine-tuning: with limited data, fine-tuning is difficult. There may still be ways to leverage it for low-resource languages, however.
  • How can LLMs be used for foreign language education?

    • Ultimately, the goal of endangered language revitalization is to create new human speakers.
    • How can LLMs be used effectively in language education? We proposed a new approach for using LLMs as practice partners and tutors for language learning13.

Useful Tools Enabled by Research

Pursuing the above research questions will guide and enable the development of many practically useful tools for endangered language revitalization. Some of these include:

  • Parsing linguistic literature for grammar, vocabulary, etc.
  • Summarizing/explaining content for language learners
  • Grammar induction
  • Auto-completion
  • Data sanitization/standardization
  • Adaptive data collection: using an LLM to help adjust the questions or queries made to native speakers during data collection to gather the most relevant and useful information.

Special Concerns for Indigenous Communities

When working on language revitalization efforts with indigenous communities, history and context matter. Genocide and forced assimilation14 have led to the endangerment of many indigenous cultures and languages throughout the United States. At Indian boarding schools, which were established to turn the surviving indigenous population into a servile class, children were forced to abandon their native languages and cultures15.

Even the more modern and well-intentioned efforts to document and revitalize indigenous languages are not without their own ethical concerns. My tribe, for example, prohibits telling some traditional stories except during the winter. To document these stories and make them publicly available throughout the year would undermine this culturally important tradition. Different indigenous communities have different boundaries and rules for what is appropriate to share and what is not. It is important to respect these boundaries and to work with communities to ensure that the work being done is culturally appropriate and respectful.

Finally, it is imperative that indigenous communities benefit from the work being done to document and revitalize their languages. This means that the tools and resources developed should be made available to the communities in a way that is accessible and useful to them. Another personal example: my great-grandmother was a fluent speaker of our language and so was the subject of a study by the University of California, San Diego Ph.D. student, Evan Norris. His thesis “A Grammar Sketch And Comparative Study Of Eastern Mono”16, an invaluable resource for our critically endangered language, is locked behind a ProQuest academic paywall and is almost impossible for my family and other tribal members to access.

In our research on using LLMs for endangered language revitalization, we commit to respecting the boundaries and rules of the communities we work with and to making the research output accessible and useful to those communities.

Notes on Papers

This section contains notes, summaries, and thoughts on some of the papers in the bibliography below.

Linguistic Probing

“Does string-based neural MT learn source syntax?”11 is a very nice introduction to linguistic probing. The following excerpt, in particular, is very helpful in understanding the probing technique in general:

As a simple example, we train an English-French NMT system on 110M tokens of bilingual data (English side). We then take 10K separate English sentences and label their voice as active or passive. We use the learned NMT encoder to convert these sentences into 10k corresponding 1000-dimension encoding vectors. We use 9000 sentences to train a logistic regression model to predict voice using the encoding cell states, and test on the other 1000 sentences. We achieve 92.8% accuracy (Table 2), far above the majority class baseline (82.8%). This means that in reducing the source sentence to a fixed-length vector, the NMT system has decided to store the voice of English sentences in an easily accessible way. When we carry out the same experiment on an English-English (auto-encoder) system, we find that English voice information is no longer easily accessed from the encoding vector. We can only predict it with 82.7% accuracy, no better than chance. Thus, in learning to reproduce input English sentences, the seq2seq model decides to use the fixed-length encoding vector for other purposes.

So, in general, the idea behind probing is to see whether a model is learning to encode certain known linguistic features as a byproduct of learning to perform a given task. In particular, this example explores how the model is learning to encode the voice of the sentence (active or passive) as a linear combination of the encoding vectors. Note that this approach is not capable of telling us about linguistic features that may be encoded in a non-linear way, but it is a good start.

The paper “What Does BERT Look at? An Analysis of BERT’s Attention”9 is a very interesting follow-up to the previous paper that applies the probing technique to BERT and explores how different attention heads encode different linguistic features.

The paper “Emergent linguistic structure in artificial neural networks trained by self-supervision”10 uses probing techniques to perform experiments that suggest the BERT encoder is learning to encode parse tree distances in its hidden states.

“Probing Classifiers: Promises, Shortcomings, and Advances”8 is a nice survey on many different linguistic probing techniques and their limitations.

Footnotes

  1. Sébastien Bubeck et al. 2023. “Sparks of Artificial General Intelligence: Early experiments with GPT-4.” arXiv:2303.08774. DOI: 10.48550/arXiv.2303.08774

  2. OpenAI. 2023. “GPT-4 Technical Report.” arXiv:2303.08774. DOI: 10.48550/arXiv.2303.08774

  3. Aakanksha Chowdhery et al. 2022. “PaLM: Scaling Language Modeling with Pathways.” arXiv:2204.02311. DOI: 10.48550/arXiv.2204.02311

  4. Nathaniel R. Robinson, Perez Ogayo, David R. Mortensen, and Graham Neubig. 2023. “ChatGPT MT: Competitive for High- (but not Low-) Resource Languages.” arXiv:2309.07423. DOI: 10.48550/arXiv.2309.07423

  5. Mark C. Baker. 2008. The atoms of language: The mind’s hidden rules of grammar. Basic books.

  6. Guy Deutscher. 2010. Through the language glass: Why the world looks different in other languages. Metropolitan books.

  7. Steven Pinker. 2003. The language instinct: How the mind creates language. Penguin UK.

  8. Yonatan Belinkov. 2022. “Probing Classifiers: Promises, Shortcomings, and Advances.” Computational Linguistics 48(1): 207-219. DOI: 10.1162/COLI_A_00422 2

  9. Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. “What Does BERT Look at? An Analysis of BERT’s Attention.” In Proceedings of BlackboxNLP@ACL 2019, 276-286. DOI: 10.18653/V1/W19-4828 2

  10. Christopher D. Manning et al. 2020. “Emergent linguistic structure in artificial neural networks trained by self-supervision.” Proceedings of the National Academy of Sciences 117(48): 30046-30054. DOI: 10.1073/PNAS.1907367117 2

  11. Xing Shi, Inkit Padhi, and Kevin Knight. 2016. “Does string-based neural MT learn source syntax?” In Proceedings of EMNLP 2016, 1526-1534. DOI: 10.18653/V1/D16-1159 2

  12. Jared Coleman, Bhaskar Krishnamachari, Khalil Iskarous, and Ruben Rosales. 2024. “LLM-Assisted Rule Based Machine Translation for Low/No-Resource Languages.” arXiv:2405.08997. DOI: 10.48550/arXiv.2405.08997

  13. Sheng Yu, Jared Coleman, and Bhaskar Krishnamachari. 2023. “Chatlang: A Two-Window Approach to Chatbots for Language Learning.” https://anrg.usc.edu/www/papers/chatlang.pdf

  14. Benjamin Madley. 2016. An American Genocide: The United States and the California Indian Catastrophe, 1846-1873. Yale University Press.

  15. K. Tsianina Lomawaima and Teresa L. McCarty. 2006. “To remain an Indian”: Lessons in democracy from a century of Native American education. Teachers College Press.

  16. Evan J. Norris. 1986. “A Grammar Sketch And Comparative Study Of Eastern Mono.” Ph.D. dissertation, ProQuest. ISBN: 979-8-206-18923-0