#source/paper #project/step-thesis #note/reference

Lost in the Middle

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics, 12, 157–173. https://doi.org/10.1162/tacl_a_00638

Context

What other papers is this one related to?

Correctness

What are the assumptions and are they valid?

Contributions

Who are the paper's main contributors?

Clarity

Second Pass

RAG method of augmenting inputs with documents or external information are actually more susceptible to LITM problem
In investigating multi-document Q&A, they control input context length by varying number of documents ("retrieving more or fewer documents") and position of relevant information within input context (start, middle or end)
Primacy and recency bias
Relevant info in the middle is worse than without any documents
Trade-off with longer input contexts: More information may help it perform downstream task but at the cost of increasing amount of content the model must reason over
Model performance saturates long before retriever recall - models don't effective use additional retrieved documents
- Retriever recall is the proportion of relevant documents successfully retrieved from a corpus in response to a query. This is calculated by number of relevant documents divided by total number of relevant documents.
- "50 vs 20 only marginally improves" - but should we still prioritise the 1.5% increase? What is bad about giving it more documents?
  - Memory and compute increases quadratically in sequence length
- Decoder models (GPT, LLaMA) vs Encoder-decoder models. Model architecture has an impact on what length of sequences would see an earlier degradation of performance when more documents are added
To claim that a model can use information robustly within long input contexts is to show that performance is minimally affected by position of relevant info in context

I wrote this in my thesis slides with looming deadline, adding them here:

Fine-tuning models on tasks requiring long-context comprehension can slightly reduce positional biases. But degradation still occurs with mid-context information positioning
Overall, without addressing inherent positional biases, models may still struggle with mid-context information. Even with RAG.
6.2 has many interesting papers on how models use long context.
It is likely due to how human text artifacts are often constructed in a way where the beginning and the end of a long text matter the most