Lost in the Middle
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics, 12, 157–173. https://doi.org/10.1162/tacl_a_00638
Category
What type of paper is this?
- Explaining/qualifying the LITM - how LLMs use long input contexts
- Analysing effectiveness when accessing relevant information placed at different positions within a large context
Context
What other papers is this one related to?
Correctness
What are the assumptions and are they valid?
Contributions
Who are the paper's main contributors?
Clarity
Second Pass
- RAG method of augmenting inputs with documents or external information are actually more susceptible to LITM problem
- In investigating multi-document Q&A, they control input context length by varying number of documents ("retrieving more or fewer documents") and position of relevant information within input context (start, middle or end)
- Primacy and recency bias
- Relevant info in the middle is worse than without any documents
- Trade-off with longer input contexts: More information may help it perform downstream task but at the cost of increasing amount of content the model must reason over
- Model performance saturates long before retriever recall - models don't effective use additional retrieved documents
- Retriever recall is the proportion of relevant documents successfully retrieved from a corpus in response to a query. This is calculated by number of relevant documents divided by total number of relevant documents.
- "50 vs 20 only marginally improves" - but should we still prioritise the 1.5% increase? What is bad about giving it more documents?
- Memory and compute increases quadratically in sequence length
- Decoder models (GPT, LLaMA) vs Encoder-decoder models. Model architecture has an impact on what length of sequences would see an earlier degradation of performance when more documents are added
- To claim that a model can use information robustly within long input contexts is to show that performance is minimally affected by position of relevant info in context
I wrote this in my thesis slides with looming deadline, adding them here:
- Fine-tuning models on tasks requiring long-context comprehension can slightly reduce positional biases. But degradation still occurs with mid-context information positioning
- Overall, without addressing inherent positional biases, models may still struggle with mid-context information. Even with RAG.
- 6.2 has many interesting papers on how models use long context.
- It is likely due to how human text artifacts are often constructed in a way where the beginning and the end of a long text matter the most