Designing a Flexible Document Comparison API for Haystack Evaluators
Adding a document_comparison_field parameter to MRR, MAP, and Recall evaluators — enabling comparison by ID or metadata instead of hardcoded content matching.
The Problem
Haystack provides three retrieval evaluators: DocumentMRREvaluator (Mean Reciprocal Rank), DocumentMAPEvaluator (Mean Average Precision), and DocumentRecallEvaluator. All three need to compare ground truth documents against retrieved documents to calculate scores.
The problem: all three evaluators hardcoded doc.content for this comparison:
# Hardcoded in the evaluator
if retrieved_doc.content == ground_truth_doc.content:
# match foundThis breaks in two common RAG scenarios:
- Document chunking: When documents are split into chunks, the chunk content differs from the original ground truth content. So
chunk.content != original.content, even when the chunk came from the correct document. - Metadata-based identification: In production systems, documents are often identified by composite keys (e.g.,
file_id,url, orpage_number) stored in metadata, not by their text content.
API Design
I needed to add flexibility while keeping the default behavior unchanged. I designed a document_comparison_field parameter:
evaluator = DocumentMRREvaluator(document_comparison_field="content") # default
evaluator = DocumentMRREvaluator(document_comparison_field="id") # by document ID
evaluator = DocumentMRREvaluator(document_comparison_field="meta.file_id") # by metadataThe "meta.<key>" syntax was a deliberate design choice — it's readable, unambiguous (the meta. prefix clearly indicates metadata access), and extensible (any metadata key works without enum expansion).
Implementation
The core logic is a shared helper method that extracts the comparison value from a document:
def _get_comparison_value(self, doc: Document) -> Any:
if self.document_comparison_field == "content":
return doc.content
elif self.document_comparison_field == "id":
return doc.id
elif self.document_comparison_field.startswith("meta."):
meta_key = self.document_comparison_field[len("meta."):]
return doc.meta.get(meta_key)
else:
raise ValueError(f"Unsupported comparison field: {self.document_comparison_field}")I added this to all three evaluators. One structural change was required for DocumentRecallEvaluator: its _recall_single_hit and _recall_multi_hit methods were @staticmethods. To use self._get_comparison_value(), I needed to change them to regular instance methods.
Serialization
Haystack components use to_dict/from_dict for serialization. I updated to_dict in each evaluator to include the new parameter:
def to_dict(self) -> dict[str, Any]:
return default_to_dict(self, document_comparison_field=self.document_comparison_field)No from_dict changes were needed — Haystack's default_from_dict handles new keyword arguments automatically.
Key Takeaways
- New parameters should always have backwards-compatible defaults. Using
document_comparison_field="content"as default means zero behavior change for existing users. - When the same logic appears in 3+ classes, extract a shared helper.
_get_comparison_valuewas identical across all three evaluators. - Think about edge cases in the API design. The
"meta.<key>"syntax is clean, but I also needed to handle nested metadata keys and the case where the metadata key doesn't exist (returnsNone, which correctly means "no match").
Impact & Reflection
Impact: This feature unblocked a fundamental limitation in Haystack's RAG evaluation pipeline. Previously, users who chunked documents (the standard practice in production RAG) could not accurately evaluate retrieval quality because the evaluators only compared by content. This PR enables comparison by document ID or any metadata field, making MRR, MAP, and Recall metrics actually meaningful for chunked document workflows. The issue had been open for over a year (#9331).
What I learned about API design: This was my first time designing a user-facing API parameter for a framework used by thousands of developers. The key insight: the "meta.file_id" dot-notation syntax was a small design decision, but it determined whether the API felt intuitive or clunky. I spent more time thinking about the parameter interface than writing the implementation. Good API design is about making the common case obvious and the edge cases possible.
A mistake I almost made: My first draft used an enum ("content" | "id" | "meta") with a separate meta_key parameter. During code review, I realized this split the configuration unnecessarily. Combining it into a single document_comparison_field string with dot notation was simpler and more extensible. This taught me to always ask: "Can two parameters become one?"