Extracting Hyperlinks from PPTX and XLSX Files in Haystack
How pandas.read_excel() silently discards hyperlinks, and the openpyxl workaround to preserve them. Plus python-pptx run-level hyperlink extraction.
The Problem
Haystack's document converters (PPTXToDocument and XLSXToDocument) extracted text content from files but silently discarded hyperlinks. If a PowerPoint slide contained "Visit our website" or an Excel cell had a hyperlinked URL, the converter only captured the display text — the URL was lost.
This matters for RAG applications because hyperlinks often contain important context: references, citations, source URLs. Losing them degrades retrieval quality.
The DOCXToDocument converter already supported a link_format parameter for extracting hyperlinks. The issue asked for the same feature in PPTX and XLSX converters.
PPTX Implementation
For PowerPoint files, I used python-pptx's API to access hyperlinks at the run level. A "run" is a text segment within a paragraph — each run can have its own formatting and hyperlink.
The original code simply used shape.text to get all text from a shape:
# Before: loses hyperlinks
text = shape.textThe new code iterates through paragraphs and runs to detect hyperlinks:
# After: preserves hyperlinks
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
text = run.text
url = run.hyperlink.address if run.hyperlink.address else None
if url and link_format == "markdown":
parts.append(f"[{text}]({url})")
elif url and link_format == "plain":
parts.append(f"{text} ({url})")
else:
parts.append(text)XLSX Implementation: The pandas Problem
The XLSX case was trickier. Haystack's XLSXToDocument uses pandas.read_excel() to read spreadsheets. The problem: `pandas.read_excel()` completely discards hyperlinks. There's no option to preserve them — pandas only reads cell values.
My solution was to use openpyxl alongside pandas. When link_format is enabled, I:
- Load the workbook with
openpyxl.load_workbook()to access hyperlinks - Read the data normally with
pandas.read_excel()to preserve the table structure - Walk through the
openpyxlworksheet to find cells with hyperlinks - Replace the corresponding values in the pandas DataFrame with formatted links
if link_format != "none":
wb = openpyxl.load_workbook(file_path)
for sheet_name in wb.sheetnames:
ws = wb[sheet_name]
for row in ws.iter_rows():
for cell in row:
if cell.hyperlink and cell.hyperlink.target:
display = str(cell.value) if cell.value else cell.hyperlink.target
if link_format == "markdown":
formatted = f"[{display}]({cell.hyperlink.target})"
else:
formatted = f"{display} ({cell.hyperlink.target})"
df.at[cell.row - 2, cell.column - 1] = formattedSupported Formats
Following the existing DOCXToDocument pattern, I supported three formats:
link_format | Output | Example |
| --- | --- | --- |
"none" (default) | Text only | our website |
"markdown" | Markdown links | [our website](https://example.com) |
"plain" | Parenthesized URLs | our website (https://example.com) |
Key Takeaways
- Library limitations require creative workarounds.
pandas.read_excel()can't read hyperlinks, butopenpyxlcan. Using both together gives us the best of both worlds — pandas for structured data, openpyxl for hyperlinks. - Follow existing patterns. The
DOCXToDocumentconverter already established thelink_formatparameter convention. Matching it in PPTX and XLSX keeps the API consistent across converters. - Default to no behavior change. Using
link_format="none"as default means existing users see no difference unless they opt in.
Impact & Reflection
Impact: This PR completed Haystack's document converter suite — all three file converters (DOCX, PPTX, XLSX) now support hyperlink extraction with a consistent link_format API. For RAG applications processing corporate documents (slide decks with reference URLs, spreadsheets with linked resources), this means the retrieval index now captures information that was previously silently discarded.
What I learned about working around library limitations: The pandas.read_excel() limitation was the most interesting engineering challenge. My instinct was to replace pandas entirely with openpyxl, but that would have broken the existing table structure handling. Using both libraries together — pandas for data, openpyxl for hyperlinks — was a pragmatic compromise. This taught me that the best solution isn't always the cleanest architecture; sometimes it's about combining existing tools in creative ways.
How this contribution fit into a bigger picture: Looking back at my four Haystack PRs together (document comparison, hyperlink extraction, Anthropic reasoning, Ollama reasoning), I see a pattern: each one addressed a gap between "what the framework promises" and "what actually works in production." The document converters promise to extract content, but silently dropping hyperlinks violates that promise. Recognizing these promise-vs-reality gaps is now how I find high-impact issues to work on.