Feb 2026/8 min

Extracting Hyperlinks from PPTX and XLSX Files in Haystack

How pandas.read_excel() silently discards hyperlinks, and the openpyxl workaround to preserve them. Plus python-pptx run-level hyperlink extraction.

HaystackPythonRAG

The Problem

Haystack's document converters (PPTXToDocument and XLSXToDocument) extracted text content from files but silently discarded hyperlinks. If a PowerPoint slide contained "Visit our website" or an Excel cell had a hyperlinked URL, the converter only captured the display text — the URL was lost.

This matters for RAG applications because hyperlinks often contain important context: references, citations, source URLs. Losing them degrades retrieval quality.

The DOCXToDocument converter already supported a link_format parameter for extracting hyperlinks. The issue asked for the same feature in PPTX and XLSX converters.

PPTX Implementation

For PowerPoint files, I used python-pptx's API to access hyperlinks at the run level. A "run" is a text segment within a paragraph — each run can have its own formatting and hyperlink.

The original code simply used shape.text to get all text from a shape:

python
# Before: loses hyperlinks
text = shape.text

The new code iterates through paragraphs and runs to detect hyperlinks:

python
# After: preserves hyperlinks
for paragraph in shape.text_frame.paragraphs:
    for run in paragraph.runs:
        text = run.text
        url = run.hyperlink.address if run.hyperlink.address else None
        if url and link_format == "markdown":
            parts.append(f"[{text}]({url})")
        elif url and link_format == "plain":
            parts.append(f"{text} ({url})")
        else:
            parts.append(text)

XLSX Implementation: The pandas Problem

The XLSX case was trickier. Haystack's XLSXToDocument uses pandas.read_excel() to read spreadsheets. The problem: `pandas.read_excel()` completely discards hyperlinks. There's no option to preserve them — pandas only reads cell values.

My solution was to use openpyxl alongside pandas. When link_format is enabled, I:

  1. Load the workbook with openpyxl.load_workbook() to access hyperlinks
  2. Read the data normally with pandas.read_excel() to preserve the table structure
  3. Walk through the openpyxl worksheet to find cells with hyperlinks
  4. Replace the corresponding values in the pandas DataFrame with formatted links
python
if link_format != "none":
    wb = openpyxl.load_workbook(file_path)
    for sheet_name in wb.sheetnames:
        ws = wb[sheet_name]
        for row in ws.iter_rows():
            for cell in row:
                if cell.hyperlink and cell.hyperlink.target:
                    display = str(cell.value) if cell.value else cell.hyperlink.target
                    if link_format == "markdown":
                        formatted = f"[{display}]({cell.hyperlink.target})"
                    else:
                        formatted = f"{display} ({cell.hyperlink.target})"
                    df.at[cell.row - 2, cell.column - 1] = formatted

Supported Formats

Following the existing DOCXToDocument pattern, I supported three formats:

link_formatOutputExample
---------
"none" (default)Text onlyour website
"markdown"Markdown links[our website](https://example.com)
"plain"Parenthesized URLsour website (https://example.com)

Key Takeaways

  1. Library limitations require creative workarounds. pandas.read_excel() can't read hyperlinks, but openpyxl can. Using both together gives us the best of both worlds — pandas for structured data, openpyxl for hyperlinks.
  2. Follow existing patterns. The DOCXToDocument converter already established the link_format parameter convention. Matching it in PPTX and XLSX keeps the API consistent across converters.
  3. Default to no behavior change. Using link_format="none" as default means existing users see no difference unless they opt in.

Impact & Reflection

Impact: This PR completed Haystack's document converter suite — all three file converters (DOCX, PPTX, XLSX) now support hyperlink extraction with a consistent link_format API. For RAG applications processing corporate documents (slide decks with reference URLs, spreadsheets with linked resources), this means the retrieval index now captures information that was previously silently discarded.

What I learned about working around library limitations: The pandas.read_excel() limitation was the most interesting engineering challenge. My instinct was to replace pandas entirely with openpyxl, but that would have broken the existing table structure handling. Using both libraries together — pandas for data, openpyxl for hyperlinks — was a pragmatic compromise. This taught me that the best solution isn't always the cleanest architecture; sometimes it's about combining existing tools in creative ways.

How this contribution fit into a bigger picture: Looking back at my four Haystack PRs together (document comparison, hyperlink extraction, Anthropic reasoning, Ollama reasoning), I see a pattern: each one addressed a gap between "what the framework promises" and "what actually works in production." The document converters promise to extract content, but silently dropping hyperlinks violates that promise. Recognizing these promise-vs-reality gaps is now how I find high-impact issues to work on.