graph-builder-instructions

I’m building a “graph builder” script for my Markdown-based digital garden. Another AI helped me design the architecture in pieces; now I want you to produce ONE complete, working Python script from this spec.

Please read everything below (requirements + examples + partial code), then:

  1. Output a single Python script file.
  2. Keep comments and in-code instructions so the script is very readable and self-explanatory.
  3. Focus only on this scope: build a graph + diagnostics from Markdown, no HTML, no moving files, no quote parsing.
  4. Prefer clear, straightforward code over cleverness.

type: ai-instructions
id: graph-builder-instructions

High-Level Goal

Scan all .md files in the project, build a node registry and a graph of connections between them (edges), then produce a Markdown report with:

The graph must incorporate both:

All relationships should be represented as edges with a type.


Scope (what this script DOES and DOES NOT do)

This script DOES:

This script DOES NOT:

Think of it as: index + resolver + validator.


Data Model

Each Markdown file has frontmatter like:

---
id: the-odyssey
aliases:
  - odyssey
  - a-odisseia
type: book
metadata:
  title: The Odyssey
  author: homer
  year: -800
---

Another example:

---
id: caderno-do-fim-do-mundo--cleyton-cabral
aliases: []
type: book
metadata:
  title: Caderno do Fim do Mundo
  author: cleyton-cabral
  year: 2025
---

Later I’ll also have nodes of type author, concept, note, etc., but for now just handle whatever type appears (including missing).


Step 1 — Discover Files

Use pathlib and recursive glob from project root:

from pathlib import Path

ROOT = Path(".")

markdown_files = list(ROOT.rglob("*.md"))

The script should treat the whole tree as the “garden” (not just /books).


Step 2 — Extract Node Identity (frontmatter)

For each .md file:

Example in-memory representation (you can model this as a simple dict or a small Node dataclass):

nodes = {
    "the-odyssey": {
        "id": "the-odyssey",
        "file": Path("books/finished/the-odyssey.md"),
        "aliases": ["odyssey", "a-odisseia"],
        "type": "book",
        "metadata": {
            "title": "The Odyssey",
            "author": "homer",
            "year": -800,
        },
    },
    # ...
}

Step 3 — Build ID & Alias Registry

You need two maps:

id_map = {
    "the-odyssey": nodes["the-odyssey"],  # or a Node instance
    # ...
}

alias_map = {
    "odyssey": "the-odyssey",
    "a-odisseia": "the-odyssey",
    # ...
}

Resolution function:

def resolve(name: str) -> str | None:
    """
    Given a link target like 'a-odisseia' or 'the-odyssey',
    return the canonical node id, or None if not found.
    """
    if name in id_map:
        return name
    if name in alias_map:
        return alias_map[name]
    return None

ID and alias collisions should be detected and reported (e.g. two files claim the same id, or the same alias points to two different ids).


Step 4 — Parse Inline Links ([...](#))

For each file’s body (Markdown content after frontmatter):

Each found link should be recorded as a raw edge candidate:

{
    "source": "the-bell-jar",   # source node id
    "raw": "a-odisseia",        # raw target name before resolution
    "label": "A Odisseia",      # optional, None if not present
    "kind": "inline",           # distinguish from metadata edges
}

You may assume:

Use a regex to find [...](#), then split on | if present.


Step 5 — Extract Structured Edges from Metadata (starting with author)

Some relationships come from YAML metadata, not inline links. Start with author and make it easy to extend later.

Frontmatter example:

metadata:
  author: cleyton-cabral
  # or
  authors:
    - cleyton-cabral
    - another-author

Design a mapping layer so you don’t hardcode “author” everywhere:

RELATION_FIELDS = {
    "author": "author",      # field name -> edge type
    "authors": "author",
    # later:
    # "translator": "translator",
    # "inspired_by": "inspired-by",
}

Normalize metadata values to a list:

def ensure_list(value):
    if isinstance(value, list):
        return value
    return [value]

Then, for each node:

def extract_metadata_edges(node_id: str, data: dict) -> list[dict]:
    """
    Given a node's frontmatter data, extract edges defined by metadata fields
    like 'author', 'authors', etc.
    """
    edges: list[dict] = []
    metadata = data.get("metadata", {}) or {}

    for field, relation_type in RELATION_FIELDS.items():
        if field not in metadata:
            continue

        values = ensure_list(metadata[field])

        for v in values:
            edges.append({
                "from": node_id,
                "to_raw": v,             # raw target id/alias
                "type": relation_type,   # e.g. 'author'
                "source": "metadata",
            })

    return edges

We will resolve to_raw via resolve() in the next step.


Step 6 — Resolve All Edges

Combine both sources:

Then resolve:

all_edges: list[dict] = []

for edge in raw_edges:
    target_id = resolve(edge["to_raw"] or edge["raw"])
    if target_id is None:
        # unresolved / missing node
        edge_record = {
            "from": edge["from"] or edge["source"],
            "raw": edge["to_raw"] or edge["raw"],
            "type": edge.get("type", "link"),
            "resolved": False,
        }
        all_edges.append(edge_record)
    else:
        edge_record = {
            "from": edge["from"] or edge["source"],
            "to": target_id,
            "type": edge.get("type", "link"),
            "resolved": True,
        }
        all_edges.append(edge_record)

(Feel free to design a cleaner internal structure, but keep the idea: edges include from, either to or raw, and type.)


Step 7 — Build Backlinks and Detect Missing / Orphans

From all_edges:

  1. Backlinks index (incoming edges per node):

    backlinks: dict[str, list[dict]] = {
        # node_id -> list of {from, type, source_file}
    }
    
  2. Missing nodes (links pointing to non-existent nodes):

    Aggregate by raw name and keep a list of where they were referenced:

    missing = {
        "certain-thing": [
            {"from": "book-a", "file": "books/book-a.md", "type": "link"},
            {"from": "note-b", "file": "notes/note-b.md", "type": "author"},
        ],
        # ...
    }
    
  3. Orphans:

    A node is an orphan if it has no incoming and no outgoing edges:

    orphans = [node_id for node_id in nodes if node_id not in backlinks and node_id not in outgoing_map]
    

    You can also optionally distinguish:

    • “no incoming” (no backlinks),
    • “no outgoing” (dead-ends).
  4. Nodes without type:

    List any node that has no type field in its frontmatter:

    nodes_without_type = [node_id for node_id, node in nodes.items() if not node.get("type")]
    
  5. Basic summary stats:

    • total nodes,
    • total edges,
    • number of missing nodes,
    • number of orphans.

Step 8 — Output Markdown Diagnostics

Write a Markdown file (e.g. reports/graph-report.md). Create the reports/ directory if needed.

Structure should look roughly like:

# Graph Report

## Missing Nodes

These links point to nodes that do not exist yet.

- [certain-thing](#)
  - referenced in:
    - books/book-a.md (as: author)
    - notes/note-b.md (as: link)

---

## Orphan Nodes

Nodes with no connections.

- [lonely-book](#)
- [random-note](#)

---

## Nodes Without Type

- [identity](#)
- [modernism](#)

---

## Summary

- Total nodes: 312
- Total edges: 1240
- Missing nodes: 23
- Orphans: 17

Use the node id when rendering [id](#). If you want, you can also include type in parentheses next to each orphan or missing node.


Error Handling & Edge Cases

Please:


Implementation Requirements

Please keep all these comments/instructions in the script so that future-me can open this file and understand the pipeline without re-reading this prompt.

Finally, return only the Python code (no extra prose around it).

Outgoing Links / Edges

Backlinks


← Back to Index