graph-builder-instructions

I’m building a “graph builder” script for my Markdown-based digital garden. Another AI helped me design the architecture in pieces; now I want you to produce ONE complete, working Python script from this spec.

Please read everything below (requirements + examples + partial code), then:

  1. Output a single Python script file.
  2. Keep comments and in-code instructions so the script is very readable and self-explanatory.
  3. Focus only on this scope: build a graph + diagnostics from Markdown, no HTML, no moving files, no quote parsing.
  4. Prefer clear, straightforward code over cleverness.

type: ai-instructions
id: graph-builder-instructions

Scan all .md files in the project, build a node registry and a graph of connections between them (edges), then produce a Markdown report with:

  • missing nodes (links to things that don’t exist yet),
  • orphan nodes (no connections),
  • nodes without type,
  • and some basic summary counts.

The graph must incorporate both:

  • inline links: [id](#) and [label](#)
  • structured links from YAML frontmatter metadata (starting with author)

All relationships should be represented as edges with a type.


This script DOES:

  • Discover all .md files in the repo recursively.
  • Parse frontmatter YAML to get id, aliases, type, and metadata.
  • Build an ID and alias registry.
  • Parse [...](#) wikilinks from Markdown body.
  • Extract edges from both inline links and YAML metadata (starting with authorship).
  • Resolve link targets via id and aliases.
  • Track unresolved references and orphans.
  • Write a single Markdown report file (e.g. reports/graph-report.md).

This script DOES NOT:

  • Move files.
  • Generate HTML.
  • Parse quotes or quote metadata.
  • Modify any .md files.

Think of it as: index + resolver + validator.


Each Markdown file has frontmatter like:

---
id: the-odyssey
aliases:
  - odyssey
  - a-odisseia
type: book
metadata:
  title: The Odyssey
  author: homer
  year: -800
---

Another example:

---
id: caderno-do-fim-do-mundo--cleyton-cabral
aliases: []
type: book
metadata:
  title: Caderno do Fim do Mundo
  author: cleyton-cabral
  year: 2025
---

Later I’ll also have nodes of type author, concept, note, etc., but for now just handle whatever type appears (including missing).


Use pathlib and recursive glob from project root:

from pathlib import Path

ROOT = Path(".")

markdown_files = list(ROOT.rglob("*.md"))

The script should treat the whole tree as the “garden” (not just /books).


For each .md file:

  • Extract the YAML frontmatter between the first pair of --- lines.
  • Parse it with yaml.safe_load.
  • Pull out:
    • id (required for a valid node; if missing, treat as a node without id and report it),
    • aliases (optional; can be absent, a string, or list),
    • type (optional; we will flag missing type in the report),
    • metadata (dict; we’ll use some of its fields to create edges).

Example in-memory representation (you can model this as a simple dict or a small Node dataclass):

nodes = {
    "the-odyssey": {
        "id": "the-odyssey",
        "file": Path("books/finished/the-odyssey.md"),
        "aliases": ["odyssey", "a-odisseia"],
        "type": "book",
        "metadata": {
            "title": "The Odyssey",
            "author": "homer",
            "year": -800,
        },
    },
    # ...
}

You need two maps:

id_map = {
    "the-odyssey": nodes["the-odyssey"],  # or a Node instance
    # ...
}

alias_map = {
    "odyssey": "the-odyssey",
    "a-odisseia": "the-odyssey",
    # ...
}

Resolution function:

def resolve(name: str) -> str | None:
    """
    Given a link target like 'a-odisseia' or 'the-odyssey',
    return the canonical node id, or None if not found.
    """
    if name in id_map:
        return name
    if name in alias_map:
        return alias_map[name]
    return None

ID and alias collisions should be detected and reported (e.g. two files claim the same id, or the same alias points to two different ids).


For each file’s body (Markdown content after frontmatter):

  • Find all wikilinks of the forms:
    • [something](#)
    • [Label to display](#)

Each found link should be recorded as a raw edge candidate:

{
    "source": "the-bell-jar",   # source node id
    "raw": "a-odisseia",        # raw target name before resolution
    "label": "A Odisseia",      # optional, None if not present
    "kind": "inline",           # distinguish from metadata edges
}

You may assume:

  • IDs and aliases are lowercase, kebab-case (e.g., the-odyssey, cleyton-cabral).
  • Labels can be arbitrary text.

Use a regex to find [...](#), then split on | if present.


Some relationships come from YAML metadata, not inline links. Start with author and make it easy to extend later.

Frontmatter example:

metadata:
  author: cleyton-cabral
  # or
  authors:
    - cleyton-cabral
    - another-author

Design a mapping layer so you don’t hardcode “author” everywhere:

RELATION_FIELDS = {
    "author": "author",      # field name -> edge type
    "authors": "author",
    # later:
    # "translator": "translator",
    # "inspired_by": "inspired-by",
}

Normalize metadata values to a list:

def ensure_list(value):
    if isinstance(value, list):
        return value
    return [value]

Then, for each node:

def extract_metadata_edges(node_id: str, data: dict) -> list[dict]:
    """
    Given a node's frontmatter data, extract edges defined by metadata fields
    like 'author', 'authors', etc.
    """
    edges: list[dict] = []
    metadata = data.get("metadata", {}) or {}

    for field, relation_type in RELATION_FIELDS.items():
        if field not in metadata:
            continue

        values = ensure_list(metadata[field])

        for v in values:
            edges.append({
                "from": node_id,
                "to_raw": v,             # raw target id/alias
                "type": relation_type,   # e.g. 'author'
                "source": "metadata",
            })

    return edges

We will resolve to_raw via resolve() in the next step.


Combine both sources:

  • edges from inline links
  • edges from metadata (extract_metadata_edges)

Then resolve:

all_edges: list[dict] = []

for edge in raw_edges:
    target_id = resolve(edge["to_raw"] or edge["raw"])
    if target_id is None:
        # unresolved / missing node
        edge_record = {
            "from": edge["from"] or edge["source"],
            "raw": edge["to_raw"] or edge["raw"],
            "type": edge.get("type", "link"),
            "resolved": False,
        }
        all_edges.append(edge_record)
    else:
        edge_record = {
            "from": edge["from"] or edge["source"],
            "to": target_id,
            "type": edge.get("type", "link"),
            "resolved": True,
        }
        all_edges.append(edge_record)

(Feel free to design a cleaner internal structure, but keep the idea: edges include from, either to or raw, and type.)


From all_edges:

  1. Backlinks index (incoming edges per node):

    backlinks: dict[str, list[dict]] = {
        # node_id -> list of {from, type, source_file}
    }
    
  2. Missing nodes (links pointing to non-existent nodes):

    Aggregate by raw name and keep a list of where they were referenced:

    missing = {
        "certain-thing": [
            {"from": "book-a", "file": "books/book-a.md", "type": "link"},
            {"from": "note-b", "file": "notes/note-b.md", "type": "author"},
        ],
        # ...
    }
    
  3. Orphans:

    A node is an orphan if it has no incoming and no outgoing edges:

    orphans = [node_id for node_id in nodes if node_id not in backlinks and node_id not in outgoing_map]
    

    You can also optionally distinguish:

    • “no incoming” (no backlinks),
    • “no outgoing” (dead-ends).
  4. Nodes without type:

    List any node that has no type field in its frontmatter:

    nodes_without_type = [node_id for node_id, node in nodes.items() if not node.get("type")]
    
  5. Basic summary stats:

    • total nodes,
    • total edges,
    • number of missing nodes,
    • number of orphans.

Write a Markdown file (e.g. reports/graph-report.md). Create the reports/ directory if needed.

Structure should look roughly like:

```markdown

Graph Report

These links point to nodes that do not exist yet.

  • certain-thing
    • referenced in:
      • books/book-a.md (as: author)
      • notes/note-b.md (as: link)

Nodes with no connections.


  • Total nodes: 312
  • Total edges: 1240
  • Missing nodes: 23
  • Orphans: 17
    ```

Use the node id when rendering [id](#). If you want, you can also include type in parentheses next to each orphan or missing node.


Please:

  • Skip files without valid frontmatter, but log them in the report under a small “Files without valid frontmatter” section.
  • Detect:
    • duplicate ids (two files defining the same id),
    • alias collisions (same alias mapped to different ids),

      and list them in the report under an “Errors” section.
  • Treat self-links ([same-id](#) inside its own file) as either:
    • ignored, or
    • tracked separately, but they should not prevent the node from being considered connected.

  • Use Python 3.10+.
  • Use pathlib.Path for filesystem paths.
  • Use yaml.safe_load for YAML.
  • Organize code into clear functions, e.g.:
    • load_nodes()
    • build_registries(nodes)
    • parse_links_from_file(...)
    • extract_metadata_edges(...)
    • build_edges(...)
    • compute_backlinks_and_orphans(...)
    • write_report(...)
    • main()
  • Add comments at the top of the file explaining the purpose and scope.
  • Inside the code, add short but clear comments explaining:
    • what each function does,
    • why key decisions are made (e.g., why we don’t auto-create missing nodes, why relationships get a type).

Please keep all these comments/instructions in the script so that future-me can open this file and understand the pipeline without re-reading this prompt.

Finally, return only the Python code (no extra prose around it).

Settings
Content