graph-builder-instructions

High-Level Goal

Scan all .md files in the project, build a node registry and a graph of connections between them (edges), then produce a Markdown report with:

missing nodes (links to things that don’t exist yet),
orphan nodes (no connections),
nodes without type,
and some basic summary counts.

The graph must incorporate both:

inline links: [id](#) and [label](#)
structured links from YAML frontmatter metadata (starting with author)

All relationships should be represented as edges with a type.

Scope (what this script DOES and DOES NOT do)

This script DOES:

Discover all .md files in the repo recursively.
Parse frontmatter YAML to get id, aliases, type, and metadata.
Build an ID and alias registry.
Parse [...](#) wikilinks from Markdown body.
Extract edges from both inline links and YAML metadata (starting with authorship).
Resolve link targets via id and aliases.
Track unresolved references and orphans.
Write a single Markdown report file (e.g. reports/graph-report.md).

This script DOES NOT:

Move files.
Generate HTML.
Parse quotes or quote metadata.
Modify any .md files.

Think of it as: index + resolver + validator.

Data Model

Each Markdown file has frontmatter like:

---
id: the-odyssey
aliases:
  - odyssey
  - a-odisseia
type: book
metadata:
  title: The Odyssey
  author: homer
  year: -800
---

Another example:

---
id: caderno-do-fim-do-mundo--cleyton-cabral
aliases: []
type: book
metadata:
  title: Caderno do Fim do Mundo
  author: cleyton-cabral
  year: 2025
---

Later I’ll also have nodes of type author, concept, note, etc., but for now just handle whatever type appears (including missing).

Step 1 — Discover Files

Use pathlib and recursive glob from project root:

from pathlib import Path

ROOT = Path(".")

markdown_files = list(ROOT.rglob("*.md"))

The script should treat the whole tree as the “garden” (not just /books).

Step 2 — Extract Node Identity (frontmatter)

For each .md file:

Extract the YAML frontmatter between the first pair of --- lines.
Parse it with yaml.safe_load.
Pull out:
- id (required for a valid node; if missing, treat as a node without id and report it),
- aliases (optional; can be absent, a string, or list),
- type (optional; we will flag missing type in the report),
- metadata (dict; we’ll use some of its fields to create edges).

Example in-memory representation (you can model this as a simple dict or a small Node dataclass):

nodes = {
    "the-odyssey": {
        "id": "the-odyssey",
        "file": Path("books/finished/the-odyssey.md"),
        "aliases": ["odyssey", "a-odisseia"],
        "type": "book",
        "metadata": {
            "title": "The Odyssey",
            "author": "homer",
            "year": -800,
        },
    },
    # ...
}

Step 3 — Build ID & Alias Registry

You need two maps:

id_map = {
    "the-odyssey": nodes["the-odyssey"],  # or a Node instance
    # ...
}

alias_map = {
    "odyssey": "the-odyssey",
    "a-odisseia": "the-odyssey",
    # ...
}

Resolution function:

def resolve(name: str) -> str | None:
    """
    Given a link target like 'a-odisseia' or 'the-odyssey',
    return the canonical node id, or None if not found.
    """
    if name in id_map:
        return name
    if name in alias_map:
        return alias_map[name]
    return None

ID and alias collisions should be detected and reported (e.g. two files claim the same id, or the same alias points to two different ids).

Step 4 — Parse Inline Links (`[...](#)`)

For each file’s body (Markdown content after frontmatter):

Find all wikilinks of the forms:
- [something](#)
- [Label to display](#)

Each found link should be recorded as a raw edge candidate:

{
    "source": "the-bell-jar",   # source node id
    "raw": "a-odisseia",        # raw target name before resolution
    "label": "A Odisseia",      # optional, None if not present
    "kind": "inline",           # distinguish from metadata edges
}

You may assume:

IDs and aliases are lowercase, kebab-case (e.g., the-odyssey, cleyton-cabral).
Labels can be arbitrary text.

Use a regex to find [...](#), then split on | if present.

Step 5 — Extract Structured Edges from Metadata (starting with `author`)

Some relationships come from YAML metadata, not inline links. Start with author and make it easy to extend later.

Frontmatter example:

metadata:
  author: cleyton-cabral
  # or
  authors:
    - cleyton-cabral
    - another-author

Design a mapping layer so you don’t hardcode “author” everywhere:

RELATION_FIELDS = {
    "author": "author",      # field name -> edge type
    "authors": "author",
    # later:
    # "translator": "translator",
    # "inspired_by": "inspired-by",
}

Normalize metadata values to a list:

def ensure_list(value):
    if isinstance(value, list):
        return value
    return [value]

Then, for each node:

def extract_metadata_edges(node_id: str, data: dict) -> list[dict]:
    """
    Given a node's frontmatter data, extract edges defined by metadata fields
    like 'author', 'authors', etc.
    """
    edges: list[dict] = []
    metadata = data.get("metadata", {}) or {}

    for field, relation_type in RELATION_FIELDS.items():
        if field not in metadata:
            continue

        values = ensure_list(metadata[field])

        for v in values:
            edges.append({
                "from": node_id,
                "to_raw": v,             # raw target id/alias
                "type": relation_type,   # e.g. 'author'
                "source": "metadata",
            })

    return edges

We will resolve to_raw via resolve() in the next step.

Step 6 — Resolve All Edges

Combine both sources:

edges from inline links
edges from metadata (extract_metadata_edges)

Then resolve:

all_edges: list[dict] = []

for edge in raw_edges:
    target_id = resolve(edge["to_raw"] or edge["raw"])
    if target_id is None:
        # unresolved / missing node
        edge_record = {
            "from": edge["from"] or edge["source"],
            "raw": edge["to_raw"] or edge["raw"],
            "type": edge.get("type", "link"),
            "resolved": False,
        }
        all_edges.append(edge_record)
    else:
        edge_record = {
            "from": edge["from"] or edge["source"],
            "to": target_id,
            "type": edge.get("type", "link"),
            "resolved": True,
        }
        all_edges.append(edge_record)

(Feel free to design a cleaner internal structure, but keep the idea: edges include from, either to or raw, and type.)

Step 7 — Build Backlinks and Detect Missing / Orphans

From all_edges:

Backlinks index (incoming edges per node):

backlinks: dict[str, list[dict]] = {
    # node_id -> list of {from, type, source_file}
}

Missing nodes (links pointing to non-existent nodes):

Aggregate by raw name and keep a list of where they were referenced:

missing = {
    "certain-thing": [
        {"from": "book-a", "file": "books/book-a.md", "type": "link"},
        {"from": "note-b", "file": "notes/note-b.md", "type": "author"},
    ],
    # ...
}

Orphans:

A node is an orphan if it has no incoming and no outgoing edges:
```
orphans = [node_id for node_id in nodes if node_id not in backlinks and node_id not in outgoing_map]
```
You can also optionally distinguish:
- “no incoming” (no backlinks),
- “no outgoing” (dead-ends).

Nodes without type:

List any node that has no type field in its frontmatter:

nodes_without_type = [node_id for node_id, node in nodes.items() if not node.get("type")]

Basic summary stats:
- total nodes,
- total edges,
- number of missing nodes,
- number of orphans.

Step 8 — Output Markdown Diagnostics

Write a Markdown file (e.g. reports/graph-report.md). Create the reports/ directory if needed.

Structure should look roughly like:

# Graph Report

## Missing Nodes

These links point to nodes that do not exist yet.

- [certain-thing](#)
  - referenced in:
    - books/book-a.md (as: author)
    - notes/note-b.md (as: link)

---

## Orphan Nodes

Nodes with no connections.

- [lonely-book](#)
- [random-note](#)

---

## Nodes Without Type

- [identity](#)
- [modernism](#)

---

## Summary

- Total nodes: 312
- Total edges: 1240
- Missing nodes: 23
- Orphans: 17

Use the node id when rendering [id](#). If you want, you can also include type in parentheses next to each orphan or missing node.

Error Handling & Edge Cases

Please:

Skip files without valid frontmatter, but log them in the report under a small “Files without valid frontmatter” section.
Detect:
- duplicate ids (two files defining the same id),
- alias collisions (same alias mapped to different ids),
  
  and list them in the report under an “Errors” section.
Treat self-links ([same-id](#) inside its own file) as either:
- ignored, or
- tracked separately, but they should not prevent the node from being considered connected.

Implementation Requirements

Use Python 3.10+.
Use pathlib.Path for filesystem paths.
Use yaml.safe_load for YAML.
Organize code into clear functions, e.g.:
- load_nodes()
- build_registries(nodes)
- parse_links_from_file(...)
- extract_metadata_edges(...)
- build_edges(...)
- compute_backlinks_and_orphans(...)
- write_report(...)
- main()
Add comments at the top of the file explaining the purpose and scope.
Inside the code, add short but clear comments explaining:
- what each function does,
- why key decisions are made (e.g., why we don’t auto-create missing nodes, why relationships get a type).

Please keep all these comments/instructions in the script so that future-me can open this file and understand the pipeline without re-reading this prompt.

Finally, return only the Python code (no extra prose around it).