graph-builder-improvements-instructions

I already have a working Python script that builds a graph for my Markdown-based knowledge system. It lives at ‎system/graph/build.py.

Right now, the script does roughly this:

  • Discovers all ‎.md files.

  • Parses YAML frontmatter to build a node registry:

    ▫ each node has at least: ‎id, optional ‎type, ‎file, ‎aliases, ‎metadata.

  • Parses inline links ‎[id](#) / ‎[Label](#) from Markdown bodies.

  • Extracts structured edges from YAML metadata (starting with ‎metadata.author / ‎metadata.authors via a ‎RELATION_FIELDS mapping).

  • Resolves edges using ‎id_map / ‎alias_map.

  • Writes JSON artifacts under ‎generated/graph/ (e.g. ‎nodes.json, ‎edges.json, ‎unresolved.json).

  • Writes a Markdown diagnostics report under ‎reports/graph-report.md (missing nodes, orphans, nodes without type, etc.).

I do NOT want you to rewrite this script from scratch.

I want you to extend the existing script to add a graph enrichment step that creates “type index” nodes and edges. The idea:

  • Every node has a ‎type field in its frontmatter (e.g. ‎book, ‎author, ‎movie, ‎concept, etc.).

  • For each distinct ‎type, I want a virtual “index node” whose id is ‎<type>-list. Examples:

    ▫ ‎type: author → index node ‎author-list

    ▫ ‎type: book → index node ‎book-list

  • These index nodes do not correspond to ‎.md files; they are generated by the system.

  • Each index node should:

    ▫ appear in the ‎nodes output (e.g. ‎nodes.json) with something like:

    ⁃ ‎id: ‎"author-list"

    ⁃ ‎type: ‎"index"

    ⁃ optionally ‎index_of_type: ‎"author"

    ▫ have edges from the index node to all nodes of that type, e.g.:

    ⁃ ‎{ "from": "author-list", "to": "cleyton-cabral", "type": "contains" }

    ⁃ ‎{ "from": "book-list", "to": "the-idiot", "type": "contains" }

Key constraints and requirements:

  1. Do not remove or change existing behavior.

Keep the current node/edge building, unresolved detection, and report generation intact. You are adding an enrichment layer on top.

  1. Add a clear enrichment step after the base graph is built.

Conceptually the pipeline should be:

▫ parse Markdown → build base nodes

▫ extract inline + metadata edges

▫ resolve edges

▫ enrich graph with type index nodes + edges

▫ write JSON outputs + report

  1. Implementation details for the enrichment:

    ▫ Work from the in-memory ‎nodes and ‎edges structures that the script already uses (or whatever the current internal representation is).

    ▫ Group nodes by ‎type. Nodes without a ‎type should simply be ignored by this enrichment.

    ▫ For each distinct ‎type_name:

    ⁃ Create (or reuse if already present) an index node with:

    ▪ ‎id = f"{type_name}-list"

    ▪ ‎type = "index"

    ▪ optional helper field like ‎index_of_type = type_name

    ⁃ For each node of that type, append a new edge:

    ▫ Make sure you don’t create duplicate index nodes or duplicate ‎contains edges if the script is run multiple times in-memory.

  2. Integration into existing outputs:

    ▫ Ensure the new index nodes are included in ‎generated/graph/nodes.json.

    ▫ Ensure the new ‎contains edges are included in ‎generated/graph/edges.json using the same schema as other edges (‎from, ‎to, ‎type, optionally ‎source).

    ▫ These virtual nodes should NOT appear in ‎unresolved.json (they are not unresolved; they are generated).

    ▫ They also shouldn’t be treated as “missing type” in the report (since they have ‎type: "index").

  3. Code style and structure:

    ▫ Add one or more helper functions instead of stuffing everything into ‎main:

    ⁃ e.g. ‎def add_type_index_nodes_and_edges(nodes, edges): ...

    ▫ Use clear, descriptive names, and keep the existing style of the file.

    ▫ Add concise comments explaining:

    ⁃ what type index nodes are,

    ⁃ why they are generated,

    ⁃ where in the pipeline the enrichment happens.

  4. No API breaking changes:

    ▫ Keep current function signatures and external behavior unless absolutely necessary.

    ▫ Any new data fields added to node or edge objects should be additive and safe for downstream consumers (e.g. future HTML generator reading ‎graph.json).

Please show only the modified/added parts of ‎build.py first (so I can see the diff), and then, if helpful, show the full updated file. Keep comments and inline explanations in the code so it’s easy to understand later.

type: ai-instructions
id: graph-builder-improvements-instructions

Outgoing Links / Edges

Backlinks


← Back to Index