Ground Agents in Typed, Retrievable Knowledge

You need agents that can answer questions about relationships between entities, look up context by identifier, and find semantically related content. Right now, the knowledge is either trapped in untyped files or requires an external search engine to retrieve. Four libraries -- @forwardimpact/libresource, @forwardimpact/libgraph, @forwardimpact/libindex, and @forwardimpact/libvector -- give you a self-contained knowledge infrastructure that runs locally without external databases.

The pipeline flows in three stages: ingest HTML into typed resources, extract RDF triples into a graph, and generate vector embeddings for semantic retrieval. Each stage produces a JSONL-backed index that agents can query directly.

Prerequisites

  • Node.js 18+
  • An embedding endpoint (any OpenAI-compatible /v1/embeddings API) for vector indexing
  • HTML files with schema.org microdata markup in a data/knowledge/ directory

Install all four libraries:

npm install @forwardimpact/libresource @forwardimpact/libgraph @forwardimpact/libindex @forwardimpact/libvector

How the pipeline fits together

Each library owns one stage. The output of one stage feeds the next:

data/knowledge/*.html
        |
        v
  libresource          -->  data/resources/*.json     (typed resources)
        |
        +-------+
        |       |
        v       v
  libgraph    libvector
        |       |
        v       v
  data/graphs/  data/vectors/
  index.jsonl   index.jsonl
  ontology.ttl

libindex provides the IndexBase class that both GraphIndex and VectorIndex extend. It handles JSONL persistence, lazy loading, prefix filtering, and token budgeting so the specialized indexes inherit that behavior without reimplementing it.

1. Prepare the knowledge directory

Create data/knowledge/ and add HTML files with schema.org microdata. The resource processor extracts typed entities from itemscope / itemtype / itemprop attributes:

<!-- data/knowledge/team.html -->
<!DOCTYPE html>
<html>
<head><base href="https://example.com/team" /></head>
<body>
  <div itemscope itemtype="https://schema.org/Person">
    <span itemprop="name">Alice Chen</span>
    <span itemprop="jobTitle">Senior Engineer</span>
    <link itemprop="worksFor" href="https://example.com/org/acme" />
  </div>
  <div itemscope itemtype="https://schema.org/Organization">
    <meta itemprop="url" content="https://example.com/org/acme" />
    <span itemprop="name">Acme Corp</span>
  </div>
</body>
</html>

The <base href> element sets the IRI for all relative references in the document. Without it, the processor falls back to the --base flag or a default URI.

2. Ingest HTML into typed resources

Run the resource processor to parse every HTML file in data/knowledge/ and store each entity as a typed Message resource:

npx fit-process-resources --base=https://example.com/

The processor:

  1. Finds all .html files in data/knowledge/
  2. Sanitizes the DOM (normalizes whitespace, encodes stray characters)
  3. Extracts RDF quads from microdata using the streaming parser
  4. Skolemizes blank nodes into content-hashed URIs for cross-document deduplication
  5. Serializes each entity's triples as Turtle RDF
  6. Stores the result in data/resources/ as a JSON file with a content-hashed identifier

When the same entity appears in multiple HTML files, the processor merges triples using RDF union semantics -- new properties are added, existing identical triples are deduplicated.

After processing, verify the resources exist:

ls data/resources/
common.Message.a1b2c3d4.json
common.Message.e5f6g7h8.json

Each file contains the entity's typed identifier, its role (system), and the RDF content as a Turtle string.

3. Build the RDF graph

With resources in place, extract their RDF content into a graph index and generate the ontology:

npx fit-process-graphs

The graph processor:

  1. Reads all resource identifiers from data/resources/
  2. Filters to common.Message resources (which contain RDF content)
  3. Parses each resource's Turtle content back into quads
  4. Adds quads to the in-memory N3 triple store, keyed by resource identifier
  5. Writes the graph index to data/graphs/index.jsonl
  6. Builds a SHACL ontology from all observed types and predicates
  7. Writes the ontology to data/graphs/ontology.ttl

The ontology file describes the shape of the data -- which types exist, what properties each type has, and how types relate to each other. Agents read this file to understand what questions the graph can answer before writing queries.

Verify the graph was built:

npx fit-subjects
https://example.com/team#alice	https://schema.org/Person
https://example.com/org/acme	https://schema.org/Organization

Each line shows a subject URI and its type. Run a triple-pattern query to test a relationship:

npx fit-query "?" schema:worksFor "?"
common.Message.a1b2c3d4

The output is the resource identifier containing the matching triple. The query uses the subject predicate object pattern where ? is a wildcard. Prefixed names like schema:worksFor expand using the standard prefix map (schema: -> https://schema.org/).

4. Generate vector embeddings

The vector processor takes each resource's text content, sends it to an embedding endpoint, and stores the resulting vectors:

npx fit-process-vectors

This requires an OpenAI-compatible embedding endpoint. Configure the endpoint and token through environment variables or config/vectors.yaml:

# config/vectors.yaml
embeddingBaseUrl: http://localhost:8080

The processor:

  1. Reads all resource identifiers from data/resources/
  2. Filters out conversations and tool functions
  3. Batches resource content for efficient embedding API calls
  4. Stores each vector alongside its resource identifier in data/vectors/index.jsonl

After processing, test a semantic search:

npx fit-search "senior engineering role"
common.Message.a1b2c3d4	0.8712
common.Message.e5f6g7h8	0.6543

Results are ranked by dot-product score (cosine similarity for normalized vectors). Higher scores indicate closer semantic matches.

5. Query from code

The CLIs are thin wrappers around the library APIs. For programmatic access, use the libraries directly:

import { createGraphIndex, parseGraphQuery } from "@forwardimpact/libgraph";
import { createResourceIndex } from "@forwardimpact/libresource";

// Query the graph for all Person entities
const graph = createGraphIndex("graphs");
const pattern = parseGraphQuery("? rdf:type schema:Person");
const identifiers = await graph.queryItems(pattern);

// Resolve matched identifiers to full resources
const resources = createResourceIndex("resources");
const items = await resources.get(identifiers.map(String));

for (const item of items) {
  console.log(item.id.type, item.id.name);
  console.log(item.content);   // Turtle RDF string
}

The createGraphIndex("graphs") call reads from data/graphs/; the createResourceIndex("resources") call reads from data/resources/. Both use the data/<prefix>/ convention. Pass a different prefix to point at a different directory.

For vector search from code:

import { VectorIndex } from "@forwardimpact/libvector/index/vector.js";
import { createStorage } from "@forwardimpact/libstorage";

const storage = createStorage("vectors");
const vectorIndex = new VectorIndex(storage);

// Assume you have a query vector from your embedding API
const queryVector = [0.12, -0.34, 0.56, /* ... */];
const results = await vectorIndex.queryItems([queryVector], {
  limit: 5,
  threshold: 0.5,
});

for (const id of results) {
  console.log(String(id), id.score?.toFixed(4));
}

Both queryItems methods accept a filter object with prefix, limit, and max_tokens to scope results by identifier prefix, cap the count, or stay within a token budget.

Verify

After running all three stages, confirm the full pipeline produced the expected artifacts:

ls data/resources/       # Typed resource JSON files
ls data/graphs/          # index.jsonl + ontology.ttl
ls data/vectors/         # index.jsonl with embeddings

npx fit-subjects                           # All subjects and types
npx fit-query "?" rdf:type schema:Person   # Graph query
npx fit-search "team member"               # Semantic search

Each command should return results drawn from the HTML files you ingested. If a command returns nothing, check that the previous stage completed: resources must exist before graphs, and resources must exist before vectors.

What's next

Each query mode has a dedicated guide for deeper work:

  • Query the Graph -- write triple-pattern queries, filter by type, and traverse relationships in the RDF graph.
  • Look Up Context -- retrieve resources by identifier, apply prefix filters, and manage token budgets with the index API.
  • Resolve a Resource -- load a typed resource by identifier with access control and inspect its content and metadata.
  • Search Semantically -- embed a query, score against the vector index, and rank results by relevance.