Search Pipeline#
Search Index Configuration#
type: list
items:
type: dict
schema:
glob: {type: str, required: true}
docs_parser: {type: str}
Search Index Data Structure#
search_index:
type: dict
schema:
path: {type: str}
versions:
type: list
items:
- type: dict
schema:
content: {type: str}
chunks:
type: list
items:
- type: dict
schema:
content: {type: str}
embedding:
type: list
items: {type: float}
facts:
type: list
items:
- type: dict
schema:
content: {type: str}
embedding:
type: list
items: {type: float}
is_good: {type: boolean}
Pipeline DAG#
![digraph pipeline {
edge [color="#808080", arrowsize=.6, penwidth=3, minlen=3];
node [shape=box, fontname="DIN Next, sans-serif", style="rounded,filled", penwidth=5, fillcolor="#8010d0", color="#f0f0f0", fontcolor=white, margin="0.35" fontweight=bold]
bgcolor="#f0f0f0";
path -> content [label=" Save version"]
content -> doc [label=" Split file into docs"]
doc -> chunk [label=" Split"]
doc -> fact [label=" Extract knowledge"]
doc -> tfidf [label=" Build TF-IDF"]
chunk [label=" chunk (e)"]
fact [label=" fact (e)"]
chunk -> embedding
fact -> embedding
embedding [label="Embeddings (fact or chunk)"]
tfidf [label=" Term inverse index"]
embedding -> kNN [label=" Collect"]
serve -> kNN [label=" Find top-K embeddings"]
kNN -> RAG [label=" Summarize"]
serve -> tfidf [label=" Find top-N docs"]
tfidf -> RAG [label=" Summarize"]
serve -> LLM [label=" Answer from internal memory"]
LLM -> RAG [label=" Summarize"]
}](../_images/graphviz-a10b18f365232f117c24b0407f6914be23fe547e.png)
Knowledge Extraction#
Prompt:
Format the following document as a list of self-sufficient evergreen facts. One per line. Include supporting context in each fact.
{text}
Postprocessing:
[line.strip(' \n-') for line in output.splitlines()]
Retrieval-Augmented Generation#
Prompt:
Answer the following question using only the context below. Only include information specifically discussed. Copy the answer verbatim from the context. Exclude irrelevant sentences. Be concise.
Question: {question} Context: {context}