Search Pipeline#

Search Index Configuration#

type: list
items:
  type: dict
  schema:
    glob: {type: str, required: true}
    docs_parser: {type: str}

Search Index Data Structure#

search_index:
  type: dict
  schema:
    path: {type: str}
    versions:
      type: list
      items:
        - type: dict
          schema:
            content: {type: str}
            chunks:
              type: list
              items:
                - type: dict
                  schema:
                    content: {type: str}
                    embedding:
                      type: list
                      items: {type: float}
            facts:
              type: list
              items:
                - type: dict
                  schema:
                    content: {type: str}
                    embedding:
                      type: list
                      items: {type: float}
                    is_good: {type: boolean}

Pipeline DAG#

digraph pipeline {
edge [color="#808080", arrowsize=.6, penwidth=3, minlen=3];
node [shape=box, fontname="DIN Next, sans-serif", style="rounded,filled", penwidth=5, fillcolor="#8010d0", color="#f0f0f0", fontcolor=white,  margin="0.35" fontweight=bold]
bgcolor="#f0f0f0";
path -> content [label=" Save version"]
content -> doc [label=" Split file into docs"]
doc -> chunk [label=" Split"]
doc -> fact [label=" Extract knowledge"]
doc -> tfidf [label=" Build TF-IDF"]
chunk [label=" chunk (e)"]
fact [label=" fact (e)"]
chunk -> embedding
fact -> embedding

embedding [label="Embeddings (fact or chunk)"]
tfidf [label=" Term inverse index"]
embedding -> kNN [label=" Collect"]
serve -> kNN [label=" Find top-K embeddings"]
kNN -> RAG [label=" Summarize"]
serve -> tfidf [label=" Find top-N docs"]
tfidf -> RAG  [label=" Summarize"]
serve -> LLM [label=" Answer from internal memory"]
LLM -> RAG  [label=" Summarize"]
}

Knowledge Extraction#

Prompt:

Format the following document as a list of self-sufficient evergreen facts. One per line. Include supporting context in each fact.

{text}

Postprocessing:

[line.strip(' \n-') for line in output.splitlines()]

Retrieval-Augmented Generation#

Prompt:

Answer the following question using only the context below. Only include information specifically discussed. Copy the answer verbatim from the context. Exclude irrelevant sentences. Be concise.

Question: {question} Context: {context}