Text Preprocessing and Cleaning

After uploading content, users can choose different tools for chunking, indexing and segmenting the data.

Glik provides an automatic tool for chunking data, but users can also customize it for added convenience.

Indexing is necessary for accurate data retrieval. There are 2 types of indexing on Glik, and each has their own retrieval method:

High Quality
Economical

High Quality

In this type the system first leverages an configurable Embedding model (which can be switched) to convert chunk text into numerical vectors. This process facilitates efficient compression and persistent storage of large-scale textual data, while simultaneously enhancing the accuracy of LLM-user interactions.

This mode allow users to choose from 3 different types of retrieval methods:

Vector Search: The system vectorizes the user's input query to generate a query vector. It then computes the distance between this query vector and the text vectors in the knowledge base to identify the most semantically proximate text chunks.
Full-Text Search: Indexing all terms in the document, allowing users to query any terms and return text fragments containing those terms.
Hybrid: This process performs both full-text search and vector search simultaneously, incorporating a reordering step to select the best results that match the user's query from both types of search outcomes.

Economical

Economical mode employs an offline vector engine and keyword indexing, which reduces accuracy but eliminates additional token consumption and associated costs. The indexing method is limited to inverted indexing.

TopK

This parameter filters the text chucks that are most similar to the user's question. The system dynamically adjusts the number of snippets based on the context window size of the selected model. The default value is 3, meaning a higher value results in more text segments being retrieved. The system will also dynamically adjust the value of TopK, according to max_tokens of the selected model.

PreviousDataset Creation NextAdvanced Configuration

Last updated 1 year ago

Was this helpful?

hashtagHigh Quality

hashtagEconomical

hashtagTopK

High Quality

Economical

TopK