LogoLogo
  • Introduction
    • What Is an Agent Application?
    • Core Concepts
    • Getting Started
  • GLIK Cloud
    • Getting Started
      • GLIK Cloud vs. GLIK Studio
    • Prompt Orchestration Interface
    • What Is a Workflow?
    • Workflow Editor
    • App Types (Overview)
    • Workspaces & Permissions
    • Enterprise Readiness & Compatibility
    • Security & Data Handling
  • Enterprise SaaS
    • Expense Policy Decision Engine
    • Compliance & Audit Automation
    • Compliance Advisor Copilot
  • App Types
    • Overview
    • Chatbot
    • Agent
    • Advanced Chat
    • Workflow
  • Templates
    • Overview
    • Policy Automation
      • Overview
        • Expense Policy Validator — Summary Sheet
        • Enterprise Policy Intelligence
      • Expense Policy Decision Engine
        • Expense Policy Decision Engine (Starter)
          • Practice Exercises
        • Expense Policy Decision Engine (Threshold Memory)
          • Practice Exercises
          • Threshold Agent Response Tuning Exercises
        • Expense Policy Decision Engine (Policy API Integration)
          • Practice Exercises
        • Expense Policy Decision Engine (Escalation & Conflict Resolution)
        • Expense Policy Decision Engine (Audit & Logging)
    • Compliance and Audit Automation
      • KYC/AML Review Copilot
        • Learning Track
      • Compliance Copilot – MiCA Reporting
        • Workflow Phases
        • Block-by-Block Guide
    • Compliance Advisors
      • Global Control Copilot – Cross-Jurisdiction Policy Interpreter
        • Input Combinations & Workflow Outcomes
        • Reference Input Payloads
        • Policy Retrieval via Input Routing
        • Input Logic & Routing Behavior
        • Predefined Policy Thresholds
    • Knowledge Systems
      • Overview
      • Compliance SOP Agent
    • Process Automation
    • Task Resolution Agents
      • Why Agentify Task Resolution
    • Work Coordination Agents
      • Escrow Agent Orchestration
    • Embedded Operational Copilots
    • Expense & ERP Agents
      • ERP Vendor AI Copilots and Agents
      • Custody Approval Workflow for Token Issuance
    • Inventory & Logistics Agents
    • Sales & Forecasting Agents
    • Plugin-Based Agent Platforms
  • Marketplace
    • Overview
    • Publishing Templates
  • System Architecture
    • Overview
    • Blocks & Nodes
      • Utilities
        • Start Block
        • End Node
        • HTTP Request
        • List Operator
      • Classifier Nodes
        • Question Classifier
      • Logic Blocks
        • IF/ELSE Branch
        • Iteration
        • Loop
      • Transform Blocks
        • Variable Assigner
        • Variable Aggregator
        • Parameter Extractor
        • Data Enrichment
        • Prompt Template
        • Code
      • Input & Extraction
        • Doc Extractor
        • Knowledge Retrieval
        • LLM Block
          • LLM Reasoning
          • Fallback to LLM Reasoning
        • Tool Node
        • Agent
        • Answer
    • GLIK Knowledge
      • Creating & Managing Knowledge
      • GLIK Knowledge Retrieval
      • Writing to Knowledge
      • Scoped Memory & Access Control
      • Injection & Variable Binding
      • Performance & Limits
    • Execution Model
      • Workflow Architecture
      • Flow Engine
      • Node Lifecycle
      • Protocol Compatibility & Schema Interoperability
      • Input Binding & Value Resolution
    • Memory & Variable Scope
      • Conversation Variables
      • Memory Layers (User, App, Org)
      • Memory Slot Injection
      • Memory Retention Policy
    • Decision Routing
      • Conditional Logic Engine
      • LLM Fallthrough Patterns
      • Policy Enforcement & Escalation Paths
    • Enterprise Orchestration
      • Policy-Driven Automation
      • Enterprise Modularity
      • Auditability & Governance
      • Explainability & Decision Transparency
    • Agentifying Legacy Systems
      • Why Legacy Systems Resist Change
      • Best Practices for Agentifying ERP Workflows
      • GLIK’s Wrap-Around Model
      • Agent Surfaces (PDF, OCR, API, UI)
      • No-API Memory-Based Control
      • Compliance & Risk Considerations
    • System Observability
      • Execution Logs
      • Save Points & Snapshots
      • Variable Debugging
      • Session Trace Viewer
  • Developers
    • Overview
    • GLIK Open Core
      • Deployment & Installation
      • CLI Reference
      • Security & Compliance
      • Customization Guide
      • Versioning & Updates
  • GLIK Roadmap
  • Deprecation
    • Orchestration Interface
      • Node Orchestration
        • Node
          • Start
          • End
          • Direct Reply
          • LLM
          • Question Classifier
          • Knowledge Retrieval
          • Code Execution
          • Doc Extractor
          • HTTP Request
          • Conditional Branch IF/ELSE
          • Iteration
          • List Operator
          • Parameter Extraction
          • Template
          • Tools
          • Variable Aggregator
          • Variable Assigner
      • Variables
      • Application Toolkits
      • File Upload
    • Chatbot Features
    • Dataset
      • Dataset Creation
      • Text Preprocessing and Cleaning
        • Advanced Configuration
      • Retrieval Test/Citation
    • Studio
  • Brand Kit & Identity
    • Logos & Visual Assets
    • Typography & Colors
    • Messaging Pillars
    • Product Screenshots
    • Diagrams & Icons
    • Company Boilerplate
    • Downloads (.zip)
  • Legal
    • Terms of Service
    • Privacy Policy
    • Cookie Policy
    • Trademark Notice
    • Acceptable Use Policy
    • Open Core License
Powered by GitBook

Platform

  • Open GLIK Cloud
  • Getting Started
  • Templates

Documentation

  • Core Concepts
  • GLIK Open Core
  • Security & Data Handling
  • Workspaces & Permissions

Company

  • RIvalz AI
  • Contact Support
  • Status Page

© 2023–2025 Rivalz Technologies Ltd.

On this page
  • High Quality
  • Economical
  • TopK

Was this helpful?

  1. Deprecation
  2. Dataset

Text Preprocessing and Cleaning

PreviousDataset CreationNextAdvanced Configuration

Last updated 3 months ago

Was this helpful?

After uploading content, users can choose different tools for chunking, indexing and segmenting the data.

Glik provides an automatic tool for chunking data, but users can also customize it for added convenience.

Indexing is necessary for accurate data retrieval. There are 2 types of indexing on Glik, and each has their own retrieval method:

  • High Quality

  • Economical

High Quality

In this type the system first leverages an configurable Embedding model (which can be switched) to convert chunk text into numerical vectors. This process facilitates efficient compression and persistent storage of large-scale textual data, while simultaneously enhancing the accuracy of LLM-user interactions.

This mode allow users to choose from 3 different types of retrieval methods:

  • Vector Search: The system vectorizes the user's input query to generate a query vector. It then computes the distance between this query vector and the text vectors in the knowledge base to identify the most semantically proximate text chunks.

  • Full-Text Search: Indexing all terms in the document, allowing users to query any terms and return text fragments containing those terms.

  • Hybrid: This process performs both full-text search and vector search simultaneously, incorporating a reordering step to select the best results that match the user's query from both types of search outcomes.

Economical

Economical mode employs an offline vector engine and keyword indexing, which reduces accuracy but eliminates additional token consumption and associated costs. The indexing method is limited to inverted indexing.

TopK

This parameter filters the text chucks that are most similar to the user's question. The system dynamically adjusts the number of snippets based on the context window size of the selected model. The default value is 3, meaning a higher value results in more text segments being retrieved. The system will also dynamically adjust the value of TopK, according to max_tokens of the selected model.