Backend
Backoffice
Scalaris - Document Management

Scalaris - Document Management Agent

Overview

Scalaris is the agent responsible for document processing and management in the TKM AI Agency Platform. It handles document analysis, text extraction, summarization, and coordinates with other agents for comprehensive document processing.

Directory Structure

Scalaris/
├── data/                # Directory for processed documents
├── scalaris.py         # Main agent implementation
├── api_scalaris.py     # FastAPI endpoints
├── tools.py            # Document processing utilities
├── tools_schema.py     # Data models and schemas
├── tools_definitions.py # Constants and definitions
└── data_validation.py  # Document validation utilities

Main Components

ScalarisAgent Class

The main class that handles document processing and coordination:

  • Document processing pipeline management
  • Integration with other agents
  • Event handling
  • Data path management

Document Processing Pipeline

  1. Document validation
  2. Text extraction
  3. Summary generation
  4. Metadata extraction
  5. Classification and organization

Key Features

Document Processing

  • Text extraction from various document formats
  • Document summarization
  • Metadata extraction
  • Format validation

Integration Features

  • Embedding generation through Bala
  • Message storage with Atta
  • Classification with Orion
  • Data persistence with Niger

Data Management

  • Structured document storage
  • Metadata organization
  • Organization-level isolation
  • User-specific data handling

API Operations

Document Processing

# Processing Request
{
    "file_path": str,
    "user_id": str,
    "session_id": str,
    "conversation_id": str,
    "organization_id": str
}
 
# Processing Response
{
    "success": bool,
    "data": {
        "summary": str,
        "document_info": dict,
        "text_status": str,
        "tokens_info": dict,
        "folder_structure": dict,
        "embedding_id": str
    }
}

Integration

Agent Communication

  • Atta: Message storage and conversation management
  • Bala: Text embedding generation
  • Orion: Document classification and organization
  • Niger: Data persistence

Data Flow

  1. Document reception and validation
  2. Text extraction and processing
  3. Summary generation
  4. Embedding generation
  5. Classification
  6. Storage and organization

Error Handling

  • Document format validation
  • Processing errors
  • Integration failures
  • Storage issues
  • Comprehensive error logging

Performance Features

Document Processing

  • Efficient text extraction
  • Optimized summarization
  • Resource management
  • Processing queue handling

Storage Management

  • Organized file structure
  • Metadata tracking
  • Space optimization
  • Cleanup routines

Data Models

Document Metadata

{
    "file_path": str,
    "metadata": dict,
    "embedding_id": str,
    "tokens_info": {
        "prompt_tokens": int,
        "completion_tokens": int,
        "total_tokens": int
    },
    "folder_structure": {
        "folder_name": str,
        "category": str,
        "subcategory": str,
        "file_name": str,
        "state": str,
        "confidence": float
    }
}

Processing Result

{
    "success": bool,
    "extracted_text": str,
    "summary": dict,
    "file_path": str,
    "metadata": dict,
    "document_info": dict,
    "tokens_info": dict
}