Backend
Pegaso - Web Crawler

Pegaso - Web Crawler Agent

Overview

Pegaso is the Web Crawler agent in the TKM AI Agency Platform, specialized in web crawling and content analysis. It performs intelligent web scraping, analyzes web content, and provides structured insights using advanced language models for content understanding.

Directory Structure

Backend/CRM/Pegaso/
├── data/                # Crawl data and analysis results
├── pegaso.py           # Main agent implementation
├── tools.py            # Crawling and analysis utilities
├── tools_schema.py     # Data models and schemas
└── tools_definitions.py # Constants and definitions

Main Components

PegasoAgent Class

The core component that handles crawling operations:

  • Web crawling management
  • Content analysis
  • LLM-powered insights
  • Crawl data processing

Processing Pipeline

  1. Crawl Preparation

    • URL validation
    • Parameter refinement
    • Depth configuration
    • Context setup
  2. Content Crawling

    • Web scraping
    • Content extraction
    • Data collection
    • Structure preservation
  3. Analysis Processing

    • Content analysis
    • Key findings extraction
    • Insight generation
    • Result summarization

API Operations

Process Crawl

  • Endpoint: /process_crawl
  • Method: POST
  • Purpose: Performs web crawling and content analysis
  • Request Format:
    {
      "url": "https://example.com",
      "depth": 2,
      "max_pages": 5,
      "crawl_type": "scrape",
      "metadata": {
        "user_id": "user_identifier",
        "organization_id": "org_identifier"
      }
    }
  • Response Format:
    {
      "status": "success",
      "crawl_id": "crawl_identifier",
      "summary": {
        "key_findings": [],
        "content_analysis": {},
        "metadata": {
          "pages_crawled": 5,
          "timestamp": "2024-01-15T12:00:00Z"
        }
      }
    }

Key Features

  1. Web Crawling

    • Intelligent scraping
    • Depth control
    • Content extraction
    • Structure preservation
  2. Content Analysis

    • LLM-powered analysis
    • Key findings extraction
    • Content summarization
    • Insight generation
  3. Data Management

    • Crawl history tracking
    • Result storage
    • Activity logging
    • Data organization

Integration

Platform Integration

  • Interfaces with other CRM agents
  • Event-based communication
  • Data sharing
  • Analysis coordination

External Services

  • Firecrawl integration
  • LLM providers (Groq, OpenAI)
  • URL validation
  • Content processing

Error Handling

  1. Crawl Issues

    • URL validation
    • Connection errors
    • Rate limiting
    • Recovery procedures
  2. Analysis Errors

    • Content processing
    • Model errors
    • Fallback handling
    • Error reporting

Performance Features

  1. Optimization

    • Crawl efficiency
    • Resource management
    • Rate limiting
    • Cache utilization
  2. Configuration

    • Crawl parameters
    • Depth settings
    • Model selection
    • Provider settings

Data Models

Crawl Parameters

{
    "url": str,
    "depth": int,
    "max_pages": int,
    "crawl_type": str,
    "metadata": dict
}

Crawl Results

{
    "crawl_id": str,
    "url": str,
    "content": dict,
    "analysis": dict,
    "timestamp": datetime,
    "metadata": dict
}