Pegaso - Web Crawler Agent
Overview
Pegaso is the Web Crawler agent in the TKM AI Agency Platform, specialized in web crawling and content analysis. It performs intelligent web scraping, analyzes web content, and provides structured insights using advanced language models for content understanding.
Directory Structure
Backend/CRM/Pegaso/
├── data/ # Crawl data and analysis results
├── pegaso.py # Main agent implementation
├── tools.py # Crawling and analysis utilities
├── tools_schema.py # Data models and schemas
└── tools_definitions.py # Constants and definitions
Main Components
PegasoAgent Class
The core component that handles crawling operations:
- Web crawling management
- Content analysis
- LLM-powered insights
- Crawl data processing
Processing Pipeline
-
Crawl Preparation
- URL validation
- Parameter refinement
- Depth configuration
- Context setup
-
Content Crawling
- Web scraping
- Content extraction
- Data collection
- Structure preservation
-
Analysis Processing
- Content analysis
- Key findings extraction
- Insight generation
- Result summarization
API Operations
Process Crawl
- Endpoint:
/process_crawl
- Method: POST
- Purpose: Performs web crawling and content analysis
- Request Format:
{ "url": "https://example.com", "depth": 2, "max_pages": 5, "crawl_type": "scrape", "metadata": { "user_id": "user_identifier", "organization_id": "org_identifier" } }
- Response Format:
{ "status": "success", "crawl_id": "crawl_identifier", "summary": { "key_findings": [], "content_analysis": {}, "metadata": { "pages_crawled": 5, "timestamp": "2024-01-15T12:00:00Z" } } }
Key Features
-
Web Crawling
- Intelligent scraping
- Depth control
- Content extraction
- Structure preservation
-
Content Analysis
- LLM-powered analysis
- Key findings extraction
- Content summarization
- Insight generation
-
Data Management
- Crawl history tracking
- Result storage
- Activity logging
- Data organization
Integration
Platform Integration
- Interfaces with other CRM agents
- Event-based communication
- Data sharing
- Analysis coordination
External Services
- Firecrawl integration
- LLM providers (Groq, OpenAI)
- URL validation
- Content processing
Error Handling
-
Crawl Issues
- URL validation
- Connection errors
- Rate limiting
- Recovery procedures
-
Analysis Errors
- Content processing
- Model errors
- Fallback handling
- Error reporting
Performance Features
-
Optimization
- Crawl efficiency
- Resource management
- Rate limiting
- Cache utilization
-
Configuration
- Crawl parameters
- Depth settings
- Model selection
- Provider settings
Data Models
Crawl Parameters
{
"url": str,
"depth": int,
"max_pages": int,
"crawl_type": str,
"metadata": dict
}
Crawl Results
{
"crawl_id": str,
"url": str,
"content": dict,
"analysis": dict,
"timestamp": datetime,
"metadata": dict
}