DevOps Architecture
Last updated
Last updated
We operate a self-hosted Large Language Model (LLM) ecosystem, selecting models such as DeepSeek R3 and Mistral based on task-specific requirements. These models power:
Conversational interactions
Complex reasoning operations
Advanced classification tasks
Our infrastructure runs on RunPod’s scalable GPU resources, within a robust Docker environment. This setup allows us to:
Dynamically adjust GPU instance count in response to real-time demand.
Ensure optimal performance by efficiently allocating resources.
For tasks requiring extensive context processing (e.g., analyzing lengthy articles), we strategically integrate centralized LLMs (O3 and Anthropic). This approach is specifically utilized for processing large-scale data collected through our Anatomy of Luigi scraping system.
We prioritize data privacy and security by running our own suite of embedding models rather than outsourcing user data to external providers like OpenAI. These models include:
Text embeddings
Contextual embeddings
Reranking models Our primary embedding framework is built using models from Nomic AI, ensuring high-quality vector representations while maintaining full control over data processing.
To support scalable data retrieval and indexing, we operate a distributed TimescaleDB infrastructure across multiple geographical regions. This architecture ensures:
High availability and redundancy
Optimized performance for AI-driven data queries
By leveraging TimescaleDB’s seamless integration with PostgreSQL, our RAG pipeline supports:
Automated document embeddings generation
Advanced reranking model processing
In-database execution of complex LLM queries
This integrated approach significantly enhances data retrieval efficiency and query performance.
We utilize a tiered data storage approach to balance speed, scalability, and privacy:
Redis Instances – Handle real-time processing of user interactions, ensuring ultra-fast response times.
MongoDB Atlas – Provides optimized long-term storage, supporting efficient indexing and retrieval.
Privacy-First Approach
User data is stored as heuristic fingerprints —you exist as an encrypted ID within our system.
Conversations remain private unless explicitly shared via our secure link-sharing feature. This privacy-first architecture ensures robust data protection while maintaining seamless user experience.
Our AI agents are continuously monitored through a self-hosted Elasticsearch cluster, enabling comprehensive real-time analytics on:
System health indicators
Token processing efficiency
Term frequency and usage patterns
By maintaining detailed agent telemetry, we can:
Identify performance bottlenecks early
Optimize model efficiency and responsiveness
Continuously fine-tune our self-hosted models for long-term improvement
This observability-driven infrastructure ensures our AI ecosystem remains scalable, efficient, and highly reliable.