evaluation

8 repos

langfuse

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

Featured

analyticsautogenevaluation

TypeScript27.6K3532.8K2h ago

mlflow

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

Featured

agentopsagentsai

Python

promptfoo

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration. Used by OpenAI and Anthropic.

Featured

cici-cdcicd

TypeScript

WeKnora

Tencent

LLM-powered framework for deep document understanding, semantic retrieval, and context-aware answers using RAG paradigm.

Featured

agentagenticai

Go15.3K

coze-loop

coze-dev

Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to monitoring.

Featured

agentagent-evaluationagent-observability

Go5.5K127614h ago

lmnr

lmnr-ai

Laminar - open-source observability platform purpose-built for AI agents. YC S24.

agent-observabilityagentsai

TypeScript2.9K411983h ago

WFGY

onestardao

WFGY is heading toward WFGY 5.0 Polaris Protocol, a major open-source release for AI reasoning, RAG, agents, and real-world workflows. Includes Problem Map, Global Debug Card, WFGY 4.0, and the CFV Easter Egg.

ai-agentsalignmentdebugging

Jupyter Notebook1.8K61623h ago

Observal

BlazeUp-AI

Observal is an Observability and Evaluation platform for human-in-the-loop agents

agentsclaude-codecli-tool

Python1.3K1491592h ago