agent-builder-evals

v0.1.0 suspicious
4.0
Medium Risk

CLI benchmark suite for evaluating AI agents across providers.

🤖 AI Analysis

Final verdict: SUSPICIOUS

The package has a moderate risk score due to missing maintainer history and critical metadata, which raises concerns about its origin and intentions.

  • Lack of maintainer history
  • Missing critical metadata
Per-check LLM notes
  • Obfuscation: No obfuscation patterns detected, indicating low risk.
  • Credentials: No credential harvesting patterns detected, indicating low risk.
  • Metadata: The package shows several red flags including lack of maintainer history and missing critical metadata, suggesting potential low effort or malicious intent.

🔬 Heuristic Checks

Outbound Network Calls

No suspicious network call patterns found

Code Obfuscation

No obfuscation patterns detected

Shell / Subprocess Execution score 6.0

Found 3 shell execution pattern(s)

  • None: try: return subprocess.check_output( ["git", "rev-parse", "HEAD"], stder
  • g="utf-8") proc = subprocess.run( ["python", "-m", "pytest", "-q", str(work /
  • g="utf-8") proc = subprocess.run( ["python", str(path)], cwd=
Credential Harvesting

No credential harvesting patterns detected

Typosquatting

No typosquatting candidates detected

Registered Email Domain

No author email provided

Suspicious Page Links

All external links appear legitimate

Git Repository History

No GitHub repository linked

  • No GitHub repository link found
Maintainer History score 8.0

4 maintainer concern(s) found

  • Only one version has ever been released — brand new package
  • Author name is missing or very short
  • Author "" appears to have only 1 package on PyPI (new or inactive account)
  • Package has no PyPI classifiers (low effort / metadata quality)
Known CVE Vulnerabilities

No known vulnerabilities found in OSV database.

💡 AI App Starter Prompt

Use this prompt to build a project with agent-builder-evals
Create a command-line tool named 'AgentBench' that leverages the 'agent-builder-evals' package to benchmark and evaluate various AI agents from different providers. This tool should allow users to easily compare the performance of these agents on specific tasks, such as natural language processing, decision-making, and problem-solving. The application should include the following key features:

1. **Agent Registration**: Users should be able to register new AI agents by specifying their provider (e.g., Anthropic, Google, Microsoft), API endpoint, and any necessary authentication details.
2. **Task Configuration**: Define a set of tasks that each agent will perform. These tasks could range from simple Q&A sessions to more complex scenarios like ethical dilemmas or strategic games.
3. **Evaluation Metrics**: Implement a variety of metrics to assess the performance of the agents, such as response time, accuracy, coherence, and creativity. Each metric should be customizable based on user needs.
4. **Benchmarking Suite**: Utilize the 'agent-builder-evals' package to run the defined tasks against all registered agents, collecting data on their performance according to the specified metrics.
5. **Reporting and Visualization**: After running the benchmarks, generate comprehensive reports detailing the performance of each agent across the different tasks. Include visualizations like graphs and charts to make the data easier to understand.
6. **User Interface**: While primarily a CLI tool, consider adding basic help commands and a simple text-based menu system to guide users through the process of registering agents, configuring tasks, and viewing results.
7. **Customization Options**: Allow advanced users to customize the evaluation criteria and task definitions further, ensuring the tool remains flexible and adaptable to various use cases.
8. **Security Measures**: Ensure sensitive information, such as API keys, is handled securely, possibly using environment variables or encrypted storage.

The goal of this project is to provide a robust, user-friendly tool for anyone interested in comparing the capabilities of different AI agents, aiding in both academic research and practical applications.