agent-belt

v0.1.0 suspicious
8.0
High Risk

Evaluation harness for real headless CLI agents - reproducible multi-turn scenarios, rule + LLM scoring, cross-agent comparison

🤖 AI Analysis

Final verdict: SUSPICIOUS

The package exhibits high risks related to credential harvesting and metadata that suggest it could be part of a supply-chain attack. Despite no direct evidence of malicious content, the combination of factors points towards caution.

  • High credential risk
  • Suspicious metadata indicators
Per-check LLM notes
  • Obfuscation: No obfuscation patterns detected in the package.
  • Credentials: The observed pattern suggests potential unauthorized access to system files, indicative of credential harvesting.
  • Metadata: The package shows signs of being potentially suspicious due to its newness, lack of maintainer history, and minimal git activity, suggesting it might be a test run for malicious intent.

🔬 Heuristic Checks

Outbound Network Calls score 3.0

Found 2 network call pattern(s)

  • (base: str) -> None: with httpx.Client(base_url=base, timeout=5.0) as client: health = clie
  • try: r = httpx.get(f"http://127.0.0.1:{port}/api/health", timeout=1.0)
Code Obfuscation

No obfuscation patterns detected

Shell / Subprocess Execution score 10.0

Found 5 shell execution pattern(s)

  • ort = _free_port() proc = subprocess.Popen( [ sys.executable, "-m",
  • _env} try: return subprocess.run( [git, *args], cwd=str(cwd) if cwd i
  • ) process. proc = subprocess.Popen( # noqa: S603 - cmd is a fixed argv list owned by the adapt
  • try: result = subprocess.run([bin_path, "--version"], capture_output=True, text=True, tim
  • try: result = subprocess.run( [bin_path, "--version"], ca
Credential Harvesting score 2.5

Found 1 credential access pattern(s)

  • ce-in-depth: ``--bundled ../../etc/passwd`` would otherwise # escape the bundled root. ``resolve(
Typosquatting

No typosquatting candidates detected

Registered Email Domain

Email domain looks legitimate: jfrog.com>

Suspicious Page Links

All external links appear legitimate

Git Repository History score 5.0

Git history flags: Very few commits: 2 total

  • Very few commits: 2 total
  • Single contributor with only 2 commit(s) — possibly throwaway account
Maintainer History score 6.0

3 maintainer concern(s) found

  • Only one version has ever been released — brand new package
  • Author name is missing or very short
  • Author "" appears to have only 1 package on PyPI (new or inactive account)
Known CVE Vulnerabilities

No known vulnerabilities found in OSV database.

💡 AI App Starter Prompt

Use this prompt to build a project with agent-belt
Create a mini-application named 'AgentBench' that leverages the 'agent-belt' Python package to evaluate and compare different command-line interface (CLI) agents based on their performance in solving complex tasks. This application will serve as a benchmarking tool for developers and researchers interested in assessing the capabilities of various CLI agents across multiple scenarios.

### Project Goals:
- Develop a series of multi-turn scenarios that simulate real-world problems.
- Use 'agent-belt' to execute these scenarios with different CLI agents.
- Implement both rule-based and language model (LLM) scoring mechanisms to evaluate agent responses.
- Provide a comparative analysis of the agents based on the scores obtained.
- Ensure that all evaluations are reproducible and consistent.

### Key Features:
1. **Scenario Creation**: Allow users to define custom scenarios with multiple turns, where each turn represents a step in problem-solving.
2. **Agent Integration**: Support integration with various CLI agents through a standardized interface provided by 'agent-belt'.
3. **Scoring Mechanisms**:
   - Rule-Based Scoring: Define rules for what constitutes a correct or optimal response.
   - LLM Scoring: Utilize an LLM to assess the quality of agent responses, providing more nuanced evaluation.
4. **Reproducibility**: Ensure that all evaluations can be reproduced by saving scenario configurations and execution logs.
5. **Comparative Analysis**: Display side-by-side comparisons of different agents' performances across scenarios.
6. **User Interface**: Develop a simple yet effective web UI using Flask or a similar framework to allow users to input scenarios, select agents, and view results.

### Steps to Build the Application:
1. **Setup Environment**: Install necessary packages including 'agent-belt', Flask, and any other dependencies.
2. **Define Scenarios**: Create a few example scenarios covering diverse use cases such as data processing, system administration, and natural language understanding.
3. **Integrate Agents**: Connect your application with at least two different CLI agents.
4. **Implement Scoring Systems**: Develop both rule-based and LLM-based scoring systems using 'agent-belt' functionalities.
5. **Develop Web Interface**: Use Flask to create a user-friendly web interface allowing users to interact with the application.
6. **Test and Refine**: Conduct thorough testing of your application, ensuring it meets all specified requirements and functions smoothly.
7. **Documentation**: Write comprehensive documentation explaining how to use 'AgentBench', how to add new scenarios and agents, and how to interpret the results.

By following these steps and utilizing the 'agent-belt' package effectively, you'll create a valuable tool for evaluating and comparing CLI agents in a structured and reproducible manner.