agent-evaluator

v0.9.4 suspicious
6.0
Medium Risk

Production-ready evaluation framework for AI agents — 58 metrics (25 native + 33 Harness Config) across 7 evaluation gates: goal achievement, behavioral integrity, reliability, performance, security, multi-agent coordination, and observability

🤖 AI Analysis

Final verdict: SUSPICIOUS

The package exhibits multiple high-risk behaviors including shell execution and obfuscation, indicating potential malicious intent. While the network and credential risks are moderate, the incomplete metadata further raises suspicion.

  • High shell risk
  • Significant obfuscation risk
  • Incomplete author metadata
Per-check LLM notes
  • Network: Network calls indicate potential external communication which could be legitimate but should be reviewed to ensure it aligns with the package's intended functionality.
  • Shell: Shell execution is high risk as it can be used for unauthorized actions. This should be carefully examined to confirm its necessity and legitimacy within the package.
  • Obfuscation: The obfuscation pattern indicates potential code tampering to evade detection.
  • Credentials: The regex patterns and environment variable usage suggest possible attempts to harvest credentials or access sensitive files.
  • Metadata: The author's information is incomplete, suggesting a potentially less reputable source.

🔬 Heuristic Checks

Outbound Network Calls score 9.0

Found 6 network call pattern(s)

  • encode("utf-8") req = urllib.request.Request( self.webhook_url, data=data
  • POST", ) with urllib.request.urlopen(req, timeout=10): pass class WebhookHa
  • e(self.headers) req = urllib.request.Request( self.url, data=data,
  • ethod, ) with urllib.request.urlopen(req, timeout=10): pass class EmailHand
  • try: with urllib.request.urlopen(f"{base_url}/v1/projects", timeout=3) as resp:
  • }).encode() req = urllib.request.Request( gql_endpoint, data=
Code Obfuscation score 10.0

Found 6 obfuscation pattern(s)

  • onitor @property def eval(self) -> Any: """내부 :class:`EvalDecorator` 인스턴스."""
  • -------------- # 직접 호출 — @eval(task_type="qa") 형태 # -----------------------------------
  • y) -> Callable: """``@eval(task_type=...)`` 형태로 데코레이터를 직접 생성한다. Harness defaul
  • Usage:: @eval(task_type="qa", score_fn=my_fn) def agent(questi
  • passwd': 'critical', 'eval(': 'high', 'exec(': 'high', } param_str = s
  • = QuickEval("results/") @eval(task_type="reasoning", framework="dspy") def my_cot(ques
Shell / Subprocess Execution score 2.0

Found 1 shell execution pattern(s)

  • _phoenix_cmd() proc = subprocess.Popen( cmd, env=env, stdout=su
Credential Harvesting score 10.0

Found 6 credential access pattern(s)

  • er = SlackHandler(webhook_url=os.getenv("SLACK_WEBHOOK")) """ def __init__(self, webhook_url: str,
  • r"(\.\./)", r"(\.\.\\)", r"(/etc/passwd)", r"(/etc/shadow)", r"(C:\\Windows)", r"(/
  • .\.\\)", r"(/etc/passwd)", r"(/etc/shadow)", r"(C:\\Windows)", r"(/root/)", r"(/var/w
  • IGNORECASE), re.compile(r'/etc/passwd', re.IGNORECASE), re.compile(r'\\windows\\system32', re
  • ELETE FROM': 'high', '/etc/passwd': 'critical', 'eval(': 'high', 'exec(': 'hi
  • new_val = getpass.getpass(prompt).strip() except (EOFError, getpass.Ge
Typosquatting

No typosquatting candidates detected

Registered Email Domain

Email domain looks legitimate: gmail.com>

Suspicious Page Links

All external links appear legitimate

Git Repository History

Repository bullpeng72/Agent-Evaluator appears legitimate

Maintainer History score 4.0

2 maintainer concern(s) found

  • Author name is missing or very short
  • Author "" appears to have only 1 package on PyPI (new or inactive account)
Known CVE Vulnerabilities

No known vulnerabilities found in OSV database.

💡 AI App Starter Prompt

Use this prompt to build a project with agent-evaluator
Develop a fully-functional mini-application named 'AI-Agent-Inspector' that leverages the 'agent-evaluator' package to assess the performance of various AI agents in a simulated environment. This application will serve as a tool for developers and researchers to evaluate different AI agents against a set of predefined criteria, ensuring they meet robust standards of functionality, security, and efficiency.

The application should include the following key features:
1. **Agent Registration**: Users can register new AI agents within the application. Each agent can belong to one or more categories (e.g., chatbots, recommendation systems, autonomous vehicles).
2. **Evaluation Setup**: Users can configure evaluation scenarios based on the seven evaluation gates provided by 'agent-evaluator': goal achievement, behavioral integrity, reliability, performance, security, multi-agent coordination, and observability.
3. **Dynamic Metrics Selection**: Allow users to select specific metrics from the 58 available metrics (25 native + 33 Harness Config) for each evaluation scenario.
4. **Scenario Execution**: Execute the configured scenarios and collect data on how well each agent performs according to the selected metrics.
5. **Reporting**: Provide comprehensive reports that summarize the results of each evaluation, highlighting strengths and weaknesses of the agents.
6. **Visualization**: Implement visual dashboards to display the evaluation results in a user-friendly manner, enabling quick insights into agent performance.
7. **Security and Compliance Checks**: Ensure that the application includes checks to prevent unauthorized access and manipulation of evaluation data.

To utilize the 'agent-evaluator' package, integrate it into your application to handle the complex task of evaluating AI agents against the specified metrics and criteria. Use its comprehensive framework to streamline the process of setting up evaluation scenarios, executing them, and generating detailed reports. Additionally, leverage the package's advanced features for handling multi-agent environments and ensuring robust security measures are in place.