feat: Initial commit of LLM Tool Proxy

2025-12-31 06:35:08 +00:00
commit 0d14c98cf4
11 changed files with 775 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,135 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to consider not ignoring
+#   `.python-version` so that the required Python version is remembered.
+.python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock
+#   in version control. But in case of collaboration, if having platform-specific
+#   dependencies causes problems, excluding it is a good option.
+#Pipfile.lock
+
+# PEP 582; __pypackages__ directory
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak
+venv.bak
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
--- a/README.md
+++ b/README.md
@@ -0,0 +1,124 @@
+# LLM Tool Proxy
+
+## 1. 概述 (Overview)
+
+本项目是一个基于 FastAPI 实现的智能LLM（大语言模型）代理服务。其核心功能是拦截发往LLM的API请求，动态地将客户端定义的`tools`（工具）信息注入到提示词（Prompt）中，然后将LLM返回的结果进行解析，将其中可能包含的工具调用（Tool Call）指令提取出来，最后以结构化的格式返回给调用者。
+
+这使得即使底层LLM原生不支持工具调用参数，我们也能通过提示工程的方式赋予其使用工具的能力。
+
+## 2. 设计原则 (Design Principles)
+
+本程序在设计上严格遵循了以下原则：
+
+- **高内聚 (High Cohesion)**: 业务逻辑被集中在服务层 (`app/services.py`) 中，与API路由和数据模型分离。
+- **低耦合 (Low Coupling)**:
+    - API层 (`app/main.py`) 只负责路由和请求校验，不关心业务实现细节。
+    - 通过依赖注入 (`Depends`) 获取配置，避免了全局状态。
+    - LLM调用被抽象为独立的函数，方便未来切换不同的LLM后端或在测试中使用模拟（Mock）实现。
+- **可测试性 (Testability)**: 项目包含了完整的单元测试和集成测试 (`tests/`)，使用 `pytest` 和 `TestClient` 来确保每个模块的正确性和整体流程的稳定性。
+
+## 3. 项目结构 (Project Structure)
+
+```
+.
+├── app/                  # 核心应用代码
+│   ├── core/             # 配置管理
+│   │   └── config.py
+│   ├── main.py           # FastAPI 应用实例和 API 路由
+│   ├── models.py         # Pydantic 数据模型
+│   └── services.py       # 核心业务逻辑
+├── tests/                # 测试代码
+│   └── test_main.py
+├── .env                  # 环境变量文件 (需手动创建)
+├── .gitignore            # Git 忽略文件
+├── README.md             # 本文档
+└── .venv/                # Python 虚拟环境 (由 uv 创建)
+```
+
+## 4. 核心逻辑详解 (Core Logic)
+
+### 4.1. 提示词注入 (Prompt Injection)
+
+- **实现函数**: `app.services.inject_tools_into_prompt`
+- **策略**:
+    1.  将客户端请求中 `tools` 列表（JSON数组）序列化为格式化的JSON字符串。
+    2.  创建一个新的、`role` 为 `system` 的独立消息。
+    3.  此消息包含明确的指令，告诉LLM它拥有哪些工具以及如何通过特定的格式来调用它们。
+    4.  **调用格式约定**: 指示LLM在需要调用工具时，必须输出一个 `<tool_call>{...}</tool_call>` 的XML标签，其中包含一个带有 `name` 和 `arguments` 字段的JSON对象。
+    5.  这个系统消息被插入到原始消息列表的第二个位置（索引1），然后整个修改后的消息列表被发送到真实的LLM后端。
+- **目的**: 对调用者透明，将工具使用的“契约”通过上下文传递给LLM。
+
+### 4.2. 响应解析 (Response Parsing)
+
+- **实现函数**: `app.services.parse_llm_response`
+- **策略**:
+    1.  使用正则表达式 (`re.search`) 在LLM返回的纯文本响应中查找 `<tool_call>...</tool_call>` 标签。
+    2.  如果找到，它会提取标签内的JSON字符串，并将其解析为一个结构化的 `ToolCall` 对象。此时，返回给客户端的 `ResponseMessage` 中 `tool_calls` 字段将被填充，而 `content` 字段可能为 `None`。
+    3.  如果未找到标签，则将LLM的全部响应视为常规的文本内容，填充 `content` 字段。
+- **目的**: 将LLM的非结构化（或半结构化）输出，转换为客户端可以轻松处理的、定义良好的结构化数据。
+
+## 5. 配置管理 (Configuration)
+
+- 配置文件为根目录下的 `.env`。
+- `app/core/config.py` 中的 `get_settings` 函数通过依赖注入的方式在每次请求时加载环境变量，确保配置的实时性和在测试中的灵活性。
+- **必需变量**:
+    - `REAL_LLM_API_URL`: 真实LLM后端的地址。
+    - `REAL_LLM_API_KEY`: 用于访问真实LLM的API密钥。
+
+## 6. 如何运行与测试 (Usage)
+
+### 6.1. 环境设置
+
+```bash
+# 创建虚拟环境
+uv venv
+
+# 安装依赖
+uv pip install fastapi uvicorn httpx pytest
+```
+
+### 6.2. 运行开发服务器
+
+```bash
+uvicorn app.main:app --reload
+```
+服务将运行在 `http://127.0.0.1:8000`。
+
+### 6.3. 运行测试
+
+```bash
+# 使用 .venv 中的 python 解释器执行 pytest
+.venv/bin/python -m pytest
+```
+
+## 7. API 端点示例 (API Example)
+
+**端点**: `POST /v1/chat/completions`
+
+**请求示例 (带工具)**:
+```bash
+curl -X POST "http://127.0.0.1:8000/v1/chat/completions" \
+-H "Content-Type: application/json" \
+-d '{
+  "messages": [
+    {"role": "user", "content": "What is the weather in San Francisco?"}
+  ],
+  "tools": [
+    {
+      "type": "function",
+      "function": {
+        "name": "get_weather",
+        "description": "Get weather for a city",
+        "parameters": {}
+      }
+    }
+  ]
+}'
+```
+
+## 8. 未来升级方向 (Future Improvements)
+
+- **支持多种LLM后端**: 修改 `call_llm_api_real` 函数，使其能根据请求参数或配置选择不同的LLM提供商。
+- **更灵活的工具调用格式**: 支持除XML标签外的其他格式，例如纯JSON输出模式。
+- **流式响应 (Streaming)**: 支持LLM的流式输出，并实时解析和返回给客户端。
+- **错误处理增强**: 针对不同的LLM API错误码和网络问题，提供更精细的错误反馈。
--- a/app/init.py
+++ b/app/init.py
--- a/app/core/init.py
+++ b/app/core/init.py
--- a/app/core/config.py
+++ b/app/core/config.py
@@ -0,0 +1,17 @@
+import os
+from pydantic import BaseModel
+from typing import Optional
+
+class Settings(BaseModel):
+    """Manages application settings and configurations."""
+    REAL_LLM_API_URL: Optional[str] = None
+    REAL_LLM_API_KEY: Optional[str] = None
+
+def get_settings() -> Settings:
+    """
+    Returns an instance of the Settings object by loading from environment variables.
+    """
+    return Settings(
+        REAL_LLM_API_URL=os.getenv("REAL_LLM_API_URL"),
+        REAL_LLM_API_KEY=os.getenv("REAL_LLM_API_KEY"),
+    )
--- a/app/main.py
+++ b/app/main.py
@@ -0,0 +1,79 @@
+import os
+import sys
+from dotenv import load_dotenv
+
+# --- Explicit Debugging & Env Loading ---
+print(f"--- [DEBUG] Current Working Directory: {os.getcwd()}", file=sys.stderr)
+load_result = load_dotenv()
+print(f"--- [DEBUG] load_dotenv() result: {load_result}", file=sys.stderr)
+# ---
+
+import logging
+from fastapi import FastAPI, HTTPException, Depends
+from starlette.responses import StreamingResponse
+from .models import IncomingRequest, ProxyResponse
+from .services import process_chat_request, stream_llm_api, inject_tools_into_prompt
+from .core.config import get_settings, Settings
+
+# --- Logging Configuration ---
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
+    handlers=[
+        logging.FileHandler("llm_proxy.log"),
+        logging.StreamHandler()
+    ]
+)
+logger = logging.getLogger(__name__)
+# --- End of Logging Configuration ---
+
+app = FastAPI(
+    title="LLM Tool Proxy",
+    description="A proxy that intercepts LLM requests to inject and handle tool calls.",
+    version="1.0.0",
+)
+
+@app.on_event("startup")
+async def startup_event():
+    logger.info("Application startup complete.")
+    current_settings = get_settings()
+    logger.info(f"Loaded LLM API URL: {current_settings.REAL_LLM_API_URL}")
+
+@app.post("/v1/chat/completions")
+async def chat_completions(
+    request: IncomingRequest,
+    settings: Settings = Depends(get_settings)
+):
+    """
+    This endpoint mimics the OpenAI Chat Completions API and supports both
+    streaming (`stream=True`) and non-streaming (`stream=False`) responses.
+    """
+    if not settings.REAL_LLM_API_KEY or not settings.REAL_LLM_API_URL:
+        logger.error("REAL_LLM_API_KEY or REAL_LLM_API_URL is not configured.")
+        raise HTTPException(status_code=500, detail="LLM API Key or URL is not configured.")
+
+    # Prepare messages, potentially with tool injection
+    # This prepares the messages that will be sent to the LLM backend
+    messages_to_llm = request.messages
+    if request.tools:
+        messages_to_llm = inject_tools_into_prompt(request.messages, request.tools)
+
+    # Handle streaming request
+    if request.stream:
+        logger.info(f"Initiating streaming request with {len(messages_to_llm)} messages.")
+        generator = stream_llm_api(messages_to_llm, settings)
+        return StreamingResponse(generator, media_type="text/event-stream")
+
+    # Handle non-streaming request
+    try:
+        logger.info(f"Initiating non-streaming request with {len(messages_to_llm)} messages.")
+        response_message = await process_chat_request(messages_to_llm, request.tools, settings)
+        logger.info("Successfully processed non-streaming request.")
+        return ProxyResponse(message=response_message)
+    except Exception as e:
+        logger.exception("An unexpected error occurred during non-streaming request.")
+        raise HTTPException(status_code=500, detail=f"An unexpected error occurred: {str(e)}")
+
+@app.get("/")
+def read_root():
+    return {"message": "LLM Tool Proxy is running."}
--- a/app/models.py
+++ b/app/models.py
@@ -0,0 +1,41 @@
+from pydantic import BaseModel, Field
+from typing import List, Dict, Any, Optional
+
+# Models for incoming requests
+class ChatMessage(BaseModel):
+    """Represents a single message in the chat history."""
+    role: str
+    content: str
+
+class Tool(BaseModel):
+    """Represents a tool definition provided by the user."""
+    type: str
+    function: Dict[str, Any]
+
+class IncomingRequest(BaseModel):
+    """Defines the structure of the request from the client."""
+    messages: List[ChatMessage]
+    tools: Optional[List[Tool]] = None
+    stream: Optional[bool] = False
+
+# Models for outgoing responses
+class ToolCallFunction(BaseModel):
+    """Function call details within a tool call."""
+    name: str
+    arguments: str  # JSON string of arguments
+
+class ToolCall(BaseModel):
+    """Represents a tool call requested by the LLM."""
+    id: str
+    type: str = "function"
+    function: ToolCallFunction
+
+class ResponseMessage(BaseModel):
+    """The message part of the response from the proxy."""
+    role: str = "assistant"
+    content: Optional[str] = None
+    tool_calls: Optional[List[ToolCall]] = None
+
+class ProxyResponse(BaseModel):
+    """Defines the final structured response sent back to the client."""
+    message: ResponseMessage
--- a/app/services.py
+++ b/app/services.py
@@ -0,0 +1,198 @@
+import json
+import re
+import httpx
+import logging
+from typing import List, Dict, Any, Tuple, Optional, AsyncGenerator
+
+from .models import ChatMessage, Tool, ResponseMessage, ToolCall, ToolCallFunction
+from .core.config import Settings
+
+# Get a logger instance for this module
+logger = logging.getLogger(__name__)
+
+# --- Helper for parsing SSE ---
+# Regex to extract data field from SSE
+SSE_DATA_RE = re.compile(r"data:\s*(.*)")
+
+def _parse_sse_data(chunk: bytes) -> Optional[Dict[str, Any]]:
+    """Parses a chunk of bytes as SSE and extracts the JSON data."""
+    try:
+        lines = chunk.decode("utf-8").splitlines()
+        for line in lines:
+            if line.startswith("data:"):
+                match = SSE_DATA_RE.match(line)
+                if match:
+                    data_str = match.group(1).strip()
+                    if data_str == "[DONE]": # Handle OpenAI-style stream termination
+                        return {"type": "done"}
+                    try:
+                        return json.loads(data_str)
+                    except json.JSONDecodeError:
+                        logger.warning(f"Failed to decode JSON from SSE data: {data_str}")
+        return None
+    except UnicodeDecodeError:
+        logger.warning("Failed to decode chunk as UTF-8.")
+        return None
+
+# --- End Helper ---
+
+
+def inject_tools_into_prompt(messages: List[ChatMessage], tools: List[Tool]) -> List[ChatMessage]:
+    """
+    Injects tool definitions into the message list as a system prompt.
+    """
+    tool_defs = json.dumps([tool.model_dump() for tool in tools], indent=2)
+    tool_prompt = f"""
+You have access to a set of tools. You can call them by emitting a JSON object inside a <tool_call> XML tag.
+The JSON object should have a "name" and "arguments" field.
+
+Here are the available tools:
+{tool_defs}
+
+Only use the tools if strictly necessary.
+"""
+    new_messages = messages.copy()
+    new_messages.insert(1, ChatMessage(role="system", content=tool_prompt))
+    return new_messages
+
+
+def parse_llm_response_from_content(text: str) -> ResponseMessage:
+    """
+    (Fallback) Parses the raw LLM text response to extract a message and any tool calls.
+    This is used when the LLM does not support native tool calling.
+    """
+    if not text:
+        return ResponseMessage(content=None)
+        
+    tool_call_match = re.search(r"<tool_call>(.*?)</tool_call>", text, re.DOTALL)
+    
+    if tool_call_match:
+        tool_call_str = tool_call_match.group(1).strip()
+        try:
+            tool_call_data = json.loads(tool_call_str)
+            tool_call = ToolCall(
+                id="call_" + tool_call_data.get("name", "unknown"),
+                function=ToolCallFunction(
+                    name=tool_call_data.get("name"),
+                    arguments=json.dumps(tool_call_data.get("arguments", {})),
+                )
+            )
+            content_before = text.split("<tool_call>")[0].strip()
+            return ResponseMessage(content=content_before if content_before else None, tool_calls=[tool_call])
+        except json.JSONDecodeError as e:
+            logger.warning(f"Failed to parse tool call JSON from content: {tool_call_str}. Error: {e}")
+            return ResponseMessage(content=text)
+    else:
+        return ResponseMessage(content=text)
+
+
+async def _raw_stream_from_llm(messages: List[ChatMessage], settings: Settings) -> AsyncGenerator[bytes, None]:
+    """
+    Makes the raw HTTP streaming call to the LLM backend.
+    Yields raw byte chunks as received.
+    """
+    headers = { "Authorization": f"Bearer {settings.REAL_LLM_API_KEY}", "Content-Type": "application/json" }
+    payload = { "model": "default-model", "messages": [msg.model_dump() for msg in messages], "stream": True }
+    
+    try:
+        async with httpx.AsyncClient() as client:
+            logger.info(f"Initiating raw stream to LLM API at {settings.REAL_LLM_API_URL}")
+            async with client.stream("POST", settings.REAL_LLM_API_URL, headers=headers, json=payload, timeout=60.0) as response:
+                response.raise_for_status()
+                async for chunk in response.aiter_bytes():
+                    yield chunk
+    except httpx.HTTPStatusError as e:
+        logger.error(f"LLM API returned an error during raw stream: {e.response.status_code}, response: '{e.response.text}'")
+        # For streams, we log and let the stream terminate. The client will get a broken stream.
+        yield b'data: {"error": "LLM API Error", "status_code": ' + str(e.response.status_code).encode() + b'}\n\n'
+    except httpx.RequestError as e:
+        logger.error(f"An error occurred during raw stream request to LLM API: {e}")
+        yield b'data: {"error": "Network Error", "details": "' + str(e).encode() + b'"}\n\n'
+
+
+async def stream_llm_api(messages: List[ChatMessage], settings: Settings) -> AsyncGenerator[bytes, None]:
+    """
+    Public interface for streaming. Calls the raw stream, parses SSE, and yields SSE data chunks.
+    """
+    async for chunk in _raw_stream_from_llm(messages, settings):
+        # We assume the raw chunks are already SSE formatted or can be split into lines.
+        # For simplicity, we pass through the raw chunk bytes.
+        # A more robust parser would ensure each yield is a complete SSE event line.
+        yield chunk
+
+
+async def process_llm_stream_for_non_stream_request(
+    messages: List[ChatMessage], 
+    settings: Settings
+) -> Dict[str, Any]:
+    """
+    Aggregates a streaming LLM response into a single, non-streaming message.
+    Handles SSE parsing and delta accumulation.
+    """
+    full_content_parts = []
+    final_message_dict: Dict[str, Any] = {"role": "assistant", "content": None}
+    
+    async for chunk in _raw_stream_from_llm(messages, settings):
+        parsed_data = _parse_sse_data(chunk)
+        if parsed_data:
+            if parsed_data.get("type") == "done":
+                break # End of stream
+            
+            # Assuming OpenAI-like streaming format
+            choices = parsed_data.get("choices")
+            if choices and len(choices) > 0:
+                delta = choices[0].get("delta")
+                if delta:
+                    if "content" in delta:
+                        full_content_parts.append(delta["content"])
+                    if "tool_calls" in delta:
+                        # Accumulate tool calls if they appear in deltas (complex)
+                        # For simplicity, we'll try to reconstruct the final tool_calls
+                        # from the final message, or fall back to content parsing later.
+                        # This part is highly dependent on LLM's exact streaming format for tool_calls.
+                        pass
+                if choices[0].get("finish_reason"):
+                    # Check for finish_reason to identify stream end or tool_calls completion
+                    pass
+    
+    final_message_dict["content"] = "".join(full_content_parts) if full_content_parts else None
+
+    # This is a simplification. Reconstructing tool_calls from deltas is non-trivial.
+    # We will rely on parse_llm_response_from_content for tool calls if they are
+    # embedded in the final content string, or assume the LLM doesn't send native
+    # tool_calls in stream deltas that need aggregation here.
+    logger.info(f"Aggregated non-streaming response content: {final_message_dict.get('content')}")
+    
+    return final_message_dict
+
+
+async def process_chat_request(
+    messages: List[ChatMessage], 
+    tools: Optional[List[Tool]],
+    settings: Settings,
+) -> ResponseMessage:
+    """
+    Main service function for non-streaming requests.
+    It now calls the stream aggregation logic.
+    """
+    request_messages = messages
+    if tools:
+        request_messages = inject_tools_into_prompt(messages, tools)
+
+    # All interactions with the real LLM now go through the streaming mechanism.
+    llm_message_dict = await process_llm_stream_for_non_stream_request(request_messages, settings)
+    
+    # Priority 1: Check for native tool calls (if the aggregation could reconstruct them)
+    # Note: Reconstructing tool_calls from deltas in streaming is complex.
+    # For now, we assume if tool_calls are present, they are complete.
+    if llm_message_dict.get("tool_calls"):
+        logger.info("Native tool calls detected in aggregated LLM response.")
+        # Ensure it's a list of dicts suitable for Pydantic validation
+        if isinstance(llm_message_dict["tool_calls"], list):
+            return ResponseMessage.model_validate(llm_message_dict)
+        else:
+            logger.warning("Aggregated tool_calls not in expected list format. Treating as content.")
+
+    # Priority 2 (Fallback): Parse tool calls from content
+    logger.info("No native tool calls from aggregation. Falling back to content parsing.")
+    return parse_llm_response_from_content(llm_message_dict.get("content"))
--- a/tests/init.py
+++ b/tests/init.py
--- a/tests/test_main.py
+++ b/tests/test_main.py
@@ -0,0 +1,85 @@
+from fastapi.testclient import TestClient
+from app.main import app
+import json
+
+# The TestClient allows us to make requests to our FastAPI app without a running server.
+client = TestClient(app)
+
+def test_root_endpoint():
+    """Tests the health check endpoint."""
+    response = client.get("/")
+    assert response.status_code == 200
+    assert response.json() == {"message": "LLM Tool Proxy is running."}
+
+def test_chat_completions_no_tools(monkeypatch):
+    """
+    Tests the main endpoint with a simple request that does not include tools.
+    This is now an INTEGRATION TEST against the live backend.
+    """
+    monkeypatch.setenv("REAL_LLM_API_URL", "https://qwapi.oopsapi.com/v1/chat/completions")
+    monkeypatch.setenv("REAL_LLM_API_KEY", "dummy-key")
+
+    request_data = {
+        "messages": [
+            {"role": "user", "content": "Hello there!"}
+        ]
+    }
+    response = client.post("/v1/chat/completions", json=request_data)
+    
+    assert response.status_code == 200
+    response_json = response.json()
+    
+    # Assertions for a real response: check structure and types, not specific content.
+    assert "message" in response_json
+    assert response_json["message"]["role"] == "assistant"
+    # The real LLM should return some content
+    assert isinstance(response_json["message"]["content"], str)
+    assert len(response_json["message"]["content"]) > 0
+
+
+def test_chat_completions_with_tools_integration(monkeypatch):
+    """
+    Tests the main endpoint with a request that includes tools against the live backend.
+    We check for a valid response, but cannot guarantee a tool will be called.
+    """
+    monkeypatch.setenv("REAL_LLM_API_URL", "https://qwapi.oopsapi.com/v1/chat/completions")
+    monkeypatch.setenv("REAL_LLM_API_KEY", "dummy-key")
+
+    request_data = {
+        "messages": [
+            {"role": "user", "content": "What's the weather in San Francisco?"}
+        ],
+        "tools": [
+            {
+                "type": "function",
+                "function": {
+                    "name": "get_weather",
+                    "description": "Get the current weather for a specified city",
+                    "parameters": {
+                        "type": "object",
+                        "properties": {
+                            "city": {"type": "string", "description": "The city name"}
+                        },
+                        "required": ["city"]
+                    }
+                }
+            }
+        ]
+    }
+    
+    response = client.post("/v1/chat/completions", json=request_data)
+    
+    # For an integration test, the main goal is to ensure our proxy
+    # communicates successfully and can parse the response without errors.
+    assert response.status_code == 200
+    response_json = response.json()
+
+    # We assert that the basic structure is correct.
+    assert "message" in response_json
+    assert response_json["message"]["role"] == "assistant"
+    
+    # The response might contain content, a tool_call, or both. We just
+    # ensure the response fits our Pydantic model, which the TestClient handles.
+    # A successful 200 response is our primary success metric here.
+    assert response_json is not None
+
--- a/tests/test_services.py
+++ b/tests/test_services.py
@@ -0,0 +1,96 @@
+import pytest
+import httpx
+import json
+from typing import List, AsyncGenerator
+
+from app.services import call_llm_api_real
+from app.models import ChatMessage
+from app.core.config import Settings
+
+# Sample SSE chunks to simulate a streaming response
+SSE_STREAM_CHUNKS = [
+    'data: {"choices": [{"delta": {"role": "assistant", "content": "Hello"}}]}',
+    'data: {"choices": [{"delta": {"content": " world!"}}]}',
+    'data: {"choices": [{"delta": {"tool_calls": [{"index": 0, "id": "call_123", "function": {"name": "get_weather", "arguments": ""}}]}}]}',
+    'data: {"choices": [{"delta": {"tool_calls": [{"index": 0, "function": {"arguments": "{\\"location\\":"}}]}}]}',
+    'data: {"choices": [{"delta": {"tool_calls": [{"index": 0, "function": {"arguments": " \\"San Francisco\\"}"}}]}}]}',
+    'data: [DONE]',
+]
+
+# Mock settings for the test
+@pytest.fixture
+def mock_settings() -> Settings:
+    """Provides mock settings for tests."""
+    return Settings(
+        REAL_LLM_API_URL="http://fake-llm-api.com/chat",
+        REAL_LLM_API_KEY="fake-key"
+    )
+
+# Async generator to mock the streaming response
+async def mock_aiter_lines() -> AsyncGenerator[str, None]:
+    for chunk in SSE_STREAM_CHUNKS:
+        yield chunk
+
+# Mock for the httpx.Response object
+class MockStreamResponse:
+    def __init__(self, status_code: int = 200):
+        self._status_code = status_code
+
+    def raise_for_status(self):
+        if self._status_code != 200:
+            raise httpx.HTTPStatusError(
+                message="Error", request=httpx.Request("POST", ""), response=httpx.Response(self._status_code)
+            )
+
+    def aiter_lines(self) -> AsyncGenerator[str, None]:
+        return mock_aiter_lines()
+    
+    async def __aenter__(self):
+        return self
+    
+    async def __aexit__(self, exc_type, exc_val, exc_tb):
+        pass
+
+# Mock for the httpx.AsyncClient
+class MockAsyncClient:
+    def stream(self, method, url, headers, json, timeout):
+        return MockStreamResponse()
+
+    async def __aenter__(self):
+        return self
+    
+    async def __aexit__(self, exc_type, exc_val, exc_tb):
+        pass
+
+@pytest.mark.anyio
+async def test_call_llm_api_real_streaming(monkeypatch, mock_settings):
+    """
+    Tests that `call_llm_api_real` correctly handles an SSE stream,
+    parses the chunks, and assembles the final response message.
+    """
+    # Patch httpx.AsyncClient to use our mock
+    monkeypatch.setattr(httpx, "AsyncClient", MockAsyncClient)
+
+    messages = [ChatMessage(role="user", content="What is the weather in San Francisco?")]
+
+    # Call the function
+    result = await call_llm_api_real(messages, mock_settings)
+
+    # Define the expected assembled result
+    expected_result = {
+        "role": "assistant",
+        "content": "Hello world!",
+        "tool_calls": [
+            {
+                "id": "call_123",
+                "type": "function",
+                "function": {
+                    "name": "get_weather",
+                    "arguments": '{"location": "San Francisco"}',
+                },
+            }
+        ],
+    }
+
+    # Assert that the result matches the expected output
+    assert result == expected_result