You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: filter system prompt content from agent responses
- Add _filter_system_prompt_from_response() function to detect and filter leaked system prompt content
- Apply filtering to all 4 response output points in process_message() and send_user_response()
- Return safe fallback message when system prompt patterns are detected
- Prevents internal agent instructions from being exposed to users during follow-up queries
Filter out any system prompt content that might have leaked into agent responses.
152
+
153
+
This is a safety measure to ensure internal agent instructions are never
154
+
exposed to users, even if the LLM model accidentally includes them.
155
+
156
+
Args:
157
+
response_text: The agent's response text
158
+
159
+
Returns:
160
+
str: Cleaned response with any system prompt content removed
161
+
"""
162
+
ifnotresponse_text:
163
+
returnresponse_text
164
+
165
+
# Check if response contains system prompt patterns
166
+
forpatternin_SYSTEM_PROMPT_PATTERNS_COMPILED:
167
+
ifpattern.search(response_text):
168
+
logger.warning(f"System prompt content detected in agent response, filtering. Pattern: {pattern.pattern[:50]}")
169
+
# Return a safe fallback message instead of the leaked content
170
+
return"I understand your request. Could you please clarify what specific changes you'd like me to make to the marketing content? I'm here to help refine your campaign materials."
171
+
172
+
returnresponse_text
173
+
174
+
112
175
# Standard RAI refusal message for harmful content
113
176
RAI_HARMFUL_CONTENT_RESPONSE="""I'm a specialized marketing content generation assistant designed exclusively for creating professional marketing materials.
114
177
@@ -637,8 +700,9 @@ async def process_message(
637
700
formsginmessages
638
701
])
639
702
640
-
# Get the last message content
703
+
# Get the last message content and filter any system prompt leakage
0 commit comments