7.3 Data Processing - SIANEXX—

7.3 User Interaction Data Processing Workflow

To effectively utilize user-AI Agent interaction data, a systematic data processing workflow is required, covering data collection, processing, and model optimization. This ensures data quality and usability, and continuously feeds back into the iterative improvement of the Agent.

7.3.1 Data Collection

Data collection is the foundation of the entire process, divided into implicit and explicit methods, aiming to comprehensively capture user behavior and feedback.

7.3.1.1 Implicit Collection (Automatic Recording)

Implicit collection refers to the system automatically recording interaction data with little or no user awareness. This is typically achieved through logging systems, ensuring data completeness and objectivity. - Logging System Records: Detailed records of complete dialogue content between users and Agents (user input, Agent output), including metadata such as timestamps, device information, and dialogue length. These logs can be structured as a Dialogue Tree, recording dialogue paths (e.g., user question → Agent response → user follow-up question) and building a dialogue state graph, which helps analyze dialogue flow and user behavior patterns. - Additional Metadata: Beyond basic dialogue content, additional metadata can be added, such as dialogue topic tags (e.g., automatically classified as “technical issue,” “complaint,” “product inquiry” through natural language processing techniques) and sentiment (automatically tagged with user emotions using sentiment analysis models, such as BERT-based classifiers), providing richer dimensions for subsequent data analysis and model optimization [12]. - Dialogue Association: Dialogue association across sessions is achieved through user IDs or Session IDs, building a complete user interaction history. This is crucial for personalized services and the construction of long-term memory.

7.3.1.2 Explicit Collection (User-Provided)

Explicit collection involves users actively providing feedback, which typically offers more direct and subjective evaluations, valuable for understanding user satisfaction and identifying model errors. - Like/Dislike Buttons and Text Feedback: This is the most common explicit feedback mechanism. Users can quickly provide binary evaluations (good/bad) on Agent responses and optionally add text comments explaining the reasons for their like or dislike (e.g., “This answer is not detailed enough,” “The information is incorrect, the correct information is…”). These text feedbacks are direct clues for identifying model errors and improvement directions. - Annotation Tools: Provide specialized annotation tools that allow users to highlight erroneous segments in Agent responses and select error types (e.g., “knowledge error,” “logical error,” “format error,” “unclear expression”). This fine-grained annotation is highly beneficial for diagnosing model issues and performing targeted fine-tuning. - Questionnaires and Surveys: Periodically or occasionally send questionnaires to users to collect their ratings and qualitative feedback on the Agent’s overall performance, service quality, and user experience. Surveys can obtain deeper user needs and satisfaction information, complementing automated metrics.

7.3.2 Data Processing

Collected raw data needs to undergo a series of processing steps to be transformed into high-quality data suitable for model training and optimization.

7.3.2.1 Data Cleaning

Data cleaning is a critical step to ensure data quality, aiming to remove noise, redundancy, and sensitive information. - Removal of Personally Identifiable Information (PII): Anonymize collected dialogue data by removing sensitive personal information such as names, phone numbers, and addresses, to protect user privacy and comply with data privacy regulations (e.g., GDPR, CCPA). - Filtering Low-Quality Data: Delete meaningless inputs (e.g., garbled text, repetitive content, spam) and low-quality Agent outputs to prevent negative impacts on model training. - Sensitive Content Detection: Identify and filter out illegal, ethically problematic, or inappropriate content through keyword filtering, rule matching, or specialized sensitive content detection models, ensuring Agent outputs comply with social norms. - Text Normalization: Standardize text by, for example, automatically expanding abbreviations (e.g., “AI” to “Artificial Intelligence”), correcting typos, and supplementing domain-specific terminology, to improve text consistency and readability.

7.3.2.2 Data Annotation

Data annotation is the process of transforming raw data into structured information that can be learned by models, combining automated and manual annotation. - Automated Annotation: Use rules, heuristic methods, or pre-trained small models to perform preliminary annotation on data, such as automatically identifying “knowledge errors” or “logical contradictions” in responses. This can significantly improve annotation efficiency and provide a basis for manual annotation. - Manual Annotation and Active Learning: Prioritize manual annotation for samples where the model is uncertain or has low confidence (e.g., if the model’s confidence in judging an error type is below 0.7). This Active Learning strategy can achieve higher model performance improvements with less annotation cost.

7.3.2.3 Embedding Vector Generation: Deep Semantic Representation Based on Large Language Models (LLMs)

Traditional embedding vector generation techniques, such as Word2Vec and GloVe, primarily focus on capturing local semantic information of words or phrases. However, when an Agent interacts with users in natural language and involves complex scenarios such as likes, dislikes, and comments, a deeper and more context-aware semantic representation is required. Therefore, we have chosen the embedding generation capability of Large Language Models (LLMs) as the core technology to precisely capture user intent and feedback.

Core advantages of LLM-based embedding generation: - Context Awareness: LLMs are pre-trained on vast amounts of text data, enabling them to understand complex linguistic structures and contextual relationships. Embedding vectors generated by LLMs not only reflect the semantics of words themselves but also capture their deeper meanings in specific contexts. This is crucial for understanding the subtle emotions and intentions of users during interactions. - Semantic Richness: LLMs can map text to high-dimensional vector spaces, where these vectors contain rich semantic information. Similar texts are closer in the vector space, which helps the Agent quickly retrieve highly relevant historical dialogues, comments, or feedback from large-scale interaction data. - Potential for Multimodal Fusion: With the development of multimodal LLMs, future embedding generation will be able to seamlessly integrate information from multiple modalities such as text, images, and audio. For example, images or voice messages uploaded by users in comments can be transformed into unified vector representations through multimodal embedding models, further enriching the Agent's understanding of user feedback. - Adaptability and Fine-tuning: As a foundational model, LLMs can be fine-tuned with a small amount of domain-specific data to adapt their generated embedding vectors to the language habits and domain knowledge of specific Agent application scenarios. This is significant for improving the interaction effectiveness of agents in specific fields such as customer service, education, and entertainment. - Seamless Integration with RAG: Embedding vectors generated by LLMs are a key component of the RAG (Retrieve-Augment-Generate) architecture. By transforming user queries and knowledge base content into high-quality embedding vectors, the RAG system can efficiently retrieve relevant information and use it as the basis for LLM-generated responses, thereby enhancing the accuracy and relevance of the responses.

Technical Implementation: We will utilize pre-trained LLMs (such as BERT, RoBERTa, and the GPT series) or LLMs specifically optimized for embedding tasks (such as Sentence-BERT and BGE) to generate text embeddings. Specifically, the natural language interaction content between users and Agents, like/dislike behaviors (which can be converted into textual descriptions), and comment content will all be input into the LLM to generate corresponding embedding vectors. These vectors will be stored in a vector database for subsequent similarity search and analysis.

7.3.3 Model Optimization

Collected and processed data is ultimately used for continuous model optimization, forming a closed-loop iterative improvement process.

7.3.3.1 Supervised Fine-Tuning (SFT)

Supervised fine-tuning involves using cleaned and annotated dialogue data to construct (input, ideal_output) sample pairs for supervised training of pre-trained LLMs. This enables the model to learn specific task patterns and behavioral norms. To enhance model robustness, Adversarial Training can be introduced, injecting noisy data (e.g., typos, semantic shifts) to ensure the model maintains good performance even with imperfect inputs.

7.3.3.2 Continual Learning

As new interaction data is continuously generated, models need to be able to learn new knowledge while avoiding forgetting old knowledge, addressing the problem of Catastrophic Forgetting. - Incremental Training: Periodically update model parameters with new data. To mitigate catastrophic forgetting, a Replay Buffer mechanism can be employed to store a representative portion of old data and mix it with new data during training, ensuring the model consolidates learned knowledge while acquiring new information. - Knowledge Distillation: Utilize a large, high-performing model (teacher model) to guide the learning of a smaller model (student model), or distill model capabilities using high-quality small-scale data, thereby reducing model complexity and computational cost while maintaining performance.

7.3.3.3 Retrieval Augmentation

Retrieval augmentation, especially the RAG architecture, significantly improves the accuracy and timeliness of LLMs by storing historical high-quality dialogues or external knowledge in a retrieval database, enabling real-time retrieval of similar cases to assist generation. This allows the model to generate responses by incorporating the latest and most accurate information, rather than solely relying on its training data.