Sesja 12: Warsztaty - System wyszukiwania i przetwarzania dokumentów

Praktyczna implementacja end-to-end

🎯 Cele warsztatów

Budowa kompletnego systemu wyszukiwania semantycznego
Integracja Azure Cognitive Search z przetwarzaniem dokumentów
Implementacja wyszukiwania hybrydowego (tekstowe + wektorowe)
Stworzenie interfejsu użytkownika dla systemu wyszukiwania

🏗️ Architektura systemu end-to-end

Kompletna platforma wyszukiwania

[PRZESYŁANIE DOKUMENTÓW] → [POTOK PRZETWARZANIA] → [INDEKS WYSZUKIWANIA] → [INTERFEJS UŻYTKOWNIKA]
              ↓                        ↓                        ↓                    ↓
    [BLOB STORAGE] → [USŁUGI KOGNITYWNE] → [BAZA WEKTOROWA] → [API WYSZUKIWANIA]
              ↓                        ↓                        ↓                    ↓
[ARCHIWUM] → [MONITOROWANIE] → [ANALITYKA] → [PANEL ADMINISTRATORA]

Kluczowe funkcje:

Inteligentne przesyłanie - automatyczne wykrywanie typu, deduplikacja
Zaawansowane przetwarzanie - OCR, analiza semantyczna, kategooryzacja
Inteligentne indeksowanie - embeddingi semantyczne, pełnotekstowe
Inteligentne wyszukiwanie - język naturalny, podobieństwo wektorowe

💻 Warsztat praktyczny

Implementacja kompletnego systemu

import asyncio
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import *
from azure.ai.formrecognizer import DocumentAnalysisClient
from sentence_transformers import SentenceTransformer

class IntelligentDocumentSearchSystem:
    def __init__(self, search_endpoint, form_recognizer_endpoint, storage_connection):
        self.search_client = SearchClient(search_endpoint, "documents", credential)
        self.index_client = SearchIndexClient(search_endpoint, credential)
        self.form_recognizer = DocumentAnalysisClient(form_recognizer_endpoint, credential)
        self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
        self.storage_connection = storage_connection
        
    def create_search_index(self):
        """Utworzenie indeksu wyszukiwania z polami wektorowymi"""
        
        fields = [
            SimpleField(name="id", type=SearchFieldDataType.String, key=True),
            SearchableField(name="title", type=SearchFieldDataType.String),
            SearchableField(name="content", type=SearchFieldDataType.String),
            SearchableField(name="summary", type=SearchFieldDataType.String),
            SimpleField(name="document_type", type=SearchFieldDataType.String, filterable=True),
            SimpleField(name="upload_date", type=SearchFieldDataType.DateTimeOffset, sortable=True),
            SimpleField(name="file_size", type=SearchFieldDataType.Int64),
            SearchableField(
                name="contentVector", 
                type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
                searchable=True, 
                vector_search_dimensions=384,  # dla all-MiniLM-L6-v2
                vector_search_profile_name="myHnswProfile"
            )
        ]
        
        # Konfiguracja wyszukiwania wektorowego
        vector_search = VectorSearch(
            profiles=[VectorSearchProfile(
                name="myHnswProfile",
                algorithm_configuration_name="myHnsw"
            )],
            algorithms=[HnswAlgorithmConfiguration(
                name="myHnsw",
                parameters=HnswParameters(
                    metric=VectorSearchAlgorithmMetric.COSINE,
                    m=4,
                    ef_construction=400,
                    ef_search=500
                )
            )]
        )
        
        # Konfiguracja semantic search
        semantic_search = SemanticSearch(
            configurations=[SemanticConfiguration(
                name="default",
                prioritized_fields=SemanticPrioritizedFields(
                    title_field=SemanticField(field_name="title"),
                    content_fields=[
                        SemanticField(field_name="content"),
                        SemanticField(field_name="summary")
                    ]
                )
            )]
        )
        
        index = SearchIndex(
            name="intelligent-documents",
            fields=fields,
            vector_search=vector_search,
            semantic_search=semantic_search
        )
        
        self.index_client.create_or_update_index(index)
        return index
    
    async def process_and_index_document(self, document_url, metadata=None):
        """Przetwarzanie i indeksowanie dokumentu"""
        
        try:
            # Krok 1: Analiza dokumentu z Form Recognizer
            analysis_result = await self._analyze_document_content(document_url)
            
            # Krok 2: Wydobywanie i czyszczenie tekstu
            clean_content = self._clean_extracted_text(analysis_result["content"])
            
            # Krok 3: Generowanie podsumowania
            summary = await self._generate_summary(clean_content)
            
            # Krok 4: Klasyfikacja typu dokumentu
            document_type = self._classify_document_type(analysis_result)
            
            # Krok 5: Generowanie embeddingów
            content_embedding = self.embedding_model.encode(clean_content).tolist()
            
            # Krok 6: Przygotowanie dokumentu do indeksowania
            search_document = {
                "id": self._generate_document_id(document_url),
                "title": self._extract_title(analysis_result, document_url),
                "content": clean_content,
                "summary": summary,
                "contentVector": content_embedding,
                "document_type": document_type,
                "upload_date": datetime.utcnow().isoformat(),
                "file_size": len(clean_content),
                "metadata": metadata or {}
            }
            
            # Krok 7: Indeksowanie
            result = await self.search_client.upload_documents([search_document])
            
            return {
                "document_id": search_document["id"],
                "status": "indexed",
                "document_type": document_type,
                "content_length": len(clean_content),
                "indexing_result": result
            }
            
        except Exception as e:
            return {
                "document_url": document_url,
                "status": "error",
                "error": str(e)
            }
    
    async def intelligent_search(self, query, search_options=None):
        """Inteligentne wyszukiwanie hybrydowe"""
        
        options = search_options or {}
        
        # Generowanie embedding zapytania
        query_vector = self.embedding_model.encode(query).tolist()
        
        # Konfiguracja wyszukiwania hybrydowego
        search_results = await self.search_client.search(
            search_text=query,
            vector_queries=[VectorizedQuery(
                vector=query_vector,
                k_nearest_neighbors=options.get("k", 10),
                fields="contentVector"
            )],
            query_type=QueryType.SEMANTIC,
            semantic_configuration_name="default",
            query_caption=QueryCaptionType.EXTRACTIVE,
            query_answer=QueryAnswerType.EXTRACTIVE,
            filter=options.get("filter"),
            order_by=options.get("order_by"),
            top=options.get("top", 10),
            include_total_count=True
        )
        
        # Formatowanie wyników
        formatted_results = []
        async for result in search_results:
            formatted_result = {
                "id": result["id"],
                "title": result["title"],
                "content_preview": result["content"][:200] + "...",
                "document_type": result["document_type"],
                "score": result["@search.score"],
                "highlights": result.get("@search.highlights", {}),
                "captions": result.get("@search.captions", []),
                "answers": result.get("@search.answers", [])
            }
            formatted_results.append(formatted_result)
        
        return {
            "query": query,
            "total_results": search_results.get_count(),
            "results": formatted_results,
            "search_type": "hybrid_semantic"
        }
    
    def create_search_interface(self):
        """Tworzenie prostego interfejsu wyszukiwania"""
        
        interface_code = '''
<!DOCTYPE html>
<html>
<head>
    <title>Intelligent Document Search</title>
    <style>
        body { font-family: Arial, sans-serif; max-width: 1200px; margin: 0 auto; padding: 20px; }
        .search-box { width: 100%; padding: 15px; font-size: 16px; border: 2px solid #ddd; border-radius: 8px; }
        .filters { margin: 20px 0; }
        .filter-group { display: inline-block; margin-right: 20px; }
        .results { margin-top: 30px; }
        .result-item { border: 1px solid #eee; padding: 20px; margin-bottom: 15px; border-radius: 8px; }
        .result-title { font-size: 18px; font-weight: bold; color: #0066cc; }
        .result-preview { margin: 10px 0; color: #555; }
        .result-meta { font-size: 12px; color: #888; }
        .highlight { background-color: yellow; }
        .loading { text-align: center; padding: 50px; }
    </style>
</head>
<body>
    <h1>🔍 Intelligent Document Search</h1>
    
    <div class="search-container">
        <input type="text" class="search-box" id="searchQuery" 
               placeholder="Wpisz zapytanie w języku naturalnym..." 
               onkeypress="handleKeyPress(event)">
        
        <div class="filters">
            <div class="filter-group">
                <label>Typ dokumentu:</label>
                <select id="documentType">
                    <option value="">Wszystkie</option>
                    <option value="invoice">Faktury</option>
                    <option value="contract">Umowy</option>
                    <option value="report">Raporty</option>
                    <option value="manual">Instrukcje</option>
                </select>
            </div>
            
            <div class="filter-group">
                <label>Data od:</label>
                <input type="date" id="dateFrom">
            </div>
            
            <div class="filter-group">
                <button onclick="performSearch()">🔍 Szukaj</button>
            </div>
        </div>
    </div>
    
    <div id="results" class="results"></div>
    
    <script>
        async function performSearch() {
            const query = document.getElementById('searchQuery').value;
            const documentType = document.getElementById('documentType').value;
            const dateFrom = document.getElementById('dateFrom').value;
            
            if (!query.trim()) return;
            
            document.getElementById('results').innerHTML = '<div class="loading">Wyszukiwanie...</div>';
            
            try {
                const response = await fetch('/api/search', {
                    method: 'POST',
                    headers: {
                        'Content-Type': 'application/json',
                    },
                    body: JSON.stringify({
                        query: query,
                        filters: {
                            document_type: documentType,
                            date_from: dateFrom
                        }
                    })
                });
                
                const results = await response.json();
                displayResults(results);
                
            } catch (error) {
                document.getElementById('results').innerHTML = 
                    '<div style="color: red;">Błąd wyszukiwania: ' + error.message + '</div>';
            }
        }
        
        function displayResults(searchResults) {
            const resultsDiv = document.getElementById('results');
            
            if (!searchResults.results || searchResults.results.length === 0) {
                resultsDiv.innerHTML = '<div>Nie znaleziono dokumentów.</div>';
                return;
            }
            
            let html = `<div><strong>Znaleziono ${searchResults.total_results} wyników</strong></div>`;
            
            searchResults.results.forEach(result => {
                html += `
                    <div class="result-item">
                        <div class="result-title">${result.title}</div>
                        <div class="result-preview">${result.content_preview}</div>
                        <div class="result-meta">
                            Typ: ${result.document_type} | 
                            Trafność: ${Math.round(result.score * 100)}%
                        </div>
                    </div>
                `;
            });
            
            resultsDiv.innerHTML = html;
        }
        
        function handleKeyPress(event) {
            if (event.key === 'Enter') {
                performSearch();
            }
        }
    </script>
</body>
</html>
        '''
        
        return interface_code

🎯 Zadania warsztatowe

Projekt główny: Korporporacyjna baza wiedzy (120 min)

Scenariusz biznesowy: Firma ma 1000+ dokumentów w różnych formatach (PDF, Word, Excel, PowerPoint) i potrzebuje inteligentnego systemu wyszukiwania dla zespołów.

Wymagania:

Automatyczne przetwarzanie nowych dokumentów
Wyszukiwanie w języku naturalnym
Klasyfikacja i kategoryzacja
Interfejs web dla użytkowników końcowych

Implementacja krok po kroku:

Krok 1: Konfiguracja infrastruktury (30 min)

# Tworzenie zasobów Azure
az group create --name rg-document-search --location eastus

az search service create \
  --name doc-search-service \
  --resource-group rg-document-search \
  --sku standard

az cognitiveservices account create \
  --name form-recognizer-service \
  --resource-group rg-document-search \
  --kind FormRecognizer \
  --sku S0 \
  --location eastus

az storage account create \
  --name docstorage \
  --resource-group rg-document-search \
  --location eastus \
  --sku Standard_LRS

Krok 2: Przygotowanie danych (30 min)

Przesłanie przykładowych dokumentów do Blob Storage
Organizacja w folderach według typów
Konfiguracja dostępu dla usług

Krok 3: Implementacja systemu (45 min)

Stworzenie indeksu wyszukiwania
Implementacja processingu dokumentów
Batch processing dla istniejących dokumentów
Testowanie wyszukiwania

Krok 4: Interface użytkownika (15 min)

Deployment prostego interfejsu web
Integracja z API wyszukiwania
Testowanie end-to-end

Zadania dodatkowe

Zadanie 1: Optymalizacja wyszukiwania (30 min)

Dostrojenie parametrów wyszukiwania wektorowego
Konfiguracja semantic search
A/B testing różnych konfiguracji

Zadanie 2: Monitoring i analytics (20 min)

Implementacja logowania zapytań
Konfiguracja Application Insights
Dashboard z metrykami użytkowania

Zadanie 3: Scaling i performance (20 min)

Konfiguracja auto-scaling
Cache dla częstych zapytań
Optymalizacja kosztów

📊 Kryteria oceny projektu

Funkcjonalność (40 punktów)

System indeksuje różne typy dokumentów (10 pkt)
Wyszukiwanie hybrydowe działa poprawnie (15 pkt)
Interface użytkownika jest funkcjonalny (15 pkt)

Jakość techniczna (30 punktów)

Kod jest dobrze zorganizowany i udokumentowany (10 pkt)
Obsługa błędów i logging (10 pkt)
Wydajność i optymalizacja (10 pkt)

Innowacja i dodatkowe funkcje (20 punktów)

Dodatkowe filtry i opcje wyszukiwania (5 pkt)
Personalizacja wyników (5 pkt)
Dodatkowe integracje (5 pkt)
UI/UX improvements (5 pkt)

Prezentacja (10 punktów)

Demo działającego systemu (5 pkt)
Wyjaśnienie architektury i decyzji technicznych (5 pkt)

🏆 Rezultat warsztatów

Po ukończeniu warsztatów uczestnicy będą mieli:

Działający system wyszukiwania - kompletna implementacja z rzeczywistymi dokumentami
Praktyczna wiedza - doświadczenie z Azure Cognitive Search i Form Recognizer
Gotowy kod - szablon do wykorzystania w przyszłych projektach
Portfolio projekt - demonstracja umiejętności dla pracodawców