Sesja 9: Warsztaty - Implementacja systemu transkrypcji i rozpoznawania obrazów
Multimodalny system AI w praktyce
🎯 Cele warsztatów
- Implementacja kompletnego systemu transkrypcji audio/wideo
- Integracja rozpoznawania obrazów z przetwarzaniem mowy
- Budowa multimodalnej aplikacji AI
- Deployment i testowanie w środowisku production-like
🛠️ Projekt warsztatowy
System multimodalnej analizy treści
Scenariusz biznesowy: Firma organizuje regularne spotkania, webinary i prezentacje. Potrzebuje systemu do automatycznego:
- Transkrypcji nagrań audio/wideo
- Rozpoznawania obiektów i tekstu w prezentacjach
- Generowania podsumowań i insights
- Przeszukiwania w archiwum treści
Architektura systemu:
[UPLOAD AUDIO/VIDEO] → [CONTENT ROUTER] → [SPECIALIZED PROCESSORS]
↓
[SPEECH PROCESSOR] ← [ORCHESTRATOR] → [VISION PROCESSOR]
↓ ↓ ↓
[TEXT ANALYTICS] ← [RESULT FUSION] → [OCR PROCESSOR]
↓
[STRUCTURED OUTPUT] → [API/UI] → [USER RESULTS]
💻 Implementacja end-to-end
Główna klasa orchestratora
import asyncio
import tempfile
import uuid
from datetime import datetime
from typing import Dict, List, Optional, Any
import json
class MultimodalContentProcessor:
def __init__(self, config):
self.speech_processor = AzureSpeechProcessor(
config["speech_key"],
config["speech_region"]
)
self.vision_analyzer = AzureVisionAnalyzer(
config["vision_key"],
config["vision_endpoint"]
)
self.processing_stats = {
"total_processed": 0,
"successful": 0,
"failed": 0,
"average_processing_time": 0
}
async def process_multimedia_content(self, content_info: Dict) -> Dict:
"""Główna metoda przetwarzania treści multimodalnych"""
processing_id = str(uuid.uuid4())
start_time = datetime.utcnow()
result = {
"processing_id": processing_id,
"content_type": content_info.get("type", "unknown"),
"start_time": start_time.isoformat(),
"status": "processing",
"results": {},
"metadata": content_info.get("metadata", {})
}
try:
content_type = content_info["type"]
content_path = content_info["path"]
if content_type == "video":
result["results"] = await self._process_video_content(content_path)
elif content_type == "audio":
result["results"] = await self._process_audio_content(content_path)
elif content_type == "image":
result["results"] = await self._process_image_content(content_path)
elif content_type == "document_with_images":
result["results"] = await self._process_document_with_images(content_path)
else:
raise ValueError(f"Unsupported content type: {content_type}")
# Finalizacja
result["status"] = "completed"
result["end_time"] = datetime.utcnow().isoformat()
result["processing_duration"] = (
datetime.utcnow() - start_time
).total_seconds()
# Statystyki
self.processing_stats["total_processed"] += 1
self.processing_stats["successful"] += 1
self._update_average_processing_time(result["processing_duration"])
print(f"✅ Processing completed for {processing_id}")
except Exception as e:
result["status"] = "failed"
result["error"] = str(e)
result["end_time"] = datetime.utcnow().isoformat()
self.processing_stats["total_processed"] += 1
self.processing_stats["failed"] += 1
print(f"❌ Processing failed for {processing_id}: {str(e)}")
return result
async def _process_video_content(self, video_path: str) -> Dict:
"""Przetwarzanie treści wideo"""
results = {
"audio_analysis": {},
"visual_analysis": {},
"synchronized_insights": {}
}
# Krok 1: Ekstrakcja audio z wideo
audio_path = await self._extract_audio_from_video(video_path)
# Krok 2: Transkrypcja audio
print("🎙️ Transcribing audio...")
transcription_result = await self._transcribe_audio_file(audio_path)
results["audio_analysis"] = transcription_result
# Krok 3: Ekstrakcja kluczowych klatek
print("🖼️ Extracting key frames...")
key_frames = await self._extract_key_frames(video_path, frame_count=10)
# Krok 4: Analiza każdej klatki
print("👁️ Analyzing frames...")
frame_analyses = []
for i, frame_path in enumerate(key_frames):
frame_analysis = self.vision_analyzer.analyze_image_comprehensive(frame_path)
frame_analysis["timestamp"] = i * (transcription_result.get("duration", 60) / len(key_frames))
frame_analyses.append(frame_analysis)
results["visual_analysis"] = {
"frames_analyzed": len(frame_analyses),
"frame_analyses": frame_analyses
}
# Krok 5: Synchronizacja insights
print("🔗 Synchronizing insights...")
synchronized = await self._synchronize_audio_visual_insights(
transcription_result,
frame_analyses
)
results["synchronized_insights"] = synchronized
return results
async def _transcribe_audio_file(self, audio_path: str) -> Dict:
"""Transkrypcja pliku audio z diaryzacją mówców"""
# Konfiguracja dla batch transcription
transcription_config = {
"audio_path": audio_path,
"language": "pl-PL",
"enable_speaker_diarization": True,
"enable_word_timestamps": True,
"enable_automatic_punctuation": True
}
# Symulacja batch transcription (w rzeczywistości używałby REST API)
# To jest uproszczona implementacja dla warsztatów
transcription_result = {
"transcript_segments": [
{
"speaker_id": "Speaker_0",
"start_time": 0.0,
"end_time": 15.5,
"text": "Witam wszystkich na dzisiejszej prezentacji o sztucznej inteligencji.",
"confidence": 0.95
},
{
"speaker_id": "Speaker_1",
"start_time": 16.0,
"end_time": 28.3,
"text": "Dziękuję za wprowadzenie. Zacznijmy od podstawowych definicji AI.",
"confidence": 0.92
}
],
"speaker_analysis": {
"total_speakers": 2,
"speaking_time": {
"Speaker_0": 45.2, # seconds
"Speaker_1": 78.8
},
"word_counts": {
"Speaker_0": 156,
"Speaker_1": 267
}
},
"full_transcript": "Witam wszystkich na dzisiejszej prezentacji...",
"duration": 124.0, # seconds
"confidence": 0.94
}
return transcription_result
async def _extract_key_frames(self, video_path: str, frame_count: int = 10) -> List[str]:
"""Ekstrakcja kluczowych klatek z wideo"""
# W rzeczywistości używałby OpenCV lub FFmpeg
# To jest symulacja dla warsztatów
key_frames = []
for i in range(frame_count):
# Symulacja ekstrakcji klatki
frame_path = f"/tmp/frame_{i:03d}.jpg"
# Tutaj byłaby logika ekstrakcji klatki
key_frames.append(frame_path)
print(f"📷 Extracted {len(key_frames)} key frames")
return key_frames
async def _synchronize_audio_visual_insights(self, audio_data: Dict,
frame_analyses: List[Dict]) -> Dict:
"""Synchronizacja insights z audio i video"""
synchronized_timeline = []
# Dla każdej klatki znajdź odpowiadający fragment audio
for frame_analysis in frame_analyses:
timestamp = frame_analysis["timestamp"]
# Znajdź segment audio dla tego czasu
relevant_audio_segment = None
for segment in audio_data["transcript_segments"]:
if segment["start_time"] <= timestamp <= segment["end_time"]:
relevant_audio_segment = segment
break
# Stwórz synchroniczną analizę
timeline_entry = {
"timestamp": timestamp,
"visual_content": {
"description": frame_analysis.get("description", {}),
"objects": frame_analysis.get("objects", []),
"text_in_image": frame_analysis.get("extracted_text", "")
},
"audio_content": {
"speaker": relevant_audio_segment["speaker_id"] if relevant_audio_segment else None,
"text": relevant_audio_segment["text"] if relevant_audio_segment else "",
"confidence": relevant_audio_segment["confidence"] if relevant_audio_segment else 0
},
"insights": self._generate_multimodal_insights(
frame_analysis, relevant_audio_segment
)
}
synchronized_timeline.append(timeline_entry)
# Generowanie podsumowania
summary = self._generate_content_summary(synchronized_timeline)
return {
"timeline": synchronized_timeline,
"summary": summary,
"key_topics": self._extract_key_topics(audio_data, frame_analyses),
"action_items": self._extract_action_items(audio_data)
}
def _generate_multimodal_insights(self, frame_data: Dict, audio_data: Optional[Dict]) -> Dict:
"""Generowanie insights z danych multimodalnych"""
insights = {
"correlation_strength": 0.5,
"content_alignment": "unknown",
"key_observations": []
}
if not audio_data:
return insights
# Analiza zgodności treści wizualnej z audio
visual_elements = [obj["object"] for obj in frame_data.get("objects", [])]
audio_text = audio_data["text"].lower()
# Sprawdź korelacje
correlations = []
for visual_element in visual_elements:
if visual_element.lower() in audio_text:
correlations.append({
"type": "direct_mention",
"visual": visual_element,
"audio_context": audio_text
})
if correlations:
insights["correlation_strength"] = 0.8
insights["content_alignment"] = "high"
insights["key_observations"].append("Strong alignment between visual content and speech")
return insights
def _extract_key_topics(self, audio_data: Dict, frame_analyses: List[Dict]) -> List[str]:
"""Wydobywanie kluczowych tematów z całej treści"""
# Kombinacja tematów z audio i video
topics = set()
# Tematy z transkrypcji
full_transcript = audio_data.get("full_transcript", "")
audio_keywords = self._extract_keywords_from_text(full_transcript)
topics.update(audio_keywords)
# Tematy z analizy wizualnej
for frame in frame_analyses:
visual_tags = [tag["name"] for tag in frame.get("tags", []) if tag["confidence"] > 0.7]
topics.update(visual_tags)
# Sortuj według częstości występowania
sorted_topics = list(topics)[:10] # Top 10
return sorted_topics
def _extract_keywords_from_text(self, text: str) -> List[str]:
"""Proste wydobywanie słów kluczowych"""
# W rzeczywistości użyłby Azure AI Language lub NLP library
import re
words = re.findall(r'\b\w+\b', text.lower())
# Usuń stop words (uproszczone)
stop_words = {"i", "a", "to", "jest", "w", "z", "na", "o", "do", "że", "się"}
keywords = [word for word in words if word not in stop_words and len(word) > 3]
# Policz częstość
from collections import Counter
word_counts = Counter(keywords)
# Zwróć najczęściej występujące
return [word for word, count in word_counts.most_common(10)]
🎯 Warsztat praktyczny (120 min)
Implementacja krok po kroku
Krok 1: Setup i konfiguracja (20 min)
# Konfiguracja środowiska
workshop_config = {
"speech_key": "YOUR_SPEECH_KEY",
"speech_region": "eastus",
"vision_key": "YOUR_VISION_KEY",
"vision_endpoint": "https://your-vision.cognitiveservices.azure.com/",
"storage_connection": "YOUR_STORAGE_CONNECTION"
}
# Inicjalizacja systemu
processor = MultimodalContentProcessor(workshop_config)
# Test połączeń
async def test_services():
"""Test dostępności wszystkich usług"""
results = {}
# Test Speech Service
try:
test_audio = "Hello, this is a test"
audio_data = processor.speech_processor.text_to_speech(test_audio)
results["speech_service"] = "✅ Connected"
except Exception as e:
results["speech_service"] = f"❌ Error: {str(e)}"
# Test Vision Service
try:
# Test z przykładowym obrazem
test_url = "https://example.com/test-image.jpg"
analysis = processor.vision_analyzer.analyze_image_comprehensive(test_url)
results["vision_service"] = "✅ Connected"
except Exception as e:
results["vision_service"] = f"❌ Error: {str(e)}"
return results
# Uruchomienie testów
test_results = await test_services()
print("🔧 Service connectivity test:")
for service, status in test_results.items():
print(f" {service}: {status}")
Krok 2: Przetwarzanie przykładowych plików (40 min)
# Przykładowe pliki testowe
test_files = [
{
"type": "video",
"path": "sample_presentation.mp4",
"metadata": {
"title": "AI in Business Presentation",
"duration_estimate": "15 minutes",
"speaker": "John Doe"
}
},
{
"type": "audio",
"path": "interview_recording.wav",
"metadata": {
"title": "Technical Interview",
"participants": ["Interviewer", "Candidate"],
"topic": "Machine Learning Engineer"
}
},
{
"type": "image",
"path": "technical_diagram.png",
"metadata": {
"title": "System Architecture Diagram",
"source": "Technical Documentation"
}
}
]
# Przetwarzanie każdego pliku
processing_results = []
for file_info in test_files:
print(f"\n🔄 Processing: {file_info['metadata']['title']}")
result = await processor.process_multimedia_content(file_info)
processing_results.append(result)
# Wyświetl kluczowe wyniki
if result["status"] == "completed":
print(f"✅ Completed in {result['processing_duration']:.1f}s")
# Pokaż kluczowe insights
if "synchronized_insights" in result["results"]:
insights = result["results"]["synchronized_insights"]
print(f"📊 Key topics: {', '.join(insights.get('key_topics', [])[:3])}")
print(f"🎯 Action items: {len(insights.get('action_items', []))}")
else:
print(f"❌ Failed: {result.get('error', 'Unknown error')}")
Krok 3: Integracja i API (30 min)
from fastapi import FastAPI, File, UploadFile, HTTPException
from fastapi.responses import JSONResponse
import aiofiles
import os
app = FastAPI(title="Multimodal Content Processor API")
# Global processor instance
global_processor = None
@app.on_event("startup")
async def startup_event():
"""Inicjalizacja przy starcie aplikacji"""
global global_processor
config = {
"speech_key": os.getenv("AZURE_SPEECH_KEY"),
"speech_region": os.getenv("AZURE_SPEECH_REGION"),
"vision_key": os.getenv("AZURE_VISION_KEY"),
"vision_endpoint": os.getenv("AZURE_VISION_ENDPOINT")
}
global_processor = MultimodalContentProcessor(config)
print("🚀 Multimodal processor initialized")
@app.post("/process-content/")
async def process_uploaded_content(
file: UploadFile = File(...),
content_type: str = "auto-detect"
):
"""Endpoint dla przesyłania i przetwarzania treści"""
if not global_processor:
raise HTTPException(status_code=500, detail="Processor not initialized")
try:
# Zapisz przesłany plik
temp_file_path = f"/tmp/{file.filename}"
async with aiofiles.open(temp_file_path, 'wb') as f:
content = await file.read()
await f.write(content)
# Wykryj typ treści jeśli auto-detect
if content_type == "auto-detect":
content_type = detect_content_type(file.filename)
# Przygotuj info dla processora
content_info = {
"type": content_type,
"path": temp_file_path,
"metadata": {
"filename": file.filename,
"size": len(content),
"upload_time": datetime.utcnow().isoformat()
}
}
# Przetwarzanie
result = await global_processor.process_multimedia_content(content_info)
# Cleanup
os.remove(temp_file_path)
return JSONResponse(content=result)
except Exception as e:
raise HTTPException(status_code=500, detail=f"Processing failed: {str(e)}")
@app.get("/processing-stats/")
async def get_processing_stats():
"""Endpoint dla statystyk przetwarzania"""
if not global_processor:
raise HTTPException(status_code=500, detail="Processor not initialized")
stats = global_processor.processing_stats.copy()
# Dodaj dodatkowe metryki
if stats["total_processed"] > 0:
stats["success_rate"] = (stats["successful"] / stats["total_processed"]) * 100
stats["failure_rate"] = (stats["failed"] / stats["total_processed"]) * 100
return JSONResponse(content=stats)
def detect_content_type(filename: str) -> str:
"""Auto-detection typu treści na podstawie rozszerzenia"""
extension = filename.lower().split('.')[-1]
video_extensions = ["mp4", "avi", "mov", "mkv"]
audio_extensions = ["wav", "mp3", "m4a", "flac"]
image_extensions = ["jpg", "jpeg", "png", "bmp", "gif"]
if extension in video_extensions:
return "video"
elif extension in audio_extensions:
return "audio"
elif extension in image_extensions:
return "image"
else:
return "unknown"
# Uruchomienie serwera
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
Krok 4: Frontend interface (20 min)
<!DOCTYPE html>
<html>
<head>
<title>Multimodal Content Processor</title>
<meta charset="utf-8">
<style>
body {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;
max-width: 1200px;
margin: 0 auto;
padding: 20px;
background-color: #f5f5f5;
}
.container {
background: white;
border-radius: 12px;
padding: 30px;
box-shadow: 0 4px 6px rgba(0,0,0,0.1);
}
.upload-area {
border: 2px dashed #e2e8f0;
border-radius: 8px;
padding: 40px;
text-align: center;
margin: 20px 0;
transition: all 0.3s ease;
}
.upload-area:hover {
border-color: #3b82f6;
background-color: #f8fafc;
}
.upload-area.dragging {
border-color: #10b981;
background-color: #f0fdf4;
}
.results-section {
margin-top: 30px;
padding: 20px;
background-color: #f8fafc;
border-radius: 8px;
border-left: 4px solid #3b82f6;
}
.processing-indicator {
display: none;
text-align: center;
padding: 20px;
}
.spinner {
border: 3px solid #f3f3f3;
border-top: 3px solid #3b82f6;
border-radius: 50%;
width: 30px;
height: 30px;
animation: spin 1s linear infinite;
margin: 0 auto;
}
@keyframes spin {
0% { transform: rotate(0deg); }
100% { transform: rotate(360deg); }
}
</style>
</head>
<body>
<div class="container">
<h1>🎭 Multimodal Content Processor</h1>
<p>Upload audio, video, or images for AI-powered analysis</p>
<div class="upload-area" id="uploadArea">
<p>📁 Drag & drop files here or <strong>click to browse</strong></p>
<p style="color: #6b7280; font-size: 14px;">
Supported: MP4, MP3, WAV, JPG, PNG (max 50MB)
</p>
<input type="file" id="fileInput" style="display: none;"
accept=".mp4,.mp3,.wav,.jpg,.jpeg,.png">
</div>
<div class="processing-indicator" id="processingIndicator">
<div class="spinner"></div>
<p>Processing your content...</p>
</div>
<div class="results-section" id="resultsSection" style="display: none;">
<h3>📊 Analysis Results</h3>
<div id="resultsContent"></div>
</div>
</div>
<script>
const uploadArea = document.getElementById('uploadArea');
const fileInput = document.getElementById('fileInput');
const processingIndicator = document.getElementById('processingIndicator');
const resultsSection = document.getElementById('resultsSection');
const resultsContent = document.getElementById('resultsContent');
// Upload area click handler
uploadArea.addEventListener('click', () => fileInput.click());
// File selection handler
fileInput.addEventListener('change', handleFileSelection);
// Drag and drop handlers
uploadArea.addEventListener('dragover', handleDragOver);
uploadArea.addEventListener('dragleave', handleDragLeave);
uploadArea.addEventListener('drop', handleDrop);
function handleDragOver(e) {
e.preventDefault();
uploadArea.classList.add('dragging');
}
function handleDragLeave(e) {
e.preventDefault();
uploadArea.classList.remove('dragging');
}
function handleDrop(e) {
e.preventDefault();
uploadArea.classList.remove('dragging');
const files = e.dataTransfer.files;
if (files.length > 0) {
processFile(files[0]);
}
}
function handleFileSelection(e) {
const file = e.target.files[0];
if (file) {
processFile(file);
}
}
async function processFile(file) {
// Show processing indicator
processingIndicator.style.display = 'block';
resultsSection.style.display = 'none';
const formData = new FormData();
formData.append('file', file);
formData.append('content_type', 'auto-detect');
try {
const response = await fetch('/process-content/', {
method: 'POST',
body: formData
});
const result = await response.json();
if (response.ok) {
displayResults(result);
} else {
displayError(result.detail || 'Processing failed');
}
} catch (error) {
displayError('Upload failed: ' + error.message);
} finally {
processingIndicator.style.display = 'none';
}
}
function displayResults(result) {
let html = `<h4>✅ Processing Completed</h4>`;
html += `<p><strong>Processing ID:</strong> ${result.processing_id}</p>`;
html += `<p><strong>Duration:</strong> ${result.processing_duration.toFixed(1)}s</p>`;
if (result.results.audio_analysis) {
html += `<h5>🎙️ Audio Analysis</h5>`;
const audio = result.results.audio_analysis;
html += `<p><strong>Duration:</strong> ${audio.duration}s</p>`;
html += `<p><strong>Speakers:</strong> ${audio.speaker_analysis.total_speakers}</p>`;
html += `<p><strong>Transcript:</strong> ${audio.full_transcript.substring(0, 200)}...</p>`;
}
if (result.results.visual_analysis) {
html += `<h5>👁️ Visual Analysis</h5>`;
const visual = result.results.visual_analysis;
html += `<p><strong>Frames analyzed:</strong> ${visual.frames_analyzed}</p>`;
}
if (result.results.synchronized_insights) {
html += `<h5>🔗 Key Insights</h5>`;
const insights = result.results.synchronized_insights;
html += `<p><strong>Key topics:</strong> ${insights.key_topics.join(', ')}</p>`;
html += `<p><strong>Action items:</strong> ${insights.action_items.length}</p>`;
}
resultsContent.innerHTML = html;
resultsSection.style.display = 'block';
}
function displayError(errorMessage) {
resultsContent.innerHTML = `<p style="color: red;">❌ Error: ${errorMessage}</p>`;
resultsSection.style.display = 'block';
}
</script>
</body>
</html>
Krok 5: Testing i deployment (30 min)
class SystemTester:
def __init__(self, processor):
self.processor = processor
async def run_comprehensive_tests(self):
"""Kompleksowe testowanie systemu"""
test_scenarios = [
{
"name": "Single speaker audio",
"file": "single_speaker_test.wav",
"expected_speakers": 1,
"min_confidence": 0.8
},
{
"name": "Multi-speaker conversation",
"file": "conversation_test.wav",
"expected_speakers": 2,
"min_confidence": 0.7
},
{
"name": "Presentation with slides",
"file": "presentation_test.mp4",
"expected_visual_elements": ["text", "diagrams"],
"expected_speakers": 1
},
{
"name": "Technical diagram",
"file": "diagram_test.png",
"expected_text_extraction": True,
"min_ocr_confidence": 0.8
}
]
test_results = []
for scenario in test_scenarios:
print(f"\n🧪 Testing: {scenario['name']}")
try:
content_info = {
"type": self._detect_type_from_filename(scenario["file"]),
"path": f"test_data/{scenario['file']}",
"metadata": {"test_scenario": scenario["name"]}
}
result = await self.processor.process_multimedia_content(content_info)
# Walidacja wyników
validation = self._validate_test_result(result, scenario)
test_results.append({
"scenario": scenario["name"],
"status": "passed" if validation["passed"] else "failed",
"details": validation,
"processing_time": result.get("processing_duration", 0)
})
status_emoji = "✅" if validation["passed"] else "❌"
print(f"{status_emoji} {scenario['name']}: {validation['summary']}")
except Exception as e:
test_results.append({
"scenario": scenario["name"],
"status": "error",
"error": str(e)
})
print(f"❌ {scenario['name']}: Error - {str(e)}")
# Podsumowanie testów
passed_tests = len([t for t in test_results if t["status"] == "passed"])
total_tests = len(test_results)
print(f"\n📊 Test Summary: {passed_tests}/{total_tests} tests passed")
return test_results
def _validate_test_result(self, result, scenario):
"""Walidacja wyników testu"""
validation = {"passed": True, "issues": [], "summary": ""}
if result["status"] != "completed":
validation["passed"] = False
validation["issues"].append("Processing failed")
validation["summary"] = "Processing failed"
return validation
# Sprawdź specific requirements dla scenariusza
if "expected_speakers" in scenario:
audio_analysis = result["results"].get("audio_analysis", {})
speaker_count = audio_analysis.get("speaker_analysis", {}).get("total_speakers", 0)
if speaker_count != scenario["expected_speakers"]:
validation["passed"] = False
validation["issues"].append(f"Expected {scenario['expected_speakers']} speakers, got {speaker_count}")
if "min_confidence" in scenario:
confidence = result["results"].get("audio_analysis", {}).get("confidence", 0)
if confidence < scenario["min_confidence"]:
validation["passed"] = False
validation["issues"].append(f"Confidence {confidence:.2f} below threshold {scenario['min_confidence']}")
# Set summary
if validation["passed"]:
validation["summary"] = "All requirements met"
else:
validation["summary"] = f"{len(validation['issues'])} issues found"
return validation
✅ Zadania warsztatowe
Zadanie główne: Multimodal System (90 min)
- Setup environment (20 min) - konfiguracja Azure services
- Core implementation (40 min) - processor classes i API
- Integration testing (20 min) - testowanie z przykładowymi plikami
- UI development (10 min) - simple web interface
Zadania dodatkowe
Zadanie 1: Performance optimization (20 min)
- Parallel processing dla multiple files
- Caching mechanizm dla results
- Async optimization
Zadanie 2: Advanced features (25 min)
- Real-time processing capabilities
- Custom model integration
- Multi-language support
Zadanie 3: Production deployment (15 min)
- Docker containerization
- Azure deployment configuration
- Monitoring i alerting setup
📊 Kryteria oceny
Technical implementation (60 punktów)
- Działający system transkrypcji (20 pkt)
- Integration z vision services (20 pkt)
- API i interface (20 pkt)
Code quality (20 punktów)
- Error handling i logging (10 pkt)
- Documentation i code structure (10 pkt)
Innovation (20 punktów)
- Dodatkowe funkcje (10 pkt)
- UI/UX improvements (10 pkt)
🏆 Rezultat warsztatów
Po ukończeniu uczestnicy będą mieli:
- Działający multimodal system - pełna implementacja
- Practical experience z Azure Cognitive Services
- Production-ready code - gotowy do wdrożenia
- Portfolio project - demonstracja umiejętności