feat(providers): Add vision and document processing capabilities to providers

2025-02-03 15:26:00 +01:00
parent e82c510094
commit eda8ce36df
9 changed files with 212 additions and 6 deletions
--- a/readme.md
+++ b/readme.md
@@ -17,8 +17,8 @@ This command installs the package and adds it to your project's dependencies.
@push.rocks/smartai supports multiple AI providers, each with its own unique capabilities:

 ### OpenAI
- Models: GPT-4, GPT-3.5-turbo
- Features: Chat, Streaming, Audio Generation
+- Models: GPT-4, GPT-3.5-turbo, GPT-4-vision-preview
+- Features: Chat, Streaming, Audio Generation, Vision, Document Processing
 - Configuration:
  ```typescript
  openaiToken: 'your-openai-token'
@@ -49,12 +49,13 @@ This command installs the package and adds it to your project's dependencies.
  ```

 ### Ollama
- Models: Configurable (default: llama2)
- Features: Chat, Streaming
+- Models: Configurable (default: llama2, llava for vision/documents)
+- Features: Chat, Streaming, Vision, Document Processing
 - Configuration:
  ```typescript
  baseUrl: 'http://localhost:11434' // Optional
  model: 'llama2' // Optional
+  visionModel: 'llava' // Optional, for vision and document tasks
  ```

 ## Usage
@@ -147,15 +148,47 @@ const audioStream = await smartAi.openaiProvider.audio({

 ### Document Processing

-For providers that support document processing (currently OpenAI):
+For providers that support document processing (OpenAI and Ollama):

 ```typescript
+// Using OpenAI
 const result = await smartAi.openaiProvider.document({
  systemMessage: 'Classify the document type',
  userMessage: 'What type of document is this?',
  messageHistory: [],
  pdfDocuments: [pdfBuffer] // Uint8Array of PDF content
 });
+
+// Using Ollama with llava
+const analysis = await smartAi.ollamaProvider.document({
+  systemMessage: 'You are a document analysis assistant',
+  userMessage: 'Extract the key information from this document',
+  messageHistory: [],
+  pdfDocuments: [pdfBuffer] // Uint8Array of PDF content
+});
+```
+
+Both providers will:
+1. Convert PDF documents to images
+2. Process each page using their vision models
+3. Return a comprehensive analysis based on the system message and user query
+
+### Vision Processing
+
+For providers that support vision tasks (OpenAI and Ollama):
+
+```typescript
+// Using OpenAI's GPT-4 Vision
+const description = await smartAi.openaiProvider.vision({
+  image: imageBuffer, // Buffer containing the image data
+  prompt: 'What do you see in this image?'
+});
+
+// Using Ollama's Llava model
+const analysis = await smartAi.ollamaProvider.vision({
+  image: imageBuffer,
+  prompt: 'Analyze this image in detail'
+});
 ```

 ## Error Handling