feat(providers): Add vision and document processing capabilities to providers

This commit is contained in:
2025-02-03 15:26:00 +01:00
parent e82c510094
commit eda8ce36df
9 changed files with 212 additions and 6 deletions

View File

@@ -17,8 +17,8 @@ This command installs the package and adds it to your project's dependencies.
@push.rocks/smartai supports multiple AI providers, each with its own unique capabilities:
### OpenAI
- Models: GPT-4, GPT-3.5-turbo
- Features: Chat, Streaming, Audio Generation
- Models: GPT-4, GPT-3.5-turbo, GPT-4-vision-preview
- Features: Chat, Streaming, Audio Generation, Vision, Document Processing
- Configuration:
```typescript
openaiToken: 'your-openai-token'
@@ -49,12 +49,13 @@ This command installs the package and adds it to your project's dependencies.
```
### Ollama
- Models: Configurable (default: llama2)
- Features: Chat, Streaming
- Models: Configurable (default: llama2, llava for vision/documents)
- Features: Chat, Streaming, Vision, Document Processing
- Configuration:
```typescript
baseUrl: 'http://localhost:11434' // Optional
model: 'llama2' // Optional
visionModel: 'llava' // Optional, for vision and document tasks
```
## Usage
@@ -147,15 +148,47 @@ const audioStream = await smartAi.openaiProvider.audio({
### Document Processing
For providers that support document processing (currently OpenAI):
For providers that support document processing (OpenAI and Ollama):
```typescript
// Using OpenAI
const result = await smartAi.openaiProvider.document({
systemMessage: 'Classify the document type',
userMessage: 'What type of document is this?',
messageHistory: [],
pdfDocuments: [pdfBuffer] // Uint8Array of PDF content
});
// Using Ollama with llava
const analysis = await smartAi.ollamaProvider.document({
systemMessage: 'You are a document analysis assistant',
userMessage: 'Extract the key information from this document',
messageHistory: [],
pdfDocuments: [pdfBuffer] // Uint8Array of PDF content
});
```
Both providers will:
1. Convert PDF documents to images
2. Process each page using their vision models
3. Return a comprehensive analysis based on the system message and user query
### Vision Processing
For providers that support vision tasks (OpenAI and Ollama):
```typescript
// Using OpenAI's GPT-4 Vision
const description = await smartAi.openaiProvider.vision({
image: imageBuffer, // Buffer containing the image data
prompt: 'What do you see in this image?'
});
// Using Ollama's Llava model
const analysis = await smartAi.ollamaProvider.vision({
image: imageBuffer,
prompt: 'Analyze this image in detail'
});
```
## Error Handling