Integrating multiple modalities such as text, images, audio and video has become increasingly important for creating sophisticated and engaging AI applications. And LangChain and Google’s Gemini API are proving to be perfect partners for developers, offering a powerful toolkit to help build advanced multimodal AI solutions.
What Are LangChain and Google’s Gemini API?
LangChain: A resilient framework for building AI applications
LangChain is a robust and flexible framework that can simplify the development of AI applications. It provides a modular and composable approach, allowing technologists to combine various tools, such as language models, knowledge bases and data sources, to create complex AI systems. With LangChain, developers can leverage state-of-the-art natural language processing (NLP) models, integrate external data sources and build custom agents tailored to specific use cases.
Google’s Gemini API: Unleashing the potential of multimodal AI
Google’s Gemini API is a cutting-edge multimodal AI platform that enables developers to build applications that can understand and process multiple modalities simultaneously. This API uses Google’s advanced machine learning models and computer vision capabilities to analyze and interpret text, images, audio and video data. With Gemini, developers can create intelligent applications that can perceive and comprehend the world in a more human-like manner.
To leverage LangChain with Google’s Gemini API in Python to develop advanced multimodal AI applications, you need to follow a process of installing essential packages, setting up your API key from Google AI Studio and engaging with various Gemini models to use their full capabilities.
The following guide is designed to help you take advantage of the multimodal functionalities of these tools, enabling effective text generation and comprehensive image analysis, with detailed code snippets to offer both a theoretical understanding and practical experience.
Setup and Installation
To ensure your Python environment is prepared for working with LangChain and Google’s Gemini, install the necessary packages using pip:
1pip install -q langchain-google-genai
2pip install --upgrade -q langchain-google-genai
3pip show langchain-google-genai
4pip install -q google-generativeai
These commands handle installing and upgrading the LangChain package tailored for Google’s Gemini and the Gemini API client library.
Configuration
To use Google’s Gemini API, you need an API key. Store this key in an .env
file for security and easy access:
1from dotenv import load_dotenv, find_dotenv
2load_dotenv(find_dotenv(), override=True)
If the API key is not set in your environment variables, the script below will prompt you to enter it manually:
import getpass
import os
if 'GOOGLE_API_KEY' not in os.environ:
os.environ['GOOGLE_API_KEY'] = getpass.getpass('Provide your Google API Key: ')
Exploring Available Models
Before diving into specific functionalities, it’s useful to know which models are available:
import google.generativeai as genai
for model in genai.list_models():
print(model.name)
This snippet lists all models accessible through the Gemini API, allowing you to choose the appropriate one for your task.
Integrating Gemini With LangChain
Basic Setup
LangChain simplifies the interaction with Gemini models. Here’s how to set up a basic chat interface:
from langchain_google_genai import ChatGoogleGenerativeAI
# Create an instance of the LLM, using the 'gemini-pro' model with a specified creativity level
llm = ChatGoogleGenerativeAI(model='gemini-pro', temperature=0.9)
# Send a creative prompt to the LLM
response = llm.invoke('Write a paragraph about life on Mars in year 2100.')
print(response.content)
This code initializes a LangChain LLM instance using the Gemini-pro model and sends a creative prompt about life on Mars in 2100.
Advanced Use With Templating and Chains
LangChain also supports more advanced templating and chaining mechanisms:
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
# Set up a prompt template
prompt = PromptTemplate.from_template('You are a content creator. Write me a tweet about {topic}')
# Create a chain that utilizes both the LLM and the prompt template
chain = LLMChain(llm=llm, prompt=prompt, verbose=True)
topic = 'Why will AI change the world'
response = chain.invoke(input=topic)
print(response)
This setup enables more structured interactions, where the chain constructs and sends prompts dynamically based on the input.
System Prompt and Streaming
System Prompt
Handling specific instructions in prompts can be crucial for controlling your AI application’s behavior:
from langchain_core.messages import HumanMessage, SystemMessage
# Setup with system message conversion
llm = ChatGoogleGenerativeAI(model='gemini-pro', convert_system_message_to_human=True)
output = llm.invoke([
SystemMessage(content='Answer only YES or NO in French.'),
HumanMessage(content='Is fish a mammal?')
])
print(output.content)
This method is useful for creating structured, controlled dialogues where the AI system adheres strictly to given instructions.
Streaming Responses
For longer outputs, streaming can be essential:
# Send a prompt requiring detailed, continuous output
prompt = 'Write a scientific paper outlining the mathematical foundation of our universe.'
for chunk in llm.stream(prompt):
print(chunk.content)
print('-' * 100)
Streaming allows the API to handle larger outputs more efficiently, sending them in manageable chunks.
Multimodal AI With Gemini Pro Vision
Handling Images
Gemini Pro Vision extends capabilities to image analysis:
from PIL import Image
img = Image.open('match.jpg') #change this with your image
# Setup for image analysis
from langchain_core.messages import HumanMessage
from langchain_google_genai import ChatGoogleGenerativeAI
llm = ChatGoogleGenerativeAI(model='gemini-pro-vision')
prompt = 'What is in this image?'
message = HumanMessage(
content=[
{'type': 'text', 'text': prompt},
{'type': 'image_url', 'image_url': img}
]
)
response = llm.invoke([message])
print(response.content)
This example demonstrates how to prompt the AI system to ask questions about an image and describe its contents.
Conclusion
Using the functionalities of LangChain and Gemini, you can generate text, analyze images and implement multimodal AI interactions.
Integrating these advanced technologies allows developers to develop AI systems that are more intelligent, highly responsive and capable of handling complex tasks with ease.
Whether you aim to enhance user interactions, automate responses or analyze visual content, you can incorporate these robust tools into your projects.
Start experimenting and explore the potential of LangChain and Google’s Gemini to transform your applications into more powerful and innovative platforms.
Read about what the recent GPT-4o and Gemini releases mean for AI.
About the author:
Oladimeji Sowole
Oladimeji Sowole is a member of the Andela Talent Network, a private marketplace for global tech talent. A Data Scientist and Data Analyst with more than 6 years of professional experience building data visualizations with different tools and predictive models for actionable insights, he has hands-on expertise in implementing technologies such as Python, R, and SQL to develop solutions that drive client satisfaction. A collaborative team player, he has a great passion for solving problems.