Jump to content

Search the Community

Showing results for tags 'rag'.

  • Search By Tags

    Type tags separated by commas.
  • Search By Author

Content Type


Forums

  • General
    • General Discussion
    • Artificial Intelligence
    • DevOpsForum News
  • DevOps & SRE
    • DevOps & SRE General Discussion
    • Databases, Data Engineering & Data Science
    • Development & Programming
    • CI/CD, GitOps, Orchestration & Scheduling
    • Docker, Containers, Microservices, Serverless & Virtualization
    • Infrastructure-as-Code
    • Kubernetes & Container Orchestration
    • Linux
    • Logging, Monitoring & Observability
    • Security, Governance, Risk & Compliance
  • Cloud Providers
    • Amazon Web Services
    • Google Cloud Platform
    • Microsoft Azure

Find results in...

Find results that contain...


Date Created

  • Start

    End


Last Updated

  • Start

    End


Filter by number of...

Joined

  • Start

    End


Group


Website URL


LinkedIn Profile URL


About Me


Cloud Platforms


Cloud Experience


Development Experience


Current Role


Skills


Certifications


Favourite Tools


Interests

Found 5 results

  1. Discover the basics of using Gemini with Python via VertexAI, creating APIs with FastAPI, data validation with Pydantic and the fundamentals of Retrieval-Augmented Generation (RAG)Photo by Kenny Eliason on UnsplashWithin this article, I share some of the basics to create a LLM-driven web-application, using various technologies, such as: Python, FastAPI, Pydantic, VertexAI and more. You will learn how to create such a project from the very beginning and get an overview of the underlying concepts, including Retrieval-Augmented Generation (RAG).Disclaimer: I am using data from The Movie Database within this project. The API is free to use for non-commercial purposes and complies with the Digital Millennium Copyright Act (DMCA). For further information about TMDB data usage, please read the official FAQ. Table of contents– Inspiration – System Architecture – Understanding Retrieval-Augmented Generation (RAG) – Python projects with Poetry – Create the API with FastAPI – Data validation and quality with Pydantic – TMDB client with httpx – Gemini LLM client with VertexAI – Modular prompt generator with Jinja – Frontend – API examples – Conclusion The best way to share this knowledge is through a practical example. Hence, I’ll use my project Gemini Movie Detectives to cover the various aspects. The project was created as part of the Google AI Hackathon 2024, which is still running while I am writing this. Gemini Movie Detectives (by author)Gemini Movie Detectives is a project aimed at leveraging the power of the Gemini Pro model via VertexAI to create an engaging quiz game using the latest movie data from The Movie Database (TMDB). Part of the project was also to make it deployable with Docker and to create a live version. Try it yourself: movie-detectives.com. Keep in mind that this is a simple prototype, so there might be unexpected issues. Also, I had to add some limitations in order to control costs that might be generated by using GCP and VertexAI. Gemini Movie Detectives (by author)The project is fully open-source and is split into two separate repositories: Github repository for backend: https://github.com/vojay-dev/gemini-movie-detectives-api Github repository for frontend: https://github.com/vojay-dev/gemini-movie-detectives-uiThe focus of the article is the backend project and underlying concepts. It will therefore only briefly explain the frontend and its components. In the following video, I also give an overview over the project and its components: https://medium.com/media/bf4270fa881797cd515421b7bb646d1d/hrefInspirationGrowing up as a passionate gamer and now working as a Data Engineer, I’ve always been drawn to the intersection of gaming and data. With this project, I combined two of my greatest passions: gaming and data. Back in the 90’ I always enjoyed the video game series You Don’t Know Jack, a delightful blend of trivia and comedy that not only entertained but also taught me a thing or two. Generally, the usage of games for educational purposes is another concept that fascinates me. In 2023, I organized a workshop to teach kids and young adults game development. They learned about mathematical concepts behind collision detection, yet they had fun as everything was framed in the context of gaming. It was eye-opening that gaming is not only a huge market but also holds a great potential for knowledge sharing. With this project, called Movie Detectives, I aim to showcase the magic of Gemini, and AI in general, in crafting engaging trivia and educational games, but also how game design can profit from these technologies in general. By feeding the Gemini LLM with accurate and up-to-date movie metadata, I could ensure the accuracy of the questions from Gemini. An important aspect, because without this Retrieval-Augmented Generation (RAG) methodology to enrich queries with real-time metadata, there’s a risk of propagating misinformation — a typical pitfall when using AI for this purpose. Another game-changer lies in the modular prompt generation framework I’ve crafted using Jinja templates. It’s like having a Swiss Army knife for game design — effortlessly swapping show master personalities to tailor the game experience. And with the language module, translating the quiz into multiple languages is a breeze, eliminating the need for costly translation processes. Taking that on a business perspective, it can be used to reach a much broader audience of customers, without the need of expensive translation processes. From a business standpoint, this modularization opens doors to a wider customer base, transcending language barriers without breaking a sweat. And personally, I’ve experienced firsthand the transformative power of these modules. Switching from the default quiz master to the dad-joke-quiz-master was a riot — a nostalgic nod to the heyday of You Don’t Know Jack, and a testament to the versatility of this project. Movie Detectives — Example: Santa Claus personality (by author)System ArchitectureBefore we jump into details, let’s get an overview of how the application was built. Tech Stack: Backend Python 3.12 + FastAPI API developmenthttpx for TMDB integrationJinja templating for modular prompt generationPydantic for data modeling and validationPoetry for dependency managementDocker for deploymentTMDB API for movie dataVertexAI and Gemini for generating quiz questions and evaluating answersRuff as linter and code formatter together with pre-commit hooksGithub Actions to automatically run tests and linter on every pushTech Stack: Frontend VueJS 3.4 as the frontend frameworkVite for frontend toolingEssentially, the application fetches up-to-date movie metadata from an external API (TMDB), constructs a prompt based on different modules (personality, language, …), enriches this prompt with the metadata and that way, uses Gemini to initiate a movie quiz in which the user has to guess the correct title. The backend infrastructure is built with FastAPI and Python, employing the Retrieval-Augmented Generation (RAG) methodology to enrich queries with real-time metadata. Utilizing Jinja templating, the backend modularizes prompt generation into base, personality, and data enhancement templates, enabling the generation of accurate and engaging quiz questions. The frontend is powered by Vue 3 and Vite, supported by daisyUI and Tailwind CSS for efficient frontend development. Together, these tools provide users with a sleek and modern interface for seamless interaction with the backend. In Movie Detectives, quiz answers are interpreted by the Language Model (LLM) once again, allowing for dynamic scoring and personalized responses. This showcases the potential of integrating LLM with RAG in game design and development, paving the way for truly individualized gaming experiences. Furthermore, it demonstrates the potential for creating engaging quiz trivia or educational games by involving LLM. Adding and changing personalities or languages is as easy as adding more Jinja template modules. With very little effort, this can change the full game experience, reducing the effort for developers. System Overview (by author)As can be seen in the overview, Retrieval-Augmented Generation (RAG) is one of the essential ideas of the backend. Let’s have a closer look at this particular paradigm. Understanding Retrieval-Augmented Generation (RAG)In the realm of Large Language Models (LLM) and AI, one paradigm becoming more and more popular is Retrieval-Augmented Generation (RAG). But what does RAG entail, and how does it influence the landscape of AI development? At its essence, RAG enhances LLM systems by incorporating external data to enrich their predictions. Which means, you pass relevant context to the LLM as an additional part of the prompt, but how do you find relevant context? Usually, this data can be automatically retrieved from a database with vector search or dedicated vector databases. Vector databases are especially useful, since they store data in a way, so that it can be queried for similar data quickly. The LLM then generates the output based on both, the query and the retrieved documents. Picture this: you have an LLM capable of generating text based on a given prompt. RAG takes this a step further by infusing additional context from external sources, like up-to-date movie data, to enhance the relevance and accuracy of the generated text. Let’s break down the key components of RAG: LLMs: LLMs serve as the backbone of RAG workflows. These models, trained on vast amounts of text data, possess the ability to understand and generate human-like text.Vector Indexes for contextual enrichment: A crucial aspect of RAG is the use of vector indexes, which store embeddings of text data in a format understandable by LLMs. These indexes allow for efficient retrieval of relevant information during the generation process. In the context of the project this could be a database of movie metadata.Retrieval process: RAG involves retrieving pertinent documents or information based on the given context or prompt. This retrieved data acts as the additional input for the LLM, supplementing its understanding and enhancing the quality of generated responses. This could be getting all relevant information known and connected to a specific movie.Generative Output: With the combined knowledge from both the LLM and the retrieved context, the system generates text that is not only coherent but also contextually relevant, thanks to the augmented data.RAG architecture (by author)While in the Gemini Movie Detectives project, the prompt is enhanced with external API data from The Movie Database, RAG typically involves the use of vector indexes to streamline this process. It is using much more complex documents as well as a much higher amount of data for enhancement. Thus, these indexes act like signposts, guiding the system to relevant external sources quickly. In this project, it is therefore a mini version of RAG but showing the basic idea at least, demonstrating the power of external data to augment LLM capabilities. In more general terms, RAG is a very important concept, especially when crafting trivia quizzes or educational games using LLMs like Gemini. This concept can avoid the risk of false positives, asking wrong questions, or misinterpreting answers from the users. Here are some open-source projects that might be helpful when approaching RAG in one of your projects: txtai: All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows.LangChain: LangChain is a framework for developing applications powered by large language models (LLMs).Qdrant: Vector Search Engine for the next generation of AI applications.Weaviate: Weaviate is a cloud-native, open source vector database that is robust, fast, and scalable.Of course, with the potential value of this approach for LLM-based applications, there are many more open- and close-source alternatives, but with these, you should be able to get your research on the topic started. Python projects with PoetryNow that the main concepts are clear, let’s have a closer look how the project was created and how dependencies are managed in general. The three main tasks Poetry can help you with are: Build, Publish and Track. The idea is to have a deterministic way to manage dependencies, to share your project and to track dependency states. Photo by Kat von Wood on UnsplashPoetry also handles the creation of virtual environments for you. Per default, those are in a centralized folder within your system. However, if you prefer to have the virtual environment of project in the project folder, like I do, it is a simple config change: poetry config virtualenvs.in-project trueWith poetry new you can then create a new Python project. It will create a virtual environment linking you systems default Python. If you combine this with pyenv, you get a flexible way to create projects using specific versions. Alternatively, you can also tell Poetry directly which Python version to use: poetry env use /full/path/to/python. Once you have a new project, you can use poetry add to add dependencies to it. With this, I created the project for Gemini Movie Detectives: poetry config virtualenvs.in-project true poetry new gemini-movie-detectives-api cd gemini-movie-detectives-api poetry add 'uvicorn[standard]' poetry add fastapi poetry add pydantic-settings poetry add httpx poetry add 'google-cloud-aiplatform>=1.38' poetry add jinja2The metadata about your projects, including the dependencies with the respective versions, are stored in the poetry.toml and poetry.lock files. I added more dependencies later, which resulted in the following poetry.toml for the project: [tool.poetry] name = "gemini-movie-detectives-api" version = "0.1.0" description = "Use Gemini Pro LLM via VertexAI to create an engaging quiz game incorporating TMDB API data" authors = ["Volker Janz <volker@janz.sh>"] readme = "README.md" [tool.poetry.dependencies] python = "^3.12" fastapi = "^0.110.1" uvicorn = {extras = ["standard"], version = "^0.29.0"} python-dotenv = "^1.0.1" httpx = "^0.27.0" pydantic-settings = "^2.2.1" google-cloud-aiplatform = ">=1.38" jinja2 = "^3.1.3" ruff = "^0.3.5" pre-commit = "^3.7.0" [build-system] requires = ["poetry-core"] build-backend = "poetry.core.masonry.api"Create the API with FastAPIFastAPI is a Python framework that allows for rapid API development. Built on open standards, it offers a seamless experience without new syntax to learn. With automatic documentation generation, robust validation, and integrated security, FastAPI streamlines development while ensuring great performance. Photo by Florian Steciuk on UnsplashImplementing the API for the Gemini Movie Detectives projects, I simply started from a Hello World application and extended it from there. Here is how to get started: from fastapi import FastAPI app = FastAPI() @app.get("/") def read_root(): return {"Hello": "World"}Assuming you also keep the virtual environment within the project folder as .venv/ and use uvicorn, this is how to start the API with the reload feature enabled, in order to test code changes without the need of a restart: source .venv/bin/activate uvicorn gemini_movie_detectives_api.main:app --reload curl -s localhost:8000 | jq .If you have not yet installed jq, I highly recommend doing it now. I might cover this wonderful JSON Swiss Army knife in a future article. This is how the response looks like: Hello FastAPI (by author)From here, you can develop your API endpoints as needed. This is how the API endpoint implementation to start a movie quiz in Gemini Movie Detectives looks like for example: @app.post('/quiz') @rate_limit @retry(max_retries=settings.quiz_max_retries) def start_quiz(quiz_config: QuizConfig = QuizConfig()): movie = tmdb_client.get_random_movie( page_min=_get_page_min(quiz_config.popularity), page_max=_get_page_max(quiz_config.popularity), vote_avg_min=quiz_config.vote_avg_min, vote_count_min=quiz_config.vote_count_min ) if not movie: logger.info('could not find movie with quiz config: %s', quiz_config.dict()) raise HTTPException(status_code=status.HTTP_404_NOT_FOUND, detail='No movie found with given criteria') try: genres = [genre['name'] for genre in movie['genres']] prompt = prompt_generator.generate_question_prompt( movie_title=movie['title'], language=get_language_by_name(quiz_config.language), personality=get_personality_by_name(quiz_config.personality), tagline=movie['tagline'], overview=movie['overview'], genres=', '.join(genres), budget=movie['budget'], revenue=movie['revenue'], average_rating=movie['vote_average'], rating_count=movie['vote_count'], release_date=movie['release_date'], runtime=movie['runtime'] ) chat = gemini_client.start_chat() logger.debug('starting quiz with generated prompt: %s', prompt) gemini_reply = gemini_client.get_chat_response(chat, prompt) gemini_question = gemini_client.parse_gemini_question(gemini_reply) quiz_id = str(uuid.uuid4()) session_cache[quiz_id] = SessionData( quiz_id=quiz_id, chat=chat, question=gemini_question, movie=movie, started_at=datetime.now() ) return StartQuizResponse(quiz_id=quiz_id, question=gemini_question, movie=movie) except GoogleAPIError as e: raise HTTPException(status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail=f'Google API error: {e}') except Exception as e: raise HTTPException(status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail=f'Internal server error: {e}')Within this code, you can see already three of the main components of the backend: tmdb_client: A client I implemented using httpx to fetch data from The Movie Database (TMDB).prompt_generator: A class that helps to generate modular prompts based on Jinja templates.gemini_client: A client to interact with the Gemini LLM via VertexAI in Google Cloud.We will look at these components in detail later, but first some more helpful insights regarding the usage of FastAPI. FastAPI makes it really easy to define the HTTP method and data to be transferred to the backend. For this particular function, I expect a POST request as this creates a new quiz. This can be done with the post decorator: @app.post('/quiz')Also, I am expecting some data within the request sent as JSON in the body. In this case, I am expecting an instance of QuizConfig as JSON. I simply defined QuizConfig as a subclass of BaseModel from Pydantic (will be covered later) and with that, I can pass it in the API function and FastAPI will do the rest: class QuizConfig(BaseModel): vote_avg_min: float = Field(5.0, ge=0.0, le=9.0) vote_count_min: float = Field(1000.0, ge=0.0) popularity: int = Field(1, ge=1, le=3) personality: str = Personality.DEFAULT.name language: str = Language.DEFAULT.name # ... def start_quiz(quiz_config: QuizConfig = QuizConfig()):Furthermore, you might notice two custom decorators: @rate_limit @retry(max_retries=settings.quiz_max_retries)These I implemented to reduce duplicate code. They wrap the API function to retry the function in case of errors and to introduce a global rate limit of how many movie quizzes can be started per day. What I also liked personally is the error handling with FastAPI. You can simply raise a HTTPException, give it the desired status code and the user will then receive a proper response, for example, if no movie could be found with a given configuration: raise HTTPException(status_code=status.HTTP_404_NOT_FOUND, detail='No movie found with given criteria')With this, you should have an overview of creating an API like the one for Gemini Movie Detectives with FastAPI. Keep in mind: all code is open-source, so feel free to have a look at the API repository on Github. Data validation and quality with PydanticOne of the main challenges with todays AI/ML projects is data quality. But that does not only apply to ETL/ELT pipelines, which prepare datasets to be used in model training or prediction, but also to the AI/ML application itself. Using Python for example usually enables Data Engineers and Scientist to get a reasonable result with little code but being (mostly) dynamically typed, Python lacks of data validation when used in a naive way. That is why in this project, I combined FastAPI with Pydantic, a powerful data validation library for Python. The goal was to make the API lightweight but strict and strong, when it comes to data quality and validation. Instead of plain dictionaries for example, the Movie Detectives API strictly uses custom classes inherited from the BaseModel provided by Pydantic. This is the configuration for a quiz for example: class QuizConfig(BaseModel): vote_avg_min: float = Field(5.0, ge=0.0, le=9.0) vote_count_min: float = Field(1000.0, ge=0.0) popularity: int = Field(1, ge=1, le=3) personality: str = Personality.DEFAULT.name language: str = Language.DEFAULT.nameThis example illustrates, how not only correct type is ensured, but also further validation is applied to the actual values. Furthermore, up-to-date Python features, like StrEnum are used to distinguish certain types, like personalities: class Personality(StrEnum): DEFAULT = 'default.jinja' CHRISTMAS = 'christmas.jinja' SCIENTIST = 'scientist.jinja' DAD = 'dad.jinja'Also, duplicate code is avoided by defining custom decorators. For example, the following decorator limits the number of quiz sessions today, to have control over GCP costs: call_count = 0 last_reset_time = datetime.now() def rate_limit(func: callable) -> callable: @wraps(func) def wrapper(*args, **kwargs) -> callable: global call_count global last_reset_time # reset call count if the day has changed if datetime.now().date() > last_reset_time.date(): call_count = 0 last_reset_time = datetime.now() if call_count >= settings.quiz_rate_limit: raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail='Daily limit reached') call_count += 1 return func(*args, **kwargs) return wrapperIt is then simply applied to the related API function: @app.post('/quiz') @rate_limit @retry(max_retries=settings.quiz_max_retries) def start_quiz(quiz_config: QuizConfig = QuizConfig()):The combination of up-to-date Python features and libraries, such as FastAPI, Pydantic or Ruff makes the backend less verbose but still very stable and ensures a certain data quality, to ensure the LLM output has the expected quality. TMDB client with httpxThe TMDB Client class is using httpx to perform requests against the TMDB API. httpx is a rising star in the world of Python libraries. While requests has long been the go-to choice for making HTTP requests, httpx offers a valid alternative. One of its key strengths is asynchronous functionality. httpx allows you to write code that can handle multiple requests concurrently, potentially leading to significant performance improvements in applications that deal with a high volume of HTTP interactions. Additionally, httpx aims for broad compatibility with requests, making it easier for developers to pick it up. In case of Gemini Movie Detectives, there are two main requests: get_movies: Get a list of random movies based on specific settings, like average number of votesget_movie_details: Get details for a specific movie to be used in a quizIn order to reduce the amount of external requests, the latter one uses the lru_cache decorator, which stands for “Least Recently Used cache”. It’s used to cache the results of function calls so that if the same inputs occur again, the function doesn’t have to recompute the result. Instead, it returns the cached result, which can significantly improve the performance of the program, especially for functions with expensive computations. In our case, we cache the details for 1024 movies, so if 2 players get the same movie, we do not need to make a request again: @lru_cache(maxsize=1024) def get_movie_details(self, movie_id: int): response = httpx.get(f'https://api.themoviedb.org/3/movie/{movie_id}', headers={ 'Authorization': f'Bearer {self.tmdb_api_key}' }, params={ 'language': 'en-US' }) movie = response.json() movie['poster_url'] = self.get_poster_url(movie['poster_path']) return movieAccessing data from The Movie Database (TMDB) is for free for non-commercial usage, you can simply generate an API key and start making requests. Gemini LLM client with VertexAIBefore Gemini via VertexAI can be used, you need a Google Cloud project with VertexAI enabled and a Service Account with sufficient access together with its JSON key file. Create GCP project (by author)After creating a new project, navigate to APIs & Services –> Enable APIs and service –> search for VertexAI API –> Enable. Enable VertexAI (by author)To create a Service Account, navigate to IAM & Admin –> Service Accounts –> Create service account. Choose a proper name and go to the next step. Create Service Account (by author)Now ensure to assign the account the pre-defined role Vertex AI User. Assign correct role (by author)Finally you can generate and download the JSON key file by clicking on the new user –> Keys –> Add Key –> Create new key –> JSON. With this file, you are good to go. Create JSON key file (by author)Using Gemini from Google with Python via VertexAI starts by adding the necessary dependency to the project: poetry add 'google-cloud-aiplatform>=1.38'With that, you can import and initialize vertexai with your JSON key file. Also you can load a model, like the newly released Gemini 1.5 Pro model, and start a chat session like this: import vertexai from google.oauth2.service_account import Credentials from vertexai.generative_models import GenerativeModel project_id = "my-project-id" location = "us-central1" credentials = Credentials.from_service_account_file("credentials.json") model = "gemini-1.0-pro" vertexai.init(project=project_id, location=location, credentials=credentials) model = GenerativeModel(model) chat_session = model.start_chat()You can now use chat.send_message() to send a prompt to the model. However, since you get the response in chunks of data, I recommend using a little helper function, so that you simply get the full response as one String: def get_chat_response(chat: ChatSession, prompt: str) -> str: text_response = [] responses = chat.send_message(prompt, stream=True) for chunk in responses: text_response.append(chunk.text) return ''.join(text_response)A full example can then look like this: import vertexai from google.oauth2.service_account import Credentials from vertexai.generative_models import GenerativeModel, ChatSession project_id = "my-project-id" location = "us-central1" credentials = Credentials.from_service_account_file("credentials.json") model = "gemini-1.0-pro" vertexai.init(project=project_id, location=location, credentials=credentials) model = GenerativeModel(model) chat_session = model.start_chat() def get_chat_response(chat: ChatSession, prompt: str) -> str: text_response = [] responses = chat.send_message(prompt, stream=True) for chunk in responses: text_response.append(chunk.text) return ''.join(text_response) response = get_chat_response( chat_session, "How to say 'you are awesome' in Spanish?" ) print(response)Running this, Gemini gave me the following response: You are awesome (by author)I agree with Gemini: Eres increíbleAnother hint when using this: you can also configure the model generation by passing a configuration to the generation_config parameter as part of the send_message function. For example: generation_config = { 'temperature': 0.5 } responses = chat.send_message( prompt, generation_config=generation_config, stream=True )I am using this in Gemini Movie Detectives to set the temperature to 0.5, which gave me best results. In this context temperature means: how creative are the generated responses by Gemini. The value must be between 0.0 and 1.0, whereas closer to 1.0 means more creativity. One of the main challenges apart from sending a prompt and receive the reply from Gemini is to parse the reply in order to extract the relevant information. One learning from the project is: Specify a format for Gemini, which does not rely on exact words but uses key symbols to separate information elementsFor example, the question prompt for Gemini contains this instruction: Your reply must only consist of three lines! You must only reply strictly using the following template for the three lines: Question: <Your question> Hint 1: <The first hint to help the participants> Hint 2: <The second hint to get the title more easily>The naive approach would be, to parse the answer by looking for a line that starts with Question:. However, if we use another language, like German, the reply would look like: Antwort:. Instead, focus on the structure and key symbols. Read the reply like this: It has 3 linesThe first line is the questionSecond line the first hintThird line the second hintKey and value are separated by :With this approach, the reply can be parsed language agnostic, and this is my implementation in the actual client: @staticmethod def parse_gemini_question(gemini_reply: str) -> GeminiQuestion: result = re.findall(r'[^:]+: ([^\n]+)', gemini_reply, re.MULTILINE) if len(result) != 3: msg = f'Gemini replied with an unexpected format. Gemini reply: {gemini_reply}' logger.warning(msg) raise ValueError(msg) question = result[0] hint1 = result[1] hint2 = result[2] return GeminiQuestion(question=question, hint1=hint1, hint2=hint2)In the future, the parsing of responses will become even easier. During the Google Cloud Next ’24 conference, Google announced that Gemini 1.5 Pro is now publicly available and with that, they also announced some features including a JSON mode to have responses in JSON format. Checkout this article for more details. Apart from that, I wrapped the Gemini client into a configurable class. You can find the full implementation open-source on Github. Modular prompt generator with JinjaThe Prompt Generator is a class wich combines and renders Jinja2 template files to create a modular prompt. There are two base templates: one for generating the question and one for evaluating the answer. Apart from that, there is a metadata template to enrich the prompt with up-to-date movie data. Furthermore, there are language and personality templates, organized in separate folders with a template file for each option. Prompt Generator (by author)Using Jinja2 allows to have advanced features like template inheritance, which is used for the metadata. This makes it easy to extend this component, not only with more options for personalities and languages, but also to extract it into its own open-source project to make it available for other Gemini projects. FrontendThe Gemini Movie Detectives frontend is split into four main components and uses vue-router to navigate between them. The Home component simply displays the welcome message. The Quiz component displays the quiz itself and talks to the API via fetch. To create a quiz, it sends a POST request to api/quiz with the desired settings. The backend is then selecting a random movie based on the user settings, creates the prompt with the modular prompt generator, uses Gemini to generate the question and hints and finally returns everything back to the component so that the quiz can be rendered. Additionally, each quiz gets a session ID assigned in the backend and is stored in a limited LRU cache. For debugging purposes, this component fetches data from the api/sessions endpoint. This returns all active sessions from the cache. This component displays statistics about the service. However, so far there is only one category of data displayed, which is the quiz limit. To limit the costs for VertexAI and GCP usage in general, there is a daily limit of quiz sessions, which will reset with the first quiz of the next day. Data is retrieved form the api/limit endpoint. Vue components (by author)API examplesOf course using the frontend is a nice way to interact with the application, but it is also possible to just use the API. The following example shows how to start a quiz via the API using the Santa Claus / Christmas personality: curl -s -X POST https://movie-detectives.com/api/quiz \ -H 'Content-Type: application/json' \ -d '{"vote_avg_min": 5.0, "vote_count_min": 1000.0, "popularity": 3, "personality": "christmas"}' | jq .{ "quiz_id": "e1d298c3-fcb0-4ebe-8836-a22a51f87dc6", "question": { "question": "Ho ho ho, this movie takes place in a world of dreams, just like the dreams children have on Christmas Eve after seeing Santa Claus! It's about a team who enters people's dreams to steal their secrets. Can you guess the movie? Merry Christmas!", "hint1": "The main character is like a skilled elf, sneaking into people's minds instead of houses. ", "hint2": "I_c_p_i_n " }, "movie": {...} }Movie Detectives — Example: Santa Claus personality (by author)This example shows how to change the language for a quiz: curl -s -X POST https://movie-detectives.com/api/quiz \ -H 'Content-Type: application/json' \ -d '{"vote_avg_min": 5.0, "vote_count_min": 1000.0, "popularity": 3, "language": "german"}' | jq .{ "quiz_id": "7f5f8cf5-4ded-42d3-a6f0-976e4f096c0e", "question": { "question": "Stellt euch vor, es gäbe riesige Monster, die auf der Erde herumtrampeln, als wäre es ein Spielplatz! Einer ist ein echtes Urviech, eine Art wandelnde Riesenechse mit einem Atem, der so heiß ist, dass er euer Toastbrot in Sekundenschnelle rösten könnte. Der andere ist ein gigantischer Affe, der so stark ist, dass er Bäume ausreißt wie Gänseblümchen. Und jetzt ratet mal, was passiert? Die beiden geraten aneinander, wie zwei Kinder, die sich um das letzte Stück Kuchen streiten! Wer wird wohl gewinnen, die Riesenechse oder der Superaffe? Das ist die Frage, die sich die ganze Welt stellt! ", "hint1": "Der Film spielt in einer Zeit, in der Monster auf der Erde wandeln.", "hint2": "G_dz_ll_ vs. K_ng " }, "movie": {...} }And this is how to answer to a quiz via an API call: curl -s -X POST https://movie-detectives.com/api/quiz/84c19425-c179-4198-9773-a8a1b71c9605/answer \ -H 'Content-Type: application/json' \ -d '{"answer": "Greenland"}' | jq .{ "quiz_id": "84c19425-c179-4198-9773-a8a1b71c9605", "question": {...}, "movie": {...}, "user_answer": "Greenland", "result": { "points": "3", "answer": "Congratulations! You got it! Greenland is the movie we were looking for. You're like a human GPS, always finding the right way!" } }ConclusionAfter I finished the basic project, adding more personalities and languages was so easy with the modular prompt approach, that I was impressed by the possibilities this opens up for game design and development. I could change this game from a pure educational game about movies, into a comedy trivia “You Don’t Know Jack”-like game within a minute by adding another personality. Also, combining up-to-date Python functionality with validation libraries like Pydantic is very powerful and can be used to ensure good data quality for LLM input. And there you have it, folks! You’re now equipped to craft your own LLM-powered web application. Feeling inspired but need a starting point? Check out the open-source code for the Gemini Movie Detectives project: Github repository for backend: https://github.com/vojay-dev/gemini-movie-detectives-api Github repository for frontend: https://github.com/vojay-dev/gemini-movie-detectives-uiThe future of AI-powered applications is bright, and you’re holding the paintbrush! Let’s go make something remarkable. And if you need a break, feel free to try https://movie-detectives.com/. Create an AI-Driven Movie Quiz with Gemini LLM, Python, FastAPI, Pydantic, RAG and more was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story. View the full article
  2. Knowledge Bases for Amazon Bedrock is a fully managed Retrieval-Augmented Generation (RAG) capability that allows you to connect foundation models (FMs) to internal company data sources to deliver relevant and accurate responses. We are excited to add new capabilities for building enterprise-ready RAG. Knowledge Bases now supports AWS CloudFormation and Service Quotas. View the full article
  3. One of the hottest topics in AI right now is RAG, or retrieval-augmented generation, which is a retrieval method used by some AI tools to improve the quality and relevance of their outputs. Organizations want AI tools that use RAG because it makes those tools aware of proprietary data without the effort and expense of custom model training. RAG also keeps models up to date.  When generating an answer without RAG, models can only draw upon data that existed when they were trained. With RAG, on the other hand, models can leverage a private database of newer information for more informed responses. We talked to GitHub Next’s Senior Director of Research, Idan Gazit, and Software Engineer, Colin Merkel, to learn more about RAG and how it’s used in generative AI tools. Why everyone’s talking about RAG One of the reasons you should always verify outputs from a generative AI tool is because its training data has a knowledge cut-off date. While models are able to produce outputs that are tailored to a request, they can only reference information that existed at the time of their training. But with RAG, an AI tool can use data sources beyond its model’s training data to generate an output. The difference between RAG and fine-tuning Most organizations currently don’t train their own AI models. Instead, they customize pre-trained models to their specific needs, often using RAG or fine-tuning. Here’s a quick breakdown of how these two strategies differ. Fine-tuning requires adjusting a model’s weights, which results in a highly customized model that excels at a specific task. It’s a good option for organizations that rely on codebases written in a specialized language, especially if the language isn’t well-represented in the model’s original training data. RAG, on the other hand, doesn’t require weight adjustment. Instead, it retrieves and gathers information from a variety of data sources to augment a prompt, which results in an AI model generating a more contextually relevant response for the end user. Some organizations start with RAG and then fine-tune their models to accomplish a more specific task. Other organizations find that RAG is a sufficient method for AI customization alone. How AI models use context In order for an AI tool to generate helpful responses, it needs the right context. This is the same dilemma we face as humans when making a decision or solving a problem. It’s hard to do when you don’t have the right information to act on. So, let’s talk more about context in the context () of generative AI: Today’s generative AI applications are powered by large language models (LLMs) that are structured as transformers, and all transformer LLMs have a context window— the amount of data that they can accept in a single prompt. Though context windows are limited in size, they can and will continue to grow larger as more powerful models are released. Input data will vary depending on the AI tool’s capabilities. For instance, when it comes to GitHub Copilot in the IDE, input data comprises all of the code in the file that you’re currently working on. This is made possible because of our Fill-in-the-Middle (FIM) paradigm, which makes GitHub Copilot aware of both the code before your cursor (the prefix) and after your cursor (the suffix). GitHub Copilot also processes code from your other open tabs (a process we call neighboring tabs) to potentially find and add relevant information to the prompt. When there are a lot of open tabs, GitHub Copilot will scan the most recently reviewed ones. Because of the context window’s limited size, the challenge of ML engineers is to figure out what input data to add to the prompt and in what order to generate the most relevant suggestion from the AI model. This task is known as prompt engineering. How RAG enhances an AI model’s contextual understanding With RAG, an LLM can go beyond training data and retrieve information from a variety of data sources, including customized ones. When it comes to GitHub Copilot Chat within GitHub.com and in the IDE, input data can include your conversation with the chat assistant, whether it’s code or natural language, through a process called in-context learning. It can also include data from indexed repositories (public or private), a collection of Markdown documentation across repositories (that we refer to as knowledge bases), and results from integrated search engines. From these other sources, RAG will retrieve additional data to augment the initial prompt. As a result, it can generate a more relevant response. The type of input data used by GitHub Copilot will depend on which GitHub Copilot plan you’re using. RAG and semantic search Unlike keyword search or Boolean search operators, an ML-powered semantic search system uses its training data to understand the relationship between your keywords. So, rather than view, for example, “cats” and “kittens” as independent terms as you would in a keyword search, a semantic search system can understand, from its training, that those words are often associated with cute videos of the animal. Because of this, a search for just “cats and kittens” might rank a cute animal video as a top search result. How does semantic search improve the quality of RAG retrievals? When using a customized database or search engine as a RAG data source, semantic search can improve the context added to the prompt and overall relevance of the AI-generated output. The semantic search process is at the heart of retrieval. “It surfaces great examples that often elicit great results,” Gazit says. https://github.blog/wp-content/uploads/2024/02/BLOG2_chat-knowledge-base_002.mp4 Developers can use Copilot Chat on GitHub.com to ask questions and receive answers about a codebase in natural language, or surface relevant documentation and existing solutions. RAG data sources: Where RAG uses semantic search You’ve probably read dozens of articles (including some of our own) that talk about RAG, vector databases, and embeddings. And even if you haven’t, here’s something you should know: RAG doesn’t require embeddings or vector databases. A RAG system can use semantic search to retrieve relevant documents, whether from an embedding-based retrieval system, traditional database, or search engine. The snippets from those documents are then formatted into the model’s prompt. We’ll provide a quick recap of vector databases and then, using GitHub Copilot Enterprise as an example, cover how RAG retrieves data from a variety of sources. Vector databases Vector databases are optimized for storing embeddings of your repository code and documentation. They allow us to use novel search parameters to find matches between similar vectors. To retrieve data from a vector database, code and documentation are converted into embeddings, a type of high-dimensional vector, to make them searchable by a RAG system. Here’s how RAG retrieves data from vector databases: while you code in your IDE, algorithms create embeddings for your code snippets, which are stored in a vector database. Then, an AI coding tool can search that database by embedding similarity to find snippets from across your codebase that are related to the code you’re currently writing and generate a coding suggestion. Those snippets are often highly relevant context, enabling an AI coding assistant to generate a more contextually relevant coding suggestion. GitHub Copilot Chat uses embedding similarity in the IDE and on GitHub.com, so it finds code and documentation snippets related to your query. Embedding similarity  is incredibly powerful because it identifies code that has subtle relationships to the code you’re editing. “Embedding similarity might surface code that uses the same APIs, or code that performs a similar task to yours but that lives in another part of the codebase,” Gazit explains. “When those examples are added to a prompt, the model’s primed to produce responses that mimic the idioms and techniques that are native to your codebase—even though the model was not trained on your code.” General text search and search engines With a general text search, any documents that you want to be accessible to the AI model are indexed ahead of time and stored for later retrieval. For instance, RAG in GitHub Copilot Enterprise can retrieve data from files in an indexed repository and Markdown files across repositories. Learn more about GitHub Copilot Enterprise features RAG can also retrieve information from external and internal search engines. When integrated with an external search engine, RAG can search and retrieve information from the entire internet. When integrated with an internal search engine, it can also access information from within your organization, like an internal website or platform. Integrating both kinds of search engines supercharges RAG’s ability to provide relevant responses. For instance, GitHub Copilot Enterprise integrates both Bing, an external search engine, and an internal search engine built by GitHub into Copilot Chat on GitHub.com. Bing integration allows GitHub Copilot Chat to conduct a web search and retrieve up-to-date information, like about the latest Java release. But without a search engine searching internally, ”Copilot Chat on GitHub.com cannot answer questions about your private codebase unless you provide a specific code reference yourself,” explains Merkel, who helped to build GitHub’s internal search engine from scratch. Here’s how this works in practice. When a developer asks a question about a repository to GitHub Copilot Chat in GitHub.com, RAG in Copilot Enterprise uses the internal search engine to find relevant code or text from indexed files to answer that question. To do this, the internal search engine conducts a semantic search by analyzing the content of documents from the indexed repository, and then ranking those documents based on relevance. GitHub Copilot Chat then uses RAG, which also conducts a semantic search, to find and retrieve the most relevant snippets from the top-ranked documents. Those snippets are added to the prompt so GitHub Copilot Chat can generate a relevant response for the developer. Key takeaways about RAG RAG offers an effective way to customize AI models, helping to ensure outputs are up to date with organizational knowledge and best practices, and the latest information on the internet. GitHub Copilot uses a variety of methods to improve the quality of input data and contextualize an initial prompt, and that ability is enhanced with RAG. What’s more, the RAG retrieval method in GitHub Copilot Enterprise goes beyond vector databases and includes data sources like general text search and search engine integrations, which provides even more cost-efficient retrievals. Context is everything when it comes to getting the most out of an AI tool. To improve the relevance and quality of a generative AI output, you need to improve the relevance and quality of the input. As Gazit says, “Quality in, quality out.” Looking to bring the power of GitHub Copilot Enterprise to your organization? Learn more about GitHub Copilot Enterprise or get started now. The post What is retrieval-augmented generation, and what does it do for generative AI? appeared first on The GitHub Blog. View the full article
  4. Improving the relevance of your LLM application by leveraging Charmed Opensearch’s vector database Large Language Models (LLMs) fall under the category of Generative AI (GenAI), an artificial intelligence type that produces content based on user-defined context. These models undergo training using an extensive dataset composed of trillions of combinations of words from natural language, enabling them to empower interactive and conversational applications across various scenarios. Renowned LLMs like GPT, BERT, PaLM, and LLaMa can experience performance improvements by gaining access to additional structured and unstructured data. This additional data may include public or internal documents, websites, and various text forms and content. This methodology, termed retrieval-augmented generation (RAG), ensures that your conversational application generates accurate results with contextual relevance and domain-specific knowledge, even in areas where the pertinent facts were not part of the initial training dataset. RAG can drastically improve the accuracy of an LLM’s responses. See the example below: “What is PRO?” response without RAG Pro is a subscription-based service that offers additional features and functionality to users. For example, Pro users can access exclusive content, receive priority customer support, and more. To become a Pro user, you can sign up for a Pro subscription on our website. Once you have signed up, you can access all of the Pro features and benefits. “What is PRO?” response with RAG Ubuntu Pro is an additional stream of security updates and packages that meet compliance requirements, such as FIPS or HIPAA, on top of an Ubuntu LTS. It provides an SLA for security fixes for the entire distribution (‘main and universe’ packages) for ten years, with extensions for industrial use cases. Ubuntu Pro is free for personal use, offering the full suite of Ubuntu Pro capabilities on up to 5 machines. This article guides you on leveraging Charmed OpenSearch to maintain a relevant and up-to-date LLM application. What is OpenSearch? OpenSearch is an open-source search and analytics engine. Users can extend the functionality of OpenSearch with a selection of plugins that enhance search, security, performance analysis, machine learning, and more. This previous article we wrote provides additional details on the comprehensive features of OpenSearch. We discussed the capability of enabling enterprise-grade solutions through Charmed OpenSearch. This blog will emphasise a specific feature pertinent to RAG: utilising OpenSearch as a vector database. What is a vector database? Vector databases allow you to store and index, for example, text documents, rich media, audio, geospatial coordinates, tables, and graphs into vectors. These vectors represent points in N-dimensional spaces, effectively encapsulating the context of an asset. Search tools can look into these spaces using low-latency queries to find similar assets in neighbouring data points. These search tools typically do this by exploiting the efficiency of different methods for obtaining, for example, the k-nearest neighbours (k-NN) from an index of vectors. In particular, OpenSearch enables this feature with the k-NN plugin and augments this functionality by providing your conversational applications with other essential features, such as fault tolerance, resource access controls, and a powerful query engine. Using the OpenSearch k-NN plugin for RAG IIn this section, we provide a practical example of using Charmed OpenSearch in the RAG process as a retrieval tool with an experiment using a Jupyter notebook on top of Charmed Kubeflow to infer an LLM. 1. Deploy Charmed OpenSearch and enable the k-NN plugin. Follow the Charmed OpenSearch tutorial, which is a good starting point. At the end, verify if the plugin is enabled, which is enabled by default: $ juju config opensearch plugin_opensearch_knn true 2. Get your credentials. The easiest way to create and retrieve your first administrator credentials is to add a relation between Charmed Opensearch and the Data Integrator Charm, which is also part of the tutorial. 3. Create a vector index for your k-NN index. Now, we can create a vector index for your additional documents encoded into the knn_vectors data type. For simplicity, we will use the opensearch-py client. from opensearchpy import OpenSearch os_host = 10.56.118.209 os_port = 9200 os_url = "https://10.56.118.209:9200" os_auth = ("opensearch-client_7","sqlKjlEK7ldsBxqsOHNcFoSXayDudf30") os_client = OpenSearch( hosts = [{'host': os_host, 'port': os_port}], http_compress = True, http_auth = os_auth, use_ssl = True, verify_certs = False, ssl_assert_hostname = False, ssl_show_warn = False ) os_index_name = "rag-index" settings = { "settings": { "index": { "knn": True, "knn.space_type": "cosinesimil" } } } opensearch_client.indices.create(index=os_index_name, body=settings) properties={ "properties": { "vector_field": { "type": "knn_vector", "dimension": 384 }, "text": { "type": "keyword" } } } opensearch_client.indices.put_mapping(index=os_index_name, body=properties) 4. Aggregate source documents. In this example, we will select a list of web content that we want our application to use as relevant information to provide accurate answers: content_links = [ https://discourse.ubuntu.com/t/ubuntu-pro-faq/34042 ] 5. Load document contents into memory and split the content into chunks. It will allow us to create the embeddings from the selected documents and upload them to the index we created. from langchain.document_loaders import WebBaseLoader loader = WebBaseLoader(content_links) htmls = loader.load() from langchain.text_splitter import CharacterTextSplitter text_splitter = CharacterTextSplitter( chunk_size=500, chunk_overlap=0, separator="\n") docs = text_splitter.split_documents(htmls) 6. Create embeddings for text chunks and store embeddings in the vector index. It will allow us to create the embeddings from the selected documents and upload them to the index we created. from langchain.embeddings import HuggingFaceEmbeddings embeddings = HuggingFaceEmbeddings( model_name="sentence-transformers/all-MiniLM-L12-v2", encode_kwargs={'normalize_embeddings': False}) from langchain.vectorstores import OpenSearchVectorSearch docsearch = OpenSearchVectorSearch.from_documents(docs, embeddings, ef_construction=256, engine="faiss", space_type="innerproduct", m=48, opensearch_url=os_url, index_name=os_index_name, http_auth=os_auth, verify_certs=False) 7. Use the similarity search to retrieve the documents that provide context to your query. The search engine will perform the Approximate k-NN Search, for example, using the cosine similarity formula, and return the relevant documents in the context of your question. query = """ What is Pro? """ similar_docs = docsearch.similarity_search(query, k=2, raw_response=True, search_type="approximate_search", space_type="cosinesimil") 8. Prepare you LLM. We used a simple example using a HugginFace pipeline to load an LLM. from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline from langchain.llms import HuggingFacePipeline model_name="TheBloke/Llama-2-7B-Chat-GPTQ" model = AutoModelForCausalLM.from_pretrained( model_name, cache_dir="model", device_map='auto' ) tokenizer = AutoTokenizer.from_pretrained(model_name,cache_dir="llm/tokenizer") pl = pipeline( "text-generation", model=model, tokenizer=tokenizer, max_length = 2048. ) llm = HuggingFacePipeline(pipeline=pl) 9. Create a prompt template. It will define the expectations of the response and specify that we will provide context for an accurate answer. from langchain import PromptTemplate question_prompt_template = """ You are a friendly chatbot assistant that responds in a conversational manner to user's questions. Respond in short but complete answers unless specifically asked by the user to elaborate on something. Use History and Context to inform your answers. Context: --------- {context} --------- Question: {question} Helpful Answer:""" QUESTION_PROMPT = PromptTemplate( template=question_prompt_template, input_variables=["context", "question"] ) 10. Infer the LLM to answer your question using the context documents retrieved from OpenSearch. from langchain.chains.question_answering import load_qa_chain question = "What is Pro?" chain = load_qa_chain(llm, chain_type="stuff", prompt=QUESTION_PROMPT) chain.run(input_documents=similar_docs, question=query) Conclusion Retrieval-augmented generation (RAG) is a method that enables users to converse with data repositories. It’s a tool that can revolutionise how you access and utilise data, as we showed in our tutorial. With RAG, you can improve data retrieval, enhance knowledge sharing, and enrich the results of your LLMs to give more contextually relevant, insightful responses that better reflect the most up-to-date information in your organisation. The benefits of better LLMs that can access your knowledge base are as obvious as they are alluring: you gain better customer support, employee training and developer productivity. On top of that, you ensure that your teams get LLM answers and results that reflect accurate, up-to-date policy and information rather than generalised or even outright useless answers. As we showed, Charmed OpenSearch is a simple and robust technology that can enable RAG capabilities. With it (and our helpful tutorial), any business can leverage RAG to transform their technical or policy manuals and logs into comprehensive knowledge bases. Enterprise-grade and fully supported OpenSearch solution Charmed OpenSearch is available for the open-source community. Canonical’s team of experts can help you get started with it as the vector database to leverage the power of the k-NN search for your LLM applications at any scale. Contact Canonical if you have questions. Watch the webinar: Future-proof AI applications with OpenSearch as a vector database View the full article
  5. Because human-machine interaction using natural language is now possible with large language models (LLMs), more data teams and developers can bring AI to their daily workflows. To do this efficiently and securely, teams must decide how they want to combine the knowledge of pre-trained LLMs with their organization’s private enterprise data in order to deal with the hallucinations (that is, incorrect responses) that LLMs can generate due to the fact that they’ve only been trained on data available up to a certain date. To reduce these AI hallucinations, LLMs can be combined with private data sets via processes that either don’t require LLM customization (such as prompt engineering or retrieval augmented generation) or that do require customization (like fine-tuning or retraining). To decide where to start, it is important to make trade-offs between the resources and time it takes to customize AI models and the required timelines to show ROI on generative AI investments. While every organization should keep both options on the table, to quickly deliver value, the key is to identify and deploy use cases that can deliver value using prompt engineering and retrieval augmented generation (RAG), as these can be fast and cost-effective approaches to get value from enterprise data with LLMs. To empower organizations to deliver fast wins with generative AI while keeping data secure when using LLMs, we are excited to announce Snowflake Cortex LLM functions are now available in public preview for select AWS and Azure regions. With Snowflake Cortex, a fully managed service that runs on NVIDIA GPU-accelerated compute, there is no need to set up integrations, manage infrastructure or move data outside of the Snowflake governance boundary to use the power of industry-leading LLMs from Mistral AI, Meta and more. So how does Snowflake Cortex make AI easy, whether you are doing prompt engineering or RAG? Let’s dive into the details and check out some code along the way. To prompt or not to prompt In Snowflake Cortex, there are task-specific functions that work out of the box without the need to define a prompt. Specifically, teams can quickly and cost-effectively execute tasks such as translation, sentiment analysis and summarization. All that an analyst or any other user familiar with SQL needs to do is point the specific function below to a column of a table containing text data and voila! Snowflake Cortex functions take care of the rest — no manual orchestration, data formatting or infrastructure to manage. This is particularly useful for teams constantly working with product reviews, surveys, call transcripts and other long-text data sources traditionally underutilized within marketing, sales and customer support teams. SELECT SNOWFLAKE.CORTEX.SUMMARIZE(review_text) FROM reviews_table LIMIT 10; Of course, there are going to be many use cases where customization via prompts becomes useful. For example: Custom text summaries in JSON format Turning email domains into rich data sets Building data quality agents using LLMs All of these and more can quickly be accomplished with the power of industry-leading foundation models from Mistral AI (Mistral Large, Mistral 8x7B, Mistral 7B), Google (Gemma-7b) and Meta (Llama2 70B). All of these foundation LLMs are accessible via the complete function, which just like any other Snowflake Cortex function can run on a table with multiple rows without any manual orchestration or LLM throughput management. Figure 1: Multi-task accuracy of industry-leading LLMs based on MLLU benchmark. Source SELECT SNOWFLAKE.CORTEX.COMPLETE( 'mistral-large', CONCAT('Summarize this product review in less than 100 words. Put the product name, defect and summary in JSON format: <review>', content, '</review>') ) FROM reviews LIMIT 10; For use cases such as chatbots on top of documents, it may be costly to put all the documents as context in the prompt. In such a scenario, a different approach may be more cost effective by minimizing the volume of tokens (a general rule of thumb is that 75 words approximately equals 100 tokens) going into the LLM. A popular framework to solve this problem without having to make changes to the LLM is RAG, which is easy to do in Snowflake. What is RAG? Let’s go over the basics of RAG before jumping into how to do this in Snowflake. RAG is a popular framework in which an LLM gets access to a specific knowledge base with the most up-to-date, accurate information available before generating a response. Because there is no need to retrain the model, this extends the capability of any LLM to specific domains in a cost-effective way. To deploy this retrieval, augmentation and generation framework teams need a combination of: Client / app UI: This is where the end user, such as a business decision-maker, is able to interact with the knowledge base, typically in the form of a chat service. Context repository: This is where relevant data sources are aggregated, governed and continuously updated as needed to provide an up-to-date knowledge repository. This content needs to be inserted into an automated pipeline that chunks (that is, breaks documents into smaller pieces) and embeds the text into a vector store. Vector search: This requires the combination of a vector store, which maintains the numerical or vector representation of the knowledge base, and semantic search to provide easy retrieval of the chunks most relevant to the question. LLM inference: The combination of these enables teams to embed the question and the context to find the most relevant information and generate contextualized responses using a conversational LLM. Figure 2: Generalized RAG framework from question to contextualized answer. From RAG to rich LLM apps in minutes with Snowflake Cortex Now that we understand how RAG works in general, how can we apply it to Snowflake? Using the Snowflake platform’s rich foundation for data governance and management, which includes vector data type (in private preview), developing and deploying an end-to-end AI app using RAG is possible without integrations, infrastructure management or data movement using three key features: Figure 3: Key Snowflake features needed to build end-to-end RAG in Snowflake. Here is how these features map to the key architecture components of a RAG framework: Client / app UI: Use Streamlit in Snowflake out-of-the box chat elements to quickly build and share user interfaces all in Python. Context repository: The knowledge repository can be easily updated and governed using Snowflake stages. Once documents are loaded, all of your data preparation, including generating chunks (smaller, contextually rich blocks of text), can be done with Snowpark. For the chunking in particular, teams can seamlessly use LangChain as part of a Snowpark User Defined Function. Vector search: Thanks to the native support of VECTOR as a data type in Snowflake, there is no need to integrate and govern a separate store or service. Store VECTOR data in Snowflake tables and execute similarity queries with system-defined similarity functions (L2, cosine, or inner-product distance). LLM inference: Snowflake Cortex completes the workflow with serverless functions for embedding and text completion inference (using either Mistral AI, Llama or Gemma LLMs). Figure 4: End-to-end RAG framework in Snowflake. Show me the code Ready to try Snowflake Cortex and its tightly integrated ecosystem of features that enable fast prototyping and agile deployment of AI apps in Snowflake? Get started with one of these resources: Snowflake Cortex LLM functions documentation Run 3 useful LLM inference jobs in 10 minutes with Snowflake Cortex Build a chat-with-your-documents LLM app using RAG with Snowflake Cortex To watch live demos and ask questions of Snowflake Cortex experts, sign up for one of these events: Snowflake Cortex Live Ask Me Anything (AMA) Snowflake Cortex RAG hands-on lab Want to network with peers and learn from other industry and Snowflake experts about how to use the latest generative AI features? Make sure to join us at Snowflake Data Cloud Summit in San Francisco this June! The post Easy and Secure LLM Inference and Retrieval Augmented Generation (RAG) Using Snowflake Cortex appeared first on Snowflake. View the full article
  • Forum Statistics

    43.8k
    Total Topics
    43.3k
    Total Posts
×
×
  • Create New...