Jump to content

Search the Community

Showing results for tags 'llms'.

  • Search By Tags

    Type tags separated by commas.
  • Search By Author

Content Type


Forums

There are no results to display.

There are no results to display.


Find results in...

Find results that contain...


Date Created

  • Start

    End


Last Updated

  • Start

    End


Filter by number of...

Joined

  • Start

    End


Group


Website URL


LinkedIn Profile URL


About Me


Cloud Platforms


Cloud Experience


Development Experience


Current Role


Skills


Certifications


Favourite Tools


Interests

  1. Find out all about Google Cloud's latest learning path, and learn how to use the Gemini language model in the Google Cloud.View the full article
  2. Apple is developing its own large language model (LLM) that runs on-device to prioritize speed and privacy, Bloomberg's Mark Gurman reports. Writing in his "Power On" newsletter, Gurman said that Apple's LLM underpins upcoming generative AI features. "All indications" apparently suggests that it will run entirely on-device, rather than via the cloud like most existing AI services. Since they will run on-device, Apple's AI tools may be less capable in certain instances than its direct cloud-based rivals, but Gurman suggested that the company could "fill in the gaps" by licensing technology from Google and other AI service providers. Last month, Gurman reported that Apple was in discussions with Google to integrate its Gemini AI engine into the iPhone as part of iOS 18. The main advantages of on-device processing will be quicker response times and superior privacy compared to cloud-based solutions. Apple's marketing strategy for its AI technology will apparently be based around how it can be useful to users' daily lives, rather than its power. Apple's broader AI strategy is expected to be revealed alongside previews of its major software updates at WWDC in June.Tags: Bloomberg, Artificial Intelligence, Mark Gurman This article, "Gurman: Apple Working on On-Device LLM for Generative AI Features" first appeared on MacRumors.com Discuss this article in our forums View the full article
  3. Discover the basics of using Gemini with Python via VertexAI, creating APIs with FastAPI, data validation with Pydantic and the fundamentals of Retrieval-Augmented Generation (RAG)Photo by Kenny Eliason on UnsplashWithin this article, I share some of the basics to create a LLM-driven web-application, using various technologies, such as: Python, FastAPI, Pydantic, VertexAI and more. You will learn how to create such a project from the very beginning and get an overview of the underlying concepts, including Retrieval-Augmented Generation (RAG).Disclaimer: I am using data from The Movie Database within this project. The API is free to use for non-commercial purposes and complies with the Digital Millennium Copyright Act (DMCA). For further information about TMDB data usage, please read the official FAQ. Table of contents– Inspiration – System Architecture – Understanding Retrieval-Augmented Generation (RAG) – Python projects with Poetry – Create the API with FastAPI – Data validation and quality with Pydantic – TMDB client with httpx – Gemini LLM client with VertexAI – Modular prompt generator with Jinja – Frontend – API examples – Conclusion The best way to share this knowledge is through a practical example. Hence, I’ll use my project Gemini Movie Detectives to cover the various aspects. The project was created as part of the Google AI Hackathon 2024, which is still running while I am writing this. Gemini Movie Detectives (by author)Gemini Movie Detectives is a project aimed at leveraging the power of the Gemini Pro model via VertexAI to create an engaging quiz game using the latest movie data from The Movie Database (TMDB). Part of the project was also to make it deployable with Docker and to create a live version. Try it yourself: movie-detectives.com. Keep in mind that this is a simple prototype, so there might be unexpected issues. Also, I had to add some limitations in order to control costs that might be generated by using GCP and VertexAI. Gemini Movie Detectives (by author)The project is fully open-source and is split into two separate repositories: Github repository for backend: https://github.com/vojay-dev/gemini-movie-detectives-api Github repository for frontend: https://github.com/vojay-dev/gemini-movie-detectives-uiThe focus of the article is the backend project and underlying concepts. It will therefore only briefly explain the frontend and its components. In the following video, I also give an overview over the project and its components: https://medium.com/media/bf4270fa881797cd515421b7bb646d1d/hrefInspirationGrowing up as a passionate gamer and now working as a Data Engineer, I’ve always been drawn to the intersection of gaming and data. With this project, I combined two of my greatest passions: gaming and data. Back in the 90’ I always enjoyed the video game series You Don’t Know Jack, a delightful blend of trivia and comedy that not only entertained but also taught me a thing or two. Generally, the usage of games for educational purposes is another concept that fascinates me. In 2023, I organized a workshop to teach kids and young adults game development. They learned about mathematical concepts behind collision detection, yet they had fun as everything was framed in the context of gaming. It was eye-opening that gaming is not only a huge market but also holds a great potential for knowledge sharing. With this project, called Movie Detectives, I aim to showcase the magic of Gemini, and AI in general, in crafting engaging trivia and educational games, but also how game design can profit from these technologies in general. By feeding the Gemini LLM with accurate and up-to-date movie metadata, I could ensure the accuracy of the questions from Gemini. An important aspect, because without this Retrieval-Augmented Generation (RAG) methodology to enrich queries with real-time metadata, there’s a risk of propagating misinformation — a typical pitfall when using AI for this purpose. Another game-changer lies in the modular prompt generation framework I’ve crafted using Jinja templates. It’s like having a Swiss Army knife for game design — effortlessly swapping show master personalities to tailor the game experience. And with the language module, translating the quiz into multiple languages is a breeze, eliminating the need for costly translation processes. Taking that on a business perspective, it can be used to reach a much broader audience of customers, without the need of expensive translation processes. From a business standpoint, this modularization opens doors to a wider customer base, transcending language barriers without breaking a sweat. And personally, I’ve experienced firsthand the transformative power of these modules. Switching from the default quiz master to the dad-joke-quiz-master was a riot — a nostalgic nod to the heyday of You Don’t Know Jack, and a testament to the versatility of this project. Movie Detectives — Example: Santa Claus personality (by author)System ArchitectureBefore we jump into details, let’s get an overview of how the application was built. Tech Stack: Backend Python 3.12 + FastAPI API developmenthttpx for TMDB integrationJinja templating for modular prompt generationPydantic for data modeling and validationPoetry for dependency managementDocker for deploymentTMDB API for movie dataVertexAI and Gemini for generating quiz questions and evaluating answersRuff as linter and code formatter together with pre-commit hooksGithub Actions to automatically run tests and linter on every pushTech Stack: Frontend VueJS 3.4 as the frontend frameworkVite for frontend toolingEssentially, the application fetches up-to-date movie metadata from an external API (TMDB), constructs a prompt based on different modules (personality, language, …), enriches this prompt with the metadata and that way, uses Gemini to initiate a movie quiz in which the user has to guess the correct title. The backend infrastructure is built with FastAPI and Python, employing the Retrieval-Augmented Generation (RAG) methodology to enrich queries with real-time metadata. Utilizing Jinja templating, the backend modularizes prompt generation into base, personality, and data enhancement templates, enabling the generation of accurate and engaging quiz questions. The frontend is powered by Vue 3 and Vite, supported by daisyUI and Tailwind CSS for efficient frontend development. Together, these tools provide users with a sleek and modern interface for seamless interaction with the backend. In Movie Detectives, quiz answers are interpreted by the Language Model (LLM) once again, allowing for dynamic scoring and personalized responses. This showcases the potential of integrating LLM with RAG in game design and development, paving the way for truly individualized gaming experiences. Furthermore, it demonstrates the potential for creating engaging quiz trivia or educational games by involving LLM. Adding and changing personalities or languages is as easy as adding more Jinja template modules. With very little effort, this can change the full game experience, reducing the effort for developers. System Overview (by author)As can be seen in the overview, Retrieval-Augmented Generation (RAG) is one of the essential ideas of the backend. Let’s have a closer look at this particular paradigm. Understanding Retrieval-Augmented Generation (RAG)In the realm of Large Language Models (LLM) and AI, one paradigm becoming more and more popular is Retrieval-Augmented Generation (RAG). But what does RAG entail, and how does it influence the landscape of AI development? At its essence, RAG enhances LLM systems by incorporating external data to enrich their predictions. Which means, you pass relevant context to the LLM as an additional part of the prompt, but how do you find relevant context? Usually, this data can be automatically retrieved from a database with vector search or dedicated vector databases. Vector databases are especially useful, since they store data in a way, so that it can be queried for similar data quickly. The LLM then generates the output based on both, the query and the retrieved documents. Picture this: you have an LLM capable of generating text based on a given prompt. RAG takes this a step further by infusing additional context from external sources, like up-to-date movie data, to enhance the relevance and accuracy of the generated text. Let’s break down the key components of RAG: LLMs: LLMs serve as the backbone of RAG workflows. These models, trained on vast amounts of text data, possess the ability to understand and generate human-like text.Vector Indexes for contextual enrichment: A crucial aspect of RAG is the use of vector indexes, which store embeddings of text data in a format understandable by LLMs. These indexes allow for efficient retrieval of relevant information during the generation process. In the context of the project this could be a database of movie metadata.Retrieval process: RAG involves retrieving pertinent documents or information based on the given context or prompt. This retrieved data acts as the additional input for the LLM, supplementing its understanding and enhancing the quality of generated responses. This could be getting all relevant information known and connected to a specific movie.Generative Output: With the combined knowledge from both the LLM and the retrieved context, the system generates text that is not only coherent but also contextually relevant, thanks to the augmented data.RAG architecture (by author)While in the Gemini Movie Detectives project, the prompt is enhanced with external API data from The Movie Database, RAG typically involves the use of vector indexes to streamline this process. It is using much more complex documents as well as a much higher amount of data for enhancement. Thus, these indexes act like signposts, guiding the system to relevant external sources quickly. In this project, it is therefore a mini version of RAG but showing the basic idea at least, demonstrating the power of external data to augment LLM capabilities. In more general terms, RAG is a very important concept, especially when crafting trivia quizzes or educational games using LLMs like Gemini. This concept can avoid the risk of false positives, asking wrong questions, or misinterpreting answers from the users. Here are some open-source projects that might be helpful when approaching RAG in one of your projects: txtai: All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows.LangChain: LangChain is a framework for developing applications powered by large language models (LLMs).Qdrant: Vector Search Engine for the next generation of AI applications.Weaviate: Weaviate is a cloud-native, open source vector database that is robust, fast, and scalable.Of course, with the potential value of this approach for LLM-based applications, there are many more open- and close-source alternatives, but with these, you should be able to get your research on the topic started. Python projects with PoetryNow that the main concepts are clear, let’s have a closer look how the project was created and how dependencies are managed in general. The three main tasks Poetry can help you with are: Build, Publish and Track. The idea is to have a deterministic way to manage dependencies, to share your project and to track dependency states. Photo by Kat von Wood on UnsplashPoetry also handles the creation of virtual environments for you. Per default, those are in a centralized folder within your system. However, if you prefer to have the virtual environment of project in the project folder, like I do, it is a simple config change: poetry config virtualenvs.in-project trueWith poetry new you can then create a new Python project. It will create a virtual environment linking you systems default Python. If you combine this with pyenv, you get a flexible way to create projects using specific versions. Alternatively, you can also tell Poetry directly which Python version to use: poetry env use /full/path/to/python. Once you have a new project, you can use poetry add to add dependencies to it. With this, I created the project for Gemini Movie Detectives: poetry config virtualenvs.in-project true poetry new gemini-movie-detectives-api cd gemini-movie-detectives-api poetry add 'uvicorn[standard]' poetry add fastapi poetry add pydantic-settings poetry add httpx poetry add 'google-cloud-aiplatform>=1.38' poetry add jinja2The metadata about your projects, including the dependencies with the respective versions, are stored in the poetry.toml and poetry.lock files. I added more dependencies later, which resulted in the following poetry.toml for the project: [tool.poetry] name = "gemini-movie-detectives-api" version = "0.1.0" description = "Use Gemini Pro LLM via VertexAI to create an engaging quiz game incorporating TMDB API data" authors = ["Volker Janz <volker@janz.sh>"] readme = "README.md" [tool.poetry.dependencies] python = "^3.12" fastapi = "^0.110.1" uvicorn = {extras = ["standard"], version = "^0.29.0"} python-dotenv = "^1.0.1" httpx = "^0.27.0" pydantic-settings = "^2.2.1" google-cloud-aiplatform = ">=1.38" jinja2 = "^3.1.3" ruff = "^0.3.5" pre-commit = "^3.7.0" [build-system] requires = ["poetry-core"] build-backend = "poetry.core.masonry.api"Create the API with FastAPIFastAPI is a Python framework that allows for rapid API development. Built on open standards, it offers a seamless experience without new syntax to learn. With automatic documentation generation, robust validation, and integrated security, FastAPI streamlines development while ensuring great performance. Photo by Florian Steciuk on UnsplashImplementing the API for the Gemini Movie Detectives projects, I simply started from a Hello World application and extended it from there. Here is how to get started: from fastapi import FastAPI app = FastAPI() @app.get("/") def read_root(): return {"Hello": "World"}Assuming you also keep the virtual environment within the project folder as .venv/ and use uvicorn, this is how to start the API with the reload feature enabled, in order to test code changes without the need of a restart: source .venv/bin/activate uvicorn gemini_movie_detectives_api.main:app --reload curl -s localhost:8000 | jq .If you have not yet installed jq, I highly recommend doing it now. I might cover this wonderful JSON Swiss Army knife in a future article. This is how the response looks like: Hello FastAPI (by author)From here, you can develop your API endpoints as needed. This is how the API endpoint implementation to start a movie quiz in Gemini Movie Detectives looks like for example: @app.post('/quiz') @rate_limit @retry(max_retries=settings.quiz_max_retries) def start_quiz(quiz_config: QuizConfig = QuizConfig()): movie = tmdb_client.get_random_movie( page_min=_get_page_min(quiz_config.popularity), page_max=_get_page_max(quiz_config.popularity), vote_avg_min=quiz_config.vote_avg_min, vote_count_min=quiz_config.vote_count_min ) if not movie: logger.info('could not find movie with quiz config: %s', quiz_config.dict()) raise HTTPException(status_code=status.HTTP_404_NOT_FOUND, detail='No movie found with given criteria') try: genres = [genre['name'] for genre in movie['genres']] prompt = prompt_generator.generate_question_prompt( movie_title=movie['title'], language=get_language_by_name(quiz_config.language), personality=get_personality_by_name(quiz_config.personality), tagline=movie['tagline'], overview=movie['overview'], genres=', '.join(genres), budget=movie['budget'], revenue=movie['revenue'], average_rating=movie['vote_average'], rating_count=movie['vote_count'], release_date=movie['release_date'], runtime=movie['runtime'] ) chat = gemini_client.start_chat() logger.debug('starting quiz with generated prompt: %s', prompt) gemini_reply = gemini_client.get_chat_response(chat, prompt) gemini_question = gemini_client.parse_gemini_question(gemini_reply) quiz_id = str(uuid.uuid4()) session_cache[quiz_id] = SessionData( quiz_id=quiz_id, chat=chat, question=gemini_question, movie=movie, started_at=datetime.now() ) return StartQuizResponse(quiz_id=quiz_id, question=gemini_question, movie=movie) except GoogleAPIError as e: raise HTTPException(status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail=f'Google API error: {e}') except Exception as e: raise HTTPException(status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail=f'Internal server error: {e}')Within this code, you can see already three of the main components of the backend: tmdb_client: A client I implemented using httpx to fetch data from The Movie Database (TMDB).prompt_generator: A class that helps to generate modular prompts based on Jinja templates.gemini_client: A client to interact with the Gemini LLM via VertexAI in Google Cloud.We will look at these components in detail later, but first some more helpful insights regarding the usage of FastAPI. FastAPI makes it really easy to define the HTTP method and data to be transferred to the backend. For this particular function, I expect a POST request as this creates a new quiz. This can be done with the post decorator: @app.post('/quiz')Also, I am expecting some data within the request sent as JSON in the body. In this case, I am expecting an instance of QuizConfig as JSON. I simply defined QuizConfig as a subclass of BaseModel from Pydantic (will be covered later) and with that, I can pass it in the API function and FastAPI will do the rest: class QuizConfig(BaseModel): vote_avg_min: float = Field(5.0, ge=0.0, le=9.0) vote_count_min: float = Field(1000.0, ge=0.0) popularity: int = Field(1, ge=1, le=3) personality: str = Personality.DEFAULT.name language: str = Language.DEFAULT.name # ... def start_quiz(quiz_config: QuizConfig = QuizConfig()):Furthermore, you might notice two custom decorators: @rate_limit @retry(max_retries=settings.quiz_max_retries)These I implemented to reduce duplicate code. They wrap the API function to retry the function in case of errors and to introduce a global rate limit of how many movie quizzes can be started per day. What I also liked personally is the error handling with FastAPI. You can simply raise a HTTPException, give it the desired status code and the user will then receive a proper response, for example, if no movie could be found with a given configuration: raise HTTPException(status_code=status.HTTP_404_NOT_FOUND, detail='No movie found with given criteria')With this, you should have an overview of creating an API like the one for Gemini Movie Detectives with FastAPI. Keep in mind: all code is open-source, so feel free to have a look at the API repository on Github. Data validation and quality with PydanticOne of the main challenges with todays AI/ML projects is data quality. But that does not only apply to ETL/ELT pipelines, which prepare datasets to be used in model training or prediction, but also to the AI/ML application itself. Using Python for example usually enables Data Engineers and Scientist to get a reasonable result with little code but being (mostly) dynamically typed, Python lacks of data validation when used in a naive way. That is why in this project, I combined FastAPI with Pydantic, a powerful data validation library for Python. The goal was to make the API lightweight but strict and strong, when it comes to data quality and validation. Instead of plain dictionaries for example, the Movie Detectives API strictly uses custom classes inherited from the BaseModel provided by Pydantic. This is the configuration for a quiz for example: class QuizConfig(BaseModel): vote_avg_min: float = Field(5.0, ge=0.0, le=9.0) vote_count_min: float = Field(1000.0, ge=0.0) popularity: int = Field(1, ge=1, le=3) personality: str = Personality.DEFAULT.name language: str = Language.DEFAULT.nameThis example illustrates, how not only correct type is ensured, but also further validation is applied to the actual values. Furthermore, up-to-date Python features, like StrEnum are used to distinguish certain types, like personalities: class Personality(StrEnum): DEFAULT = 'default.jinja' CHRISTMAS = 'christmas.jinja' SCIENTIST = 'scientist.jinja' DAD = 'dad.jinja'Also, duplicate code is avoided by defining custom decorators. For example, the following decorator limits the number of quiz sessions today, to have control over GCP costs: call_count = 0 last_reset_time = datetime.now() def rate_limit(func: callable) -> callable: @wraps(func) def wrapper(*args, **kwargs) -> callable: global call_count global last_reset_time # reset call count if the day has changed if datetime.now().date() > last_reset_time.date(): call_count = 0 last_reset_time = datetime.now() if call_count >= settings.quiz_rate_limit: raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail='Daily limit reached') call_count += 1 return func(*args, **kwargs) return wrapperIt is then simply applied to the related API function: @app.post('/quiz') @rate_limit @retry(max_retries=settings.quiz_max_retries) def start_quiz(quiz_config: QuizConfig = QuizConfig()):The combination of up-to-date Python features and libraries, such as FastAPI, Pydantic or Ruff makes the backend less verbose but still very stable and ensures a certain data quality, to ensure the LLM output has the expected quality. TMDB client with httpxThe TMDB Client class is using httpx to perform requests against the TMDB API. httpx is a rising star in the world of Python libraries. While requests has long been the go-to choice for making HTTP requests, httpx offers a valid alternative. One of its key strengths is asynchronous functionality. httpx allows you to write code that can handle multiple requests concurrently, potentially leading to significant performance improvements in applications that deal with a high volume of HTTP interactions. Additionally, httpx aims for broad compatibility with requests, making it easier for developers to pick it up. In case of Gemini Movie Detectives, there are two main requests: get_movies: Get a list of random movies based on specific settings, like average number of votesget_movie_details: Get details for a specific movie to be used in a quizIn order to reduce the amount of external requests, the latter one uses the lru_cache decorator, which stands for “Least Recently Used cache”. It’s used to cache the results of function calls so that if the same inputs occur again, the function doesn’t have to recompute the result. Instead, it returns the cached result, which can significantly improve the performance of the program, especially for functions with expensive computations. In our case, we cache the details for 1024 movies, so if 2 players get the same movie, we do not need to make a request again: @lru_cache(maxsize=1024) def get_movie_details(self, movie_id: int): response = httpx.get(f'https://api.themoviedb.org/3/movie/{movie_id}', headers={ 'Authorization': f'Bearer {self.tmdb_api_key}' }, params={ 'language': 'en-US' }) movie = response.json() movie['poster_url'] = self.get_poster_url(movie['poster_path']) return movieAccessing data from The Movie Database (TMDB) is for free for non-commercial usage, you can simply generate an API key and start making requests. Gemini LLM client with VertexAIBefore Gemini via VertexAI can be used, you need a Google Cloud project with VertexAI enabled and a Service Account with sufficient access together with its JSON key file. Create GCP project (by author)After creating a new project, navigate to APIs & Services –> Enable APIs and service –> search for VertexAI API –> Enable. Enable VertexAI (by author)To create a Service Account, navigate to IAM & Admin –> Service Accounts –> Create service account. Choose a proper name and go to the next step. Create Service Account (by author)Now ensure to assign the account the pre-defined role Vertex AI User. Assign correct role (by author)Finally you can generate and download the JSON key file by clicking on the new user –> Keys –> Add Key –> Create new key –> JSON. With this file, you are good to go. Create JSON key file (by author)Using Gemini from Google with Python via VertexAI starts by adding the necessary dependency to the project: poetry add 'google-cloud-aiplatform>=1.38'With that, you can import and initialize vertexai with your JSON key file. Also you can load a model, like the newly released Gemini 1.5 Pro model, and start a chat session like this: import vertexai from google.oauth2.service_account import Credentials from vertexai.generative_models import GenerativeModel project_id = "my-project-id" location = "us-central1" credentials = Credentials.from_service_account_file("credentials.json") model = "gemini-1.0-pro" vertexai.init(project=project_id, location=location, credentials=credentials) model = GenerativeModel(model) chat_session = model.start_chat()You can now use chat.send_message() to send a prompt to the model. However, since you get the response in chunks of data, I recommend using a little helper function, so that you simply get the full response as one String: def get_chat_response(chat: ChatSession, prompt: str) -> str: text_response = [] responses = chat.send_message(prompt, stream=True) for chunk in responses: text_response.append(chunk.text) return ''.join(text_response)A full example can then look like this: import vertexai from google.oauth2.service_account import Credentials from vertexai.generative_models import GenerativeModel, ChatSession project_id = "my-project-id" location = "us-central1" credentials = Credentials.from_service_account_file("credentials.json") model = "gemini-1.0-pro" vertexai.init(project=project_id, location=location, credentials=credentials) model = GenerativeModel(model) chat_session = model.start_chat() def get_chat_response(chat: ChatSession, prompt: str) -> str: text_response = [] responses = chat.send_message(prompt, stream=True) for chunk in responses: text_response.append(chunk.text) return ''.join(text_response) response = get_chat_response( chat_session, "How to say 'you are awesome' in Spanish?" ) print(response)Running this, Gemini gave me the following response: You are awesome (by author)I agree with Gemini: Eres increíbleAnother hint when using this: you can also configure the model generation by passing a configuration to the generation_config parameter as part of the send_message function. For example: generation_config = { 'temperature': 0.5 } responses = chat.send_message( prompt, generation_config=generation_config, stream=True )I am using this in Gemini Movie Detectives to set the temperature to 0.5, which gave me best results. In this context temperature means: how creative are the generated responses by Gemini. The value must be between 0.0 and 1.0, whereas closer to 1.0 means more creativity. One of the main challenges apart from sending a prompt and receive the reply from Gemini is to parse the reply in order to extract the relevant information. One learning from the project is: Specify a format for Gemini, which does not rely on exact words but uses key symbols to separate information elementsFor example, the question prompt for Gemini contains this instruction: Your reply must only consist of three lines! You must only reply strictly using the following template for the three lines: Question: <Your question> Hint 1: <The first hint to help the participants> Hint 2: <The second hint to get the title more easily>The naive approach would be, to parse the answer by looking for a line that starts with Question:. However, if we use another language, like German, the reply would look like: Antwort:. Instead, focus on the structure and key symbols. Read the reply like this: It has 3 linesThe first line is the questionSecond line the first hintThird line the second hintKey and value are separated by :With this approach, the reply can be parsed language agnostic, and this is my implementation in the actual client: @staticmethod def parse_gemini_question(gemini_reply: str) -> GeminiQuestion: result = re.findall(r'[^:]+: ([^\n]+)', gemini_reply, re.MULTILINE) if len(result) != 3: msg = f'Gemini replied with an unexpected format. Gemini reply: {gemini_reply}' logger.warning(msg) raise ValueError(msg) question = result[0] hint1 = result[1] hint2 = result[2] return GeminiQuestion(question=question, hint1=hint1, hint2=hint2)In the future, the parsing of responses will become even easier. During the Google Cloud Next ’24 conference, Google announced that Gemini 1.5 Pro is now publicly available and with that, they also announced some features including a JSON mode to have responses in JSON format. Checkout this article for more details. Apart from that, I wrapped the Gemini client into a configurable class. You can find the full implementation open-source on Github. Modular prompt generator with JinjaThe Prompt Generator is a class wich combines and renders Jinja2 template files to create a modular prompt. There are two base templates: one for generating the question and one for evaluating the answer. Apart from that, there is a metadata template to enrich the prompt with up-to-date movie data. Furthermore, there are language and personality templates, organized in separate folders with a template file for each option. Prompt Generator (by author)Using Jinja2 allows to have advanced features like template inheritance, which is used for the metadata. This makes it easy to extend this component, not only with more options for personalities and languages, but also to extract it into its own open-source project to make it available for other Gemini projects. FrontendThe Gemini Movie Detectives frontend is split into four main components and uses vue-router to navigate between them. The Home component simply displays the welcome message. The Quiz component displays the quiz itself and talks to the API via fetch. To create a quiz, it sends a POST request to api/quiz with the desired settings. The backend is then selecting a random movie based on the user settings, creates the prompt with the modular prompt generator, uses Gemini to generate the question and hints and finally returns everything back to the component so that the quiz can be rendered. Additionally, each quiz gets a session ID assigned in the backend and is stored in a limited LRU cache. For debugging purposes, this component fetches data from the api/sessions endpoint. This returns all active sessions from the cache. This component displays statistics about the service. However, so far there is only one category of data displayed, which is the quiz limit. To limit the costs for VertexAI and GCP usage in general, there is a daily limit of quiz sessions, which will reset with the first quiz of the next day. Data is retrieved form the api/limit endpoint. Vue components (by author)API examplesOf course using the frontend is a nice way to interact with the application, but it is also possible to just use the API. The following example shows how to start a quiz via the API using the Santa Claus / Christmas personality: curl -s -X POST https://movie-detectives.com/api/quiz \ -H 'Content-Type: application/json' \ -d '{"vote_avg_min": 5.0, "vote_count_min": 1000.0, "popularity": 3, "personality": "christmas"}' | jq .{ "quiz_id": "e1d298c3-fcb0-4ebe-8836-a22a51f87dc6", "question": { "question": "Ho ho ho, this movie takes place in a world of dreams, just like the dreams children have on Christmas Eve after seeing Santa Claus! It's about a team who enters people's dreams to steal their secrets. Can you guess the movie? Merry Christmas!", "hint1": "The main character is like a skilled elf, sneaking into people's minds instead of houses. ", "hint2": "I_c_p_i_n " }, "movie": {...} }Movie Detectives — Example: Santa Claus personality (by author)This example shows how to change the language for a quiz: curl -s -X POST https://movie-detectives.com/api/quiz \ -H 'Content-Type: application/json' \ -d '{"vote_avg_min": 5.0, "vote_count_min": 1000.0, "popularity": 3, "language": "german"}' | jq .{ "quiz_id": "7f5f8cf5-4ded-42d3-a6f0-976e4f096c0e", "question": { "question": "Stellt euch vor, es gäbe riesige Monster, die auf der Erde herumtrampeln, als wäre es ein Spielplatz! Einer ist ein echtes Urviech, eine Art wandelnde Riesenechse mit einem Atem, der so heiß ist, dass er euer Toastbrot in Sekundenschnelle rösten könnte. Der andere ist ein gigantischer Affe, der so stark ist, dass er Bäume ausreißt wie Gänseblümchen. Und jetzt ratet mal, was passiert? Die beiden geraten aneinander, wie zwei Kinder, die sich um das letzte Stück Kuchen streiten! Wer wird wohl gewinnen, die Riesenechse oder der Superaffe? Das ist die Frage, die sich die ganze Welt stellt! ", "hint1": "Der Film spielt in einer Zeit, in der Monster auf der Erde wandeln.", "hint2": "G_dz_ll_ vs. K_ng " }, "movie": {...} }And this is how to answer to a quiz via an API call: curl -s -X POST https://movie-detectives.com/api/quiz/84c19425-c179-4198-9773-a8a1b71c9605/answer \ -H 'Content-Type: application/json' \ -d '{"answer": "Greenland"}' | jq .{ "quiz_id": "84c19425-c179-4198-9773-a8a1b71c9605", "question": {...}, "movie": {...}, "user_answer": "Greenland", "result": { "points": "3", "answer": "Congratulations! You got it! Greenland is the movie we were looking for. You're like a human GPS, always finding the right way!" } }ConclusionAfter I finished the basic project, adding more personalities and languages was so easy with the modular prompt approach, that I was impressed by the possibilities this opens up for game design and development. I could change this game from a pure educational game about movies, into a comedy trivia “You Don’t Know Jack”-like game within a minute by adding another personality. Also, combining up-to-date Python functionality with validation libraries like Pydantic is very powerful and can be used to ensure good data quality for LLM input. And there you have it, folks! You’re now equipped to craft your own LLM-powered web application. Feeling inspired but need a starting point? Check out the open-source code for the Gemini Movie Detectives project: Github repository for backend: https://github.com/vojay-dev/gemini-movie-detectives-api Github repository for frontend: https://github.com/vojay-dev/gemini-movie-detectives-uiThe future of AI-powered applications is bright, and you’re holding the paintbrush! Let’s go make something remarkable. And if you need a break, feel free to try https://movie-detectives.com/. Create an AI-Driven Movie Quiz with Gemini LLM, Python, FastAPI, Pydantic, RAG and more was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story. View the full article
  4. In March, Snowflake announced exciting releases, including advances in AI and ML with new features in Snowflake Cortex, new governance and privacy features in Snowflake Horizon, and broader developer support with the Snowflake CLI. Read on to learn more about everything we announced last month. Snowflake Cortex LLM Functions – in public preview Snowflake Cortex is an intelligent, fully managed service that delivers state-of-the-art large language models (LLMs) as serverless SQL/Python functions; there are no integrations to set up, data to move or GPUs to provision. In Snowflake Cortex, there are task-specific functions that teams can use to quickly and cost-effectively execute complex tasks, such as translation, sentiment analysis and summarization. Additionally, to build custom apps, teams can use the complete function to run custom prompts using LLMs from Mistral AI, Meta and Google. Learn more. Streamlit Streamlit 1.26 – in public preview We’re excited to announce support for Streamlit version 1.26 within Snowflake. This update, in preview, expands your options for building data apps directly in Snowflake’s secure environment. Now you can leverage the latest features and functionalities available in Streamlit 1.26.0 — including st.chat_input and st.chat_message, two powerful primitives for creating conversational interfaces within your data apps. This addition allows users to interact with your data applications using natural language, making them more accessible and user-friendly. You can also utilize the new features of Streamlit 1.26.0 to create even more interactive and informative data visualizations and dashboards. To learn more and get started, head over to the Snowflake documentation. Snowflake Horizon Sensitive Data Custom Classification – in public preview In addition to using standard classifiers in Snowflake, customers can now also write their own classifiers using SQL with custom logic to define what data is sensitive to their organization. This is an important enhancement to data classification and provides the necessary extensibility that customers need to detect and classify more of their data. Learn more. Data Quality Monitoring – in public preview Data Quality Monitoring is a built-in solution with out-of-the-box metrics, like null counts, time since the object was last updated and count of rows inserted into an object. Customers can even create custom metrics to monitor the quality of data. They can then effectively monitor and report on data quality by defining the frequency it is automatically measured and configure alerts to receive email notifications when quality thresholds are violated. Learn more. Snowflake Data Clean Rooms – generally available in select regions Snowflake Data Clean Rooms allow customers to unlock insights and value through secure data collaboration. Launched as a Snowflake Native App on Snowflake Marketplace, Snowflake Data Clean Rooms are now generally available to customers in AWS East, AWS West and Azure West. Snowflake Data Clean Rooms make it easy to build and use data clean rooms for both technical and non-technical users, with no additional access fees set by Snowflake. Find out more in this blog. DevOps on Snowflake Snowflake CLI – public preview The new Snowflake CLI is an open source tool that empowers developers with a flexible and extensible interface for managing the end-to-end lifecycle of applications across various workloads (Snowpark, Snowpark Container Services, Snowflake Native Applications and Streamlit in Snowflake). It offers features such as user-defined functions, stored procedures, Streamlit integration and direct SQL execution. Learn more. Snowflake Marketplace Snowflake customers can tap into Snowflake Marketplace for access to more than 2,500 live and ready-to-query third-party data, apps and AI products all in one place (as of April 10, 2024). Here are all the providers who launched on Marketplace in March: AI/ML Products Brillersys – Time Series Data Generator Atscale, Inc. – Semantic Modeling Data paretos GmbH – Demand Forecasting App Connectors/SaaS Data HALitics – eCommerce Platform Connector Developer Tools DataOps.live – CI/CD, Automation and DataOps Data Governance, Quality and Cost Optimization Select Labs US Inc. – Snowflake Performance & Cost Optimization Foreground Data Solutions Inc – PII Data Detector CareEvolution – Data Format Transformation Merse, Inc – Snowflake Performance & Cost Optimization Qbrainx – Snowflake Performance & Cost Optimization Yuki – Snowflake Performance Optimization DATAN3RD LLC – Data Quality App Third-Party Data Providers Upper Hand – Sports Facilities & Athletes Data Sporting Group – Sportsbook Data Quiet Data – UK Company Data Manifold Data Mining – Demographics Data in Canada SESAMm – ESG Controversy Data KASPR Datahaus – Internet Quality & Anomaly Data Blitzscaling – Blockchain Data Starlitics – ETF and Mutual Fund Data SFR Analytics – Geographic Data SignalRank – Startup Data GfK SE – Purchasing Power Data —- ​​Forward-Looking Statement This post contains express and implied forward-looking statements, including statements regarding (i) Snowflake’s business strategy, (ii) Snowflake’s products, services, and technology offerings, including those that are under development or not generally available, (iii) market growth, trends, and competitive considerations, and (iv) the integration, interoperability, and availability of Snowflake’s products with and on third-party platforms. These forward-looking statements are subject to a number of risks, uncertainties, and assumptions, including those described under the heading “Risk Factors” and elsewhere in the Quarterly Reports on Form 10-Q and Annual Reports of Form 10-K that Snowflake files with the Securities and Exchange Commission. In light of these risks, uncertainties, and assumptions, actual results could differ materially and adversely from those anticipated or implied in the forward-looking statements. As a result, you should not rely on any forward-looking statements as predictions of future events. © 2024 Snowflake Inc. All rights reserved. Snowflake, the Snowflake logo, and all other Snowflake product, feature, and service names mentioned herein are registered trademarks or trademarks of Snowflake Inc. in the United States and other countries. All other brand names or logos mentioned or used herein are for identification purposes only and may be the trademarks of their respective holder(s). Snowflake may not be associated with, or be sponsored or endorsed by, any such holder(s). The post New Snowflake Features Released in March 2024 appeared first on Snowflake. View the full article
  5. BigQuery allows you to analyze your data using a range of large language models (LLMs) hosted in Vertex AI including Gemini 1.0 Pro, Gemini 1.0 Pro Vision and text-bison. These models work well for several tasks such as text summarization, sentiment analysis, etc. using only prompt engineering. However, in some scenarios, additional customization via model fine-tuning is needed, such as when the expected behavior of the model is hard to concisely define in a prompt, or when prompts do not produce expected results consistently enough. Fine-tuning also helps the model learn specific response styles (e.g., terse or verbose), new behaviors (e.g., answering as a specific persona), or to update itself with new information. Today, we are announcing support for customizing LLMs in BigQuery with supervised fine-tuning. Supervised fine-tuning via BigQuery uses a dataset which has examples of input text (the prompt) and the expected ideal output text (the label), and fine-tunes the model to mimic the behavior or task implied from these examples.Let’s see how this works. Feature walkthrough To illustrate model fine-tuning, let’s look at a classification problem using text data. We’ll use a medical transcription dataset and ask our model to classify a given transcript into one of 17 categories, e.g. ‘Allergy/Immunology’, ‘Dentistry’, ‘Cardiovascular/ Pulmonary’, etc. Dataset Our dataset is from mtsamples.com as provided on Kaggle. To fine-tune and evaluate our model, we first create an evaluation table and a training table in BigQuery using a subset of this data available in Cloud Storage as follows: code_block <ListValue: [StructValue([('code', "-- Create a eval table\r\n\r\nLOAD DATA INTO\r\n bqml_tutorial.medical_transcript_eval\r\nFROM FILES( format='NEWLINE_DELIMITED_JSON',\r\n uris = ['gs://cloud-samples-data/vertex-ai/model-evaluation/peft_eval_sample.jsonl'] )\r\n\r\n-- Create a train table\r\n\r\nLOAD DATA INTO\r\n bqml_tutorial.medical_transcript_train\r\nFROM FILES( format='NEWLINE_DELIMITED_JSON',\r\n uris = ['gs://cloud-samples-data/vertex-ai/model-evaluation/peft_train_sample.jsonl'] )"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eb42d770ee0>)])]> The training and evaluation dataset has an ‘input_text’ column that contains the transcript, and a ‘output_text’ column that contains the label, or ground truth. Baseline performance of text-bison model First, let’s establish a performance baseline for the text-bison model. You can create a remote text-bison model in BigQuery using a SQL statement like the one below. For more details on creating a connection and remote models refer to the documentation (1,2). code_block <ListValue: [StructValue([('code', "CREATE OR REPLACE MODEL\r\n `bqml_tutorial.text_bison_001` REMOTE\r\nWITH CONNECTION `LOCATION. ConnectionID`\r\nOPTIONS (ENDPOINT ='text-bison@001')"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eb42d770be0>)])]> For inference on the model, we first construct a prompt by concatenating the task description for our model and the transcript from the tables we created. We then use the ML.GENERATE_TEXT function to get the output. While the model gets many classifications correct out of the box, it classifies some transcripts erroneously. Here’s a sample response where it classifies incorrectly. code_block <ListValue: [StructValue([('code', 'Prompt\r\n\r\nPlease assign a label for the given medical transcript from among these labels [Allergy / Immunology, Autopsy, Bariatrics, Cardiovascular / Pulmonary, Chiropractic, Consult - History and Phy., Cosmetic / Plastic Surgery, Dentistry, Dermatology, Diets and Nutritions, Discharge Summary, ENT - Otolaryngology, Emergency Room Reports, Endocrinology, Gastroenterology, General Medicine, Hematology - Oncology, Hospice - Palliative Care, IME-QME-Work Comp etc., Lab Medicine - Pathology, Letters, Nephrology, Neurology, Neurosurgery, Obstetrics / Gynecology, Office Notes, Ophthalmology, Orthopedic, Pain Management, Pediatrics - Neonatal, Physical Medicine - Rehab, Podiatry, Psychiatry / Psychology, Radiology, Rheumatology, SOAP / Chart / Progress Notes, Sleep Medicine, Speech - Language, Surgery, Urology]. TRANSCRIPT: \r\nINDICATIONS FOR PROCEDURE:, The patient has presented with atypical type right arm discomfort and neck discomfort. She had noninvasive vascular imaging demonstrating suspected right subclavian stenosis. Of note, there was bidirectional flow in the right vertebral artery, as well as 250 cm per second velocities in the right subclavian. Duplex ultrasound showed at least a 50% stenosis.,APPROACH:, Right common femoral artery.,ANESTHESIA:, IV sedation with cardiac catheterization protocol. Local infiltration with 1% Xylocaine.,COMPLICATIONS:, None.,ESTIMATED BLOOD LOSS:, Less than 10 ml.,ESTIMATED CONTRAST:, Less than 250 ml.,PROCEDURE PERFORMED:, Right brachiocephalic angiography, right subclavian angiography, selective catheterization of the right subclavian, selective aortic arch angiogram, right iliofemoral angiogram, 6 French Angio-Seal placement.,DESCRIPTION OF PROCEDURE:, The patient was brought to the cardiac catheterization lab in the usual fasting state. She was laid supine on the cardiac catheterization table, and the right groin was prepped and draped in the usual sterile fashion. 1% Xylocaine was infiltrated into the right femoral vessels. Next, a #6 French sheath was introduced into the right femoral artery via the modified Seldinger technique.,AORTIC ARCH ANGIOGRAM:, Next, a pigtail catheter was advanced to the aortic arch. Aortic arch angiogram was then performed with injection of 45 ml of contrast, rate of 20 ml per second, maximum pressure 750 PSI in the 4 degree LAO view.,SELECTIVE SUBCLAVIAN ANGIOGRAPHY:, Next, the right subclavian was selectively cannulated. It was injected in the standard AP, as well as the RAO view. Next pull back pressures were measured across the right subclavian stenosis. No significant gradient was measured.,ANGIOGRAPHIC DETAILS:, The right brachiocephalic artery was patent. The proximal portion of the right carotid was patent. The proximal portion of the right subclavian prior to the origin of the vertebral and the internal mammary showed 50% stenosis.,IMPRESSION:,1. Moderate grade stenosis in the right subclavian artery.,2. Patent proximal edge of the right carotid.\r\n\r\nResponse\r\nRadiology'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eb42d770d90>)])]> In the above case the correct classification should have been ‘Cardiovascular/ Pulmonary’. Metrics-based evaluation for base modelTo perform a more robust evaluation of the model’s performance, you can use BigQuery’s ML.EVALUATE function to compute metrics on how the model responses compare against the ideal responses from a test/eval dataset. You can do so as follows: code_block <ListValue: [StructValue([('code', '-- Evaluate base model\r\n\r\nSELECT\r\n *\r\nFROM\r\n ml.evaluate(MODEL bqml_tutorial.text_bison_001,\r\n (\r\n SELECT\r\n CONCAT("Please assign a label for the given medical transcript from among these labels [Allergy / Immunology, Autopsy, Bariatrics, Cardiovascular / Pulmonary, Chiropractic, Consult - History and Phy., Cosmetic / Plastic Surgery, Dentistry, Dermatology, Diets and Nutritions, Discharge Summary, ENT - Otolaryngology, Emergency Room Reports, Endocrinology, Gastroenterology, General Medicine, Hematology - Oncology, Hospice - Palliative Care, IME-QME-Work Comp etc., Lab Medicine - Pathology, Letters, Nephrology, Neurology, Neurosurgery, Obstetrics / Gynecology, Office Notes, Ophthalmology, Orthopedic, Pain Management, Pediatrics - Neonatal, Physical Medicine - Rehab, Podiatry, Psychiatry / Psychology, Radiology, Rheumatology, SOAP / Chart / Progress Notes, Sleep Medicine, Speech - Language, Surgery, Urology]. ", input_text) AS input_text,\r\n output_text\r\n FROM\r\n `bqml_tutorial.medical_transcript_eval` ),\r\n STRUCT("classification" AS task_type))'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eb42d7708e0>)])]> In the above code we provided an evaluation table as input and chose ‘classification‘ as the task type on which we evaluate the model. We left other inference parameters at their defaults but they can be modified for the evaluation. The evaluation metrics that are returned are computed for each class (label). The results look like following: Focusing on the F1 score (harmonic mean of precision and recall), you can see that the model performance varies between classes. For example, the baseline model performs well for ‘Autopsy’, ‘Diets and Nutritions’, and ‘Dentistry’, but performs poorly for ‘Consult - History and Phy.’, ‘Chiropractic’, and ‘Cardiovascular / Pulmonary’ classes. Now let’s fine-tune our model and see if we can improve on this baseline performance. Creating a fine-tuned model Creating a fine-tuned model in BigQuery is simple. You can perform fine-tuning by specifying the training data with ‘prompt’ and ‘label’ columns in it in the Create Model statement. We use the same prompt for fine-tuning that we used in the evaluation earlier. Create a fine-tuned model as follows: code_block <ListValue: [StructValue([('code', '-- Fine tune a textbison model\r\n\r\nCREATE OR REPLACE MODEL\r\n `bqml_tutorial.text_bison_001_medical_transcript_finetuned` REMOTE\r\nWITH CONNECTION `LOCATION. ConnectionID`\r\nOPTIONS (endpoint="text-bison@001",\r\n max_iterations=300,\r\n data_split_method="no_split") AS\r\nSELECT\r\n CONCAT("Please assign a label for the given medical transcript from among these labels [Allergy / Immunology, Autopsy, Bariatrics, Cardiovascular / Pulmonary, Chiropractic, Consult - History and Phy., Cosmetic / Plastic Surgery, Dentistry, Dermatology, Diets and Nutritions, Discharge Summary, ENT - Otolaryngology, Emergency Room Reports, Endocrinology, Gastroenterology, General Medicine, Hematology - Oncology, Hospice - Palliative Care, IME-QME-Work Comp etc., Lab Medicine - Pathology, Letters, Nephrology, Neurology, Neurosurgery, Obstetrics / Gynecology, Office Notes, Ophthalmology, Orthopedic, Pain Management, Pediatrics - Neonatal, Physical Medicine - Rehab, Podiatry, Psychiatry / Psychology, Radiology, Rheumatology, SOAP / Chart / Progress Notes, Sleep Medicine, Speech - Language, Surgery, Urology]. ", input_text) AS prompt,\r\n output_text AS label\r\nFROM\r\n `bqml_tutorial.medical_transcript_train`'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eb42d770cd0>)])]> The CONNECTION you use to create the fine-tuned model should have (a) Storage Object User and (b) Vertex AI Service Agent roles attached. In addition, your Compute Engine (GCE) default service account should have an editor access to the project. Refer to the documentation for guidance on working with BigQuery connections. BigQuery performs model fine-tuning using a technique known as Low-Rank Adaptation (LoRA. LoRA tuning is a parameter efficient tuning (PET) method that freezes the pretrained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture to reduce the number of trainable parameters. The model fine-tuning itself happens on a Vertex AI compute and you have the option to choose GPUs or TPUs as accelerators. You are billed by BigQuery for the data scanned or slots used, as well as by Vertex AI for the Vertex AI resources consumed. The fine-tuning job creates a new model endpoint that represents the learned weights. The Vertex AI inference charges you incur when querying the fine-tuned model are the same as for the baseline model. This fine-tuning job may take a couple of hours to complete, varying based on training options such as ‘max_iterations’. Once completed, you can find the details of your fine-tuned model in the BigQuery UI, where you will see a different remote endpoint for the fine-tuned model. Endpoint for the baseline model vs a fine tuned model. Currently, BigQuery supports fine-tuning of text-bison-001 and text-bison-002 models. Evaluating performance of fine-tuned model You can now generate predictions from the fine-tuned model using code such as following: code_block <ListValue: [StructValue([('code', 'SELECT\r\n ml_generate_text_llm_result,\r\n label,\r\n prompt\r\nFROM\r\n ml.generate_text(MODEL bqml_tutorial.text_bison_001_medical_transcript_finetuned,\r\n (\r\n SELECT\r\n CONCAT("Please assign a label for the given medical transcript from among these labels [Allergy / Immunology, Autopsy, Bariatrics, Cardiovascular / Pulmonary, Chiropractic, Consult - History and Phy., Cosmetic / Plastic Surgery, Dentistry, Dermatology, Diets and Nutritions, Discharge Summary, ENT - Otolaryngology, Emergency Room Reports, Endocrinology, Gastroenterology, General Medicine, Hematology - Oncology, Hospice - Palliative Care, IME-QME-Work Comp etc., Lab Medicine - Pathology, Letters, Nephrology, Neurology, Neurosurgery, Obstetrics / Gynecology, Office Notes, Ophthalmology, Orthopedic, Pain Management, Pediatrics - Neonatal, Physical Medicine - Rehab, Podiatry, Psychiatry / Psychology, Radiology, Rheumatology, SOAP / Chart / Progress Notes, Sleep Medicine, Speech - Language, Surgery, Urology]. ", input_text) AS prompt,\r\n output_text as label\r\n FROM\r\n `bqml_tutorial.medical_transcript_eval`\r\n ),\r\n STRUCT(TRUE AS flatten_json_output))'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eb42d770e20>)])]> Let us look at the response to the sample prompt we evaluated earlier. Using the same prompt, the model now classifies the transcript as ‘Cardiovascular / Pulmonary’ — the correct response. Metrics based evaluation for fine tuned model Now, we will compute metrics on the fine-tuned model using the same evaluation data and the same prompt we previously used for evaluating the base model. code_block <ListValue: [StructValue([('code', '-- Evaluate fine tuned model\r\n\r\n\r\nSELECT\r\n *\r\nFROM\r\n ml.evaluate(MODEL bqml_tutorial.text_bison_001_medical_transcript_finetuned,\r\n (\r\n SELECT\r\n CONCAT("Please assign a label for the given medical transcript from among these labels [Allergy / Immunology, Autopsy, Bariatrics, Cardiovascular / Pulmonary, Chiropractic, Consult - History and Phy., Cosmetic / Plastic Surgery, Dentistry, Dermatology, Diets and Nutritions, Discharge Summary, ENT - Otolaryngology, Emergency Room Reports, Endocrinology, Gastroenterology, General Medicine, Hematology - Oncology, Hospice - Palliative Care, IME-QME-Work Comp etc., Lab Medicine - Pathology, Letters, Nephrology, Neurology, Neurosurgery, Obstetrics / Gynecology, Office Notes, Ophthalmology, Orthopedic, Pain Management, Pediatrics - Neonatal, Physical Medicine - Rehab, Podiatry, Psychiatry / Psychology, Radiology, Rheumatology, SOAP / Chart / Progress Notes, Sleep Medicine, Speech - Language, Surgery, Urology]. ", input_text) AS prompt,\r\n output_text as label\r\n FROM\r\n `bqml_tutorial.medical_transcript_eval`), STRUCT("classification" AS task_type))'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eb42d770820>)])]> The metrics from the fine-tuned model are below. Even though the fine-tuning (training) dataset we used for this blog contained only 519 examples, we already see a marked improvement in performance. F1 scores on the labels, where the model had performed poorly earlier, have improved, with the “macro” F1 score (a simple average of F1 score across all labels) jumping from 0.54 to 0.66. Ready for inference The fine-tuned model can now be used for inference using the ML.GENERATE_TEXT function, which we used in the previous steps to get the sample responses. You don’t need to manage any additional infrastructure for your fine-tuned model and you are charged the same inference price as you would have incurred for the base model. To try fine-tuning for text-bison models in BigQuery, check out the documentation. Have feedback or need fine-tuning support for additional models? Let us know at bqml-feedback@google.com>. Special thanks to Tianxiang Gao for his contributions to this blog. View the full article
  6. An overview of the top vulnerabilities affecting large language model (LLM) applications. The post OWASP Top 10 for LLM Applications: A Quick Guide appeared first on Mend. The post OWASP Top 10 for LLM Applications: A Quick Guide appeared first on Security Boulevard. View the full article
  7. Recent developments in building large language models (LLMs) to boost generative AI in local languages have caught everyone’s attention. This post focuses on the needs and challenges of homegrown LLMs amid the fast-evolving technology landscape.View the full article
  8. AMD has published performance slides of its Ryzen 7 7840U dominating Intel’s latest Core Ultra 7 155H processor in a plethora of large language model benchmarks. View the full article
  9. A new artificial intelligence (AI) benchmark based on classic arcade title Street Fighter III was devised at the Mistral AI hackathon in San Francisco last week. View the full article
  10. Do you want to know how to run LLMs on your computer without installing a lot of dependencies or writing code? Well, you're in luck! By the end of this tutorial, you will have successfully run an LLM using llamafile and interacted with it through a user-friendly interface.View the full article
  11. Picus Security today added an artificial intelligence (AI) capability to enable cybersecurity teams to automate tasks via a natural language interface. The capability, enabled by OpenAI, leverages the existing knowledge graph technologies from Picus Security. Dubbed Picus Numi AI, the company is making use of a large language model (LLM) developed by Open AI to.. The post Picus Security Melds Security Knowledge Graph with Open AI LLM appeared first on Security Boulevard. View the full article
  12. How this open source LLM chatbot runner hit the gas on x86, Arm CPUsView the full article
  13. PLEASE POST AT 9:00AM EST ON TUESDAY APRIL 2nd Tabnine today revealed that it is giving DevOps teams the ability to switch between multiple large language models (LLMs) when they use its generative artificial intelligence (AI) platform to write code. Application development teams can continue to use the LLM that Tabine originally developed (in collaboration […]View the full article
  14. Databricks’ mission is to deliver data intelligence to every enterprise by allowing organizations to understand and use their unique data to build their... View the full article
  15. Today, we are excited to introduce DBRX, an open, general-purpose LLM created by Databricks. Across a range of standard benchmarks, DBRX sets a... View the full article
  16. From theory to practice, learn how to enhance your NLP projects with these 7 simple steps.View the full article
  17. MemVerge, a provider of software designed to accelerate and optimize data-intensive applications, has partnered with Micron to boost the performance of LLMs using Compute Express Link (CXL) technology. The company's Memory Machine software uses CXL to reduce idle time in GPUs caused by memory loading. The technology was demonstrated at Micron’s booth at Nvidia GTC 2024 and Charles Fan, CEO and Co-founder of MemVerge said, “Scaling LLM performance cost-effectively means keeping the GPUs fed with data. Our demo at GTC demonstrates that pools of tiered memory not only drive performance higher but also maximize the utilization of precious GPU resources.” Impressive results The demo utilized a high-throughput FlexGen generation engine and an OPT-66B large language model. This was performed on a Supermicro Petascale Server, equipped with an AMD Genoa CPU, Nvidia A10 GPU, Micron DDR5-4800 DIMMs, CZ120 CXL memory modules, and MemVerge Memory Machine X intelligent tiering software. The demo contrasted the performance of a job running on an A10 GPU with 24GB of GDDR6 memory, and data fed from 8x 32GB Micron DRAM, against the same job running on the Supermicro server fitted with Micron CZ120 CXL 24GB memory expander and the MemVerge software. The FlexGen benchmark, using tiered memory, completed tasks in under half the time of traditional NVMe storage methods. Additionally, GPU utilization jumped from 51.8% to 91.8%, reportedly as a result of MemVerge Memory Machine X software's transparent data tiering across GPU, CPU, and CXL memory. Raj Narasimhan, senior vice president and general manager of Micron’s Compute and Networking Business Unit, said “Through our collaboration with MemVerge, Micron is able to demonstrate the substantial benefits of CXL memory modules to improve effective GPU throughput for AI applications resulting in faster time to insights for customers. Micron’s innovations across the memory portfolio provide compute with the necessary memory capacity and bandwidth to scale AI use cases from cloud to the edge.” However, experts remain skeptical about the claims. Blocks and Files pointed out that the Nvidia A10 GPU uses GDDR6 memory, which is not HBM. A MemVerge spokesperson responded to this point, and others that the site raised, stating, “Our solution does have the same effect on the other GPUs with HBM. Between Flexgen’s memory offloading capabilities and Memory Machine X’s memory tiering capabilities, the solution is managing the entire memory hierarchy that includes GPU, CPU and CXL memory modules.” (Image credit: MemVerge) More from TechRadar Pro Are we exaggerating AI capabilities?'The fastest AI chip in the world': Gigantic AI CPU has almost one million coresAI chip built using ancient Samsung tech is claimed to be as fast as Nvidia A100 GPU View the full article
  18. Improving the relevance of your LLM application by leveraging Charmed Opensearch’s vector database Large Language Models (LLMs) fall under the category of Generative AI (GenAI), an artificial intelligence type that produces content based on user-defined context. These models undergo training using an extensive dataset composed of trillions of combinations of words from natural language, enabling them to empower interactive and conversational applications across various scenarios. Renowned LLMs like GPT, BERT, PaLM, and LLaMa can experience performance improvements by gaining access to additional structured and unstructured data. This additional data may include public or internal documents, websites, and various text forms and content. This methodology, termed retrieval-augmented generation (RAG), ensures that your conversational application generates accurate results with contextual relevance and domain-specific knowledge, even in areas where the pertinent facts were not part of the initial training dataset. RAG can drastically improve the accuracy of an LLM’s responses. See the example below: “What is PRO?” response without RAG Pro is a subscription-based service that offers additional features and functionality to users. For example, Pro users can access exclusive content, receive priority customer support, and more. To become a Pro user, you can sign up for a Pro subscription on our website. Once you have signed up, you can access all of the Pro features and benefits. “What is PRO?” response with RAG Ubuntu Pro is an additional stream of security updates and packages that meet compliance requirements, such as FIPS or HIPAA, on top of an Ubuntu LTS. It provides an SLA for security fixes for the entire distribution (‘main and universe’ packages) for ten years, with extensions for industrial use cases. Ubuntu Pro is free for personal use, offering the full suite of Ubuntu Pro capabilities on up to 5 machines. This article guides you on leveraging Charmed OpenSearch to maintain a relevant and up-to-date LLM application. What is OpenSearch? OpenSearch is an open-source search and analytics engine. Users can extend the functionality of OpenSearch with a selection of plugins that enhance search, security, performance analysis, machine learning, and more. This previous article we wrote provides additional details on the comprehensive features of OpenSearch. We discussed the capability of enabling enterprise-grade solutions through Charmed OpenSearch. This blog will emphasise a specific feature pertinent to RAG: utilising OpenSearch as a vector database. What is a vector database? Vector databases allow you to store and index, for example, text documents, rich media, audio, geospatial coordinates, tables, and graphs into vectors. These vectors represent points in N-dimensional spaces, effectively encapsulating the context of an asset. Search tools can look into these spaces using low-latency queries to find similar assets in neighbouring data points. These search tools typically do this by exploiting the efficiency of different methods for obtaining, for example, the k-nearest neighbours (k-NN) from an index of vectors. In particular, OpenSearch enables this feature with the k-NN plugin and augments this functionality by providing your conversational applications with other essential features, such as fault tolerance, resource access controls, and a powerful query engine. Using the OpenSearch k-NN plugin for RAG IIn this section, we provide a practical example of using Charmed OpenSearch in the RAG process as a retrieval tool with an experiment using a Jupyter notebook on top of Charmed Kubeflow to infer an LLM. 1. Deploy Charmed OpenSearch and enable the k-NN plugin. Follow the Charmed OpenSearch tutorial, which is a good starting point. At the end, verify if the plugin is enabled, which is enabled by default: $ juju config opensearch plugin_opensearch_knn true 2. Get your credentials. The easiest way to create and retrieve your first administrator credentials is to add a relation between Charmed Opensearch and the Data Integrator Charm, which is also part of the tutorial. 3. Create a vector index for your k-NN index. Now, we can create a vector index for your additional documents encoded into the knn_vectors data type. For simplicity, we will use the opensearch-py client. from opensearchpy import OpenSearch os_host = 10.56.118.209 os_port = 9200 os_url = "https://10.56.118.209:9200" os_auth = ("opensearch-client_7","sqlKjlEK7ldsBxqsOHNcFoSXayDudf30") os_client = OpenSearch( hosts = [{'host': os_host, 'port': os_port}], http_compress = True, http_auth = os_auth, use_ssl = True, verify_certs = False, ssl_assert_hostname = False, ssl_show_warn = False ) os_index_name = "rag-index" settings = { "settings": { "index": { "knn": True, "knn.space_type": "cosinesimil" } } } opensearch_client.indices.create(index=os_index_name, body=settings) properties={ "properties": { "vector_field": { "type": "knn_vector", "dimension": 384 }, "text": { "type": "keyword" } } } opensearch_client.indices.put_mapping(index=os_index_name, body=properties) 4. Aggregate source documents. In this example, we will select a list of web content that we want our application to use as relevant information to provide accurate answers: content_links = [ https://discourse.ubuntu.com/t/ubuntu-pro-faq/34042 ] 5. Load document contents into memory and split the content into chunks. It will allow us to create the embeddings from the selected documents and upload them to the index we created. from langchain.document_loaders import WebBaseLoader loader = WebBaseLoader(content_links) htmls = loader.load() from langchain.text_splitter import CharacterTextSplitter text_splitter = CharacterTextSplitter( chunk_size=500, chunk_overlap=0, separator="\n") docs = text_splitter.split_documents(htmls) 6. Create embeddings for text chunks and store embeddings in the vector index. It will allow us to create the embeddings from the selected documents and upload them to the index we created. from langchain.embeddings import HuggingFaceEmbeddings embeddings = HuggingFaceEmbeddings( model_name="sentence-transformers/all-MiniLM-L12-v2", encode_kwargs={'normalize_embeddings': False}) from langchain.vectorstores import OpenSearchVectorSearch docsearch = OpenSearchVectorSearch.from_documents(docs, embeddings, ef_construction=256, engine="faiss", space_type="innerproduct", m=48, opensearch_url=os_url, index_name=os_index_name, http_auth=os_auth, verify_certs=False) 7. Use the similarity search to retrieve the documents that provide context to your query. The search engine will perform the Approximate k-NN Search, for example, using the cosine similarity formula, and return the relevant documents in the context of your question. query = """ What is Pro? """ similar_docs = docsearch.similarity_search(query, k=2, raw_response=True, search_type="approximate_search", space_type="cosinesimil") 8. Prepare you LLM. We used a simple example using a HugginFace pipeline to load an LLM. from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline from langchain.llms import HuggingFacePipeline model_name="TheBloke/Llama-2-7B-Chat-GPTQ" model = AutoModelForCausalLM.from_pretrained( model_name, cache_dir="model", device_map='auto' ) tokenizer = AutoTokenizer.from_pretrained(model_name,cache_dir="llm/tokenizer") pl = pipeline( "text-generation", model=model, tokenizer=tokenizer, max_length = 2048. ) llm = HuggingFacePipeline(pipeline=pl) 9. Create a prompt template. It will define the expectations of the response and specify that we will provide context for an accurate answer. from langchain import PromptTemplate question_prompt_template = """ You are a friendly chatbot assistant that responds in a conversational manner to user's questions. Respond in short but complete answers unless specifically asked by the user to elaborate on something. Use History and Context to inform your answers. Context: --------- {context} --------- Question: {question} Helpful Answer:""" QUESTION_PROMPT = PromptTemplate( template=question_prompt_template, input_variables=["context", "question"] ) 10. Infer the LLM to answer your question using the context documents retrieved from OpenSearch. from langchain.chains.question_answering import load_qa_chain question = "What is Pro?" chain = load_qa_chain(llm, chain_type="stuff", prompt=QUESTION_PROMPT) chain.run(input_documents=similar_docs, question=query) Conclusion Retrieval-augmented generation (RAG) is a method that enables users to converse with data repositories. It’s a tool that can revolutionise how you access and utilise data, as we showed in our tutorial. With RAG, you can improve data retrieval, enhance knowledge sharing, and enrich the results of your LLMs to give more contextually relevant, insightful responses that better reflect the most up-to-date information in your organisation. The benefits of better LLMs that can access your knowledge base are as obvious as they are alluring: you gain better customer support, employee training and developer productivity. On top of that, you ensure that your teams get LLM answers and results that reflect accurate, up-to-date policy and information rather than generalised or even outright useless answers. As we showed, Charmed OpenSearch is a simple and robust technology that can enable RAG capabilities. With it (and our helpful tutorial), any business can leverage RAG to transform their technical or policy manuals and logs into comprehensive knowledge bases. Enterprise-grade and fully supported OpenSearch solution Charmed OpenSearch is available for the open-source community. Canonical’s team of experts can help you get started with it as the vector database to leverage the power of the k-NN search for your LLM applications at any scale. Contact Canonical if you have questions. Watch the webinar: Future-proof AI applications with OpenSearch as a vector database View the full article
  19. An in-depth overview of extractive text summarization, how state-of-the-art NLP models like BERT can enhance it, and a coding tutorial for using BERT to generate extractive summaries.View the full article
  20. Check out our LLM Solution Accelerators for Retail for more details and to download the notebooks. Product recommendations are a core feature of... View the full article
  21. Because human-machine interaction using natural language is now possible with large language models (LLMs), more data teams and developers can bring AI to their daily workflows. To do this efficiently and securely, teams must decide how they want to combine the knowledge of pre-trained LLMs with their organization’s private enterprise data in order to deal with the hallucinations (that is, incorrect responses) that LLMs can generate due to the fact that they’ve only been trained on data available up to a certain date. To reduce these AI hallucinations, LLMs can be combined with private data sets via processes that either don’t require LLM customization (such as prompt engineering or retrieval augmented generation) or that do require customization (like fine-tuning or retraining). To decide where to start, it is important to make trade-offs between the resources and time it takes to customize AI models and the required timelines to show ROI on generative AI investments. While every organization should keep both options on the table, to quickly deliver value, the key is to identify and deploy use cases that can deliver value using prompt engineering and retrieval augmented generation (RAG), as these can be fast and cost-effective approaches to get value from enterprise data with LLMs. To empower organizations to deliver fast wins with generative AI while keeping data secure when using LLMs, we are excited to announce Snowflake Cortex LLM functions are now available in public preview for select AWS and Azure regions. With Snowflake Cortex, a fully managed service that runs on NVIDIA GPU-accelerated compute, there is no need to set up integrations, manage infrastructure or move data outside of the Snowflake governance boundary to use the power of industry-leading LLMs from Mistral AI, Meta and more. So how does Snowflake Cortex make AI easy, whether you are doing prompt engineering or RAG? Let’s dive into the details and check out some code along the way. To prompt or not to prompt In Snowflake Cortex, there are task-specific functions that work out of the box without the need to define a prompt. Specifically, teams can quickly and cost-effectively execute tasks such as translation, sentiment analysis and summarization. All that an analyst or any other user familiar with SQL needs to do is point the specific function below to a column of a table containing text data and voila! Snowflake Cortex functions take care of the rest — no manual orchestration, data formatting or infrastructure to manage. This is particularly useful for teams constantly working with product reviews, surveys, call transcripts and other long-text data sources traditionally underutilized within marketing, sales and customer support teams. SELECT SNOWFLAKE.CORTEX.SUMMARIZE(review_text) FROM reviews_table LIMIT 10; Of course, there are going to be many use cases where customization via prompts becomes useful. For example: Custom text summaries in JSON format Turning email domains into rich data sets Building data quality agents using LLMs All of these and more can quickly be accomplished with the power of industry-leading foundation models from Mistral AI (Mistral Large, Mistral 8x7B, Mistral 7B), Google (Gemma-7b) and Meta (Llama2 70B). All of these foundation LLMs are accessible via the complete function, which just like any other Snowflake Cortex function can run on a table with multiple rows without any manual orchestration or LLM throughput management. Figure 1: Multi-task accuracy of industry-leading LLMs based on MLLU benchmark. Source SELECT SNOWFLAKE.CORTEX.COMPLETE( 'mistral-large', CONCAT('Summarize this product review in less than 100 words. Put the product name, defect and summary in JSON format: <review>', content, '</review>') ) FROM reviews LIMIT 10; For use cases such as chatbots on top of documents, it may be costly to put all the documents as context in the prompt. In such a scenario, a different approach may be more cost effective by minimizing the volume of tokens (a general rule of thumb is that 75 words approximately equals 100 tokens) going into the LLM. A popular framework to solve this problem without having to make changes to the LLM is RAG, which is easy to do in Snowflake. What is RAG? Let’s go over the basics of RAG before jumping into how to do this in Snowflake. RAG is a popular framework in which an LLM gets access to a specific knowledge base with the most up-to-date, accurate information available before generating a response. Because there is no need to retrain the model, this extends the capability of any LLM to specific domains in a cost-effective way. To deploy this retrieval, augmentation and generation framework teams need a combination of: Client / app UI: This is where the end user, such as a business decision-maker, is able to interact with the knowledge base, typically in the form of a chat service. Context repository: This is where relevant data sources are aggregated, governed and continuously updated as needed to provide an up-to-date knowledge repository. This content needs to be inserted into an automated pipeline that chunks (that is, breaks documents into smaller pieces) and embeds the text into a vector store. Vector search: This requires the combination of a vector store, which maintains the numerical or vector representation of the knowledge base, and semantic search to provide easy retrieval of the chunks most relevant to the question. LLM inference: The combination of these enables teams to embed the question and the context to find the most relevant information and generate contextualized responses using a conversational LLM. Figure 2: Generalized RAG framework from question to contextualized answer. From RAG to rich LLM apps in minutes with Snowflake Cortex Now that we understand how RAG works in general, how can we apply it to Snowflake? Using the Snowflake platform’s rich foundation for data governance and management, which includes vector data type (in private preview), developing and deploying an end-to-end AI app using RAG is possible without integrations, infrastructure management or data movement using three key features: Figure 3: Key Snowflake features needed to build end-to-end RAG in Snowflake. Here is how these features map to the key architecture components of a RAG framework: Client / app UI: Use Streamlit in Snowflake out-of-the box chat elements to quickly build and share user interfaces all in Python. Context repository: The knowledge repository can be easily updated and governed using Snowflake stages. Once documents are loaded, all of your data preparation, including generating chunks (smaller, contextually rich blocks of text), can be done with Snowpark. For the chunking in particular, teams can seamlessly use LangChain as part of a Snowpark User Defined Function. Vector search: Thanks to the native support of VECTOR as a data type in Snowflake, there is no need to integrate and govern a separate store or service. Store VECTOR data in Snowflake tables and execute similarity queries with system-defined similarity functions (L2, cosine, or inner-product distance). LLM inference: Snowflake Cortex completes the workflow with serverless functions for embedding and text completion inference (using either Mistral AI, Llama or Gemma LLMs). Figure 4: End-to-end RAG framework in Snowflake. Show me the code Ready to try Snowflake Cortex and its tightly integrated ecosystem of features that enable fast prototyping and agile deployment of AI apps in Snowflake? Get started with one of these resources: Snowflake Cortex LLM functions documentation Run 3 useful LLM inference jobs in 10 minutes with Snowflake Cortex Build a chat-with-your-documents LLM app using RAG with Snowflake Cortex To watch live demos and ask questions of Snowflake Cortex experts, sign up for one of these events: Snowflake Cortex Live Ask Me Anything (AMA) Snowflake Cortex RAG hands-on lab Want to network with peers and learn from other industry and Snowflake experts about how to use the latest generative AI features? Make sure to join us at Snowflake Data Cloud Summit in San Francisco this June! The post Easy and Secure LLM Inference and Retrieval Augmented Generation (RAG) Using Snowflake Cortex appeared first on Snowflake. View the full article
  22. Karpathy's talk provides a comprehensive yet accessible introduction to large language models, explaining their capabilities, future potential, and associated security risks in an engaging manner.View the full article
  23. The future world is full of LLM, and you don’t want to miss this most sought skill.View the full article
  24. How to write function in Python to reverse a string How to write SQL query to select users from a database by age How to implement binary search in Java How often do you have to break the flow, leave your IDE, and search for answers to questions (that are maybe similar to the ones above)? And how often do you end up getting distracted and end up watching cat videos instead of getting back to work? (This happens to the best of them, even to GitHub’s VP of Developer Relations, Martin Woodward.) It doesn’t have to be that way. A developer’s ability to get AI coding assistance directly in a workspace was found to reduce context switching and conserve a developer’s mental energy. When directly integrated into workspaces, these tools become familiar enough with a developer’s code to quickly provide tailored suggestions. Now, without getting sidetracked, developers can get customized answers to coding questions like: Can you suggest a better way to structure my code for scalability? Can you help me debug this function? It's not returning the expected results. Can you help me understand this piece of code in this repository? But how do AI coding assistants provide customized answers? What can organizations and developers do to receive more tailored solutions? And how, ultimately, do customized AI coding assistants benefit organizations as a whole? We talked to Alireza Goudarzi, a senior machine learning researcher at GitHub, to get the answers. How AI coding assistants provide customized answers When it comes to problem solving, context is everything. Business decision makers use information gathered from internal metrics, customer meetings, employee feedback, and more to make decisions about what resources their companies need. Meanwhile, developers use details from pull requests, a folder in a project, open issues, and more to solve coding problems. Large language models, or LLMs, do something similar: Generative AI coding tools are powered by LLMs, which are sets of algorithms trained on large amounts of code and human language. Today’s LLMs are structured as transformers, a kind of architecture that makes the model good at connecting the dots between data. Following the transformer architecture is what enables today’s LLMs to generate responses that are more contextually relevant than previous AI models. Though transformer LLMs are good at connecting the dots, they need to learn what data to process and in what order. A generative AI coding assistant in the IDE can be instructed to use data from open files or code written before and after the cursor to understand the context around the current line of code and suggest a relevant code completion. As a chatbot in an IDE or on a website, a generative AI coding assistant can provide guidance by using data from indexed repositories, customized knowledge bases, developer-provided input in a prompt or query, and even search engine integrations. All input data—the code, query, and additional context—passes through something called a context window, which is present in all transformer-based LLMs. The size of the context window represents the capacity of data an LLM can process. Though it can’t process an infinite amount of data, it can grow larger. But because that window is limited, prompt engineers have to figure out what data, and in what order, to feed the model so it generates the most useful, contextually relevant responses for the developer. How to customize your LLM Customizing an LLM is not the same as training it. Training an LLM means building the scaffolding and neural networks to enable deep learning. Customizing an LLM means adapting a pre-trained LLM to specific tasks, such as generating information about a specific repository or updating your organization’s legacy code into a different language. There are a few approaches to customizing your LLM: retrieval augmented generation, in-context learning, and fine-tuning. We broke these down in this post about the architecture of today’s LLM applications and how GitHub Copilot is getting better at understanding your code. Here’s a recap. Retrieval-augmented generation (RAG) RAG typically uses something called embeddings to retrieve information from a vector database. Vector databases are a big deal because they transform your source code into retrievable data while maintaining the code’s semantic complexity and nuance. In practice, that means an LLM-based coding assistant using RAG can generate relevant answers to questions about a private repository or proprietary source code. It also means that LLMs can use information from external search engines to generate their responses. If you’re wondering what a vector database is, we have you covered: Vector databases store embeddings of your repository code and documentation. The embeddings are what make your code and documentation readable by an LLM. (This is similar to the way programming languages are converted into a binary system language for a computer to understand.) As developers code in an IDE, algorithms transform code snippets in the IDE into embeddings. Algorithms then make approximate matches between the embeddings that are created for those IDE snippets and the embeddings already stored in the vector database. When asking a question to a chat-based AI coding assistant, the questions and requests written in natural language are also transformed into embeddings. A similar process to the one described above takes place: the embeddings created for the natural language prompts are matched to embeddings already stored in vector databases. Vector databases and embeddings allow algorithms to quickly search for approximate matches (not just exact ones) on the data they store. This is important because if an LLM’s algorithms only make exact matches, it could be the case that no data is included as context. Embeddings improve an LLM’s semantic understanding, so the LLM can find data that might be relevant to a developer’s code or question and use it as context to generate a useful response. Have questions about what data GitHub Copilot uses and how? Read this for answers to frequently asked questions and visit the GitHub Copilot Trust Center for more details. In-context learning In-context learning, a method sometimes referred to as prompt engineering, is when developers give the model specific instructions or examples at the time of inference (also known as the time they’re typing or vocalizing a question or request). By providing these instructions and examples, the LLM understands the developer is asking it to infer what they need and will generate a contextually relevant output. In-context learning can be done in a variety of ways, like providing examples, rephrasing your queries, and adding a sentence that states your goal at a high-level. Discover more LLM prompting tips in our guide Fine-tuning Fine-tuning your model can result in a highly customized LLM that excels at a specific task. There are two ways to customize your model with fine-tuning: supervised learning and reinforcement learning from human feedback (RLHF). Under supervised learning, there is a predefined correct answer that the model is taught to generate. Under RLHF, there is high-level feedback that the model uses to gauge whether its generated response is acceptable or not. Let’s dive deeper. Supervised learning This method is when the model’s generated output is evaluated against an intended or known output. For example, you know that the sentiment behind a statement like this is negative: “This sentence is unclear.” To evaluate the LLM, you’d feed this sentence to the model and query it to label the sentiment as positive or negative. If the model labels it as positive, then you’d adjust the model’s parameters (variables that can be weighed or prioritized differently to change a model’s output) and try prompting it again to see if it can classify the sentiment as negative. But even smaller models can have over 300 million parameters. Those are a lot of variables to sift through and adjust (and re-adjust). This method also requires time-intensive labeling. Each input sample requires an output that’s labeled with exactly the correct answer, such as “Negative,” for the example above. That label gives the output something to measure against so adjustments can be made to the model’s parameters. Reinforcement learning from human feedback (RLHF) RLHF requires either direct human feedback or creating a reward model that’s trained to model human feedback (by predicting if a user will accept or reject the output from the pre-trained LLM). The learnings from the reward model are passed to the pre-trained LLM, which will adjust its outputs based on user acceptance rate. The benefit to RLHF is that it doesn’t require supervised learning and, consequently, expands the criteria for what’s an acceptable output. For example, with enough human feedback, the LLM can learn that if there’s an 80% probability that a user will accept an output, then it’s fine to generate. For more on LLMs and how they process data, read: The architecture of today’s LLM applications How LLMs can do things they weren’t trained to do How generative AI is changing the way developers work How GitHub Copilot is getting better at understanding your code What developers need to know about generative AI A developer’s guide to prompt engineering and LLMs How to customize GitHub Copilot GitHub Copilot’s contextual understanding has continuously evolved over time. The first version was only able to consider the file you were working on in your IDE to be contextually relevant. We then expanded the context to neighboring tabs, which are all the open files in your IDE that GItHub Copilot can comb through to find additional context. Just a year and a half later, we launched GitHub Copilot Enterprise, which uses an organization’s indexed repositories to provide developers with coding assistance that’s customized to their codebases. With GitHub Copilot Enterprise, organizations can tailor GitHub Copilot suggestions in the following ways: Index their source code repositories in vector databases, which improves semantic search and gives their developers a customized coding experience. Create knowledge bases, which are Markdown files from a collection of repositories that provide GitHub Copilot with additional context through unstructured data, or data that doesn’t live in a database or spreadsheet. In practice, this can benefit organizations in several ways: Enterprise developers gain a deeper understanding of your organization’s unique codebase. Senior and junior developers alike can prompt GitHub Copilot for code summaries, coding suggestions, and answers about code behavior. As a result of this streamlined code navigation and comprehension, enterprise developers implement features, resolve issues, and modernize code faster. Complex data is quickly translated into organizational knowledge and best practices. Because GitHub Copilot receives context through the repositories and documentation your organization chooses to index, developers receive coding suggestions and guidance that are more useful because they align with organizational knowledge and best practices. It’s not just developers, but also their non-developer and cross-functional team members who can use natural language to prompt Copilot Chat in GitHub.com for answers and guidance on relevant documentation or existing solutions. Data and solutions captured in repositories becomes more accessible across the organization, improving collaboration and increasing awareness of business goals and practices. Faster pull requests create smart, efficient, and accessible development workflows. With GitHub Copilot Enterprise, developers can use GitHub Copilot to generate pull request summaries directly in GitHub.com, helping them communicate clearly with reviewers while also saving valuable time. For developers reviewing pull requests, GitHub Copilot can be used to help them quickly gain a strong understanding of proposed changes and, as a result, focus more time on providing valuable feedback. GitHub Copilot Enterprise is now generally available. Read more about GitHub’s most advanced AI offering, and how it’s customized to your organization’s knowledge and codebase. Best practices for customizing your LLM Customized LLMs help organizations increase value out of all of the data they have access to, even if that data’s unstructured. Using this data to customize an LLM can reveal valuable insights, help you make data-driven decisions, and make enterprise information easier to find overall. Here are our top tips for customizing an LLM. Select an AI solution that uses RAG Like we mentioned above, not all of your organization’s data will be contained in a database or spreadsheet. A lot of data comes in the form of text, like code documentation. Organizations that opt into GitHub Copilot Enterprise will have a customized chat experience with GitHub Copilot in GitHub.com. GitHub Copilot Chat will have access to the organization’s selected repositories and knowledge base files (also known as Markdown documentation files) across a collection of those repositories. Adopt innersource practices Kyle Daigle, GitHub’s chief operating officer, previously shared the value of adapting communication best practices from the open source community to their internal teams in a process known as innersource. One of those best practices is writing something down and making it easily discoverable. How does this practice pay off? It provides more documentation, which means more context for an AI tool to generate tailored solutions to our organization. Effective AI adoption requires establishing this foundation of context. Moreover, developers can use GitHub Copilot Chat in their preferred natural language—from German to Telugu. That means more documentation, and therefore more context for AI, improves global collaboration. All of your developers can work on the same code while using their own natural language to understand and improve it. Here are Daigle’s top tips for innersource adoption: If you like what you hear, record it and make it discoverable (and remember: plenty of video and productivity tools now provide AI-powered summaries and action items). If you come up with a useful solution for your team, share it out with the wider organization so they can benefit from it, too. Offer feedback to publicly shared information and solutions. But remember to critique the work, not the person. If you request a change to a project or document, explain why you’re requesting that change. Bonus points if you add all of these notes to your relevant GitHub repositories and format them in Markdown. How do you expand your LLM results? The answer lies in search engine integration. Transformer-based LLMs have impressive semantic understanding even without embedding and high-dimensional vectors. This is because they’re trained on a large_ _amount of unlabeled natural language data and publicly available source code. They also use a self-supervised learning process where they use a portion of input data to learn basic learning objectives, and then apply what they’ve learned to the rest of the input. When a search engine is integrated into an LLM application, the LLM is able to retrieve search engine results relevant to your prompt because of the semantic understanding it’s gained through its training. That means an LLM-based coding assistant with search engine integration (made possible through a search engine’s API) will have a broader pool of current information that it can retrieve information from. Why does this matter to your organization? Let’s say a developer asks an AI coding tool a question about the most recent version of Java. However, the LLM was trained on data from before the release, and the organization hasn’t updated its repositories’ knowledge with information about the latest release. The AI coding tool can still answer the developer’s question by conducting a web search to retrieve the answer. A generative AI coding assistant that can retrieve data from both custom and publicly available data sources gives employees customized and comprehensive guidance. The path forward 50% of enterprise software engineers are expected to use machine-learning powered coding tools by 2027, according to Gartner. Today, developers are using AI coding assistants to get a head start on complex code translation tasks, build better test coverage, tackle new problems with creative solutions, and find answers to coding-related questions without leaving their IDEs. With customization, developers can also quickly find solutions tailored to an organization’s proprietary or private source code, and build better communication and collaboration with their non-technical team members. In the future, we imagine a workspace that offers more customization for organizations. For example, your ability to fine-tune a generative AI coding assistant could improve code completion suggestions. Additionally, integrating an AI coding tool into your custom tech stack could feed the tool with more context that’s specific to your organization and from services and data beyond GitHub. Looking to bring the power of GitHub Copilot Enterprise to your organization? Learn more or purchase your plan today. The post Customizing and fine-tuning LLMs: What you need to know appeared first on The GitHub Blog. View the full article
  25. Intel shows how you can run Llama 2 on an Arc A770 GPU, using its PyTorch optimizations. View the full article
  • Forum Statistics

    70.4k
    Total Topics
    68.3k
    Total Posts
×
×
  • Create New...