Nick: tutorials

2024-04-22 12:42:46 -07:00 · 2024-04-22 12:42:46 -07:00 · 18450b5f9a
commit 18450b5f9a
parent de7e1f501b
2 changed files with 186 additions and 0 deletions
--- a/tutorials/data-extraction-using-llms.mdx
+++ b/tutorials/data-extraction-using-llms.mdx
@ -0,0 +1,95 @@
 ---
 title: "Extract website data using LLMs"
 description: "Learn how to use Firecrawl and Groq to extract structured data from a web page in a few lines of code."
 'og:image': "/images/og.png"
 'twitter:image': "/images/og.png"
 ---
 ## Setup
 Install our python dependencies, including groq and firecrawl-py. 
 ```bash
 pip install groq firecrawl-py
 ```
 ## Getting your Groq and Firecrawl API Keys
 To use Groq and Firecrawl, you will need to get your API keys. You can get your Groq API key from [here](https://groq.com) and your Firecrawl API key from [here](https://firecrawl.dev).   
 ## Load website with Firecrawl
 To be able to get all the data from a website page and make sure it is in the cleanest format, we will use [FireCrawl](https://firecrawl.dev). It handles by-passing JS-blocked websites, extracting the main content, and outputting in a LLM-readable format for increased accuracy.
 Here is how we will scrape a website url using Firecrawl. We will also set a `pageOptions` for only extracting the main content (`onlyMainContent: True`) of the website page - excluding the navs, footers, etc.
 ```python
 from firecrawl import FirecrawlApp  # Importing the FireCrawlLoader
 url = "https://about.fb.com/news/2024/04/introducing-our-open-mixed-reality-ecosystem/"
 firecrawl = FirecrawlApp(
    api_key="fc-YOUR_FIRECRAWL_API_KEY",
 )
 page_content = firecrawl.scrape_url(url=url,  # Target URL to crawl
    params={
        "pageOptions":{
            "onlyMainContent": True # Ignore navs, footers, etc.
        }
    })
 print(page_content)
 ```
 Perfect, now we have clean data from the website - ready to be fed to the LLM for data extraction.
 ## Extraction and Generation
 Now that we have the website data, let's use Groq to pull out the information we need. We'll use Groq Llama 3 model in JSON mode and pick out certain fields from the page content.
 We are using LLama 3 8b model for this example. Feel free to use bigger models for improved results.
 ```python
 import json
 from groq import Groq
 client = Groq(
    api_key="gsk_YOUR_GROQ_API_KEY",  # Note: Replace 'API_KEY' with your actual Groq API key
 )
 # Here we define the fields we want to extract from the page content
 extract = ["summary","date","companies_building_with_quest","title_of_the_article","people_testimonials"]
 completion = client.chat.completions.create(
    model="llama3-8b-8192",
    messages=[
        {
            "role": "system",
            "content": "You are a legal advisor who extracts information from documents in JSON."
        },
        {
            "role": "user",
            # Here we pass the page content and the fields we want to extract
            "content": f"Extract the following information from the provided documentation:\Page content:\n\n{page_content}\n\nInformation to extract: {extract}"
        }
    ],
    temperature=0,
    max_tokens=1024,
    top_p=1,
    stream=False,
    stop=None,
    # We set the response format to JSON object
    response_format={"type": "json_object"}
 )
 # Pretty print the JSON response
 dataExtracted = json.dumps(str(completion.choices[0].message.content), indent=4)
 print(dataExtracted)
 ```
 ## And Voila!
 You have now built a data extraction bot using Groq and Firecrawl. You can now use this bot to extract structured data from any website.
 If you have any questions or need help, feel free to reach out to us at [Firecrawl](https://firecrawl.dev).
--- a/tutorials/rag-llama3.mdx
+++ b/tutorials/rag-llama3.mdx
@ -0,0 +1,91 @@
 ---
 title: "Build a 'Chat with website' using Groq Llama 3"
 description: "Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot."
 ---
 ## Setup
 Install our python dependencies, including langchain, groq, faiss, ollama, and firecrawl-py. 
 ```bash
 pip install --upgrade --quiet langchain langchain-community groq faiss-cpu ollama firecrawl-py
 ```
 We will be using Ollama for the embeddings, you can download Ollama [here](https://ollama.com/). But feel free to use any other embeddings you prefer.
 ## Load website with Firecrawl
 To be able to get all the data from a website and make sure it is in the cleanest format, we will use FireCrawl. Firecrawl integrates very easily with Langchain as a document loader.
 Here is how you can load a website with FireCrawl:
 ```python
 from langchain_community.document_loaders import FireCrawlLoader  # Importing the FireCrawlLoader
 url = "https://firecrawl.dev"
 loader = FireCrawlLoader(
    api_key="fc-YOUR_API_KEY", # Note: Replace 'YOUR_API_KEY' with your actual FireCrawl API key
    url=url,  # Target URL to crawl
    mode="crawl"  # Mode set to 'crawl' to crawl all accessible subpages
 )
 docs = loader.load()
 ```
 ## Setup the Vectorstore
 Next, we will setup the vectorstore. The vectorstore is a data structure that allows us to store and query embeddings. We will use the Ollama embeddings and the FAISS vectorstore.
 We split the documents into chunks of 1000 characters each, with a 200 character overlap. This is to ensure that the chunks are not too small and not too big - and that it can fit into the LLM model when we query it.
 ```python
 from langchain_community.embeddings import OllamaEmbeddings
 from langchain_text_splitters import RecursiveCharacterTextSplitter
 from langchain_community.vectorstores import FAISS
 text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
 splits = text_splitter.split_documents(docs)
 vectorstore = FAISS.from_documents(documents=splits, embedding=OllamaEmbeddings())
 ```
 ## Retrieval and Generation
 Now that our documents  are loaded and the vectorstore is setup, we can, based on user's question, do a similarity search to retrieve the most relevant documents. That way we can use these documents to be fed to the LLM model.
 ```python
 question = "What is firecrawl?"
 docs = vectorstore.similarity_search(query=question)
 ```
 ## Generation
 Last but not least, you can use the Groq to generate a response to a question based on the documents we have loaded.
 ```python
 from groq import Groq
 client = Groq(
    api_key="YOUR_GROQ_API_KEY",
 )
 completion = client.chat.completions.create(
    model="llama3-8b-8192",
    messages=[
        {
            "role": "user",
            "content": f"You are a friendly assistant. Your job is to answer the users question based on the documentation provided below:\nDocs:\n\n{docs}\n\nQuestion: {question}"
        }
    ],
    temperature=1,
    max_tokens=1024,
    top_p=1,
    stream=False,
    stop=None,
 )
 print(completion.choices[0].message)
 ```
 ## And Voila!
 You have now built a 'Chat with your website' bot using Llama 3, Groq Llama 3, Langchain, and Firecrawl. You can now use this bot to answer questions based on the documentation of your website.
 If you have any questions or need help, feel free to reach out to us at [Firecrawl](https://firecrawl.dev).