diff --git a/tutorials/data-extraction-using-llms.mdx b/tutorials/data-extraction-using-llms.mdx new file mode 100644 index 0000000..554e787 --- /dev/null +++ b/tutorials/data-extraction-using-llms.mdx @@ -0,0 +1,95 @@ +--- +title: "Extract website data using LLMs" +description: "Learn how to use Firecrawl and Groq to extract structured data from a web page in a few lines of code." +'og:image': "/images/og.png" +'twitter:image': "/images/og.png" +--- + +## Setup + +Install our python dependencies, including groq and firecrawl-py. + +```bash +pip install groq firecrawl-py +``` + +## Getting your Groq and Firecrawl API Keys + +To use Groq and Firecrawl, you will need to get your API keys. You can get your Groq API key from [here](https://groq.com) and your Firecrawl API key from [here](https://firecrawl.dev). + +## Load website with Firecrawl + +To be able to get all the data from a website page and make sure it is in the cleanest format, we will use [FireCrawl](https://firecrawl.dev). It handles by-passing JS-blocked websites, extracting the main content, and outputting in a LLM-readable format for increased accuracy. + +Here is how we will scrape a website url using Firecrawl. We will also set a `pageOptions` for only extracting the main content (`onlyMainContent: True`) of the website page - excluding the navs, footers, etc. + +```python +from firecrawl import FirecrawlApp # Importing the FireCrawlLoader + +url = "https://about.fb.com/news/2024/04/introducing-our-open-mixed-reality-ecosystem/" + +firecrawl = FirecrawlApp( + api_key="fc-YOUR_FIRECRAWL_API_KEY", +) +page_content = firecrawl.scrape_url(url=url, # Target URL to crawl + params={ + "pageOptions":{ + "onlyMainContent": True # Ignore navs, footers, etc. + } + }) +print(page_content) +``` + +Perfect, now we have clean data from the website - ready to be fed to the LLM for data extraction. + +## Extraction and Generation + +Now that we have the website data, let's use Groq to pull out the information we need. We'll use Groq Llama 3 model in JSON mode and pick out certain fields from the page content. + +We are using LLama 3 8b model for this example. Feel free to use bigger models for improved results. + +```python +import json +from groq import Groq + +client = Groq( + api_key="gsk_YOUR_GROQ_API_KEY", # Note: Replace 'API_KEY' with your actual Groq API key +) + +# Here we define the fields we want to extract from the page content +extract = ["summary","date","companies_building_with_quest","title_of_the_article","people_testimonials"] + +completion = client.chat.completions.create( + model="llama3-8b-8192", + messages=[ + { + "role": "system", + "content": "You are a legal advisor who extracts information from documents in JSON." + }, + { + "role": "user", + # Here we pass the page content and the fields we want to extract + "content": f"Extract the following information from the provided documentation:\Page content:\n\n{page_content}\n\nInformation to extract: {extract}" + } + ], + temperature=0, + max_tokens=1024, + top_p=1, + stream=False, + stop=None, + # We set the response format to JSON object + response_format={"type": "json_object"} +) + + +# Pretty print the JSON response +dataExtracted = json.dumps(str(completion.choices[0].message.content), indent=4) + +print(dataExtracted) +``` + +## And Voila! + +You have now built a data extraction bot using Groq and Firecrawl. You can now use this bot to extract structured data from any website. + +If you have any questions or need help, feel free to reach out to us at [Firecrawl](https://firecrawl.dev). \ No newline at end of file diff --git a/tutorials/rag-llama3.mdx b/tutorials/rag-llama3.mdx new file mode 100644 index 0000000..ae9c48f --- /dev/null +++ b/tutorials/rag-llama3.mdx @@ -0,0 +1,91 @@ +--- +title: "Build a 'Chat with website' using Groq Llama 3" +description: "Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot." +--- + +## Setup + +Install our python dependencies, including langchain, groq, faiss, ollama, and firecrawl-py. + +```bash +pip install --upgrade --quiet langchain langchain-community groq faiss-cpu ollama firecrawl-py +``` + +We will be using Ollama for the embeddings, you can download Ollama [here](https://ollama.com/). But feel free to use any other embeddings you prefer. + +## Load website with Firecrawl + +To be able to get all the data from a website and make sure it is in the cleanest format, we will use FireCrawl. Firecrawl integrates very easily with Langchain as a document loader. + +Here is how you can load a website with FireCrawl: + +```python +from langchain_community.document_loaders import FireCrawlLoader # Importing the FireCrawlLoader + +url = "https://firecrawl.dev" +loader = FireCrawlLoader( + api_key="fc-YOUR_API_KEY", # Note: Replace 'YOUR_API_KEY' with your actual FireCrawl API key + url=url, # Target URL to crawl + mode="crawl" # Mode set to 'crawl' to crawl all accessible subpages +) +docs = loader.load() +``` + +## Setup the Vectorstore + +Next, we will setup the vectorstore. The vectorstore is a data structure that allows us to store and query embeddings. We will use the Ollama embeddings and the FAISS vectorstore. +We split the documents into chunks of 1000 characters each, with a 200 character overlap. This is to ensure that the chunks are not too small and not too big - and that it can fit into the LLM model when we query it. + +```python +from langchain_community.embeddings import OllamaEmbeddings +from langchain_text_splitters import RecursiveCharacterTextSplitter +from langchain_community.vectorstores import FAISS + +text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) +splits = text_splitter.split_documents(docs) +vectorstore = FAISS.from_documents(documents=splits, embedding=OllamaEmbeddings()) +``` + +## Retrieval and Generation + +Now that our documents are loaded and the vectorstore is setup, we can, based on user's question, do a similarity search to retrieve the most relevant documents. That way we can use these documents to be fed to the LLM model. + + +```python +question = "What is firecrawl?" +docs = vectorstore.similarity_search(query=question) +``` + +## Generation +Last but not least, you can use the Groq to generate a response to a question based on the documents we have loaded. + +```python +from groq import Groq + +client = Groq( + api_key="YOUR_GROQ_API_KEY", +) + +completion = client.chat.completions.create( + model="llama3-8b-8192", + messages=[ + { + "role": "user", + "content": f"You are a friendly assistant. Your job is to answer the users question based on the documentation provided below:\nDocs:\n\n{docs}\n\nQuestion: {question}" + } + ], + temperature=1, + max_tokens=1024, + top_p=1, + stream=False, + stop=None, +) + +print(completion.choices[0].message) +``` + +## And Voila! + +You have now built a 'Chat with your website' bot using Llama 3, Groq Llama 3, Langchain, and Firecrawl. You can now use this bot to answer questions based on the documentation of your website. + +If you have any questions or need help, feel free to reach out to us at [Firecrawl](https://firecrawl.dev). \ No newline at end of file