Nick: tutorials
This commit is contained in:
parent
de7e1f501b
commit
18450b5f9a
95
tutorials/data-extraction-using-llms.mdx
Normal file
95
tutorials/data-extraction-using-llms.mdx
Normal file
@ -0,0 +1,95 @@
|
||||
---
|
||||
title: "Extract website data using LLMs"
|
||||
description: "Learn how to use Firecrawl and Groq to extract structured data from a web page in a few lines of code."
|
||||
'og:image': "/images/og.png"
|
||||
'twitter:image': "/images/og.png"
|
||||
---
|
||||
|
||||
## Setup
|
||||
|
||||
Install our python dependencies, including groq and firecrawl-py.
|
||||
|
||||
```bash
|
||||
pip install groq firecrawl-py
|
||||
```
|
||||
|
||||
## Getting your Groq and Firecrawl API Keys
|
||||
|
||||
To use Groq and Firecrawl, you will need to get your API keys. You can get your Groq API key from [here](https://groq.com) and your Firecrawl API key from [here](https://firecrawl.dev).
|
||||
|
||||
## Load website with Firecrawl
|
||||
|
||||
To be able to get all the data from a website page and make sure it is in the cleanest format, we will use [FireCrawl](https://firecrawl.dev). It handles by-passing JS-blocked websites, extracting the main content, and outputting in a LLM-readable format for increased accuracy.
|
||||
|
||||
Here is how we will scrape a website url using Firecrawl. We will also set a `pageOptions` for only extracting the main content (`onlyMainContent: True`) of the website page - excluding the navs, footers, etc.
|
||||
|
||||
```python
|
||||
from firecrawl import FirecrawlApp # Importing the FireCrawlLoader
|
||||
|
||||
url = "https://about.fb.com/news/2024/04/introducing-our-open-mixed-reality-ecosystem/"
|
||||
|
||||
firecrawl = FirecrawlApp(
|
||||
api_key="fc-YOUR_FIRECRAWL_API_KEY",
|
||||
)
|
||||
page_content = firecrawl.scrape_url(url=url, # Target URL to crawl
|
||||
params={
|
||||
"pageOptions":{
|
||||
"onlyMainContent": True # Ignore navs, footers, etc.
|
||||
}
|
||||
})
|
||||
print(page_content)
|
||||
```
|
||||
|
||||
Perfect, now we have clean data from the website - ready to be fed to the LLM for data extraction.
|
||||
|
||||
## Extraction and Generation
|
||||
|
||||
Now that we have the website data, let's use Groq to pull out the information we need. We'll use Groq Llama 3 model in JSON mode and pick out certain fields from the page content.
|
||||
|
||||
We are using LLama 3 8b model for this example. Feel free to use bigger models for improved results.
|
||||
|
||||
```python
|
||||
import json
|
||||
from groq import Groq
|
||||
|
||||
client = Groq(
|
||||
api_key="gsk_YOUR_GROQ_API_KEY", # Note: Replace 'API_KEY' with your actual Groq API key
|
||||
)
|
||||
|
||||
# Here we define the fields we want to extract from the page content
|
||||
extract = ["summary","date","companies_building_with_quest","title_of_the_article","people_testimonials"]
|
||||
|
||||
completion = client.chat.completions.create(
|
||||
model="llama3-8b-8192",
|
||||
messages=[
|
||||
{
|
||||
"role": "system",
|
||||
"content": "You are a legal advisor who extracts information from documents in JSON."
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
# Here we pass the page content and the fields we want to extract
|
||||
"content": f"Extract the following information from the provided documentation:\Page content:\n\n{page_content}\n\nInformation to extract: {extract}"
|
||||
}
|
||||
],
|
||||
temperature=0,
|
||||
max_tokens=1024,
|
||||
top_p=1,
|
||||
stream=False,
|
||||
stop=None,
|
||||
# We set the response format to JSON object
|
||||
response_format={"type": "json_object"}
|
||||
)
|
||||
|
||||
|
||||
# Pretty print the JSON response
|
||||
dataExtracted = json.dumps(str(completion.choices[0].message.content), indent=4)
|
||||
|
||||
print(dataExtracted)
|
||||
```
|
||||
|
||||
## And Voila!
|
||||
|
||||
You have now built a data extraction bot using Groq and Firecrawl. You can now use this bot to extract structured data from any website.
|
||||
|
||||
If you have any questions or need help, feel free to reach out to us at [Firecrawl](https://firecrawl.dev).
|
91
tutorials/rag-llama3.mdx
Normal file
91
tutorials/rag-llama3.mdx
Normal file
@ -0,0 +1,91 @@
|
||||
---
|
||||
title: "Build a 'Chat with website' using Groq Llama 3"
|
||||
description: "Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot."
|
||||
---
|
||||
|
||||
## Setup
|
||||
|
||||
Install our python dependencies, including langchain, groq, faiss, ollama, and firecrawl-py.
|
||||
|
||||
```bash
|
||||
pip install --upgrade --quiet langchain langchain-community groq faiss-cpu ollama firecrawl-py
|
||||
```
|
||||
|
||||
We will be using Ollama for the embeddings, you can download Ollama [here](https://ollama.com/). But feel free to use any other embeddings you prefer.
|
||||
|
||||
## Load website with Firecrawl
|
||||
|
||||
To be able to get all the data from a website and make sure it is in the cleanest format, we will use FireCrawl. Firecrawl integrates very easily with Langchain as a document loader.
|
||||
|
||||
Here is how you can load a website with FireCrawl:
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import FireCrawlLoader # Importing the FireCrawlLoader
|
||||
|
||||
url = "https://firecrawl.dev"
|
||||
loader = FireCrawlLoader(
|
||||
api_key="fc-YOUR_API_KEY", # Note: Replace 'YOUR_API_KEY' with your actual FireCrawl API key
|
||||
url=url, # Target URL to crawl
|
||||
mode="crawl" # Mode set to 'crawl' to crawl all accessible subpages
|
||||
)
|
||||
docs = loader.load()
|
||||
```
|
||||
|
||||
## Setup the Vectorstore
|
||||
|
||||
Next, we will setup the vectorstore. The vectorstore is a data structure that allows us to store and query embeddings. We will use the Ollama embeddings and the FAISS vectorstore.
|
||||
We split the documents into chunks of 1000 characters each, with a 200 character overlap. This is to ensure that the chunks are not too small and not too big - and that it can fit into the LLM model when we query it.
|
||||
|
||||
```python
|
||||
from langchain_community.embeddings import OllamaEmbeddings
|
||||
from langchain_text_splitters import RecursiveCharacterTextSplitter
|
||||
from langchain_community.vectorstores import FAISS
|
||||
|
||||
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
|
||||
splits = text_splitter.split_documents(docs)
|
||||
vectorstore = FAISS.from_documents(documents=splits, embedding=OllamaEmbeddings())
|
||||
```
|
||||
|
||||
## Retrieval and Generation
|
||||
|
||||
Now that our documents are loaded and the vectorstore is setup, we can, based on user's question, do a similarity search to retrieve the most relevant documents. That way we can use these documents to be fed to the LLM model.
|
||||
|
||||
|
||||
```python
|
||||
question = "What is firecrawl?"
|
||||
docs = vectorstore.similarity_search(query=question)
|
||||
```
|
||||
|
||||
## Generation
|
||||
Last but not least, you can use the Groq to generate a response to a question based on the documents we have loaded.
|
||||
|
||||
```python
|
||||
from groq import Groq
|
||||
|
||||
client = Groq(
|
||||
api_key="YOUR_GROQ_API_KEY",
|
||||
)
|
||||
|
||||
completion = client.chat.completions.create(
|
||||
model="llama3-8b-8192",
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": f"You are a friendly assistant. Your job is to answer the users question based on the documentation provided below:\nDocs:\n\n{docs}\n\nQuestion: {question}"
|
||||
}
|
||||
],
|
||||
temperature=1,
|
||||
max_tokens=1024,
|
||||
top_p=1,
|
||||
stream=False,
|
||||
stop=None,
|
||||
)
|
||||
|
||||
print(completion.choices[0].message)
|
||||
```
|
||||
|
||||
## And Voila!
|
||||
|
||||
You have now built a 'Chat with your website' bot using Llama 3, Groq Llama 3, Langchain, and Firecrawl. You can now use this bot to answer questions based on the documentation of your website.
|
||||
|
||||
If you have any questions or need help, feel free to reach out to us at [Firecrawl](https://firecrawl.dev).
|
Loading…
Reference in New Issue
Block a user