0

Initial commit

This commit is contained in:
Nicolas 2024-04-15 17:01:47 -04:00
commit a6c2a87811
74 changed files with 10873 additions and 0 deletions

BIN
.DS_Store vendored Normal file

Binary file not shown.

2
.gitattributes vendored Normal file
View File

@ -0,0 +1,2 @@
# Auto detect text files and perform LF normalization
* text=auto

20
.github/workflows/fly.yml vendored Normal file
View File

@ -0,0 +1,20 @@
name: Fly Deploy
on:
push:
branches:
- main
# schedule:
# - cron: '0 */4 * * *'
jobs:
deploy:
name: Deploy app
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: superfly/flyctl-actions/setup-flyctl@master
- name: Change directory
run: cd apps/api
- run: flyctl deploy --remote-only
env:
FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}

6
.gitignore vendored Normal file
View File

@ -0,0 +1,6 @@
/node_modules/
/dist/
.env
*.csv
dump.rdb
/mongo-data

4
CONTRIBUTING.md Normal file
View File

@ -0,0 +1,4 @@
# Contributing
We love contributions! Please read our [contributing guide](CONTRIBUTING.md) before submitting a pull request.

201
LICENSE Normal file
View File

@ -0,0 +1,201 @@
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright 2024 Firecrawl | Mendable.ai
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

108
README.md Normal file
View File

@ -0,0 +1,108 @@
# 🔥 Firecrawl
Crawl and convert any website into clean markdown
*This repo is still in early development.*
## What is Firecrawl?
[Firecrawl](https://firecrawl.dev?ref=github) is an API service that takes a URL, crawls it, and converts it into clean markdown. We crawl all accessible subpages and give you clean markdown for each. No sitemap required.
## How to use it?
We provide an easy to use API with our hosted version. You can find the playground and documentation [here](https://firecrawl.com/playground). You can also self host the backend if you'd like.
- [x] API
- [x] Python SDK
- [x] JS SDK - Coming Soon
Self-host. To self-host refer to guide [here](https://github.com/mendableai/firecrawl/blob/main/SELF_HOST.md).
### API Key
To use the API, you need to sign up on [Firecrawl](https://firecrawl.com) and get an API key.
### Crawling
Used to crawl a URL and all accessible subpages. This submits a crawl job and returns a job ID to check the status of the crawl.
```bash
curl -X POST https://api.firecrawl.dev/v0/crawl \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://mendable.ai"
}'
```
Returns a jobId
```json
{ "jobId": "1234-5678-9101" }
```
### Check Crawl Job
Used to check the status of a crawl job and get its result.
```bash
curl -X GET https://api.firecrawl.dev/v0/crawl/status/1234-5678-9101 \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY'
```
```json
{
"status": "completed",
"current": 22,
"total": 22,
"data": [
{
"content": "Raw Content ",
"markdown": "# Markdown Content",
"provider": "web-scraper",
"metadata": {
"title": "Mendable | AI for CX and Sales",
"description": "AI for CX and Sales",
"language": null,
"sourceURL": "https://www.mendable.ai/",
}
]
}
```
## Using Python SDK
### Installing Python SDK
```bash
pip install firecrawl-py
```
### Crawl a website
```python
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="YOUR_API_KEY")
crawl_result = app.crawl_url('mendable.ai', {'crawlerOptions': {'excludes': ['blog/*']}})
# Get the markdown
for result in crawl_result:
print(result['markdown'])
```
### Scraping a URL
To scrape a single URL, use the `scrape_url` method. It takes the URL as a parameter and returns the scraped data as a dictionary.
```python
url = 'https://example.com'
scraped_data = app.scrape_url(url)
```
## Contributing
We love contributions! Please read our [contributing guide](CONTRIBUTING.md) before submitting a pull request.

6
SELF_HOST.md Normal file
View File

@ -0,0 +1,6 @@
# Self-hosting Firecrawl
Guide coming soon.

BIN
apps/.DS_Store vendored Normal file

Binary file not shown.

4
apps/api/.dockerignore Normal file
View File

@ -0,0 +1,4 @@
/node_modules/
/dist/
.env
*.csv

8
apps/api/.env.local Normal file
View File

@ -0,0 +1,8 @@
PORT=
HOST=
SUPABASE_ANON_TOKEN=
SUPABASE_URL=
SUPABASE_SERVICE_TOKEN=
REDIS_URL=
OPENAI_API_KEY=
SCRAPING_BEE_API_KEY=

2
apps/api/.gitattributes vendored Normal file
View File

@ -0,0 +1,2 @@
# Auto detect text files and perform LF normalization
* text=auto

6
apps/api/.gitignore vendored Normal file
View File

@ -0,0 +1,6 @@
/node_modules/
/dist/
.env
*.csv
dump.rdb
/mongo-data

36
apps/api/Dockerfile Normal file
View File

@ -0,0 +1,36 @@
FROM node:20-slim AS base
ENV PNPM_HOME="/pnpm"
ENV PATH="$PNPM_HOME:$PATH"
LABEL fly_launch_runtime="Node.js"
RUN corepack enable
COPY . /app
WORKDIR /app
FROM base AS prod-deps
RUN --mount=type=cache,id=pnpm,target=/pnpm/store pnpm install --prod --frozen-lockfile
FROM base AS build
RUN --mount=type=cache,id=pnpm,target=/pnpm/store pnpm install --frozen-lockfile
RUN pnpm install
RUN pnpm run build
# Install packages needed for deployment
FROM base
RUN apt-get update -qq && \
apt-get install --no-install-recommends -y chromium chromium-sandbox && \
rm -rf /var/lib/apt/lists /var/cache/apt/archives
COPY --from=prod-deps /app/node_modules /app/node_modules
COPY --from=build /app /app
# Start the server by default, this can be overwritten at runtime
EXPOSE 8080
ENV PUPPETEER_EXECUTABLE_PATH="/usr/bin/chromium"
CMD [ "pnpm", "run", "start:production" ]
CMD [ "pnpm", "run", "worker:production" ]

47
apps/api/fly.toml Normal file
View File

@ -0,0 +1,47 @@
# fly.toml app configuration file generated for firecrawl-scraper-js on 2024-04-07T21:09:59-03:00
#
# See https://fly.io/docs/reference/configuration/ for information about how to use this file.
#
app = 'firecrawl-scraper-js'
primary_region = 'mia'
kill_signal = 'SIGINT'
kill_timeout = '5s'
[build]
[processes]
app = 'npm run start:production'
worker = 'npm run worker:production'
[http_service]
internal_port = 8080
force_https = true
auto_stop_machines = true
auto_start_machines = true
min_machines_running = 0
processes = ['app']
[[services]]
protocol = 'tcp'
internal_port = 8080
processes = ['app']
[[services.ports]]
port = 80
handlers = ['http']
force_https = true
[[services.ports]]
port = 443
handlers = ['tls', 'http']
[services.concurrency]
type = 'connections'
hard_limit = 45
soft_limit = 20
[[vm]]
size = 'performance-1x'

5
apps/api/jest.config.js Normal file
View File

@ -0,0 +1,5 @@
module.exports = {
preset: "ts-jest",
testEnvironment: "node",
setupFiles: ["./jest.setup.js"],
};

1
apps/api/jest.setup.js Normal file
View File

@ -0,0 +1 @@
global.fetch = require('jest-fetch-mock');

98
apps/api/package.json Normal file
View File

@ -0,0 +1,98 @@
{
"name": "firecrawl-scraper-js",
"version": "1.0.0",
"description": "",
"main": "index.js",
"scripts": {
"start": "nodemon --exec ts-node src/index.ts",
"start:production": "tsc && node dist/src/index.js",
"format": "prettier --write \"src/**/*.(js|ts)\"",
"flyio": "node dist/src/index.js",
"start:dev": "nodemon --exec ts-node src/index.ts",
"build": "tsc",
"test": "jest --verbose",
"workers": "nodemon --exec ts-node src/services/queue-worker.ts",
"worker:production": "node dist/src/services/queue-worker.js",
"mongo-docker": "docker run -d -p 2717:27017 -v ./mongo-data:/data/db --name mongodb mongo:latest",
"mongo-docker-console": "docker exec -it mongodb mongosh",
"run-example": "npx ts-node src/example.ts"
},
"author": "",
"license": "ISC",
"devDependencies": {
"@flydotio/dockerfile": "^0.4.10",
"@tsconfig/recommended": "^1.0.3",
"@types/body-parser": "^1.19.2",
"@types/bull": "^4.10.0",
"@types/cors": "^2.8.13",
"@types/express": "^4.17.17",
"@types/jest": "^29.5.6",
"body-parser": "^1.20.1",
"express": "^4.18.2",
"jest": "^29.6.3",
"jest-fetch-mock": "^3.0.3",
"nodemon": "^2.0.20",
"supabase": "^1.77.9",
"supertest": "^6.3.3",
"ts-jest": "^29.1.1",
"ts-node": "^10.9.1",
"typescript": "^5.4.2"
},
"dependencies": {
"@brillout/import": "^0.2.2",
"@bull-board/api": "^5.14.2",
"@bull-board/express": "^5.8.0",
"@devil7softwares/pos": "^1.0.2",
"@dqbd/tiktoken": "^1.0.7",
"@logtail/node": "^0.4.12",
"@nangohq/node": "^0.36.33",
"@sentry/node": "^7.48.0",
"@supabase/supabase-js": "^2.7.1",
"async": "^3.2.5",
"async-mutex": "^0.4.0",
"axios": "^1.3.4",
"bottleneck": "^2.19.5",
"bull": "^4.11.4",
"cheerio": "^1.0.0-rc.12",
"cohere": "^1.1.1",
"cors": "^2.8.5",
"cron-parser": "^4.9.0",
"date-fns": "^2.29.3",
"dotenv": "^16.3.1",
"express-rate-limit": "^6.7.0",
"glob": "^10.3.12",
"gpt3-tokenizer": "^1.1.5",
"ioredis": "^5.3.2",
"keyword-extractor": "^0.0.25",
"langchain": "^0.1.25",
"languagedetect": "^2.0.0",
"logsnag": "^0.1.6",
"luxon": "^3.4.3",
"md5": "^2.3.0",
"moment": "^2.29.4",
"mongoose": "^8.0.3",
"natural": "^6.3.0",
"openai": "^4.28.4",
"pos": "^0.4.2",
"promptable": "^0.0.9",
"puppeteer": "^22.6.3",
"rate-limiter-flexible": "^2.4.2",
"redis": "^4.6.7",
"robots-parser": "^3.0.1",
"scrapingbee": "^1.7.4",
"stripe": "^12.2.0",
"turndown": "^7.1.3",
"typesense": "^1.5.4",
"unstructured-client": "^0.9.4",
"uuid": "^9.0.1",
"wordpos": "^2.1.0",
"xml2js": "^0.6.2"
},
"nodemonConfig": {
"ignore": [
"*.docx",
"*.json",
"temp"
]
}
}

6146
apps/api/pnpm-lock.yaml Normal file

File diff suppressed because it is too large Load Diff

53
apps/api/requests.http Normal file
View File

@ -0,0 +1,53 @@
### Crawl Website
POST http://localhost:3002/v0/crawl HTTP/1.1
Authorization: Bearer 1b5b20e4-3417-4298-b1b2-983920e1b74d
{
"url":"https://docs.mendable.ai"
}
### Check Job Status
GET http://localhost:3002/v0/jobs/active HTTP/1.1
### Scrape Website
POST https://api.firecrawl.dev/v0/scrape HTTP/1.1
Authorization: Bearer 1b5b20e4-3417-4298-b1b2-983920e1b74d
content-type: application/json
{
"url":"https://www.agentops.ai"
}
### Scrape Website
POST http://localhost:3002/v0/scrape HTTP/1.1
Authorization: Bearer 1b5b20e4-3417-4298-b1b2-983920e1b74d
content-type: application/json
{
"url":"https://www.agentops.ai"
}
### Check Job Status
GET http://localhost:3002/v0/crawl/status/333ab225-dc3e-418b-9d4b-8fb833cbaf89 HTTP/1.1
Authorization: Bearer 1b5b20e4-3417-4298-b1b2-983920e1b74d
### Get Job Result
POST https://api.firecrawl.dev/v0/crawl HTTP/1.1
Authorization: Bearer 30c90634-8377-4446-9ef9-a280b9be1702
content-type: application/json
{
"url":"https://markprompt.com"
}
### Check Job Status
GET https://api.firecrawl.dev/v0/crawl/status/cfcb71ac-23a3-4da5-bd85-d4e58b871d66
Authorization: Bearer 30c90634-8377-4446-9ef9-a280b9be1702

BIN
apps/api/src/.DS_Store vendored Normal file

Binary file not shown.

2
apps/api/src/control.ts Normal file
View File

@ -0,0 +1,2 @@
// ! IN CASE OPENAI goes down, then activate the fallback -> true
export const is_fallback = false;

18
apps/api/src/example.ts Normal file
View File

@ -0,0 +1,18 @@
import { WebScraperDataProvider } from "./scraper/WebScraper";
async function example() {
const example = new WebScraperDataProvider();
await example.setOptions({
mode: "crawl",
urls: ["https://mendable.ai"],
crawlerOptions: {},
});
const docs = await example.getDocuments(false);
docs.map((doc) => {
console.log(doc.metadata.sourceURL);
});
console.log(docs.length);
}
// example();

352
apps/api/src/index.ts Normal file
View File

@ -0,0 +1,352 @@
import express from "express";
import bodyParser from "body-parser";
import cors from "cors";
import "dotenv/config";
import { getWebScraperQueue } from "./services/queue-service";
import { addWebScraperJob } from "./services/queue-jobs";
import { supabase_service } from "./services/supabase";
import { WebScraperDataProvider } from "./scraper/WebScraper";
import { billTeam, checkTeamCredits } from "./services/billing/credit_billing";
import { getRateLimiter, redisClient } from "./services/rate-limiter";
const { createBullBoard } = require("@bull-board/api");
const { BullAdapter } = require("@bull-board/api/bullAdapter");
const { ExpressAdapter } = require("@bull-board/express");
export const app = express();
global.isProduction = process.env.IS_PRODUCTION === "true";
app.use(bodyParser.urlencoded({ extended: true }));
app.use(bodyParser.json({ limit: "10mb" }));
app.use(cors()); // Add this line to enable CORS
const serverAdapter = new ExpressAdapter();
serverAdapter.setBasePath(`/admin/${process.env.BULL_AUTH_KEY}/queues`);
const { addQueue, removeQueue, setQueues, replaceQueues } = createBullBoard({
queues: [new BullAdapter(getWebScraperQueue())],
serverAdapter: serverAdapter,
});
app.use(
`/admin/${process.env.BULL_AUTH_KEY}/queues`,
serverAdapter.getRouter()
);
app.get("/", (req, res) => {
res.send("SCRAPERS-JS: Hello, world! Fly.io");
});
//write a simple test function
app.get("/test", async (req, res) => {
res.send("Hello, world!");
});
async function authenticateUser(req, res, mode?: string): Promise<string> {
const authHeader = req.headers.authorization;
if (!authHeader) {
return res.status(401).json({ error: "Unauthorized" });
}
const token = authHeader.split(" ")[1]; // Extract the token from "Bearer <token>"
if (!token) {
return res.status(401).json({ error: "Unauthorized: Token missing" });
}
try {
const incomingIP = (req.headers["x-forwarded-for"] ||
req.socket.remoteAddress) as string;
const iptoken = incomingIP + token;
await getRateLimiter(
token === "this_is_just_a_preview_token" ? true : false
).consume(iptoken);
} catch (rateLimiterRes) {
console.error(rateLimiterRes);
return res.status(429).json({
error: "Rate limit exceeded. Too many requests, try again in 1 minute.",
});
}
if (token === "this_is_just_a_preview_token" && mode === "scrape") {
return "preview";
}
// make sure api key is valid, based on the api_keys table in supabase
const { data, error } = await supabase_service
.from("api_keys")
.select("*")
.eq("key", token);
if (error || !data || data.length === 0) {
return res.status(401).json({ error: "Unauthorized: Invalid token" });
}
return data[0].team_id;
}
app.post("/v0/scrape", async (req, res) => {
try {
// make sure to authenticate user first, Bearer <token>
const team_id = await authenticateUser(req, res, "scrape");
try {
const { success: creditsCheckSuccess, message: creditsCheckMessage } =
await checkTeamCredits(team_id, 1);
if (!creditsCheckSuccess) {
return res.status(402).json({ error: "Insufficient credits" });
}
} catch (error) {
console.error(error);
return res.status(500).json({ error: "Internal server error" });
}
// authenticate on supabase
const url = req.body.url;
if (!url) {
return res.status(400).json({ error: "Url is required" });
}
try {
const a = new WebScraperDataProvider();
await a.setOptions({
mode: "single_urls",
urls: [url],
});
const docs = await a.getDocuments(false);
// make sure doc.content is not empty
const filteredDocs = docs.filter(
(doc: { content?: string }) =>
doc.content && doc.content.trim().length > 0
);
if (filteredDocs.length === 0) {
return res.status(200).json({ success: true, data: [] });
}
const { success, credit_usage } = await billTeam(
team_id,
filteredDocs.length
);
if (!success) {
// throw new Error("Failed to bill team, no subscribtion was found");
// return {
// success: false,
// message: "Failed to bill team, no subscribtion was found",
// docs: [],
// };
return res
.status(402)
.json({ error: "Failed to bill, no subscribtion was found" });
}
return res.json({
success: true,
data: filteredDocs[0],
});
} catch (error) {
console.error(error);
return res.status(500).json({ error: error.message });
}
} catch (error) {
console.error(error);
return res.status(500).json({ error: error.message });
}
});
app.post("/v0/crawl", async (req, res) => {
try {
const team_id = await authenticateUser(req, res);
const { success: creditsCheckSuccess, message: creditsCheckMessage } =
await checkTeamCredits(team_id, 1);
if (!creditsCheckSuccess) {
return res.status(402).json({ error: "Insufficient credits" });
}
// authenticate on supabase
const url = req.body.url;
if (!url) {
return res.status(400).json({ error: "Url is required" });
}
const mode = req.body.mode ?? "crawl";
const crawlerOptions = req.body.crawlerOptions ?? {};
if (mode === "single_urls" && !url.includes(",")) {
try {
const a = new WebScraperDataProvider();
await a.setOptions({
mode: "single_urls",
urls: [url],
crawlerOptions: {
returnOnlyUrls: true,
},
});
const docs = await a.getDocuments(false, (progress) => {
job.progress({
current: progress.current,
total: progress.total,
current_step: "SCRAPING",
current_url: progress.currentDocumentUrl,
});
});
return res.json({
success: true,
documents: docs,
});
} catch (error) {
console.error(error);
return res.status(500).json({ error: error.message });
}
}
const job = await addWebScraperJob({
url: url,
mode: mode ?? "crawl", // fix for single urls not working
crawlerOptions: { ...crawlerOptions },
team_id: team_id,
});
res.json({ jobId: job.id });
} catch (error) {
console.error(error);
return res.status(500).json({ error: error.message });
}
});
app.post("/v0/crawlWebsitePreview", async (req, res) => {
try {
// make sure to authenticate user first, Bearer <token>
const authHeader = req.headers.authorization;
if (!authHeader) {
return res.status(401).json({ error: "Unauthorized" });
}
const token = authHeader.split(" ")[1]; // Extract the token from "Bearer <token>"
if (!token) {
return res.status(401).json({ error: "Unauthorized: Token missing" });
}
// authenticate on supabase
const url = req.body.url;
if (!url) {
return res.status(400).json({ error: "Url is required" });
}
const mode = req.body.mode ?? "crawl";
const crawlerOptions = req.body.crawlerOptions ?? {};
const job = await addWebScraperJob({
url: url,
mode: mode ?? "crawl", // fix for single urls not working
crawlerOptions: { ...crawlerOptions, limit: 5, maxCrawledLinks: 5 },
team_id: "preview",
});
res.json({ jobId: job.id });
} catch (error) {
console.error(error);
return res.status(500).json({ error: error.message });
}
});
app.get("/v0/crawl/status/:jobId", async (req, res) => {
try {
const authHeader = req.headers.authorization;
if (!authHeader) {
return res.status(401).json({ error: "Unauthorized" });
}
const token = authHeader.split(" ")[1]; // Extract the token from "Bearer <token>"
if (!token) {
return res.status(401).json({ error: "Unauthorized: Token missing" });
}
// make sure api key is valid, based on the api_keys table in supabase
const { data, error } = await supabase_service
.from("api_keys")
.select("*")
.eq("key", token);
if (error || !data || data.length === 0) {
return res.status(401).json({ error: "Unauthorized: Invalid token" });
}
const job = await getWebScraperQueue().getJob(req.params.jobId);
if (!job) {
return res.status(404).json({ error: "Job not found" });
}
const { current, current_url, total, current_step } = await job.progress();
res.json({
status: await job.getState(),
// progress: job.progress(),
current: current,
current_url: current_url,
current_step: current_step,
total: total,
data: job.returnvalue,
});
} catch (error) {
console.error(error);
return res.status(500).json({ error: error.message });
}
});
app.get("/v0/checkJobStatus/:jobId", async (req, res) => {
try {
const job = await getWebScraperQueue().getJob(req.params.jobId);
if (!job) {
return res.status(404).json({ error: "Job not found" });
}
const { current, current_url, total, current_step } = await job.progress();
res.json({
status: await job.getState(),
// progress: job.progress(),
current: current,
current_url: current_url,
current_step: current_step,
total: total,
data: job.returnvalue,
});
} catch (error) {
console.error(error);
return res.status(500).json({ error: error.message });
}
});
const DEFAULT_PORT = process.env.PORT ?? 3002;
const HOST = process.env.HOST ?? "localhost";
redisClient.connect();
export function startServer(port = DEFAULT_PORT) {
const server = app.listen(Number(port), HOST, () => {
console.log(`Server listening on port ${port}`);
console.log(`For the UI, open http://${HOST}:${port}/admin/queues`);
console.log("");
console.log("1. Make sure Redis is running on port 6379 by default");
console.log(
"2. If you want to run nango, make sure you do port forwarding in 3002 using ngrok http 3002 "
);
});
return server;
}
if (require.main === module) {
startServer();
}
// Use this as a health check that way we dont destroy the server
app.get(`/admin/${process.env.BULL_AUTH_KEY}/queues`, async (req, res) => {
try {
const webScraperQueue = getWebScraperQueue();
const [webScraperActive] = await Promise.all([
webScraperQueue.getActiveCount(),
]);
const noActiveJobs = webScraperActive === 0;
// 200 if no active jobs, 503 if there are active jobs
return res.status(noActiveJobs ? 200 : 500).json({
webScraperActive,
noActiveJobs,
});
} catch (error) {
console.error(error);
return res.status(500).json({ error: error.message });
}
});
app.get("/is-production", (req, res) => {
res.send({ isProduction: global.isProduction });
});

View File

@ -0,0 +1,16 @@
export async function batchProcess<T>(
array: T[],
batchSize: number,
asyncFunction: (item: T, index: number) => Promise<void>
): Promise<void> {
const batches = [];
for (let i = 0; i < array.length; i += batchSize) {
const batch = array.slice(i, i + batchSize);
batches.push(batch);
}
for (const batch of batches) {
await Promise.all(batch.map((item, i) => asyncFunction(item, i)));
}
}

View File

@ -0,0 +1,21 @@
export class CustomError extends Error {
statusCode: number;
status: string;
message: string;
dataIngestionJob: any;
constructor(
statusCode: number,
status: string,
message: string = "",
dataIngestionJob?: any,
) {
super(message);
this.statusCode = statusCode;
this.status = status;
this.message = message;
this.dataIngestionJob = dataIngestionJob;
Object.setPrototypeOf(this, CustomError.prototype);
}
}

View File

@ -0,0 +1,37 @@
export interface Progress {
current: number;
total: number;
status: string;
metadata?: {
sourceURL?: string;
[key: string]: any;
};
currentDocumentUrl?: string;
}
export class Document {
id?: string;
content: string;
markdown?: string;
createdAt?: Date;
updatedAt?: Date;
type?: string;
metadata: {
sourceURL?: string;
[key: string]: any;
};
childrenLinks?: string[];
constructor(data: Partial<Document>) {
if (!data.content) {
throw new Error("Missing required fields");
}
this.content = data.content;
this.createdAt = data.createdAt || new Date();
this.updatedAt = data.updatedAt || new Date();
this.type = data.type || "unknown";
this.metadata = data.metadata || { sourceURL: "" };
this.markdown = data.markdown || "";
this.childrenLinks = data.childrenLinks || undefined;
}
}

View File

@ -0,0 +1,51 @@
export function parseMarkdown(html: string) {
var TurndownService = require("turndown");
const turndownService = new TurndownService();
turndownService.addRule("inlineLink", {
filter: function (node, options) {
return (
options.linkStyle === "inlined" &&
node.nodeName === "A" &&
node.getAttribute("href")
);
},
replacement: function (content, node) {
var href = node.getAttribute("href").trim();
var title = node.title ? ' "' + node.title + '"' : "";
return "[" + content.trim() + "](" + href + title + ")\n";
},
});
let markdownContent = turndownService.turndown(html);
// multiple line links
let insideLinkContent = false;
let newMarkdownContent = "";
let linkOpenCount = 0;
for (let i = 0; i < markdownContent.length; i++) {
const char = markdownContent[i];
if (char == "[") {
linkOpenCount++;
} else if (char == "]") {
linkOpenCount = Math.max(0, linkOpenCount - 1);
}
insideLinkContent = linkOpenCount > 0;
if (insideLinkContent && char == "\n") {
newMarkdownContent += "\\" + "\n";
} else {
newMarkdownContent += char;
}
}
markdownContent = newMarkdownContent;
// Remove [Skip to Content](#page) and [Skip to content](#skip)
markdownContent = markdownContent.replace(
/\[Skip to Content\]\(#[^\)]*\)/gi,
""
);
return markdownContent;
}

View File

@ -0,0 +1,12 @@
export function parseMode(mode: string) {
switch (mode) {
case "single_urls":
return "single_urls";
case "sitemap":
return "sitemap";
case "crawl":
return "crawl";
default:
return "single_urls";
}
}

View File

@ -0,0 +1,96 @@
import { Job } from "bull";
import { CrawlResult, WebScraperOptions } from "../types";
import { WebScraperDataProvider } from "../scraper/WebScraper";
import { Progress } from "../lib/entities";
import { billTeam } from "../services/billing/credit_billing";
export async function startWebScraperPipeline({
job,
}: {
job: Job<WebScraperOptions>;
}) {
return (await runWebScraper({
url: job.data.url,
mode: job.data.mode,
crawlerOptions: job.data.crawlerOptions,
inProgress: (progress) => {
job.progress(progress);
},
onSuccess: (result) => {
job.moveToCompleted(result);
},
onError: (error) => {
job.moveToFailed(error);
},
team_id: job.data.team_id,
})) as { success: boolean; message: string; docs: CrawlResult[] };
}
export async function runWebScraper({
url,
mode,
crawlerOptions,
inProgress,
onSuccess,
onError,
team_id,
}: {
url: string;
mode: "crawl" | "single_urls" | "sitemap";
crawlerOptions: any;
inProgress: (progress: any) => void;
onSuccess: (result: any) => void;
onError: (error: any) => void;
team_id: string;
}): Promise<{ success: boolean; message: string; docs: CrawlResult[] }> {
try {
const provider = new WebScraperDataProvider();
if (mode === "crawl") {
await provider.setOptions({
mode: mode,
urls: [url],
crawlerOptions: crawlerOptions,
});
} else {
await provider.setOptions({
mode: mode,
urls: url.split(","),
crawlerOptions: crawlerOptions,
});
}
const docs = (await provider.getDocuments(false, (progress: Progress) => {
inProgress(progress);
})) as CrawlResult[];
if (docs.length === 0) {
return {
success: true,
message: "No pages found",
docs: [],
};
}
// remove docs with empty content
const filteredDocs = docs.filter((doc) => doc.content.trim().length > 0);
onSuccess(filteredDocs);
const { success, credit_usage } = await billTeam(
team_id,
filteredDocs.length
);
if (!success) {
// throw new Error("Failed to bill team, no subscribtion was found");
return {
success: false,
message: "Failed to bill team, no subscribtion was found",
docs: [],
};
}
return { success: true, message: "", docs: filteredDocs as CrawlResult[] };
} catch (error) {
console.error("Error running web scraper", error);
onError(error);
return { success: false, message: error.message, docs: [] };
}
}

View File

@ -0,0 +1,295 @@
import axios from "axios";
import cheerio, { load } from "cheerio";
import { URL } from "url";
import { getLinksFromSitemap } from "./sitemap";
import async from "async";
import { Progress } from "../../lib/entities";
import { scrapWithScrapingBee } from "./single_url";
import robotsParser from "robots-parser";
export class WebCrawler {
private initialUrl: string;
private baseUrl: string;
private includes: string[];
private excludes: string[];
private maxCrawledLinks: number;
private visited: Set<string> = new Set();
private crawledUrls: Set<string> = new Set();
private limit: number;
private robotsTxtUrl: string;
private robots: any;
constructor({
initialUrl,
includes,
excludes,
maxCrawledLinks,
limit = 10000,
}: {
initialUrl: string;
includes?: string[];
excludes?: string[];
maxCrawledLinks?: number;
limit?: number;
}) {
this.initialUrl = initialUrl;
this.baseUrl = new URL(initialUrl).origin;
this.includes = includes ?? [];
this.excludes = excludes ?? [];
this.limit = limit;
this.robotsTxtUrl = `${this.baseUrl}/robots.txt`;
this.robots = robotsParser(this.robotsTxtUrl, "");
// Deprecated, use limit instead
this.maxCrawledLinks = maxCrawledLinks ?? limit;
}
private filterLinks(sitemapLinks: string[], limit: number): string[] {
return sitemapLinks
.filter((link) => {
const url = new URL(link);
const path = url.pathname;
// Check if the link should be excluded
if (this.excludes.length > 0 && this.excludes[0] !== "") {
if (
this.excludes.some((excludePattern) =>
new RegExp(excludePattern).test(path)
)
) {
return false;
}
}
// Check if the link matches the include patterns, if any are specified
if (this.includes.length > 0 && this.includes[0] !== "") {
return this.includes.some((includePattern) =>
new RegExp(includePattern).test(path)
);
}
const isAllowed = this.robots.isAllowed(link, "FireCrawlAgent") ?? true;
// Check if the link is disallowed by robots.txt
if (!isAllowed) {
console.log(`Link disallowed by robots.txt: ${link}`);
return false;
}
return true;
})
.slice(0, limit);
}
public async start(
inProgress?: (progress: Progress) => void,
concurrencyLimit: number = 5,
limit: number = 10000
): Promise<string[]> {
// Fetch and parse robots.txt
try {
const response = await axios.get(this.robotsTxtUrl);
this.robots = robotsParser(this.robotsTxtUrl, response.data);
} catch (error) {
console.error(`Failed to fetch robots.txt from ${this.robotsTxtUrl}`);
}
const sitemapLinks = await this.tryFetchSitemapLinks(this.initialUrl);
if (sitemapLinks.length > 0) {
const filteredLinks = this.filterLinks(sitemapLinks, limit);
return filteredLinks;
}
const urls = await this.crawlUrls(
[this.initialUrl],
concurrencyLimit,
inProgress
);
if (
urls.length === 0 &&
this.filterLinks([this.initialUrl], limit).length > 0
) {
return [this.initialUrl];
}
// make sure to run include exclude here again
return this.filterLinks(urls, limit);
}
private async crawlUrls(
urls: string[],
concurrencyLimit: number,
inProgress?: (progress: Progress) => void
): Promise<string[]> {
const queue = async.queue(async (task: string, callback) => {
if (this.crawledUrls.size >= this.maxCrawledLinks) {
if (callback && typeof callback === "function") {
callback();
}
return;
}
const newUrls = await this.crawl(task);
newUrls.forEach((url) => this.crawledUrls.add(url));
if (inProgress && newUrls.length > 0) {
inProgress({
current: this.crawledUrls.size,
total: this.maxCrawledLinks,
status: "SCRAPING",
currentDocumentUrl: newUrls[newUrls.length - 1],
});
} else if (inProgress) {
inProgress({
current: this.crawledUrls.size,
total: this.maxCrawledLinks,
status: "SCRAPING",
currentDocumentUrl: task,
});
}
await this.crawlUrls(newUrls, concurrencyLimit, inProgress);
if (callback && typeof callback === "function") {
callback();
}
}, concurrencyLimit);
queue.push(
urls.filter(
(url) =>
!this.visited.has(url) && this.robots.isAllowed(url, "FireCrawlAgent")
),
(err) => {
if (err) console.error(err);
}
);
await queue.drain();
return Array.from(this.crawledUrls);
}
async crawl(url: string): Promise<string[]> {
if (this.visited.has(url) || !this.robots.isAllowed(url, "FireCrawlAgent"))
return [];
this.visited.add(url);
if (!url.startsWith("http")) {
url = "https://" + url;
}
if (url.endsWith("/")) {
url = url.slice(0, -1);
}
if (this.isFile(url) || this.isSocialMediaOrEmail(url)) {
return [];
}
try {
let content;
// If it is the first link, fetch with scrapingbee
if (this.visited.size === 1) {
content = await scrapWithScrapingBee(url, "load");
} else {
const response = await axios.get(url);
content = response.data;
}
const $ = load(content);
let links: string[] = [];
$("a").each((_, element) => {
const href = $(element).attr("href");
if (href) {
let fullUrl = href;
if (!href.startsWith("http")) {
fullUrl = new URL(href, this.baseUrl).toString();
}
const url = new URL(fullUrl);
const path = url.pathname;
if (
// fullUrl.startsWith(this.initialUrl) && // this condition makes it stop crawling back the url
this.isInternalLink(fullUrl) &&
this.matchesPattern(fullUrl) &&
this.noSections(fullUrl) &&
this.matchesIncludes(path) &&
!this.matchesExcludes(path) &&
this.robots.isAllowed(fullUrl, "FireCrawlAgent")
) {
links.push(fullUrl);
}
}
});
return links.filter((link) => !this.visited.has(link));
} catch (error) {
return [];
}
}
private matchesIncludes(url: string): boolean {
if (this.includes.length === 0 || this.includes[0] == "") return true;
return this.includes.some((pattern) => new RegExp(pattern).test(url));
}
private matchesExcludes(url: string): boolean {
if (this.excludes.length === 0 || this.excludes[0] == "") return false;
return this.excludes.some((pattern) => new RegExp(pattern).test(url));
}
private noSections(link: string): boolean {
return !link.includes("#");
}
private isInternalLink(link: string): boolean {
const urlObj = new URL(link, this.baseUrl);
const domainWithoutProtocol = this.baseUrl.replace(/^https?:\/\//, "");
return urlObj.hostname === domainWithoutProtocol;
}
private matchesPattern(link: string): boolean {
return true; // Placeholder for future pattern matching implementation
}
private isFile(url: string): boolean {
const fileExtensions = [
".png",
".jpg",
".jpeg",
".gif",
".css",
".js",
".ico",
".svg",
".pdf",
".zip",
".exe",
".dmg",
".mp4",
".mp3",
".pptx",
".docx",
".xlsx",
".xml",
];
return fileExtensions.some((ext) => url.endsWith(ext));
}
private isSocialMediaOrEmail(url: string): boolean {
const socialMediaOrEmail = [
"facebook.com",
"twitter.com",
"linkedin.com",
"instagram.com",
"pinterest.com",
"mailto:",
];
return socialMediaOrEmail.some((ext) => url.includes(ext));
}
private async tryFetchSitemapLinks(url: string): Promise<string[]> {
const sitemapUrl = url.endsWith("/sitemap.xml")
? url
: `${url}/sitemap.xml`;
try {
const response = await axios.get(sitemapUrl);
if (response.status === 200) {
return await getLinksFromSitemap(sitemapUrl);
}
} catch (error) {
// Error handling for failed sitemap fetch
}
return [];
}
}

View File

@ -0,0 +1,287 @@
import { Document } from "../../lib/entities";
import { Progress } from "../../lib/entities";
import { scrapSingleUrl } from "./single_url";
import { SitemapEntry, fetchSitemapData, getLinksFromSitemap } from "./sitemap";
import { WebCrawler } from "./crawler";
import { getValue, setValue } from "../../services/redis";
export type WebScraperOptions = {
urls: string[];
mode: "single_urls" | "sitemap" | "crawl";
crawlerOptions?: {
returnOnlyUrls?: boolean;
includes?: string[];
excludes?: string[];
maxCrawledLinks?: number;
limit?: number;
};
concurrentRequests?: number;
};
export class WebScraperDataProvider {
private urls: string[] = [""];
private mode: "single_urls" | "sitemap" | "crawl" = "single_urls";
private includes: string[];
private excludes: string[];
private maxCrawledLinks: number;
private returnOnlyUrls: boolean;
private limit: number = 10000;
private concurrentRequests: number = 20;
authorize(): void {
throw new Error("Method not implemented.");
}
authorizeNango(): Promise<void> {
throw new Error("Method not implemented.");
}
private async convertUrlsToDocuments(
urls: string[],
inProgress?: (progress: Progress) => void
): Promise<Document[]> {
const totalUrls = urls.length;
let processedUrls = 0;
console.log("Converting urls to documents");
console.log("Total urls", urls);
const results: (Document | null)[] = new Array(urls.length).fill(null);
for (let i = 0; i < urls.length; i += this.concurrentRequests) {
const batchUrls = urls.slice(i, i + this.concurrentRequests);
await Promise.all(batchUrls.map(async (url, index) => {
const result = await scrapSingleUrl(url, true);
processedUrls++;
if (inProgress) {
inProgress({
current: processedUrls,
total: totalUrls,
status: "SCRAPING",
currentDocumentUrl: url,
});
}
results[i + index] = result;
}));
}
return results.filter((result) => result !== null) as Document[];
}
async getDocuments(
useCaching: boolean = false,
inProgress?: (progress: Progress) => void
): Promise<Document[]> {
if (this.urls[0].trim() === "") {
throw new Error("Url is required");
}
if (!useCaching) {
if (this.mode === "crawl") {
const crawler = new WebCrawler({
initialUrl: this.urls[0],
includes: this.includes,
excludes: this.excludes,
maxCrawledLinks: this.maxCrawledLinks,
limit: this.limit,
});
const links = await crawler.start(inProgress, 5, this.limit);
if (this.returnOnlyUrls) {
return links.map((url) => ({
content: "",
metadata: { sourceURL: url },
provider: "web",
type: "text",
}));
}
let documents = await this.convertUrlsToDocuments(links, inProgress);
documents = await this.getSitemapData(this.urls[0], documents);
console.log("documents", documents)
// CACHING DOCUMENTS
// - parent document
const cachedParentDocumentString = await getValue('web-scraper-cache:' + this.normalizeUrl(this.urls[0]));
if (cachedParentDocumentString != null) {
let cachedParentDocument = JSON.parse(cachedParentDocumentString);
if (!cachedParentDocument.childrenLinks || cachedParentDocument.childrenLinks.length < links.length - 1) {
cachedParentDocument.childrenLinks = links.filter((link) => link !== this.urls[0]);
await setValue('web-scraper-cache:' + this.normalizeUrl(this.urls[0]), JSON.stringify(cachedParentDocument), 60 * 60 * 24 * 10); // 10 days
}
} else {
let parentDocument = documents.filter((document) => this.normalizeUrl(document.metadata.sourceURL) === this.normalizeUrl(this.urls[0]))
await this.setCachedDocuments(parentDocument, links);
}
await this.setCachedDocuments(documents.filter((document) => this.normalizeUrl(document.metadata.sourceURL) !== this.normalizeUrl(this.urls[0])), []);
documents = this.removeChildLinks(documents);
documents = documents.splice(0, this.limit);
return documents;
}
if (this.mode === "single_urls") {
let documents = await this.convertUrlsToDocuments(this.urls, inProgress);
const baseUrl = new URL(this.urls[0]).origin;
documents = await this.getSitemapData(baseUrl, documents);
await this.setCachedDocuments(documents);
documents = this.removeChildLinks(documents);
documents = documents.splice(0, this.limit);
return documents;
}
if (this.mode === "sitemap") {
const links = await getLinksFromSitemap(this.urls[0]);
let documents = await this.convertUrlsToDocuments(links.slice(0, this.limit), inProgress);
documents = await this.getSitemapData(this.urls[0], documents);
await this.setCachedDocuments(documents);
documents = this.removeChildLinks(documents);
documents = documents.splice(0, this.limit);
return documents;
}
return [];
}
let documents = await this.getCachedDocuments(this.urls.slice(0, this.limit));
if (documents.length < this.limit) {
const newDocuments: Document[] = await this.getDocuments(false, inProgress);
newDocuments.forEach(doc => {
if (!documents.some(d => this.normalizeUrl(d.metadata.sourceURL) === this.normalizeUrl(doc.metadata?.sourceURL))) {
documents.push(doc);
}
});
}
documents = this.filterDocsExcludeInclude(documents);
documents = this.removeChildLinks(documents);
documents = documents.splice(0, this.limit);
return documents;
}
private filterDocsExcludeInclude(documents: Document[]): Document[] {
return documents.filter((document) => {
const url = new URL(document.metadata.sourceURL);
const path = url.pathname;
if (this.excludes.length > 0 && this.excludes[0] !== '') {
// Check if the link should be excluded
if (this.excludes.some(excludePattern => new RegExp(excludePattern).test(path))) {
return false;
}
}
if (this.includes.length > 0 && this.includes[0] !== '') {
// Check if the link matches the include patterns, if any are specified
if (this.includes.length > 0) {
return this.includes.some(includePattern => new RegExp(includePattern).test(path));
}
}
return true;
});
}
private normalizeUrl(url: string): string {
if (url.includes("//www.")) {
return url.replace("//www.", "//");
}
return url;
}
private removeChildLinks(documents: Document[]): Document[] {
for (let document of documents) {
if (document?.childrenLinks) delete document.childrenLinks;
};
return documents;
}
async setCachedDocuments(documents: Document[], childrenLinks?: string[]) {
for (const document of documents) {
if (document.content.trim().length === 0) {
continue;
}
const normalizedUrl = this.normalizeUrl(document.metadata.sourceURL);
await setValue('web-scraper-cache:' + normalizedUrl, JSON.stringify({
...document,
childrenLinks: childrenLinks || []
}), 60 * 60 * 24 * 10); // 10 days
}
}
async getCachedDocuments(urls: string[]): Promise<Document[]> {
let documents: Document[] = [];
for (const url of urls) {
const normalizedUrl = this.normalizeUrl(url);
console.log("Getting cached document for web-scraper-cache:" + normalizedUrl)
const cachedDocumentString = await getValue('web-scraper-cache:' + normalizedUrl);
if (cachedDocumentString) {
const cachedDocument = JSON.parse(cachedDocumentString);
documents.push(cachedDocument);
// get children documents
for (const childUrl of cachedDocument.childrenLinks) {
const normalizedChildUrl = this.normalizeUrl(childUrl);
const childCachedDocumentString = await getValue('web-scraper-cache:' + normalizedChildUrl);
if (childCachedDocumentString) {
const childCachedDocument = JSON.parse(childCachedDocumentString);
if (!documents.find((doc) => doc.metadata.sourceURL === childCachedDocument.metadata.sourceURL)) {
documents.push(childCachedDocument);
}
}
}
}
}
return documents;
}
setOptions(options: WebScraperOptions): void {
if (!options.urls) {
throw new Error("Urls are required");
}
console.log("options", options.crawlerOptions?.excludes)
this.urls = options.urls;
this.mode = options.mode;
this.concurrentRequests = options.concurrentRequests ?? 20;
this.includes = options.crawlerOptions?.includes ?? [];
this.excludes = options.crawlerOptions?.excludes ?? [];
this.maxCrawledLinks = options.crawlerOptions?.maxCrawledLinks ?? 1000;
this.returnOnlyUrls = options.crawlerOptions?.returnOnlyUrls ?? false;
this.limit = options.crawlerOptions?.limit ?? 10000;
//! @nicolas, for some reason this was being injected and breakign everything. Don't have time to find source of the issue so adding this check
this.excludes = this.excludes.filter(item => item !== '');
// make sure all urls start with https://
this.urls = this.urls.map((url) => {
if (!url.trim().startsWith("http")) {
return `https://${url}`;
}
return url;
});
}
private async getSitemapData(baseUrl: string, documents: Document[]) {
const sitemapData = await fetchSitemapData(baseUrl)
if (sitemapData) {
for (let i = 0; i < documents.length; i++) {
const docInSitemapData = sitemapData.find((data) => this.normalizeUrl(data.loc) === this.normalizeUrl(documents[i].metadata.sourceURL))
if (docInSitemapData) {
let sitemapDocData: Partial<SitemapEntry> = {};
if (docInSitemapData.changefreq) {
sitemapDocData.changefreq = docInSitemapData.changefreq;
}
if (docInSitemapData.priority) {
sitemapDocData.priority = Number(docInSitemapData.priority);
}
if (docInSitemapData.lastmod) {
sitemapDocData.lastmod = docInSitemapData.lastmod;
}
if (Object.keys(sitemapDocData).length !== 0) {
documents[i].metadata.sitemap = sitemapDocData;
}
}
}
}
return documents;
}
}

View File

@ -0,0 +1,145 @@
import * as cheerio from "cheerio";
import { ScrapingBeeClient } from "scrapingbee";
import { attemptScrapWithRequests, sanitizeText } from "./utils/utils";
import { extractMetadata } from "./utils/metadata";
import dotenv from "dotenv";
import { Document } from "../../lib/entities";
import { parseMarkdown } from "../../lib/html-to-markdown";
// import puppeteer from "puppeteer";
dotenv.config();
export async function scrapWithScrapingBee(url: string, wait_browser:string = "domcontentloaded"): Promise<string> {
try {
const client = new ScrapingBeeClient(process.env.SCRAPING_BEE_API_KEY);
const response = await client.get({
url: url,
params: { timeout: 15000, wait_browser: wait_browser },
headers: { "ScrapingService-Request": "TRUE" },
});
if (response.status !== 200 && response.status !== 404) {
console.error(
`Scraping bee error in ${url} with status code ${response.status}`
);
return "";
}
const decoder = new TextDecoder();
const text = decoder.decode(response.data);
return text;
} catch (error) {
console.error(`Error scraping with Scraping Bee: ${error}`);
return "";
}
}
export async function scrapWithPlaywright(url: string): Promise<string> {
try {
const response = await fetch(process.env.PLAYWRIGHT_MICROSERVICE_URL, {
method: 'POST',
headers: {
"Content-Type": "application/json",
},
body: JSON.stringify({ url: url }),
});
if (!response.ok) {
console.error(`Error fetching w/ playwright server -> URL: ${url} with status: ${response.status}`);
return "";
}
const data = await response.json();
const html = data.content;
return html ?? "";
} catch (error) {
console.error(`Error scraping with Puppeteer: ${error}`);
return "";
}
}
export async function scrapSingleUrl(
urlToScrap: string,
toMarkdown: boolean = true
): Promise<Document> {
console.log(`Scraping URL: ${urlToScrap}`);
urlToScrap = urlToScrap.trim();
const removeUnwantedElements = (html: string) => {
const soup = cheerio.load(html);
soup("script, style, iframe, noscript, meta, head").remove();
return soup.html();
};
const attemptScraping = async (url: string, method: 'scrapingBee' | 'playwright' | 'scrapingBeeLoad' | 'fetch') => {
let text = "";
switch (method) {
case 'scrapingBee':
if (process.env.SCRAPING_BEE_API_KEY) {
text = await scrapWithScrapingBee(url);
}
break;
case 'playwright':
if (process.env.PLAYWRIGHT_MICROSERVICE_URL) {
text = await scrapWithPlaywright(url);
}
break;
case 'scrapingBeeLoad':
if (process.env.SCRAPING_BEE_API_KEY) {
text = await scrapWithScrapingBee(url, "networkidle2");
}
break;
case 'fetch':
try {
const response = await fetch(url);
if (!response.ok) {
console.error(`Error fetching URL: ${url} with status: ${response.status}`);
return "";
}
text = await response.text();
} catch (error) {
console.error(`Error scraping URL: ${error}`);
return "";
}
break;
}
const cleanedHtml = removeUnwantedElements(text);
return [await parseMarkdown(cleanedHtml), text];
};
try {
let [text, html ] = await attemptScraping(urlToScrap, 'scrapingBee');
if (!text || text.length < 100) {
console.log("Falling back to playwright");
[text, html] = await attemptScraping(urlToScrap, 'playwright');
}
if (!text || text.length < 100) {
console.log("Falling back to scraping bee load");
[text, html] = await attemptScraping(urlToScrap, 'scrapingBeeLoad');
}
if (!text || text.length < 100) {
console.log("Falling back to fetch");
[text, html] = await attemptScraping(urlToScrap, 'fetch');
}
const soup = cheerio.load(html);
const metadata = extractMetadata(soup, urlToScrap);
return {
content: text,
markdown: text,
metadata: { ...metadata, sourceURL: urlToScrap },
} as Document;
} catch (error) {
console.error(`Error: ${error} - Failed to fetch URL: ${urlToScrap}`);
return {
content: "",
markdown: "",
metadata: { sourceURL: urlToScrap },
} as Document;
}
}

View File

@ -0,0 +1,74 @@
import axios from "axios";
import { parseStringPromise } from "xml2js";
export async function getLinksFromSitemap(
sitemapUrl: string,
allUrls: string[] = []
): Promise<string[]> {
try {
let content: string;
try {
const response = await axios.get(sitemapUrl);
content = response.data;
} catch (error) {
console.error(`Request failed for ${sitemapUrl}: ${error}`);
return allUrls;
}
const parsed = await parseStringPromise(content);
const root = parsed.urlset || parsed.sitemapindex;
if (root && root.sitemap) {
for (const sitemap of root.sitemap) {
if (sitemap.loc && sitemap.loc.length > 0) {
await getLinksFromSitemap(sitemap.loc[0], allUrls);
}
}
} else if (root && root.url) {
for (const url of root.url) {
if (url.loc && url.loc.length > 0) {
allUrls.push(url.loc[0]);
}
}
}
} catch (error) {
console.error(`Error processing ${sitemapUrl}: ${error}`);
}
return allUrls;
}
export const fetchSitemapData = async (url: string): Promise<SitemapEntry[] | null> => {
const sitemapUrl = url.endsWith("/sitemap.xml") ? url : `${url}/sitemap.xml`;
try {
const response = await axios.get(sitemapUrl);
if (response.status === 200) {
const xml = response.data;
const parsedXml = await parseStringPromise(xml);
const sitemapData: SitemapEntry[] = [];
if (parsedXml.urlset && parsedXml.urlset.url) {
for (const urlElement of parsedXml.urlset.url) {
const sitemapEntry: SitemapEntry = { loc: urlElement.loc[0] };
if (urlElement.lastmod) sitemapEntry.lastmod = urlElement.lastmod[0];
if (urlElement.changefreq) sitemapEntry.changefreq = urlElement.changefreq[0];
if (urlElement.priority) sitemapEntry.priority = Number(urlElement.priority[0]);
sitemapData.push(sitemapEntry);
}
}
return sitemapData;
}
return null;
} catch (error) {
// Error handling for failed sitemap fetch
}
return [];
}
export interface SitemapEntry {
loc: string;
lastmod?: string;
changefreq?: string;
priority?: number;
}

View File

@ -0,0 +1,109 @@
// import * as cheerio from 'cheerio';
import { CheerioAPI } from "cheerio";
interface Metadata {
title?: string;
description?: string;
language?: string;
keywords?: string;
robots?: string;
ogTitle?: string;
ogDescription?: string;
dctermsCreated?: string;
dcDateCreated?: string;
dcDate?: string;
dctermsType?: string;
dcType?: string;
dctermsAudience?: string;
dctermsSubject?: string;
dcSubject?: string;
dcDescription?: string;
ogImage?: string;
dctermsKeywords?: string;
modifiedTime?: string;
publishedTime?: string;
articleTag?: string;
articleSection?: string;
}
export function extractMetadata(soup: CheerioAPI, url: string): Metadata {
let title: string | null = null;
let description: string | null = null;
let language: string | null = null;
let keywords: string | null = null;
let robots: string | null = null;
let ogTitle: string | null = null;
let ogDescription: string | null = null;
let dctermsCreated: string | null = null;
let dcDateCreated: string | null = null;
let dcDate: string | null = null;
let dctermsType: string | null = null;
let dcType: string | null = null;
let dctermsAudience: string | null = null;
let dctermsSubject: string | null = null;
let dcSubject: string | null = null;
let dcDescription: string | null = null;
let ogImage: string | null = null;
let dctermsKeywords: string | null = null;
let modifiedTime: string | null = null;
let publishedTime: string | null = null;
let articleTag: string | null = null;
let articleSection: string | null = null;
try {
title = soup("title").text() || null;
description = soup('meta[name="description"]').attr("content") || null;
// Assuming the language is part of the URL as per the regex pattern
const pattern = /([a-zA-Z]+-[A-Z]{2})/;
const match = pattern.exec(url);
language = match ? match[1] : null;
keywords = soup('meta[name="keywords"]').attr("content") || null;
robots = soup('meta[name="robots"]').attr("content") || null;
ogTitle = soup('meta[property="og:title"]').attr("content") || null;
ogDescription = soup('meta[property="og:description"]').attr("content") || null;
articleSection = soup('meta[name="article:section"]').attr("content") || null;
articleTag = soup('meta[name="article:tag"]').attr("content") || null;
publishedTime = soup('meta[property="article:published_time"]').attr("content") || null;
modifiedTime = soup('meta[property="article:modified_time"]').attr("content") || null;
ogImage = soup('meta[property="og:image"]').attr("content") || null;
dctermsKeywords = soup('meta[name="dcterms.keywords"]').attr("content") || null;
dcDescription = soup('meta[name="dc.description"]').attr("content") || null;
dcSubject = soup('meta[name="dc.subject"]').attr("content") || null;
dctermsSubject = soup('meta[name="dcterms.subject"]').attr("content") || null;
dctermsAudience = soup('meta[name="dcterms.audience"]').attr("content") || null;
dcType = soup('meta[name="dc.type"]').attr("content") || null;
dctermsType = soup('meta[name="dcterms.type"]').attr("content") || null;
dcDate = soup('meta[name="dc.date"]').attr("content") || null;
dcDateCreated = soup('meta[name="dc.date.created"]').attr("content") || null;
dctermsCreated = soup('meta[name="dcterms.created"]').attr("content") || null;
} catch (error) {
console.error("Error extracting metadata:", error);
}
return {
...(title ? { title } : {}),
...(description ? { description } : {}),
...(language ? { language } : {}),
...(keywords ? { keywords } : {}),
...(robots ? { robots } : {}),
...(ogTitle ? { ogTitle } : {}),
...(ogDescription ? { ogDescription } : {}),
...(dctermsCreated ? { dctermsCreated } : {}),
...(dcDateCreated ? { dcDateCreated } : {}),
...(dcDate ? { dcDate } : {}),
...(dctermsType ? { dctermsType } : {}),
...(dcType ? { dcType } : {}),
...(dctermsAudience ? { dctermsAudience } : {}),
...(dctermsSubject ? { dctermsSubject } : {}),
...(dcSubject ? { dcSubject } : {}),
...(dcDescription ? { dcDescription } : {}),
...(ogImage ? { ogImage } : {}),
...(dctermsKeywords ? { dctermsKeywords } : {}),
...(modifiedTime ? { modifiedTime } : {}),
...(publishedTime ? { publishedTime } : {}),
...(articleTag ? { articleTag } : {}),
...(articleSection ? { articleSection } : {}),
};
}

View File

@ -0,0 +1,23 @@
import axios from "axios";
export async function attemptScrapWithRequests(
urlToScrap: string
): Promise<string | null> {
try {
const response = await axios.get(urlToScrap);
if (!response.data) {
console.log("Failed normal requests as well");
return null;
}
return response.data;
} catch (error) {
console.error(`Error in attemptScrapWithRequests: ${error}`);
return null;
}
}
export function sanitizeText(text: string): string {
return text.replace("\u0000", "");
}

View File

@ -0,0 +1,219 @@
import { supabase_service } from "../supabase";
const FREE_CREDITS = 100;
export async function billTeam(team_id: string, credits: number) {
if (team_id === "preview") {
return { success: true, message: "Preview team, no credits used" };
}
console.log(`Billing team ${team_id} for ${credits} credits`);
// When the API is used, you can log the credit usage in the credit_usage table:
// team_id: The ID of the team using the API.
// subscription_id: The ID of the team's active subscription.
// credits_used: The number of credits consumed by the API call.
// created_at: The timestamp of the API usage.
// 1. get the subscription
const { data: subscription } = await supabase_service
.from("subscriptions")
.select("*")
.eq("team_id", team_id)
.eq("status", "active")
.single();
if (!subscription) {
const { data: credit_usage } = await supabase_service
.from("credit_usage")
.insert([
{
team_id,
credits_used: credits,
created_at: new Date(),
},
])
.select();
return { success: true, credit_usage };
}
// 2. add the credits to the credits_usage
const { data: credit_usage } = await supabase_service
.from("credit_usage")
.insert([
{
team_id,
subscription_id: subscription.id,
credits_used: credits,
created_at: new Date(),
},
])
.select();
return { success: true, credit_usage };
}
// if team has enough credits for the operation, return true, else return false
export async function checkTeamCredits(team_id: string, credits: number) {
if (team_id === "preview") {
return { success: true, message: "Preview team, no credits used" };
}
// 1. Retrieve the team's active subscription based on the team_id.
const { data: subscription, error: subscriptionError } =
await supabase_service
.from("subscriptions")
.select("id, price_id, current_period_start, current_period_end")
.eq("team_id", team_id)
.eq("status", "active")
.single();
if (subscriptionError || !subscription) {
const { data: creditUsages, error: creditUsageError } =
await supabase_service
.from("credit_usage")
.select("credits_used")
.is("subscription_id", null)
.eq("team_id", team_id);
// .gte("created_at", subscription.current_period_start)
// .lte("created_at", subscription.current_period_end);
if (creditUsageError) {
throw new Error(
`Failed to retrieve credit usage for subscription_id: ${subscription.id}`
);
}
const totalCreditsUsed = creditUsages.reduce(
(acc, usage) => acc + usage.credits_used,
0
);
console.log("totalCreditsUsed", totalCreditsUsed);
// 5. Compare the total credits used with the credits allowed by the plan.
if (totalCreditsUsed + credits > FREE_CREDITS) {
return {
success: false,
message: "Insufficient credits, please upgrade!",
};
}
return { success: true, message: "Sufficient credits available" };
}
// 2. Get the price_id from the subscription.
const { data: price, error: priceError } = await supabase_service
.from("prices")
.select("credits")
.eq("id", subscription.price_id)
.single();
if (priceError) {
throw new Error(
`Failed to retrieve price for price_id: ${subscription.price_id}`
);
}
// 4. Calculate the total credits used by the team within the current billing period.
const { data: creditUsages, error: creditUsageError } = await supabase_service
.from("credit_usage")
.select("credits_used")
.eq("subscription_id", subscription.id)
.gte("created_at", subscription.current_period_start)
.lte("created_at", subscription.current_period_end);
if (creditUsageError) {
throw new Error(
`Failed to retrieve credit usage for subscription_id: ${subscription.id}`
);
}
const totalCreditsUsed = creditUsages.reduce(
(acc, usage) => acc + usage.credits_used,
0
);
// 5. Compare the total credits used with the credits allowed by the plan.
if (totalCreditsUsed + credits > price.credits) {
return { success: false, message: "Insufficient credits, please upgrade!" };
}
return { success: true, message: "Sufficient credits available" };
}
// Count the total credits used by a team within the current billing period and return the remaining credits.
export async function countCreditsAndRemainingForCurrentBillingPeriod(
team_id: string
) {
// 1. Retrieve the team's active subscription based on the team_id.
const { data: subscription, error: subscriptionError } =
await supabase_service
.from("subscriptions")
.select("id, price_id, current_period_start, current_period_end")
.eq("team_id", team_id)
.single();
if (subscriptionError || !subscription) {
// throw new Error(`Failed to retrieve subscription for team_id: ${team_id}`);
// Free
const { data: creditUsages, error: creditUsageError } =
await supabase_service
.from("credit_usage")
.select("credits_used")
.is("subscription_id", null)
.eq("team_id", team_id);
// .gte("created_at", subscription.current_period_start)
// .lte("created_at", subscription.current_period_end);
if (creditUsageError || !creditUsages) {
throw new Error(
`Failed to retrieve credit usage for subscription_id: ${subscription.id}`
);
}
const totalCreditsUsed = creditUsages.reduce(
(acc, usage) => acc + usage.credits_used,
0
);
// 4. Calculate remaining credits.
const remainingCredits = FREE_CREDITS - totalCreditsUsed;
return { totalCreditsUsed, remainingCredits, totalCredits: FREE_CREDITS };
}
// 2. Get the price_id from the subscription to retrieve the total credits available.
const { data: price, error: priceError } = await supabase_service
.from("prices")
.select("credits")
.eq("id", subscription.price_id)
.single();
if (priceError || !price) {
throw new Error(
`Failed to retrieve price for price_id: ${subscription.price_id}`
);
}
// 3. Calculate the total credits used by the team within the current billing period.
const { data: creditUsages, error: creditUsageError } = await supabase_service
.from("credit_usage")
.select("credits_used")
.eq("subscription_id", subscription.id)
.gte("created_at", subscription.current_period_start)
.lte("created_at", subscription.current_period_end);
if (creditUsageError || !creditUsages) {
throw new Error(
`Failed to retrieve credit usage for subscription_id: ${subscription.id}`
);
}
const totalCreditsUsed = creditUsages.reduce(
(acc, usage) => acc + usage.credits_used,
0
);
// 4. Calculate remaining credits.
const remainingCredits = price.credits - totalCreditsUsed;
return { totalCreditsUsed, remainingCredits, totalCredits: price.credits };
}

View File

@ -0,0 +1,4 @@
const { Logtail } = require("@logtail/node");
//dot env
require("dotenv").config();
export const logtail = new Logtail(process.env.LOGTAIL_KEY);

View File

@ -0,0 +1,17 @@
import { Job, Queue } from "bull";
import {
getWebScraperQueue,
} from "./queue-service";
import { v4 as uuidv4 } from "uuid";
import { WebScraperOptions } from "../types";
export async function addWebScraperJob(
webScraperOptions: WebScraperOptions,
options: any = {}
): Promise<Job> {
return await getWebScraperQueue().add(webScraperOptions, {
...options,
jobId: uuidv4(),
});
}

View File

@ -0,0 +1,16 @@
import Queue from "bull";
let webScraperQueue;
export function getWebScraperQueue() {
if (!webScraperQueue) {
webScraperQueue = new Queue("web-scraper", process.env.REDIS_URL, {
settings: {
lockDuration: 4 * 60 * 60 * 1000, // 4 hours in milliseconds,
lockRenewTime: 30 * 60 * 1000, // 30 minutes in milliseconds
},
});
console.log("Web scraper queue created");
}
return webScraperQueue;
}

View File

@ -0,0 +1,62 @@
import { CustomError } from "../lib/custom-error";
import { getWebScraperQueue } from "./queue-service";
import "dotenv/config";
import { logtail } from "./logtail";
import { startWebScraperPipeline } from "../main/runWebScraper";
import { WebScraperDataProvider } from "../scraper/WebScraper";
import { callWebhook } from "./webhook";
getWebScraperQueue().process(
Math.floor(Number(process.env.NUM_WORKERS_PER_QUEUE ?? 8)),
async function (job, done) {
try {
job.progress({
current: 1,
total: 100,
current_step: "SCRAPING",
current_url: "",
});
const { success, message, docs } = await startWebScraperPipeline({ job });
const data = {
success: success,
result: {
links: docs.map((doc) => {
return { content: doc, source: doc.metadata.sourceURL };
}),
},
project_id: job.data.project_id,
error: message /* etc... */,
};
await callWebhook(job.data.team_id, data);
done(null, data);
} catch (error) {
if (error instanceof CustomError) {
// Here we handle the error, then save the failed job
console.error(error.message); // or any other error handling
logtail.error("Custom error while ingesting", {
job_id: job.id,
error: error.message,
dataIngestionJob: error.dataIngestionJob,
});
}
console.log(error);
logtail.error("Overall error ingesting", {
job_id: job.id,
error: error.message,
});
const data = {
success: false,
project_id: job.data.project_id,
error:
"Something went wrong... Contact help@mendable.ai or try again." /* etc... */,
};
await callWebhook(job.data.team_id, data);
done(null, data);
}
}
);

View File

@ -0,0 +1,65 @@
import { RateLimiterRedis } from "rate-limiter-flexible";
import * as redis from "redis";
const MAX_REQUESTS_PER_MINUTE_PREVIEW = 5;
const MAX_CRAWLS_PER_MINUTE_STARTER = 2;
const MAX_CRAWLS_PER_MINUTE_STANDAR = 4;
const MAX_CRAWLS_PER_MINUTE_SCALE = 20;
const MAX_REQUESTS_PER_MINUTE_ACCOUNT = 40;
export const redisClient = redis.createClient({
url: process.env.REDIS_URL,
legacyMode: true,
});
export const previewRateLimiter = new RateLimiterRedis({
storeClient: redisClient,
keyPrefix: "middleware",
points: MAX_REQUESTS_PER_MINUTE_PREVIEW,
duration: 60, // Duration in seconds
});
export const serverRateLimiter = new RateLimiterRedis({
storeClient: redisClient,
keyPrefix: "middleware",
points: MAX_REQUESTS_PER_MINUTE_ACCOUNT,
duration: 60, // Duration in seconds
});
export function crawlRateLimit(plan: string){
if(plan === "standard"){
return new RateLimiterRedis({
storeClient: redisClient,
keyPrefix: "middleware",
points: MAX_CRAWLS_PER_MINUTE_STANDAR,
duration: 60, // Duration in seconds
});
}else if(plan === "scale"){
return new RateLimiterRedis({
storeClient: redisClient,
keyPrefix: "middleware",
points: MAX_CRAWLS_PER_MINUTE_SCALE,
duration: 60, // Duration in seconds
});
}
return new RateLimiterRedis({
storeClient: redisClient,
keyPrefix: "middleware",
points: MAX_CRAWLS_PER_MINUTE_STARTER,
duration: 60, // Duration in seconds
});
}
export function getRateLimiter(preview: boolean){
if(preview){
return previewRateLimiter;
}else{
return serverRateLimiter;
}
}

View File

@ -0,0 +1,38 @@
import Redis from 'ioredis';
// Initialize Redis client
const redis = new Redis(process.env.REDIS_URL);
/**
* Set a value in Redis with an optional expiration time.
* @param {string} key The key under which to store the value.
* @param {string} value The value to store.
* @param {number} [expire] Optional expiration time in seconds.
*/
const setValue = async (key: string, value: string, expire?: number) => {
if (expire) {
await redis.set(key, value, 'EX', expire);
} else {
await redis.set(key, value);
}
};
/**
* Get a value from Redis.
* @param {string} key The key of the value to retrieve.
* @returns {Promise<string|null>} The value, if found, otherwise null.
*/
const getValue = async (key: string): Promise<string | null> => {
const value = await redis.get(key);
return value;
};
/**
* Delete a key from Redis.
* @param {string} key The key to delete.
*/
const deleteKey = async (key: string) => {
await redis.del(key);
};
export { setValue, getValue, deleteKey };

View File

@ -0,0 +1,6 @@
import { createClient } from "@supabase/supabase-js";
export const supabase_service = createClient<any>(
process.env.SUPABASE_URL,
process.env.SUPABASE_SERVICE_TOKEN,
);

View File

@ -0,0 +1,41 @@
import { supabase_service } from "./supabase";
export const callWebhook = async (teamId: string, data: any) => {
const { data: webhooksData, error } = await supabase_service
.from('webhooks')
.select('url')
.eq('team_id', teamId)
.limit(1);
if (error) {
console.error(`Error fetching webhook URL for team ID: ${teamId}`, error.message);
return null;
}
if (!webhooksData || webhooksData.length === 0) {
return null;
}
let dataToSend = [];
if (data.result.links && data.result.links.length !== 0) {
for (let i = 0; i < data.result.links.length; i++) {
dataToSend.push({
content: data.result.links[i].content.content,
markdown: data.result.links[i].content.markdown,
metadata: data.result.links[i].content.metadata,
});
}
}
await fetch(webhooksData[0].url, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
success: data.success,
data: dataToSend,
error: data.error || undefined,
}),
});
}

2
apps/api/src/strings.ts Normal file
View File

@ -0,0 +1,2 @@
export const errorNoResults =
"No results found, please check the URL or contact us at help@mendable.ai to file a ticket.";

File diff suppressed because it is too large Load Diff

26
apps/api/src/types.ts Normal file
View File

@ -0,0 +1,26 @@
export interface CrawlResult {
source: string;
content: string;
options?: {
summarize?: boolean;
summarize_max_chars?: number;
};
metadata?: any;
raw_context_id?: number | string;
permissions?: any[];
}
export interface IngestResult {
success: boolean;
error: string;
data: CrawlResult[];
}
export interface WebScraperOptions {
url: string;
mode: "crawl" | "single_urls" | "sitemap";
crawlerOptions: any;
team_id: string;
}

17
apps/api/tsconfig.json Normal file
View File

@ -0,0 +1,17 @@
{
"compilerOptions": {
"rootDir": "./src",
"lib": ["es6","DOM"],
"target": "ES2020", // or higher
"module": "commonjs",
"esModuleInterop": true,
"sourceMap": true,
"outDir": "./dist/src",
"moduleResolution": "node",
"baseUrl": ".",
"paths": {
"*": ["node_modules/*", "src/types/*"],
}
},
"include": ["src/","src/**/*", "services/db/supabase.ts", "utils/utils.ts", "services/db/supabaseEmbeddings.ts", "utils/EventEmmitter.ts", "src/services/queue-service.ts"]
}

BIN
apps/playwright-service/.DS_Store vendored Normal file

Binary file not shown.

152
apps/playwright-service/.gitignore vendored Normal file
View File

@ -0,0 +1,152 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
.pybuilder/
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# pytype static type analyzer
.pytype/
# Cython debug symbols
cython_debug/
# PyCharm
# JetBrains specific template is maintainted in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/

View File

@ -0,0 +1,38 @@
FROM python:3.11-slim
ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1
ENV PIP_DISABLE_PIP_VERSION_CHECK=1
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc \
libstdc++6
WORKDIR /app
# Install Python dependencies
COPY requirements.txt ./
# Remove py which is pulled in by retry, py is not needed and is a CVE
RUN pip install --no-cache-dir --upgrade -r requirements.txt && \
pip uninstall -y py && \
playwright install chromium && playwright install-deps chromium && \
ln -s /usr/local/bin/supervisord /usr/bin/supervisord
# Cleanup for CVEs and size reduction
# https://github.com/tornadoweb/tornado/issues/3107
# xserver-common and xvfb included by playwright installation but not needed after
# perl-base is part of the base Python Debian image but not needed for Danswer functionality
# perl-base could only be removed with --allow-remove-essential
COPY . ./
EXPOSE $PORT
# run fast api hypercorn
CMD hypercorn main:app --bind [::]:$PORT
# CMD ["hypercorn", "main:app", "--bind", "[::]:$PORT"]
# CMD ["sh", "-c", "uvicorn main:app --host 0.0.0.0 --port $PORT"]

View File

View File

@ -0,0 +1,28 @@
from fastapi import FastAPI, Response
from playwright.async_api import async_playwright
import os
from fastapi.responses import JSONResponse
from pydantic import BaseModel
app = FastAPI()
from pydantic import BaseModel
class UrlModel(BaseModel):
url: str
@app.post("/html") # Kept as POST to accept body parameters
async def root(body: UrlModel): # Using Pydantic model for request body
async with async_playwright() as p:
browser = await p.chromium.launch()
context = await browser.new_context()
page = await context.new_page()
await page.goto(body.url) # Adjusted to use the url from the request body model
page_content = await page.content() # Get the HTML content of the page
await browser.close()
json_compatible_item_data = {"content": page_content}
return JSONResponse(content=json_compatible_item_data)

View File

View File

@ -0,0 +1,4 @@
hypercorn==0.16.0
fastapi==0.110.0
playwright==1.42.0
uvicorn

View File

@ -0,0 +1 @@
3.11

91
apps/python-sdk/README.md Normal file
View File

@ -0,0 +1,91 @@
# Firecrawl Python SDK
The Firecrawl Python SDK is a library that allows you to easily scrape and crawl websites, and output the data in a format ready for use with language models (LLMs). It provides a simple and intuitive interface for interacting with the Firecrawl API.
## Installation
To install the Firecrawl Python SDK, you can use pip:
```bash
pip install firecrawl-py
```
## Usage
1. Get an API key from [firecrawl.dev](https://firecrawl.dev)
2. Set the API key as an environment variable named `FIRECRAWL_API_KEY` or pass it as a parameter to the `FirecrawlApp` class.
Here's an example of how to use the SDK:
```python
from firecrawl import FirecrawlApp
# Initialize the FirecrawlApp with your API key
app = FirecrawlApp(api_key='your_api_key')
# Scrape a single URL
url = 'https://mendable.ai'
scraped_data = app.scrape_url(url)
# Crawl a website
crawl_url = 'https://mendable.ai'
crawl_params = {
'crawlerOptions': {
'excludes': ['blog/*'],
'includes': [], # leave empty for all pages
'limit': 1000,
}
}
crawl_result = app.crawl_url(crawl_url, params=crawl_params)
```
### Scraping a URL
To scrape a single URL, use the `scrape_url` method. It takes the URL as a parameter and returns the scraped data as a dictionary.
```python
url = 'https://example.com'
scraped_data = app.scrape_url(url)
```
### Crawling a Website
To crawl a website, use the `crawl_url` method. It takes the starting URL and optional parameters as arguments. The `params` argument allows you to specify additional options for the crawl job, such as the maximum number of pages to crawl, allowed domains, and the output format.
The `wait_until_done` parameter determines whether the method should wait for the crawl job to complete before returning the result. If set to `True`, the method will periodically check the status of the crawl job until it is completed or the specified `timeout` (in seconds) is reached. If set to `False`, the method will return immediately with the job ID, and you can manually check the status of the crawl job using the `check_crawl_status` method.
```python
crawl_url = 'https://example.com'
crawl_params = {
'crawlerOptions': {
'excludes': ['blog/*'],
'includes': [], # leave empty for all pages
'limit': 1000,
}
}
crawl_result = app.crawl_url(crawl_url, params=crawl_params, wait_until_done=True, timeout=5)
```
If `wait_until_done` is set to `True`, the `crawl_url` method will return the crawl result once the job is completed. If the job fails or is stopped, an exception will be raised.
### Checking Crawl Status
To check the status of a crawl job, use the `check_crawl_status` method. It takes the job ID as a parameter and returns the current status of the crawl job.
```python
job_id = crawl_result['jobId']
status = app.check_crawl_status(job_id)
```
## Error Handling
The SDK handles errors returned by the Firecrawl API and raises appropriate exceptions. If an error occurs during a request, an exception will be raised with a descriptive error message.
## Contributing
Contributions to the Firecrawl Python SDK are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request on the GitHub repository.
## License
The Firecrawl Python SDK is open-source and released under the [MIT License](https://opensource.org/licenses/MIT).

View File

@ -0,0 +1 @@
from .firecrawl import FirecrawlApp

View File

@ -0,0 +1,96 @@
import os
import requests
class FirecrawlApp:
def __init__(self, api_key=None):
self.api_key = api_key or os.getenv('FIRECRAWL_API_KEY')
if self.api_key is None:
raise ValueError('No API key provided')
def scrape_url(self, url, params=None):
headers = {
'Content-Type': 'application/json',
'Authorization': f'Bearer {self.api_key}'
}
json_data = {'url': url}
if params:
json_data.update(params)
response = requests.post(
'https://api.firecrawl.dev/v0/scrape',
headers=headers,
json=json_data
)
if response.status_code == 200:
response = response.json()
if response['success'] == True:
return response['data']
else:
raise Exception(f'Failed to scrape URL. Error: {response["error"]}')
elif response.status_code in [402, 409, 500]:
error_message = response.json().get('error', 'Unknown error occurred')
raise Exception(f'Failed to scrape URL. Status code: {response.status_code}. Error: {error_message}')
else:
raise Exception(f'Failed to scrape URL. Status code: {response.status_code}')
def crawl_url(self, url, params=None, wait_until_done=True, timeout=2):
headers = self._prepare_headers()
json_data = {'url': url}
if params:
json_data.update(params)
response = self._post_request('https://api.firecrawl.dev/v0/crawl', json_data, headers)
if response.status_code == 200:
job_id = response.json().get('jobId')
if wait_until_done:
return self._monitor_job_status(job_id, headers, timeout)
else:
return {'jobId': job_id}
else:
self._handle_error(response, 'start crawl job')
def check_crawl_status(self, job_id):
headers = self._prepare_headers()
response = self._get_request(f'https://api.firecrawl.dev/v0/crawl/status/{job_id}', headers)
if response.status_code == 200:
return response.json()
else:
self._handle_error(response, 'check crawl status')
def _prepare_headers(self):
return {
'Content-Type': 'application/json',
'Authorization': f'Bearer {self.api_key}'
}
def _post_request(self, url, data, headers):
return requests.post(url, headers=headers, json=data)
def _get_request(self, url, headers):
return requests.get(url, headers=headers)
def _monitor_job_status(self, job_id, headers, timeout):
import time
while True:
status_response = self._get_request(f'https://api.firecrawl.dev/v0/crawl/status/{job_id}', headers)
if status_response.status_code == 200:
status_data = status_response.json()
if status_data['status'] == 'completed':
if 'data' in status_data:
return status_data['data']
else:
raise Exception('Crawl job completed but no data was returned')
elif status_data['status'] in ['active', 'paused', 'pending', 'queued']:
if timeout < 2:
timeout = 2
time.sleep(timeout) # Wait for the specified timeout before checking again
else:
raise Exception(f'Crawl job failed or was stopped. Status: {status_data["status"]}')
else:
self._handle_error(status_response, 'check crawl status')
def _handle_error(self, response, action):
if response.status_code in [402, 409, 500]:
error_message = response.json().get('error', 'Unknown error occurred')
raise Exception(f'Failed to {action}. Status code: {response.status_code}. Error: {error_message}')
else:
raise Exception(f'Unexpected error occurred while trying to {action}. Status code: {response.status_code}')

Binary file not shown.

Binary file not shown.

View File

@ -0,0 +1,13 @@
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="a6a2d63a-ed2b-46a9-946d-2a7207efed4d")
crawl_result = app.crawl_url('mendable.ai', {'crawlerOptions': {'excludes': ['blog/*']}})
print(crawl_result[0]['markdown'])
job_id = crawl_result['jobId']
print(job_id)
status = app.check_crawl_status(job_id)
print(status)

View File

@ -0,0 +1 @@
from .firecrawl import FirecrawlApp

View File

@ -0,0 +1,96 @@
import os
import requests
class FirecrawlApp:
def __init__(self, api_key=None):
self.api_key = api_key or os.getenv('FIRECRAWL_API_KEY')
if self.api_key is None:
raise ValueError('No API key provided')
def scrape_url(self, url, params=None):
headers = {
'Content-Type': 'application/json',
'Authorization': f'Bearer {self.api_key}'
}
json_data = {'url': url}
if params:
json_data.update(params)
response = requests.post(
'https://api.firecrawl.dev/v0/scrape',
headers=headers,
json=json_data
)
if response.status_code == 200:
response = response.json()
if response['success'] == True:
return response['data']
else:
raise Exception(f'Failed to scrape URL. Error: {response["error"]}')
elif response.status_code in [402, 409, 500]:
error_message = response.json().get('error', 'Unknown error occurred')
raise Exception(f'Failed to scrape URL. Status code: {response.status_code}. Error: {error_message}')
else:
raise Exception(f'Failed to scrape URL. Status code: {response.status_code}')
def crawl_url(self, url, params=None, wait_until_done=True, timeout=2):
headers = self._prepare_headers()
json_data = {'url': url}
if params:
json_data.update(params)
response = self._post_request('https://api.firecrawl.dev/v0/crawl', json_data, headers)
if response.status_code == 200:
job_id = response.json().get('jobId')
if wait_until_done:
return self._monitor_job_status(job_id, headers, timeout)
else:
return {'jobId': job_id}
else:
self._handle_error(response, 'start crawl job')
def check_crawl_status(self, job_id):
headers = self._prepare_headers()
response = self._get_request(f'https://api.firecrawl.dev/v0/crawl/status/{job_id}', headers)
if response.status_code == 200:
return response.json()
else:
self._handle_error(response, 'check crawl status')
def _prepare_headers(self):
return {
'Content-Type': 'application/json',
'Authorization': f'Bearer {self.api_key}'
}
def _post_request(self, url, data, headers):
return requests.post(url, headers=headers, json=data)
def _get_request(self, url, headers):
return requests.get(url, headers=headers)
def _monitor_job_status(self, job_id, headers, timeout):
import time
while True:
status_response = self._get_request(f'https://api.firecrawl.dev/v0/crawl/status/{job_id}', headers)
if status_response.status_code == 200:
status_data = status_response.json()
if status_data['status'] == 'completed':
if 'data' in status_data:
return status_data['data']
else:
raise Exception('Crawl job completed but no data was returned')
elif status_data['status'] in ['active', 'paused', 'pending', 'queued']:
if timeout < 2:
timeout = 2
time.sleep(timeout) # Wait for the specified timeout before checking again
else:
raise Exception(f'Crawl job failed or was stopped. Status: {status_data["status"]}')
else:
self._handle_error(status_response, 'check crawl status')
def _handle_error(self, response, action):
if response.status_code in [402, 409, 500]:
error_message = response.json().get('error', 'Unknown error occurred')
raise Exception(f'Failed to {action}. Status code: {response.status_code}. Error: {error_message}')
else:
raise Exception(f'Unexpected error occurred while trying to {action}. Status code: {response.status_code}')

View File

@ -0,0 +1,7 @@
Metadata-Version: 2.1
Name: firecrawl-py
Version: 0.0.5
Summary: Python SDK for Firecrawl API
Home-page: https://github.com/mendableai/firecrawl-py
Author: Mendable.ai
Author-email: nick@mendable.ai

View File

@ -0,0 +1,9 @@
README.md
setup.py
firecrawl/__init__.py
firecrawl/firecrawl.py
firecrawl_py.egg-info/PKG-INFO
firecrawl_py.egg-info/SOURCES.txt
firecrawl_py.egg-info/dependency_links.txt
firecrawl_py.egg-info/requires.txt
firecrawl_py.egg-info/top_level.txt

View File

@ -0,0 +1 @@

View File

@ -0,0 +1 @@
requests

View File

@ -0,0 +1 @@
firecrawl

14
apps/python-sdk/setup.py Normal file
View File

@ -0,0 +1,14 @@
from setuptools import setup, find_packages
setup(
name='firecrawl-py',
version='0.0.5',
url='https://github.com/mendableai/firecrawl-py',
author='Mendable.ai',
author_email='nick@mendable.ai',
description='Python SDK for Firecrawl API',
packages=find_packages(),
install_requires=[
'requests',
],
)

1
apps/www/README.md Normal file
View File

@ -0,0 +1 @@
Coming soon!