Initial commit

2024-04-15 17:01:47 -04:00 · 2024-04-15 17:01:47 -04:00 · a6c2a87811
commit a6c2a87811
74 changed files with 10873 additions and 0 deletions
--- a/.DS_Store
+++ b/.DS_Store
--- a/.gitattributes
+++ b/.gitattributes
@ -0,0 +1,2 @@
+# Auto detect text files and perform LF normalization
+* text=auto
--- a/.github/workflows/fly.yml
+++ b/.github/workflows/fly.yml
@ -0,0 +1,20 @@
+name: Fly Deploy
+on:
+  push:
+    branches:
+      - main
+  # schedule:
+  #   - cron: '0 */4 * * *'
+
+jobs:
+  deploy:
+    name: Deploy app
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+      - uses: superfly/flyctl-actions/setup-flyctl@master
+      - name: Change directory
+        run: cd apps/api
+      - run: flyctl deploy --remote-only
+        env:
+          FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,6 @@
+/node_modules/
+/dist/
+.env
+*.csv
+dump.rdb
+/mongo-data
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -0,0 +1,4 @@
+# Contributing
+
+We love contributions! Please read our [contributing guide](CONTRIBUTING.md) before submitting a pull request.
+
--- a/201
+++ b/201
@ -0,0 +1,201 @@
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright 2024 Firecrawl | Mendable.ai
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
--- a/README.md
+++ b/README.md
@ -0,0 +1,108 @@
+# 🔥 Firecrawl
+
+Crawl and convert any website into clean markdown
+
+
+*This repo is still in early development.*
+
+## What is Firecrawl?
+
+[Firecrawl](https://firecrawl.dev?ref=github) is an API service that takes a URL, crawls it, and converts it into clean markdown. We crawl all accessible subpages and give you clean markdown for each. No sitemap required.
+
+## How to use it?
+
+We provide an easy to use API with our hosted version. You can find the playground and documentation [here](https://firecrawl.com/playground). You can also self host the backend if you'd like. 
+
+- [x] API
+- [x] Python SDK
+- [x] JS SDK - Coming Soon
+
+Self-host. To self-host refer to guide [here](https://github.com/mendableai/firecrawl/blob/main/SELF_HOST.md).
+
+### API Key
+
+To use the API, you need to sign up on [Firecrawl](https://firecrawl.com) and get an API key.
+
+### Crawling
+
+Used to crawl a URL and all accessible subpages. This submits a crawl job and returns a job ID to check the status of the crawl.
+
+```bash
+curl -X POST https://api.firecrawl.dev/v0/crawl \
+    -H 'Content-Type: application/json' \
+    -H 'Authorization: Bearer YOUR_API_KEY' \
+    -d '{
+      "url": "https://mendable.ai"
+    }'
+```
+
+Returns a jobId
+
+```json
+{ "jobId": "1234-5678-9101" }
+```
+
+### Check Crawl Job
+
+Used to check the status of a crawl job and get its result.
+
+```bash
+curl -X GET https://api.firecrawl.dev/v0/crawl/status/1234-5678-9101 \
+  -H 'Content-Type: application/json' \
+  -H 'Authorization: Bearer YOUR_API_KEY'
+```
+
+```json
+{
+    "status": "completed",
+    "current": 22,
+    "total": 22,
+    "data": [
+        {
+        "content": "Raw Content ",
+        "markdown": "# Markdown Content",
+        "provider": "web-scraper",
+        "metadata": {
+            "title": "Mendable | AI for CX and Sales",
+            "description": "AI for CX and Sales",
+            "language": null,
+            "sourceURL": "https://www.mendable.ai/",
+        }
+    ]
+}
+```
+
+## Using Python SDK
+
+### Installing Python SDK
+
+```bash
+pip install firecrawl-py
+```
+
+### Crawl a website
+
+```python
+from firecrawl import FirecrawlApp
+
+app = FirecrawlApp(api_key="YOUR_API_KEY")
+
+crawl_result = app.crawl_url('mendable.ai', {'crawlerOptions': {'excludes': ['blog/*']}})
+
+# Get the markdown
+for result in crawl_result:
+    print(result['markdown'])
+```
+
+### Scraping a URL
+
+To scrape a single URL, use the `scrape_url` method. It takes the URL as a parameter and returns the scraped data as a dictionary.
+
+```python
+url = 'https://example.com'
+scraped_data = app.scrape_url(url)
+```
+
+## Contributing
+
+We love contributions! Please read our [contributing guide](CONTRIBUTING.md) before submitting a pull request.
--- a/SELF_HOST.md
+++ b/SELF_HOST.md
@ -0,0 +1,6 @@
+# Self-hosting Firecrawl
+
+Guide coming soon.
+
+
+
--- a/apps/.DS_Store
+++ b/apps/.DS_Store
--- a/apps/api/.dockerignore
+++ b/apps/api/.dockerignore
@ -0,0 +1,4 @@
+/node_modules/
+/dist/
+.env
+*.csv
--- a/apps/api/.env.local
+++ b/apps/api/.env.local
@ -0,0 +1,8 @@
+PORT=
+HOST=
+SUPABASE_ANON_TOKEN=
+SUPABASE_URL=
+SUPABASE_SERVICE_TOKEN=
+REDIS_URL=
+OPENAI_API_KEY=
+SCRAPING_BEE_API_KEY=
--- a/apps/api/.gitattributes
+++ b/apps/api/.gitattributes
@ -0,0 +1,2 @@
+# Auto detect text files and perform LF normalization
+* text=auto
--- a/apps/api/.gitignore
+++ b/apps/api/.gitignore
@ -0,0 +1,6 @@
+/node_modules/
+/dist/
+.env
+*.csv
+dump.rdb
+/mongo-data
--- a/apps/api/Dockerfile
+++ b/apps/api/Dockerfile
@ -0,0 +1,36 @@
+FROM node:20-slim AS base
+ENV PNPM_HOME="/pnpm"
+ENV PATH="$PNPM_HOME:$PATH"
+LABEL fly_launch_runtime="Node.js"
+RUN corepack enable
+COPY . /app
+WORKDIR /app
+
+FROM base AS prod-deps
+RUN --mount=type=cache,id=pnpm,target=/pnpm/store pnpm install --prod --frozen-lockfile
+
+FROM base AS build
+RUN --mount=type=cache,id=pnpm,target=/pnpm/store pnpm install --frozen-lockfile
+
+RUN pnpm install
+RUN pnpm run build
+
+# Install packages needed for deployment
+
+
+FROM base
+RUN apt-get update -qq && \
+    apt-get install --no-install-recommends -y chromium chromium-sandbox && \
+    rm -rf /var/lib/apt/lists /var/cache/apt/archives
+COPY --from=prod-deps /app/node_modules /app/node_modules
+COPY --from=build /app /app
+
+
+
+
+# Start the server by default, this can be overwritten at runtime
+EXPOSE 8080
+ENV PUPPETEER_EXECUTABLE_PATH="/usr/bin/chromium"
+CMD [ "pnpm", "run", "start:production" ]
+CMD [ "pnpm", "run", "worker:production" ]
+
--- a/apps/api/fly.toml
+++ b/apps/api/fly.toml
@ -0,0 +1,47 @@
+# fly.toml app configuration file generated for firecrawl-scraper-js on 2024-04-07T21:09:59-03:00
+#
+# See https://fly.io/docs/reference/configuration/ for information about how to use this file.
+#
+
+app = 'firecrawl-scraper-js'
+primary_region = 'mia'
+kill_signal = 'SIGINT'
+kill_timeout = '5s'
+
+[build]
+
+[processes]
+  app = 'npm run start:production'
+  worker = 'npm run worker:production'
+
+[http_service]
+  internal_port = 8080
+  force_https = true
+  auto_stop_machines = true
+  auto_start_machines = true
+  min_machines_running = 0
+  processes = ['app']
+
+[[services]]
+  protocol = 'tcp'
+  internal_port = 8080
+  processes = ['app']
+
+[[services.ports]]
+    port = 80
+    handlers = ['http']
+    force_https = true
+
+[[services.ports]]
+    port = 443
+    handlers = ['tls', 'http']
+
+  [services.concurrency]
+    type = 'connections'
+    hard_limit = 45
+    soft_limit = 20
+
+[[vm]]
+  size = 'performance-1x'
+
+  
--- a/apps/api/jest.config.js
+++ b/apps/api/jest.config.js
@ -0,0 +1,5 @@
+module.exports = {
+  preset: "ts-jest",
+  testEnvironment: "node",
+  setupFiles: ["./jest.setup.js"],
+};
--- a/apps/api/jest.setup.js
+++ b/apps/api/jest.setup.js
@ -0,0 +1 @@
+global.fetch = require('jest-fetch-mock');
--- a/apps/api/package.json
+++ b/apps/api/package.json
@ -0,0 +1,98 @@
+{
+  "name": "firecrawl-scraper-js",
+  "version": "1.0.0",
+  "description": "",
+  "main": "index.js",
+  "scripts": {
+    "start": "nodemon --exec ts-node src/index.ts",
+    "start:production": "tsc && node dist/src/index.js",
+    "format": "prettier --write \"src/**/*.(js|ts)\"",
+    "flyio": "node dist/src/index.js",
+    "start:dev": "nodemon --exec ts-node src/index.ts",
+    "build": "tsc",
+    "test": "jest --verbose",
+    "workers": "nodemon --exec ts-node src/services/queue-worker.ts",
+    "worker:production": "node dist/src/services/queue-worker.js",
+    "mongo-docker": "docker run -d -p 2717:27017 -v ./mongo-data:/data/db --name mongodb mongo:latest",
+    "mongo-docker-console": "docker exec -it mongodb mongosh",
+    "run-example": "npx ts-node src/example.ts"
+  },
+  "author": "",
+  "license": "ISC",
+  "devDependencies": {
+    "@flydotio/dockerfile": "^0.4.10",
+    "@tsconfig/recommended": "^1.0.3",
+    "@types/body-parser": "^1.19.2",
+    "@types/bull": "^4.10.0",
+    "@types/cors": "^2.8.13",
+    "@types/express": "^4.17.17",
+    "@types/jest": "^29.5.6",
+    "body-parser": "^1.20.1",
+    "express": "^4.18.2",
+    "jest": "^29.6.3",
+    "jest-fetch-mock": "^3.0.3",
+    "nodemon": "^2.0.20",
+    "supabase": "^1.77.9",
+    "supertest": "^6.3.3",
+    "ts-jest": "^29.1.1",
+    "ts-node": "^10.9.1",
+    "typescript": "^5.4.2"
+  },
+  "dependencies": {
+    "@brillout/import": "^0.2.2",
+    "@bull-board/api": "^5.14.2",
+    "@bull-board/express": "^5.8.0",
+    "@devil7softwares/pos": "^1.0.2",
+    "@dqbd/tiktoken": "^1.0.7",
+    "@logtail/node": "^0.4.12",
+    "@nangohq/node": "^0.36.33",
+    "@sentry/node": "^7.48.0",
+    "@supabase/supabase-js": "^2.7.1",
+    "async": "^3.2.5",
+    "async-mutex": "^0.4.0",
+    "axios": "^1.3.4",
+    "bottleneck": "^2.19.5",
+    "bull": "^4.11.4",
+    "cheerio": "^1.0.0-rc.12",
+    "cohere": "^1.1.1",
+    "cors": "^2.8.5",
+    "cron-parser": "^4.9.0",
+    "date-fns": "^2.29.3",
+    "dotenv": "^16.3.1",
+    "express-rate-limit": "^6.7.0",
+    "glob": "^10.3.12",
+    "gpt3-tokenizer": "^1.1.5",
+    "ioredis": "^5.3.2",
+    "keyword-extractor": "^0.0.25",
+    "langchain": "^0.1.25",
+    "languagedetect": "^2.0.0",
+    "logsnag": "^0.1.6",
+    "luxon": "^3.4.3",
+    "md5": "^2.3.0",
+    "moment": "^2.29.4",
+    "mongoose": "^8.0.3",
+    "natural": "^6.3.0",
+    "openai": "^4.28.4",
+    "pos": "^0.4.2",
+    "promptable": "^0.0.9",
+    "puppeteer": "^22.6.3",
+    "rate-limiter-flexible": "^2.4.2",
+    "redis": "^4.6.7",
+    "robots-parser": "^3.0.1",
+    "scrapingbee": "^1.7.4",
+    "stripe": "^12.2.0",
+    "turndown": "^7.1.3",
+    "typesense": "^1.5.4",
+    "unstructured-client": "^0.9.4",
+    "uuid": "^9.0.1",
+    "wordpos": "^2.1.0",
+    "xml2js": "^0.6.2"
+  },
+  "nodemonConfig": {
+    "ignore": [
+      "*.docx",
+      "*.json",
+      "temp"
+    ]
+  }
+}
--- a/apps/api/pnpm-lock.yaml
+++ b/apps/api/pnpm-lock.yaml
--- a/apps/api/requests.http
+++ b/apps/api/requests.http
@ -0,0 +1,53 @@
+
+
+### Crawl Website
+POST http://localhost:3002/v0/crawl HTTP/1.1
+Authorization: Bearer 1b5b20e4-3417-4298-b1b2-983920e1b74d
+
+{
+    "url":"https://docs.mendable.ai"
+}
+
+
+### Check Job Status
+GET http://localhost:3002/v0/jobs/active HTTP/1.1
+
+
+### Scrape Website
+POST https://api.firecrawl.dev/v0/scrape HTTP/1.1
+Authorization: Bearer 1b5b20e4-3417-4298-b1b2-983920e1b74d
+content-type: application/json
+
+{
+    "url":"https://www.agentops.ai"
+}
+
+
+### Scrape Website
+POST http://localhost:3002/v0/scrape HTTP/1.1
+Authorization: Bearer 1b5b20e4-3417-4298-b1b2-983920e1b74d
+content-type: application/json
+
+{
+    "url":"https://www.agentops.ai"
+}
+
+
+
+### Check Job Status
+GET http://localhost:3002/v0/crawl/status/333ab225-dc3e-418b-9d4b-8fb833cbaf89 HTTP/1.1
+Authorization: Bearer 1b5b20e4-3417-4298-b1b2-983920e1b74d
+
+### Get Job Result
+
+POST https://api.firecrawl.dev/v0/crawl HTTP/1.1
+Authorization: Bearer 30c90634-8377-4446-9ef9-a280b9be1702
+content-type: application/json
+
+{
+    "url":"https://markprompt.com"
+}
+
+### Check Job Status
+GET https://api.firecrawl.dev/v0/crawl/status/cfcb71ac-23a3-4da5-bd85-d4e58b871d66
+Authorization: Bearer 30c90634-8377-4446-9ef9-a280b9be1702
--- a/apps/api/src/.DS_Store
+++ b/apps/api/src/.DS_Store
--- a/apps/api/src/control.ts
+++ b/apps/api/src/control.ts
@ -0,0 +1,2 @@
+// ! IN CASE OPENAI goes down, then activate the fallback -> true
+export const is_fallback = false;
--- a/apps/api/src/example.ts
+++ b/apps/api/src/example.ts
@ -0,0 +1,18 @@
+import { WebScraperDataProvider } from "./scraper/WebScraper";
+
+async function example() {
+  const example = new WebScraperDataProvider();
+
+  await example.setOptions({
+    mode: "crawl",
+    urls: ["https://mendable.ai"],
+    crawlerOptions: {},
+  });
+  const docs = await example.getDocuments(false);
+  docs.map((doc) => {
+    console.log(doc.metadata.sourceURL);
+  });
+  console.log(docs.length);
+}
+
+// example();
--- a/apps/api/src/index.ts
+++ b/apps/api/src/index.ts
@ -0,0 +1,352 @@
+import express from "express";
+import bodyParser from "body-parser";
+import cors from "cors";
+import "dotenv/config";
+import { getWebScraperQueue } from "./services/queue-service";
+import { addWebScraperJob } from "./services/queue-jobs";
+import { supabase_service } from "./services/supabase";
+import { WebScraperDataProvider } from "./scraper/WebScraper";
+import { billTeam, checkTeamCredits } from "./services/billing/credit_billing";
+import { getRateLimiter, redisClient } from "./services/rate-limiter";
+
+const { createBullBoard } = require("@bull-board/api");
+const { BullAdapter } = require("@bull-board/api/bullAdapter");
+const { ExpressAdapter } = require("@bull-board/express");
+
+export const app = express();
+
+global.isProduction = process.env.IS_PRODUCTION === "true";
+
+app.use(bodyParser.urlencoded({ extended: true }));
+app.use(bodyParser.json({ limit: "10mb" }));
+
+app.use(cors()); // Add this line to enable CORS
+
+const serverAdapter = new ExpressAdapter();
+serverAdapter.setBasePath(`/admin/${process.env.BULL_AUTH_KEY}/queues`);
+
+const { addQueue, removeQueue, setQueues, replaceQueues } = createBullBoard({
+  queues: [new BullAdapter(getWebScraperQueue())],
+  serverAdapter: serverAdapter,
+});
+
+app.use(
+  `/admin/${process.env.BULL_AUTH_KEY}/queues`,
+  serverAdapter.getRouter()
+);
+
+app.get("/", (req, res) => {
+  res.send("SCRAPERS-JS: Hello, world! Fly.io");
+});
+
+//write a simple test function
+app.get("/test", async (req, res) => {
+  res.send("Hello, world!");
+});
+
+async function authenticateUser(req, res, mode?: string): Promise<string> {
+  const authHeader = req.headers.authorization;
+  if (!authHeader) {
+    return res.status(401).json({ error: "Unauthorized" });
+  }
+  const token = authHeader.split(" ")[1]; // Extract the token from "Bearer <token>"
+  if (!token) {
+    return res.status(401).json({ error: "Unauthorized: Token missing" });
+  }
+
+  try {
+    const incomingIP = (req.headers["x-forwarded-for"] ||
+      req.socket.remoteAddress) as string;
+    const iptoken = incomingIP + token;
+    await getRateLimiter(
+      token === "this_is_just_a_preview_token" ? true : false
+    ).consume(iptoken);
+  } catch (rateLimiterRes) {
+    console.error(rateLimiterRes);
+    return res.status(429).json({
+      error: "Rate limit exceeded. Too many requests, try again in 1 minute.",
+    });
+  }
+
+  if (token === "this_is_just_a_preview_token" && mode === "scrape") {
+    return "preview";
+  }
+  // make sure api key is valid, based on the api_keys table in supabase
+  const { data, error } = await supabase_service
+    .from("api_keys")
+    .select("*")
+    .eq("key", token);
+  if (error || !data || data.length === 0) {
+    return res.status(401).json({ error: "Unauthorized: Invalid token" });
+  }
+
+  return data[0].team_id;
+}
+
+app.post("/v0/scrape", async (req, res) => {
+  try {
+    // make sure to authenticate user first, Bearer <token>
+    const team_id = await authenticateUser(req, res, "scrape");
+
+    try {
+      const { success: creditsCheckSuccess, message: creditsCheckMessage } =
+        await checkTeamCredits(team_id, 1);
+      if (!creditsCheckSuccess) {
+        return res.status(402).json({ error: "Insufficient credits" });
+      }
+    } catch (error) {
+      console.error(error);
+      return res.status(500).json({ error: "Internal server error" });
+    }
+
+    // authenticate on supabase
+    const url = req.body.url;
+    if (!url) {
+      return res.status(400).json({ error: "Url is required" });
+    }
+
+    try {
+      const a = new WebScraperDataProvider();
+      await a.setOptions({
+        mode: "single_urls",
+        urls: [url],
+      });
+
+      const docs = await a.getDocuments(false);
+      // make sure doc.content is not empty
+      const filteredDocs = docs.filter(
+        (doc: { content?: string }) =>
+          doc.content && doc.content.trim().length > 0
+      );
+      if (filteredDocs.length === 0) {
+        return res.status(200).json({ success: true, data: [] });
+      }
+      const { success, credit_usage } = await billTeam(
+        team_id,
+        filteredDocs.length
+      );
+      if (!success) {
+        // throw new Error("Failed to bill team, no subscribtion was found");
+        // return {
+        //   success: false,
+        //   message: "Failed to bill team, no subscribtion was found",
+        //   docs: [],
+        // };
+        return res
+          .status(402)
+          .json({ error: "Failed to bill, no subscribtion was found" });
+      }
+      return res.json({
+        success: true,
+        data: filteredDocs[0],
+      });
+    } catch (error) {
+      console.error(error);
+      return res.status(500).json({ error: error.message });
+    }
+  } catch (error) {
+    console.error(error);
+    return res.status(500).json({ error: error.message });
+  }
+});
+
+app.post("/v0/crawl", async (req, res) => {
+  try {
+    const team_id = await authenticateUser(req, res);
+
+    const { success: creditsCheckSuccess, message: creditsCheckMessage } =
+      await checkTeamCredits(team_id, 1);
+    if (!creditsCheckSuccess) {
+      return res.status(402).json({ error: "Insufficient credits" });
+    }
+
+    // authenticate on supabase
+    const url = req.body.url;
+    if (!url) {
+      return res.status(400).json({ error: "Url is required" });
+    }
+    const mode = req.body.mode ?? "crawl";
+    const crawlerOptions = req.body.crawlerOptions ?? {};
+
+    if (mode === "single_urls" && !url.includes(",")) {
+      try {
+        const a = new WebScraperDataProvider();
+        await a.setOptions({
+          mode: "single_urls",
+          urls: [url],
+          crawlerOptions: {
+            returnOnlyUrls: true,
+          },
+        });
+
+        const docs = await a.getDocuments(false, (progress) => {
+          job.progress({
+            current: progress.current,
+            total: progress.total,
+            current_step: "SCRAPING",
+            current_url: progress.currentDocumentUrl,
+          });
+        });
+        return res.json({
+          success: true,
+          documents: docs,
+        });
+      } catch (error) {
+        console.error(error);
+        return res.status(500).json({ error: error.message });
+      }
+    }
+    const job = await addWebScraperJob({
+      url: url,
+      mode: mode ?? "crawl", // fix for single urls not working
+      crawlerOptions: { ...crawlerOptions },
+      team_id: team_id,
+    });
+
+    res.json({ jobId: job.id });
+  } catch (error) {
+    console.error(error);
+    return res.status(500).json({ error: error.message });
+  }
+});
+app.post("/v0/crawlWebsitePreview", async (req, res) => {
+  try {
+    // make sure to authenticate user first, Bearer <token>
+    const authHeader = req.headers.authorization;
+    if (!authHeader) {
+      return res.status(401).json({ error: "Unauthorized" });
+    }
+    const token = authHeader.split(" ")[1]; // Extract the token from "Bearer <token>"
+    if (!token) {
+      return res.status(401).json({ error: "Unauthorized: Token missing" });
+    }
+
+    // authenticate on supabase
+    const url = req.body.url;
+    if (!url) {
+      return res.status(400).json({ error: "Url is required" });
+    }
+    const mode = req.body.mode ?? "crawl";
+    const crawlerOptions = req.body.crawlerOptions ?? {};
+    const job = await addWebScraperJob({
+      url: url,
+      mode: mode ?? "crawl", // fix for single urls not working
+      crawlerOptions: { ...crawlerOptions, limit: 5, maxCrawledLinks: 5 },
+      team_id: "preview",
+    });
+
+    res.json({ jobId: job.id });
+  } catch (error) {
+    console.error(error);
+    return res.status(500).json({ error: error.message });
+  }
+});
+
+app.get("/v0/crawl/status/:jobId", async (req, res) => {
+  try {
+    const authHeader = req.headers.authorization;
+    if (!authHeader) {
+      return res.status(401).json({ error: "Unauthorized" });
+    }
+    const token = authHeader.split(" ")[1]; // Extract the token from "Bearer <token>"
+    if (!token) {
+      return res.status(401).json({ error: "Unauthorized: Token missing" });
+    }
+
+    // make sure api key is valid, based on the api_keys table in supabase
+    const { data, error } = await supabase_service
+      .from("api_keys")
+      .select("*")
+      .eq("key", token);
+    if (error || !data || data.length === 0) {
+      return res.status(401).json({ error: "Unauthorized: Invalid token" });
+    }
+    const job = await getWebScraperQueue().getJob(req.params.jobId);
+    if (!job) {
+      return res.status(404).json({ error: "Job not found" });
+    }
+
+    const { current, current_url, total, current_step } = await job.progress();
+    res.json({
+      status: await job.getState(),
+      // progress: job.progress(),
+      current: current,
+      current_url: current_url,
+      current_step: current_step,
+      total: total,
+      data: job.returnvalue,
+    });
+  } catch (error) {
+    console.error(error);
+    return res.status(500).json({ error: error.message });
+  }
+});
+
+app.get("/v0/checkJobStatus/:jobId", async (req, res) => {
+  try {
+    const job = await getWebScraperQueue().getJob(req.params.jobId);
+    if (!job) {
+      return res.status(404).json({ error: "Job not found" });
+    }
+
+    const { current, current_url, total, current_step } = await job.progress();
+    res.json({
+      status: await job.getState(),
+      // progress: job.progress(),
+      current: current,
+      current_url: current_url,
+      current_step: current_step,
+      total: total,
+      data: job.returnvalue,
+    });
+  } catch (error) {
+    console.error(error);
+    return res.status(500).json({ error: error.message });
+  }
+});
+
+const DEFAULT_PORT = process.env.PORT ?? 3002;
+const HOST = process.env.HOST ?? "localhost";
+redisClient.connect();
+
+export function startServer(port = DEFAULT_PORT) {
+  const server = app.listen(Number(port), HOST, () => {
+    console.log(`Server listening on port ${port}`);
+    console.log(`For the UI, open http://${HOST}:${port}/admin/queues`);
+    console.log("");
+    console.log("1. Make sure Redis is running on port 6379 by default");
+    console.log(
+      "2. If you want to run nango, make sure you do port forwarding in 3002 using ngrok http 3002 "
+    );
+  });
+  return server;
+}
+
+if (require.main === module) {
+  startServer();
+}
+
+// Use this as a health check that way we dont destroy the server
+app.get(`/admin/${process.env.BULL_AUTH_KEY}/queues`, async (req, res) => {
+  try {
+    const webScraperQueue = getWebScraperQueue();
+    const [webScraperActive] = await Promise.all([
+      webScraperQueue.getActiveCount(),
+    ]);
+
+    const noActiveJobs = webScraperActive === 0;
+    // 200 if no active jobs, 503 if there are active jobs
+    return res.status(noActiveJobs ? 200 : 500).json({
+      webScraperActive,
+      noActiveJobs,
+    });
+  } catch (error) {
+    console.error(error);
+    return res.status(500).json({ error: error.message });
+  }
+});
+
+app.get("/is-production", (req, res) => {
+  res.send({ isProduction: global.isProduction });
+});
+
--- a/apps/api/src/lib/batch-process.ts
+++ b/apps/api/src/lib/batch-process.ts
@ -0,0 +1,16 @@
+export async function batchProcess<T>(
+    array: T[],
+    batchSize: number,
+    asyncFunction: (item: T, index: number) => Promise<void>
+  ): Promise<void> {
+    const batches = [];
+    for (let i = 0; i < array.length; i += batchSize) {
+      const batch = array.slice(i, i + batchSize);
+      batches.push(batch);
+    }
+  
+    for (const batch of batches) {
+      await Promise.all(batch.map((item, i) => asyncFunction(item, i)));
+    }
+  }
+  
--- a/apps/api/src/lib/custom-error.ts
+++ b/apps/api/src/lib/custom-error.ts
@ -0,0 +1,21 @@
+export class CustomError extends Error {
+  statusCode: number;
+  status: string;
+  message: string;
+  dataIngestionJob: any;
+
+  constructor(
+    statusCode: number,
+    status: string,
+    message: string = "",
+    dataIngestionJob?: any,
+  ) {
+    super(message);
+    this.statusCode = statusCode;
+    this.status = status;
+    this.message = message;
+    this.dataIngestionJob = dataIngestionJob;
+
+    Object.setPrototypeOf(this, CustomError.prototype);
+  }
+}
--- a/apps/api/src/lib/entities.ts
+++ b/apps/api/src/lib/entities.ts
@ -0,0 +1,37 @@
+export interface Progress {
+  current: number;
+  total: number;
+  status: string;
+  metadata?: {
+    sourceURL?: string;
+    [key: string]: any;
+  };
+  currentDocumentUrl?: string;
+}
+
+export class Document {
+  id?: string;
+  content: string;
+  markdown?: string;
+  createdAt?: Date;
+  updatedAt?: Date;
+  type?: string;
+  metadata: {
+    sourceURL?: string;
+    [key: string]: any;
+  };
+  childrenLinks?: string[];
+
+  constructor(data: Partial<Document>) {
+    if (!data.content) {
+      throw new Error("Missing required fields");
+    }
+    this.content = data.content;
+    this.createdAt = data.createdAt || new Date();
+    this.updatedAt = data.updatedAt || new Date();
+    this.type = data.type || "unknown";
+    this.metadata = data.metadata || { sourceURL: "" };
+    this.markdown = data.markdown || "";
+    this.childrenLinks = data.childrenLinks || undefined;
+  }
+}
--- a/apps/api/src/lib/html-to-markdown.ts
+++ b/apps/api/src/lib/html-to-markdown.ts
@ -0,0 +1,51 @@
+export function parseMarkdown(html: string) {
+  var TurndownService = require("turndown");
+
+  const turndownService = new TurndownService();
+  turndownService.addRule("inlineLink", {
+    filter: function (node, options) {
+      return (
+        options.linkStyle === "inlined" &&
+        node.nodeName === "A" &&
+        node.getAttribute("href")
+      );
+    },
+    replacement: function (content, node) {
+      var href = node.getAttribute("href").trim();
+      var title = node.title ? ' "' + node.title + '"' : "";
+      return "[" + content.trim() + "](" + href + title + ")\n";
+    },
+  });
+
+  let markdownContent = turndownService.turndown(html);
+
+  // multiple line links
+  let insideLinkContent = false;
+  let newMarkdownContent = "";
+  let linkOpenCount = 0;
+  for (let i = 0; i < markdownContent.length; i++) {
+    const char = markdownContent[i];
+
+    if (char == "[") {
+      linkOpenCount++;
+    } else if (char == "]") {
+      linkOpenCount = Math.max(0, linkOpenCount - 1);
+    }
+    insideLinkContent = linkOpenCount > 0;
+
+    if (insideLinkContent && char == "\n") {
+      newMarkdownContent += "\\" + "\n";
+    } else {
+      newMarkdownContent += char;
+    }
+  }
+  markdownContent = newMarkdownContent;
+
+  // Remove [Skip to Content](#page) and [Skip to content](#skip)
+  markdownContent = markdownContent.replace(
+    /\[Skip to Content\]\(#[^\)]*\)/gi,
+    ""
+  );
+
+  return markdownContent;
+}
--- a/apps/api/src/lib/parse-mode.ts
+++ b/apps/api/src/lib/parse-mode.ts
@ -0,0 +1,12 @@
+export function parseMode(mode: string) {
+  switch (mode) {
+    case "single_urls":
+      return "single_urls";
+    case "sitemap":
+      return "sitemap";
+    case "crawl":
+      return "crawl";
+    default:
+      return "single_urls";
+  }
+}
--- a/apps/api/src/main/runWebScraper.ts
+++ b/apps/api/src/main/runWebScraper.ts
@ -0,0 +1,96 @@
+import { Job } from "bull";
+import { CrawlResult, WebScraperOptions } from "../types";
+import { WebScraperDataProvider } from "../scraper/WebScraper";
+import { Progress } from "../lib/entities";
+import { billTeam } from "../services/billing/credit_billing";
+
+export async function startWebScraperPipeline({
+  job,
+}: {
+  job: Job<WebScraperOptions>;
+}) {
+  return (await runWebScraper({
+    url: job.data.url,
+    mode: job.data.mode,
+    crawlerOptions: job.data.crawlerOptions,
+    inProgress: (progress) => {
+      job.progress(progress);
+    },
+    onSuccess: (result) => {
+      job.moveToCompleted(result);
+    },
+    onError: (error) => {
+      job.moveToFailed(error);
+    },
+    team_id: job.data.team_id,
+  })) as { success: boolean; message: string; docs: CrawlResult[] };
+}
+export async function runWebScraper({
+  url,
+  mode,
+  crawlerOptions,
+  inProgress,
+  onSuccess,
+  onError,
+  team_id,
+}: {
+  url: string;
+  mode: "crawl" | "single_urls" | "sitemap";
+  crawlerOptions: any;
+  inProgress: (progress: any) => void;
+  onSuccess: (result: any) => void;
+  onError: (error: any) => void;
+  team_id: string;
+}): Promise<{ success: boolean; message: string; docs: CrawlResult[] }> {
+  try {
+    const provider = new WebScraperDataProvider();
+
+    if (mode === "crawl") {
+      await provider.setOptions({
+        mode: mode,
+        urls: [url],
+        crawlerOptions: crawlerOptions,
+      });
+    } else {
+      await provider.setOptions({
+        mode: mode,
+        urls: url.split(","),
+        crawlerOptions: crawlerOptions,
+      });
+    }
+    const docs = (await provider.getDocuments(false, (progress: Progress) => {
+      inProgress(progress);
+    })) as CrawlResult[];
+
+    if (docs.length === 0) {
+      return {
+        success: true,
+        message: "No pages found",
+        docs: [],
+      };
+    }
+
+    // remove docs with empty content
+    const filteredDocs = docs.filter((doc) => doc.content.trim().length > 0);
+    onSuccess(filteredDocs);
+
+    const { success, credit_usage } = await billTeam(
+      team_id,
+      filteredDocs.length
+    );
+    if (!success) {
+      // throw new Error("Failed to bill team, no subscribtion was found");
+      return {
+        success: false,
+        message: "Failed to bill team, no subscribtion was found",
+        docs: [],
+      };
+    }
+
+    return { success: true, message: "", docs: filteredDocs as CrawlResult[] };
+  } catch (error) {
+    console.error("Error running web scraper", error);
+    onError(error);
+    return { success: false, message: error.message, docs: [] };
+  }
+}
--- a/apps/api/src/scraper/WebScraper/crawler.ts
+++ b/apps/api/src/scraper/WebScraper/crawler.ts
@ -0,0 +1,295 @@
+import axios from "axios";
+import cheerio, { load } from "cheerio";
+import { URL } from "url";
+import { getLinksFromSitemap } from "./sitemap";
+import async from "async";
+import { Progress } from "../../lib/entities";
+import { scrapWithScrapingBee } from "./single_url";
+import robotsParser from "robots-parser";
+
+export class WebCrawler {
+  private initialUrl: string;
+  private baseUrl: string;
+  private includes: string[];
+  private excludes: string[];
+  private maxCrawledLinks: number;
+  private visited: Set<string> = new Set();
+  private crawledUrls: Set<string> = new Set();
+  private limit: number;
+  private robotsTxtUrl: string;
+  private robots: any;
+
+  constructor({
+    initialUrl,
+    includes,
+    excludes,
+    maxCrawledLinks,
+    limit = 10000,
+  }: {
+    initialUrl: string;
+    includes?: string[];
+    excludes?: string[];
+    maxCrawledLinks?: number;
+    limit?: number;
+  }) {
+    this.initialUrl = initialUrl;
+    this.baseUrl = new URL(initialUrl).origin;
+    this.includes = includes ?? [];
+    this.excludes = excludes ?? [];
+    this.limit = limit;
+    this.robotsTxtUrl = `${this.baseUrl}/robots.txt`;
+    this.robots = robotsParser(this.robotsTxtUrl, "");
+    // Deprecated, use limit instead
+    this.maxCrawledLinks = maxCrawledLinks ?? limit;
+  }
+
+  private filterLinks(sitemapLinks: string[], limit: number): string[] {
+    return sitemapLinks
+      .filter((link) => {
+        const url = new URL(link);
+        const path = url.pathname;
+
+        // Check if the link should be excluded
+        if (this.excludes.length > 0 && this.excludes[0] !== "") {
+          if (
+            this.excludes.some((excludePattern) =>
+              new RegExp(excludePattern).test(path)
+            )
+          ) {
+            return false;
+          }
+        }
+
+        // Check if the link matches the include patterns, if any are specified
+        if (this.includes.length > 0 && this.includes[0] !== "") {
+          return this.includes.some((includePattern) =>
+            new RegExp(includePattern).test(path)
+          );
+        }
+
+        const isAllowed = this.robots.isAllowed(link, "FireCrawlAgent") ?? true;
+        // Check if the link is disallowed by robots.txt
+        if (!isAllowed) {
+          console.log(`Link disallowed by robots.txt: ${link}`);
+          return false;
+        }
+
+        return true;
+      })
+      .slice(0, limit);
+  }
+
+  public async start(
+    inProgress?: (progress: Progress) => void,
+    concurrencyLimit: number = 5,
+    limit: number = 10000
+  ): Promise<string[]> {
+    // Fetch and parse robots.txt
+    try {
+      const response = await axios.get(this.robotsTxtUrl);
+      this.robots = robotsParser(this.robotsTxtUrl, response.data);
+    } catch (error) {
+      console.error(`Failed to fetch robots.txt from ${this.robotsTxtUrl}`);
+    }
+
+    const sitemapLinks = await this.tryFetchSitemapLinks(this.initialUrl);
+    if (sitemapLinks.length > 0) {
+      const filteredLinks = this.filterLinks(sitemapLinks, limit);
+      return filteredLinks;
+    }
+
+    const urls = await this.crawlUrls(
+      [this.initialUrl],
+      concurrencyLimit,
+      inProgress
+    );
+    if (
+      urls.length === 0 &&
+      this.filterLinks([this.initialUrl], limit).length > 0
+    ) {
+      return [this.initialUrl];
+    }
+
+    // make sure to run include exclude here again
+    return this.filterLinks(urls, limit);
+  }
+
+  private async crawlUrls(
+    urls: string[],
+    concurrencyLimit: number,
+    inProgress?: (progress: Progress) => void
+  ): Promise<string[]> {
+    const queue = async.queue(async (task: string, callback) => {
+      if (this.crawledUrls.size >= this.maxCrawledLinks) {
+        if (callback && typeof callback === "function") {
+          callback();
+        }
+        return;
+      }
+      const newUrls = await this.crawl(task);
+      newUrls.forEach((url) => this.crawledUrls.add(url));
+      if (inProgress && newUrls.length > 0) {
+        inProgress({
+          current: this.crawledUrls.size,
+          total: this.maxCrawledLinks,
+          status: "SCRAPING",
+          currentDocumentUrl: newUrls[newUrls.length - 1],
+        });
+      } else if (inProgress) {
+        inProgress({
+          current: this.crawledUrls.size,
+          total: this.maxCrawledLinks,
+          status: "SCRAPING",
+          currentDocumentUrl: task,
+        });
+      }
+      await this.crawlUrls(newUrls, concurrencyLimit, inProgress);
+      if (callback && typeof callback === "function") {
+        callback();
+      }
+    }, concurrencyLimit);
+
+    queue.push(
+      urls.filter(
+        (url) =>
+          !this.visited.has(url) && this.robots.isAllowed(url, "FireCrawlAgent")
+      ),
+      (err) => {
+        if (err) console.error(err);
+      }
+    );
+    await queue.drain();
+    return Array.from(this.crawledUrls);
+  }
+
+  async crawl(url: string): Promise<string[]> {
+    if (this.visited.has(url) || !this.robots.isAllowed(url, "FireCrawlAgent"))
+      return [];
+    this.visited.add(url);
+    if (!url.startsWith("http")) {
+      url = "https://" + url;
+    }
+    if (url.endsWith("/")) {
+      url = url.slice(0, -1);
+    }
+    if (this.isFile(url) || this.isSocialMediaOrEmail(url)) {
+      return [];
+    }
+
+    try {
+      let content;
+      // If it is the first link, fetch with scrapingbee
+      if (this.visited.size === 1) {
+        content = await scrapWithScrapingBee(url, "load");
+      } else {
+        const response = await axios.get(url);
+        content = response.data;
+      }
+      const $ = load(content);
+      let links: string[] = [];
+
+      $("a").each((_, element) => {
+        const href = $(element).attr("href");
+        if (href) {
+          let fullUrl = href;
+          if (!href.startsWith("http")) {
+            fullUrl = new URL(href, this.baseUrl).toString();
+          }
+          const url = new URL(fullUrl);
+          const path = url.pathname;
+
+          if (
+            // fullUrl.startsWith(this.initialUrl) && // this condition makes it stop crawling back the url
+            this.isInternalLink(fullUrl) &&
+            this.matchesPattern(fullUrl) &&
+            this.noSections(fullUrl) &&
+            this.matchesIncludes(path) &&
+            !this.matchesExcludes(path) &&
+            this.robots.isAllowed(fullUrl, "FireCrawlAgent")
+          ) {
+            links.push(fullUrl);
+          }
+        }
+      });
+
+      return links.filter((link) => !this.visited.has(link));
+    } catch (error) {
+      return [];
+    }
+  }
+
+  private matchesIncludes(url: string): boolean {
+    if (this.includes.length === 0 || this.includes[0] == "") return true;
+    return this.includes.some((pattern) => new RegExp(pattern).test(url));
+  }
+
+  private matchesExcludes(url: string): boolean {
+    if (this.excludes.length === 0 || this.excludes[0] == "") return false;
+    return this.excludes.some((pattern) => new RegExp(pattern).test(url));
+  }
+
+  private noSections(link: string): boolean {
+    return !link.includes("#");
+  }
+
+  private isInternalLink(link: string): boolean {
+    const urlObj = new URL(link, this.baseUrl);
+    const domainWithoutProtocol = this.baseUrl.replace(/^https?:\/\//, "");
+    return urlObj.hostname === domainWithoutProtocol;
+  }
+
+  private matchesPattern(link: string): boolean {
+    return true; // Placeholder for future pattern matching implementation
+  }
+
+  private isFile(url: string): boolean {
+    const fileExtensions = [
+      ".png",
+      ".jpg",
+      ".jpeg",
+      ".gif",
+      ".css",
+      ".js",
+      ".ico",
+      ".svg",
+      ".pdf",
+      ".zip",
+      ".exe",
+      ".dmg",
+      ".mp4",
+      ".mp3",
+      ".pptx",
+      ".docx",
+      ".xlsx",
+      ".xml",
+    ];
+    return fileExtensions.some((ext) => url.endsWith(ext));
+  }
+
+  private isSocialMediaOrEmail(url: string): boolean {
+    const socialMediaOrEmail = [
+      "facebook.com",
+      "twitter.com",
+      "linkedin.com",
+      "instagram.com",
+      "pinterest.com",
+      "mailto:",
+    ];
+    return socialMediaOrEmail.some((ext) => url.includes(ext));
+  }
+
+  private async tryFetchSitemapLinks(url: string): Promise<string[]> {
+    const sitemapUrl = url.endsWith("/sitemap.xml")
+      ? url
+      : `${url}/sitemap.xml`;
+    try {
+      const response = await axios.get(sitemapUrl);
+      if (response.status === 200) {
+        return await getLinksFromSitemap(sitemapUrl);
+      }
+    } catch (error) {
+      // Error handling for failed sitemap fetch
+    }
+    return [];
+  }
+}
--- a/apps/api/src/scraper/WebScraper/index.ts
+++ b/apps/api/src/scraper/WebScraper/index.ts
@ -0,0 +1,287 @@
+import { Document } from "../../lib/entities";
+import { Progress } from "../../lib/entities";
+import { scrapSingleUrl } from "./single_url";
+import { SitemapEntry, fetchSitemapData, getLinksFromSitemap } from "./sitemap";
+import { WebCrawler } from "./crawler";
+import { getValue, setValue } from "../../services/redis";
+
+export type WebScraperOptions = {
+  urls: string[];
+  mode: "single_urls" | "sitemap" | "crawl";
+  crawlerOptions?: {
+    returnOnlyUrls?: boolean;
+    includes?: string[];
+    excludes?: string[];
+    maxCrawledLinks?: number;
+    limit?: number;
+
+  };
+  concurrentRequests?: number;
+};
+export class WebScraperDataProvider {
+  private urls: string[] = [""];
+  private mode: "single_urls" | "sitemap" | "crawl" = "single_urls";
+  private includes: string[];
+  private excludes: string[];
+  private maxCrawledLinks: number;
+  private returnOnlyUrls: boolean;
+  private limit: number = 10000;
+  private concurrentRequests: number = 20;
+
+  authorize(): void {
+    throw new Error("Method not implemented.");
+  }
+
+  authorizeNango(): Promise<void> {
+    throw new Error("Method not implemented.");
+  }
+
+  private async convertUrlsToDocuments(
+    urls: string[],
+    inProgress?: (progress: Progress) => void
+  ): Promise<Document[]> {
+    const totalUrls = urls.length;
+    let processedUrls = 0;
+    console.log("Converting urls to documents");
+    console.log("Total urls", urls);
+    const results: (Document | null)[] = new Array(urls.length).fill(null);
+    for (let i = 0; i < urls.length; i += this.concurrentRequests) {
+      const batchUrls = urls.slice(i, i + this.concurrentRequests);
+      await Promise.all(batchUrls.map(async (url, index) => {
+        const result = await scrapSingleUrl(url, true);
+        processedUrls++;
+        if (inProgress) {
+          inProgress({
+            current: processedUrls,
+            total: totalUrls,
+            status: "SCRAPING",
+            currentDocumentUrl: url,
+          });
+        }
+        results[i + index] = result;
+      }));
+    }
+    return results.filter((result) => result !== null) as Document[];
+  }
+
+  async getDocuments(
+    useCaching: boolean = false,
+    inProgress?: (progress: Progress) => void
+  ): Promise<Document[]> {
+    if (this.urls[0].trim() === "") {
+      throw new Error("Url is required");
+    }
+
+    if (!useCaching) {
+      if (this.mode === "crawl") {
+        const crawler = new WebCrawler({
+          initialUrl: this.urls[0],
+          includes: this.includes,
+          excludes: this.excludes,
+          maxCrawledLinks: this.maxCrawledLinks,
+          limit: this.limit,
+        });
+        const links = await crawler.start(inProgress, 5, this.limit);
+        if (this.returnOnlyUrls) {
+          return links.map((url) => ({
+            content: "",
+            metadata: { sourceURL: url },
+            provider: "web",
+            type: "text",
+          }));
+        }
+        let documents = await this.convertUrlsToDocuments(links, inProgress);
+        documents = await this.getSitemapData(this.urls[0], documents);
+        console.log("documents", documents)
+
+        // CACHING DOCUMENTS
+        // - parent document
+        const cachedParentDocumentString = await getValue('web-scraper-cache:' + this.normalizeUrl(this.urls[0]));
+        if (cachedParentDocumentString != null) {
+          let cachedParentDocument = JSON.parse(cachedParentDocumentString);
+          if (!cachedParentDocument.childrenLinks || cachedParentDocument.childrenLinks.length < links.length - 1) {
+            cachedParentDocument.childrenLinks = links.filter((link) => link !== this.urls[0]);
+            await setValue('web-scraper-cache:' + this.normalizeUrl(this.urls[0]), JSON.stringify(cachedParentDocument), 60 * 60 * 24 * 10); // 10 days
+          }
+        } else {
+          let parentDocument = documents.filter((document) => this.normalizeUrl(document.metadata.sourceURL) === this.normalizeUrl(this.urls[0]))
+          await this.setCachedDocuments(parentDocument, links);
+        }
+
+        await this.setCachedDocuments(documents.filter((document) => this.normalizeUrl(document.metadata.sourceURL) !== this.normalizeUrl(this.urls[0])), []);
+        documents = this.removeChildLinks(documents);
+        documents = documents.splice(0, this.limit);
+        return documents;
+      }
+
+      if (this.mode === "single_urls") {
+        let documents = await this.convertUrlsToDocuments(this.urls, inProgress);
+        
+        const baseUrl = new URL(this.urls[0]).origin;
+        documents = await this.getSitemapData(baseUrl, documents);
+        
+        await this.setCachedDocuments(documents);
+        documents = this.removeChildLinks(documents);
+        documents = documents.splice(0, this.limit);
+        return documents;
+      }
+      if (this.mode === "sitemap") {
+        const links = await getLinksFromSitemap(this.urls[0]);
+        let documents = await this.convertUrlsToDocuments(links.slice(0, this.limit), inProgress);
+
+        documents = await this.getSitemapData(this.urls[0], documents);
+        
+        await this.setCachedDocuments(documents);
+        documents = this.removeChildLinks(documents);
+        documents = documents.splice(0, this.limit);
+        return documents;
+      }
+
+      return [];
+    }
+
+    let documents = await this.getCachedDocuments(this.urls.slice(0, this.limit));
+    if (documents.length < this.limit) {
+       const newDocuments: Document[] = await this.getDocuments(false, inProgress);
+      newDocuments.forEach(doc => {
+        if (!documents.some(d => this.normalizeUrl(d.metadata.sourceURL) === this.normalizeUrl(doc.metadata?.sourceURL))) {
+          documents.push(doc);
+        }
+      });
+    }
+    documents = this.filterDocsExcludeInclude(documents);
+    documents = this.removeChildLinks(documents);
+    documents = documents.splice(0, this.limit);
+    return documents;
+  }
+
+  private filterDocsExcludeInclude(documents: Document[]): Document[] {
+    return documents.filter((document) => {
+      const url = new URL(document.metadata.sourceURL);
+      const path = url.pathname;
+
+      if (this.excludes.length > 0 && this.excludes[0] !== '') {
+        // Check if the link should be excluded
+        if (this.excludes.some(excludePattern => new RegExp(excludePattern).test(path))) {
+          return false;
+        }
+      }
+      
+      if (this.includes.length > 0 && this.includes[0] !== '') {
+        // Check if the link matches the include patterns, if any are specified
+        if (this.includes.length > 0) {
+          return this.includes.some(includePattern => new RegExp(includePattern).test(path));
+        }
+      }
+      return true;
+    });
+  }
+
+  private normalizeUrl(url: string): string {
+    if (url.includes("//www.")) {
+      return url.replace("//www.", "//");
+    }
+    return url;
+  }
+
+  private removeChildLinks(documents: Document[]): Document[] {
+    for (let document of documents) {
+      if (document?.childrenLinks) delete document.childrenLinks;
+    };
+    return documents;
+  }
+
+  async setCachedDocuments(documents: Document[], childrenLinks?: string[]) {
+    for (const document of documents) {
+      if (document.content.trim().length === 0) {
+        continue;
+      }
+      const normalizedUrl = this.normalizeUrl(document.metadata.sourceURL);
+      await setValue('web-scraper-cache:' + normalizedUrl, JSON.stringify({
+        ...document,
+        childrenLinks: childrenLinks || []
+      }), 60 * 60 * 24 * 10); // 10 days
+    }
+  }
+
+  async getCachedDocuments(urls: string[]): Promise<Document[]> {
+    let documents: Document[] = [];
+    for (const url of urls) {
+      const normalizedUrl = this.normalizeUrl(url);
+      console.log("Getting cached document for web-scraper-cache:" + normalizedUrl)
+      const cachedDocumentString = await getValue('web-scraper-cache:' + normalizedUrl);
+      if (cachedDocumentString) {
+        const cachedDocument = JSON.parse(cachedDocumentString);
+        documents.push(cachedDocument);
+
+        // get children documents
+        for (const childUrl of cachedDocument.childrenLinks) {
+          const normalizedChildUrl = this.normalizeUrl(childUrl);
+          const childCachedDocumentString = await getValue('web-scraper-cache:' + normalizedChildUrl);
+          if (childCachedDocumentString) {
+            const childCachedDocument = JSON.parse(childCachedDocumentString);
+            if (!documents.find((doc) => doc.metadata.sourceURL === childCachedDocument.metadata.sourceURL)) {
+              documents.push(childCachedDocument);
+            }
+          }
+        }
+      }
+    }
+    return documents;
+  }
+
+  setOptions(options: WebScraperOptions): void {
+    if (!options.urls) {
+      throw new Error("Urls are required");
+    }
+
+    console.log("options", options.crawlerOptions?.excludes)
+    this.urls = options.urls;
+    this.mode = options.mode;
+    this.concurrentRequests = options.concurrentRequests ?? 20;
+    this.includes = options.crawlerOptions?.includes ?? [];
+    this.excludes = options.crawlerOptions?.excludes ?? [];
+    this.maxCrawledLinks = options.crawlerOptions?.maxCrawledLinks ?? 1000;
+    this.returnOnlyUrls = options.crawlerOptions?.returnOnlyUrls ?? false;
+    this.limit = options.crawlerOptions?.limit ?? 10000;
+
+
+    //! @nicolas, for some reason this was being injected and breakign everything. Don't have time to find source of the issue so adding this check
+    this.excludes = this.excludes.filter(item => item !== '');
+  
+  
+    // make sure all urls start with https://
+    this.urls = this.urls.map((url) => {
+      if (!url.trim().startsWith("http")) {
+        return `https://${url}`;
+      }
+      return url;
+    });
+  }
+
+  private async getSitemapData(baseUrl: string, documents: Document[]) {
+    const sitemapData = await fetchSitemapData(baseUrl)
+    if (sitemapData) {
+      for (let i = 0; i < documents.length; i++) {
+        const docInSitemapData = sitemapData.find((data) => this.normalizeUrl(data.loc) === this.normalizeUrl(documents[i].metadata.sourceURL))
+        if (docInSitemapData) {
+          let sitemapDocData: Partial<SitemapEntry> = {};
+          if (docInSitemapData.changefreq) {
+            sitemapDocData.changefreq = docInSitemapData.changefreq;
+          }
+          if (docInSitemapData.priority) {
+            sitemapDocData.priority = Number(docInSitemapData.priority);
+          }
+          if (docInSitemapData.lastmod) {
+            sitemapDocData.lastmod = docInSitemapData.lastmod;
+          }
+          if (Object.keys(sitemapDocData).length !== 0) {
+            documents[i].metadata.sitemap = sitemapDocData;
+          }
+        }
+      }
+    }
+    return documents;
+  }
+}
+
--- a/apps/api/src/scraper/WebScraper/single_url.ts
+++ b/apps/api/src/scraper/WebScraper/single_url.ts
@ -0,0 +1,145 @@
+import * as cheerio from "cheerio";
+import { ScrapingBeeClient } from "scrapingbee";
+import { attemptScrapWithRequests, sanitizeText } from "./utils/utils";
+import { extractMetadata } from "./utils/metadata";
+import dotenv from "dotenv";
+import { Document } from "../../lib/entities";
+import { parseMarkdown } from "../../lib/html-to-markdown";
+// import puppeteer from "puppeteer";
+
+dotenv.config();
+
+
+
+export async function scrapWithScrapingBee(url: string, wait_browser:string = "domcontentloaded"): Promise<string> {
+  try {
+    const client = new ScrapingBeeClient(process.env.SCRAPING_BEE_API_KEY);
+    const response = await client.get({
+      url: url,
+      params: { timeout: 15000, wait_browser: wait_browser },
+      headers: { "ScrapingService-Request": "TRUE" },
+    });
+
+    if (response.status !== 200 && response.status !== 404) {
+      console.error(
+        `Scraping bee error in ${url} with status code ${response.status}`
+      );
+      return "";
+    }
+    const decoder = new TextDecoder();
+    const text = decoder.decode(response.data);
+    return text;
+  } catch (error) {
+    console.error(`Error scraping with Scraping Bee: ${error}`);
+    return "";
+  }
+}
+
+
+export async function scrapWithPlaywright(url: string): Promise<string> {
+  try {
+    const response = await fetch(process.env.PLAYWRIGHT_MICROSERVICE_URL, {
+      method: 'POST',
+      headers: {
+        "Content-Type": "application/json",
+      },
+      body: JSON.stringify({ url: url }),
+    });
+
+    if (!response.ok) {
+      console.error(`Error fetching w/ playwright server -> URL: ${url} with status: ${response.status}`);
+      return "";
+    }
+
+    const data = await response.json();
+    const html = data.content;
+    return html ?? "";
+  } catch (error) {
+    console.error(`Error scraping with Puppeteer: ${error}`);
+    return "";
+  }
+}
+
+export async function scrapSingleUrl(
+  urlToScrap: string,
+  toMarkdown: boolean = true
+): Promise<Document> {
+  console.log(`Scraping URL: ${urlToScrap}`);
+  urlToScrap = urlToScrap.trim();
+
+  const removeUnwantedElements = (html: string) => {
+    const soup = cheerio.load(html);
+    soup("script, style, iframe, noscript, meta, head").remove();
+    return soup.html();
+  };
+
+  const attemptScraping = async (url: string, method: 'scrapingBee' | 'playwright' | 'scrapingBeeLoad' | 'fetch') => {
+    let text = "";
+    switch (method) {
+      case 'scrapingBee':
+        if (process.env.SCRAPING_BEE_API_KEY) {
+          text = await scrapWithScrapingBee(url);
+        }
+        break;
+      case 'playwright':
+        if (process.env.PLAYWRIGHT_MICROSERVICE_URL) {
+          text = await scrapWithPlaywright(url);
+        }
+        break;
+      case 'scrapingBeeLoad':
+        if (process.env.SCRAPING_BEE_API_KEY) {
+          text = await scrapWithScrapingBee(url, "networkidle2");
+        }
+        break;
+      case 'fetch':
+        try {
+          const response = await fetch(url);
+          if (!response.ok) {
+            console.error(`Error fetching URL: ${url} with status: ${response.status}`);
+            return "";
+          }
+          text = await response.text();
+        } catch (error) {
+          console.error(`Error scraping URL: ${error}`);
+          return "";
+        }
+        break;
+      
+    }
+    const cleanedHtml = removeUnwantedElements(text);
+    return [await parseMarkdown(cleanedHtml), text];
+  };
+
+  try {
+    let [text, html ] = await attemptScraping(urlToScrap, 'scrapingBee');
+    if (!text || text.length < 100) {
+      console.log("Falling back to playwright");
+      [text, html] = await attemptScraping(urlToScrap, 'playwright');
+    }
+
+    if (!text || text.length < 100) {
+      console.log("Falling back to scraping bee load");
+      [text, html] = await attemptScraping(urlToScrap, 'scrapingBeeLoad');
+    }
+    if (!text || text.length < 100) {
+      console.log("Falling back to fetch");
+      [text, html] = await attemptScraping(urlToScrap, 'fetch');
+    }
+
+    const soup = cheerio.load(html);
+    const metadata = extractMetadata(soup, urlToScrap);
+
+    return {
+      content: text,
+      markdown: text,
+      metadata: { ...metadata, sourceURL: urlToScrap },
+    } as Document;
+  } catch (error) {
+    console.error(`Error: ${error} - Failed to fetch URL: ${urlToScrap}`);
+    return {
+      content: "",
+      markdown: "",
+      metadata: { sourceURL: urlToScrap },
+    } as Document;
+  }
+}
--- a/apps/api/src/scraper/WebScraper/sitemap.ts
+++ b/apps/api/src/scraper/WebScraper/sitemap.ts
@ -0,0 +1,74 @@
+import axios from "axios";
+import { parseStringPromise } from "xml2js";
+
+export async function getLinksFromSitemap(
+  sitemapUrl: string,
+  allUrls: string[] = []
+): Promise<string[]> {
+  try {
+    let content: string;
+    try {
+      const response = await axios.get(sitemapUrl);
+      content = response.data;
+    } catch (error) {
+      console.error(`Request failed for ${sitemapUrl}: ${error}`);
+      return allUrls;
+    }
+
+    const parsed = await parseStringPromise(content);
+    const root = parsed.urlset || parsed.sitemapindex;
+
+    if (root && root.sitemap) {
+      for (const sitemap of root.sitemap) {
+        if (sitemap.loc && sitemap.loc.length > 0) {
+          await getLinksFromSitemap(sitemap.loc[0], allUrls);
+        }
+      }
+    } else if (root && root.url) {
+      for (const url of root.url) {
+        if (url.loc && url.loc.length > 0) {
+          allUrls.push(url.loc[0]);
+        }
+      }
+    }
+  } catch (error) {
+    console.error(`Error processing ${sitemapUrl}: ${error}`);
+  }
+
+  return allUrls;
+}
+
+export const fetchSitemapData = async (url: string): Promise<SitemapEntry[] | null> => {
+  const sitemapUrl = url.endsWith("/sitemap.xml") ? url : `${url}/sitemap.xml`;
+  try {
+    const response = await axios.get(sitemapUrl);
+    if (response.status === 200) {
+      const xml = response.data;
+      const parsedXml = await parseStringPromise(xml);
+
+      const sitemapData: SitemapEntry[] = [];
+      if (parsedXml.urlset && parsedXml.urlset.url) {
+        for (const urlElement of parsedXml.urlset.url) {
+          const sitemapEntry: SitemapEntry = { loc: urlElement.loc[0] };
+          if (urlElement.lastmod) sitemapEntry.lastmod = urlElement.lastmod[0];
+          if (urlElement.changefreq) sitemapEntry.changefreq = urlElement.changefreq[0];
+          if (urlElement.priority) sitemapEntry.priority = Number(urlElement.priority[0]);
+          sitemapData.push(sitemapEntry);
+        }
+      }
+
+      return sitemapData;
+    }
+    return null;
+  } catch (error) {
+    // Error handling for failed sitemap fetch
+  }
+  return [];
+}
+
+export interface SitemapEntry {
+  loc: string;
+  lastmod?: string;
+  changefreq?: string;
+  priority?: number;
+}
--- a/apps/api/src/scraper/WebScraper/utils/metadata.ts
+++ b/apps/api/src/scraper/WebScraper/utils/metadata.ts
@ -0,0 +1,109 @@
+// import * as cheerio from 'cheerio';
+import { CheerioAPI } from "cheerio";
+interface Metadata {
+  title?: string;
+  description?: string;
+  language?: string;
+  keywords?: string;
+  robots?: string;
+  ogTitle?: string;
+  ogDescription?: string;
+  dctermsCreated?: string;
+  dcDateCreated?: string;
+  dcDate?: string;
+  dctermsType?: string;
+  dcType?: string;
+  dctermsAudience?: string;
+  dctermsSubject?: string;
+  dcSubject?: string;
+  dcDescription?: string;
+  ogImage?: string;
+  dctermsKeywords?: string;
+  modifiedTime?: string;
+  publishedTime?: string;
+  articleTag?: string;
+  articleSection?: string;
+}
+
+export function extractMetadata(soup: CheerioAPI, url: string): Metadata {
+  let title: string | null = null;
+  let description: string | null = null;
+  let language: string | null = null;
+  let keywords: string | null = null;
+  let robots: string | null = null;
+  let ogTitle: string | null = null;
+  let ogDescription: string | null = null;
+  let dctermsCreated: string | null = null;
+  let dcDateCreated: string | null = null;
+  let dcDate: string | null = null;
+  let dctermsType: string | null = null;
+  let dcType: string | null = null;
+  let dctermsAudience: string | null = null;
+  let dctermsSubject: string | null = null;
+  let dcSubject: string | null = null;
+  let dcDescription: string | null = null;
+  let ogImage: string | null = null;
+  let dctermsKeywords: string | null = null;
+  let modifiedTime: string | null = null;
+  let publishedTime: string | null = null;
+  let articleTag: string | null = null;
+  let articleSection: string | null = null;
+
+  try {
+    title = soup("title").text() || null;
+    description = soup('meta[name="description"]').attr("content") || null;
+    
+    // Assuming the language is part of the URL as per the regex pattern
+    const pattern = /([a-zA-Z]+-[A-Z]{2})/;
+    const match = pattern.exec(url);
+    language = match ? match[1] : null;
+
+    keywords = soup('meta[name="keywords"]').attr("content") || null;
+    robots = soup('meta[name="robots"]').attr("content") || null;
+    ogTitle = soup('meta[property="og:title"]').attr("content") || null;
+    ogDescription = soup('meta[property="og:description"]').attr("content") || null;
+    articleSection = soup('meta[name="article:section"]').attr("content") || null;
+    articleTag = soup('meta[name="article:tag"]').attr("content") || null;
+    publishedTime = soup('meta[property="article:published_time"]').attr("content") || null;
+    modifiedTime = soup('meta[property="article:modified_time"]').attr("content") || null;
+    ogImage = soup('meta[property="og:image"]').attr("content") || null;
+    dctermsKeywords = soup('meta[name="dcterms.keywords"]').attr("content") || null;
+    dcDescription = soup('meta[name="dc.description"]').attr("content") || null;
+    dcSubject = soup('meta[name="dc.subject"]').attr("content") || null;
+    dctermsSubject = soup('meta[name="dcterms.subject"]').attr("content") || null;
+    dctermsAudience = soup('meta[name="dcterms.audience"]').attr("content") || null;
+    dcType = soup('meta[name="dc.type"]').attr("content") || null;
+    dctermsType = soup('meta[name="dcterms.type"]').attr("content") || null;
+    dcDate = soup('meta[name="dc.date"]').attr("content") || null;
+    dcDateCreated = soup('meta[name="dc.date.created"]').attr("content") || null;
+    dctermsCreated = soup('meta[name="dcterms.created"]').attr("content") || null;
+
+  } catch (error) {
+    console.error("Error extracting metadata:", error);
+  }
+
+  return {
+    ...(title ? { title } : {}),
+    ...(description ? { description } : {}),
+    ...(language ? { language } : {}),
+    ...(keywords ? { keywords } : {}),
+    ...(robots ? { robots } : {}),
+    ...(ogTitle ? { ogTitle } : {}),
+    ...(ogDescription ? { ogDescription } : {}),
+    ...(dctermsCreated ? { dctermsCreated } : {}),
+    ...(dcDateCreated ? { dcDateCreated } : {}),
+    ...(dcDate ? { dcDate } : {}),
+    ...(dctermsType ? { dctermsType } : {}),
+    ...(dcType ? { dcType } : {}),
+    ...(dctermsAudience ? { dctermsAudience } : {}),
+    ...(dctermsSubject ? { dctermsSubject } : {}),
+    ...(dcSubject ? { dcSubject } : {}),
+    ...(dcDescription ? { dcDescription } : {}),
+    ...(ogImage ? { ogImage } : {}),
+    ...(dctermsKeywords ? { dctermsKeywords } : {}),
+    ...(modifiedTime ? { modifiedTime } : {}),
+    ...(publishedTime ? { publishedTime } : {}),
+    ...(articleTag ? { articleTag } : {}),
+    ...(articleSection ? { articleSection } : {}),
+  };
+}
--- a/apps/api/src/scraper/WebScraper/utils/utils.ts
+++ b/apps/api/src/scraper/WebScraper/utils/utils.ts
@ -0,0 +1,23 @@
+import axios from "axios";
+
+export async function attemptScrapWithRequests(
+  urlToScrap: string
+): Promise<string | null> {
+  try {
+    const response = await axios.get(urlToScrap);
+
+    if (!response.data) {
+      console.log("Failed normal requests as well");
+      return null;
+    }
+
+    return response.data;
+  } catch (error) {
+    console.error(`Error in attemptScrapWithRequests: ${error}`);
+    return null;
+  }
+}
+
+export function sanitizeText(text: string): string {
+  return text.replace("\u0000", "");
+}
--- a/apps/api/src/services/billing/credit_billing.ts
+++ b/apps/api/src/services/billing/credit_billing.ts
@ -0,0 +1,219 @@
+import { supabase_service } from "../supabase";
+
+const FREE_CREDITS = 100;
+export async function billTeam(team_id: string, credits: number) {
+  if (team_id === "preview") {
+    return { success: true, message: "Preview team, no credits used" };
+  }
+  console.log(`Billing team ${team_id} for ${credits} credits`);
+  //   When the API is used, you can log the credit usage in the credit_usage table:
+  // team_id: The ID of the team using the API.
+  // subscription_id: The ID of the team's active subscription.
+  // credits_used: The number of credits consumed by the API call.
+  // created_at: The timestamp of the API usage.
+
+  // 1. get the subscription
+
+  const { data: subscription } = await supabase_service
+    .from("subscriptions")
+    .select("*")
+    .eq("team_id", team_id)
+    .eq("status", "active")
+    .single();
+
+  if (!subscription) {
+    const { data: credit_usage } = await supabase_service
+      .from("credit_usage")
+      .insert([
+        {
+          team_id,
+          credits_used: credits,
+          created_at: new Date(),
+        },
+      ])
+      .select();
+
+    return { success: true, credit_usage };
+  }
+
+  // 2. add the credits to the credits_usage
+  const { data: credit_usage } = await supabase_service
+    .from("credit_usage")
+    .insert([
+      {
+        team_id,
+        subscription_id: subscription.id,
+        credits_used: credits,
+        created_at: new Date(),
+      },
+    ])
+    .select();
+
+  return { success: true, credit_usage };
+}
+
+// if team has enough credits for the operation, return true, else return false
+export async function checkTeamCredits(team_id: string, credits: number) {
+  if (team_id === "preview") {
+    return { success: true, message: "Preview team, no credits used" };
+  }
+  // 1. Retrieve the team's active subscription based on the team_id.
+  const { data: subscription, error: subscriptionError } =
+    await supabase_service
+      .from("subscriptions")
+      .select("id, price_id, current_period_start, current_period_end")
+      .eq("team_id", team_id)
+      .eq("status", "active")
+      .single();
+
+  if (subscriptionError || !subscription) {
+    const { data: creditUsages, error: creditUsageError } =
+      await supabase_service
+        .from("credit_usage")
+        .select("credits_used")
+        .is("subscription_id", null)
+        .eq("team_id", team_id);
+    // .gte("created_at", subscription.current_period_start)
+    // .lte("created_at", subscription.current_period_end);
+
+    if (creditUsageError) {
+      throw new Error(
+        `Failed to retrieve credit usage for subscription_id: ${subscription.id}`
+      );
+    }
+
+    const totalCreditsUsed = creditUsages.reduce(
+      (acc, usage) => acc + usage.credits_used,
+      0
+    );
+
+    console.log("totalCreditsUsed", totalCreditsUsed);
+    // 5. Compare the total credits used with the credits allowed by the plan.
+    if (totalCreditsUsed + credits > FREE_CREDITS) {
+      return {
+        success: false,
+        message: "Insufficient credits, please upgrade!",
+      };
+    }
+    return { success: true, message: "Sufficient credits available" };
+  }
+
+  // 2. Get the price_id from the subscription.
+  const { data: price, error: priceError } = await supabase_service
+    .from("prices")
+    .select("credits")
+    .eq("id", subscription.price_id)
+    .single();
+
+  if (priceError) {
+    throw new Error(
+      `Failed to retrieve price for price_id: ${subscription.price_id}`
+    );
+  }
+
+  // 4. Calculate the total credits used by the team within the current billing period.
+  const { data: creditUsages, error: creditUsageError } = await supabase_service
+    .from("credit_usage")
+    .select("credits_used")
+    .eq("subscription_id", subscription.id)
+    .gte("created_at", subscription.current_period_start)
+    .lte("created_at", subscription.current_period_end);
+
+  if (creditUsageError) {
+    throw new Error(
+      `Failed to retrieve credit usage for subscription_id: ${subscription.id}`
+    );
+  }
+
+  const totalCreditsUsed = creditUsages.reduce(
+    (acc, usage) => acc + usage.credits_used,
+    0
+  );
+
+  // 5. Compare the total credits used with the credits allowed by the plan.
+  if (totalCreditsUsed + credits > price.credits) {
+    return { success: false, message: "Insufficient credits, please upgrade!" };
+  }
+
+  return { success: true, message: "Sufficient credits available" };
+}
+
+// Count the total credits used by a team within the current billing period and return the remaining credits.
+export async function countCreditsAndRemainingForCurrentBillingPeriod(
+  team_id: string
+) {
+  // 1. Retrieve the team's active subscription based on the team_id.
+  const { data: subscription, error: subscriptionError } =
+    await supabase_service
+      .from("subscriptions")
+      .select("id, price_id, current_period_start, current_period_end")
+      .eq("team_id", team_id)
+      .single();
+
+  if (subscriptionError || !subscription) {
+    // throw new Error(`Failed to retrieve subscription for team_id: ${team_id}`);
+
+    // Free
+    const { data: creditUsages, error: creditUsageError } =
+      await supabase_service
+        .from("credit_usage")
+        .select("credits_used")
+        .is("subscription_id", null)
+        .eq("team_id", team_id);
+    // .gte("created_at", subscription.current_period_start)
+    // .lte("created_at", subscription.current_period_end);
+
+    if (creditUsageError || !creditUsages) {
+      throw new Error(
+        `Failed to retrieve credit usage for subscription_id: ${subscription.id}`
+      );
+    }
+
+    const totalCreditsUsed = creditUsages.reduce(
+      (acc, usage) => acc + usage.credits_used,
+      0
+    );
+
+    // 4. Calculate remaining credits.
+    const remainingCredits = FREE_CREDITS - totalCreditsUsed;
+
+    return { totalCreditsUsed, remainingCredits, totalCredits: FREE_CREDITS };
+  }
+
+  // 2. Get the price_id from the subscription to retrieve the total credits available.
+  const { data: price, error: priceError } = await supabase_service
+    .from("prices")
+    .select("credits")
+    .eq("id", subscription.price_id)
+    .single();
+
+  if (priceError || !price) {
+    throw new Error(
+      `Failed to retrieve price for price_id: ${subscription.price_id}`
+    );
+  }
+
+  // 3. Calculate the total credits used by the team within the current billing period.
+  const { data: creditUsages, error: creditUsageError } = await supabase_service
+    .from("credit_usage")
+    .select("credits_used")
+    .eq("subscription_id", subscription.id)
+    .gte("created_at", subscription.current_period_start)
+    .lte("created_at", subscription.current_period_end);
+
+  if (creditUsageError || !creditUsages) {
+    throw new Error(
+      `Failed to retrieve credit usage for subscription_id: ${subscription.id}`
+    );
+  }
+
+  const totalCreditsUsed = creditUsages.reduce(
+    (acc, usage) => acc + usage.credits_used,
+    0
+  );
+
+  // 4. Calculate remaining credits.
+  const remainingCredits = price.credits - totalCreditsUsed;
+
+  return { totalCreditsUsed, remainingCredits, totalCredits: price.credits };
+}
--- a/apps/api/src/services/logtail.ts
+++ b/apps/api/src/services/logtail.ts
@ -0,0 +1,4 @@
+const { Logtail } = require("@logtail/node");
+//dot env
+require("dotenv").config();
+export const logtail = new Logtail(process.env.LOGTAIL_KEY);
--- a/apps/api/src/services/queue-jobs.ts
+++ b/apps/api/src/services/queue-jobs.ts
@ -0,0 +1,17 @@
+import { Job, Queue } from "bull";
+import {
+  getWebScraperQueue,
+} from "./queue-service";
+import { v4 as uuidv4 } from "uuid";
+import { WebScraperOptions } from "../types";
+
+export async function addWebScraperJob(
+  webScraperOptions: WebScraperOptions,
+  options: any = {}
+): Promise<Job> {
+  return await getWebScraperQueue().add(webScraperOptions, {
+    ...options,
+    jobId: uuidv4(),
+  });
+}
+
--- a/apps/api/src/services/queue-service.ts
+++ b/apps/api/src/services/queue-service.ts
@ -0,0 +1,16 @@
+import Queue from "bull";
+
+let webScraperQueue;
+
+export function getWebScraperQueue() {
+  if (!webScraperQueue) {
+    webScraperQueue = new Queue("web-scraper", process.env.REDIS_URL, {
+      settings: {
+        lockDuration: 4 * 60 * 60 * 1000, // 4 hours in milliseconds,
+        lockRenewTime: 30 * 60 * 1000, // 30 minutes in milliseconds
+      },
+    });
+    console.log("Web scraper queue created");
+  }
+  return webScraperQueue;
+}
--- a/apps/api/src/services/queue-worker.ts
+++ b/apps/api/src/services/queue-worker.ts
@ -0,0 +1,62 @@
+import { CustomError } from "../lib/custom-error";
+import { getWebScraperQueue } from "./queue-service";
+import "dotenv/config";
+import { logtail } from "./logtail";
+import { startWebScraperPipeline } from "../main/runWebScraper";
+import { WebScraperDataProvider } from "../scraper/WebScraper";
+import { callWebhook } from "./webhook";
+
+getWebScraperQueue().process(
+  Math.floor(Number(process.env.NUM_WORKERS_PER_QUEUE ?? 8)),
+  async function (job, done) {
+    try {
+      job.progress({
+        current: 1,
+        total: 100,
+        current_step: "SCRAPING",
+        current_url: "",
+      });
+      const { success, message, docs } = await startWebScraperPipeline({ job });
+
+      const data = {
+        success: success,
+        result: {
+          links: docs.map((doc) => {
+            return { content: doc, source: doc.metadata.sourceURL };
+          }),
+        },
+        project_id: job.data.project_id,
+        error: message /* etc... */,
+      };
+
+      await callWebhook(job.data.team_id, data);
+      done(null, data);
+    } catch (error) {
+      if (error instanceof CustomError) {
+        // Here we handle the error, then save the failed job
+        console.error(error.message); // or any other error handling
+
+        logtail.error("Custom error while ingesting", {
+          job_id: job.id,
+          error: error.message,
+          dataIngestionJob: error.dataIngestionJob,
+        });
+      }
+      console.log(error);
+
+      logtail.error("Overall error ingesting", {
+        job_id: job.id,
+        error: error.message,
+      });
+
+      const data = {
+        success: false,
+        project_id: job.data.project_id,
+        error:
+          "Something went wrong... Contact help@mendable.ai or try again." /* etc... */,
+      };
+      await callWebhook(job.data.team_id, data);
+      done(null, data);
+    }
+  }
+);
--- a/apps/api/src/services/rate-limiter.ts
+++ b/apps/api/src/services/rate-limiter.ts
@ -0,0 +1,65 @@
+import { RateLimiterRedis } from "rate-limiter-flexible";
+import * as redis from "redis";
+
+const MAX_REQUESTS_PER_MINUTE_PREVIEW = 5;
+const MAX_CRAWLS_PER_MINUTE_STARTER = 2;
+const MAX_CRAWLS_PER_MINUTE_STANDAR = 4;
+const MAX_CRAWLS_PER_MINUTE_SCALE = 20;
+
+const MAX_REQUESTS_PER_MINUTE_ACCOUNT = 40;
+
+
+
+export const redisClient = redis.createClient({
+  url: process.env.REDIS_URL,
+  legacyMode: true,
+});
+
+export const previewRateLimiter = new RateLimiterRedis({
+  storeClient: redisClient,
+  keyPrefix: "middleware",
+  points: MAX_REQUESTS_PER_MINUTE_PREVIEW,
+  duration: 60, // Duration in seconds
+});
+
+export const serverRateLimiter = new RateLimiterRedis({
+  storeClient: redisClient,
+  keyPrefix: "middleware",
+  points: MAX_REQUESTS_PER_MINUTE_ACCOUNT,
+  duration: 60, // Duration in seconds
+});
+
+
+export function crawlRateLimit(plan: string){
+  if(plan === "standard"){
+    return new RateLimiterRedis({
+      storeClient: redisClient,
+      keyPrefix: "middleware",
+      points: MAX_CRAWLS_PER_MINUTE_STANDAR,
+      duration: 60, // Duration in seconds
+    });
+  }else if(plan === "scale"){
+    return new RateLimiterRedis({
+      storeClient: redisClient,
+      keyPrefix: "middleware",
+      points: MAX_CRAWLS_PER_MINUTE_SCALE,
+      duration: 60, // Duration in seconds
+    });
+  }
+  return new RateLimiterRedis({
+    storeClient: redisClient,
+    keyPrefix: "middleware",
+    points: MAX_CRAWLS_PER_MINUTE_STARTER,
+    duration: 60, // Duration in seconds
+  });
+
+}
+
+
+export function getRateLimiter(preview: boolean){
+  if(preview){
+    return previewRateLimiter;
+  }else{
+    return serverRateLimiter;
+  }
+}
--- a/apps/api/src/services/redis.ts
+++ b/apps/api/src/services/redis.ts
@ -0,0 +1,38 @@
+import Redis from 'ioredis';
+
+// Initialize Redis client
+const redis = new Redis(process.env.REDIS_URL);
+
+/**
+ * Set a value in Redis with an optional expiration time.
+ * @param {string} key The key under which to store the value.
+ * @param {string} value The value to store.
+ * @param {number} [expire] Optional expiration time in seconds.
+ */
+const setValue = async (key: string, value: string, expire?: number) => {
+  if (expire) {
+    await redis.set(key, value, 'EX', expire);
+  } else {
+    await redis.set(key, value);
+  }
+};
+
+/**
+ * Get a value from Redis.
+ * @param {string} key The key of the value to retrieve.
+ * @returns {Promise<string|null>} The value, if found, otherwise null.
+ */
+const getValue = async (key: string): Promise<string | null> => {
+  const value = await redis.get(key);
+  return value;
+};
+
+/**
+ * Delete a key from Redis.
+ * @param {string} key The key to delete.
+ */
+const deleteKey = async (key: string) => {
+  await redis.del(key);
+};
+
+export { setValue, getValue, deleteKey };
--- a/apps/api/src/services/supabase.ts
+++ b/apps/api/src/services/supabase.ts
@ -0,0 +1,6 @@
+import { createClient } from "@supabase/supabase-js";
+
+export const supabase_service = createClient<any>(
+  process.env.SUPABASE_URL,
+  process.env.SUPABASE_SERVICE_TOKEN,
+);
--- a/apps/api/src/services/webhook.ts
+++ b/apps/api/src/services/webhook.ts
@ -0,0 +1,41 @@
+import { supabase_service } from "./supabase";
+
+export const callWebhook = async (teamId: string, data: any) => {
+  const { data: webhooksData, error } = await supabase_service
+    .from('webhooks')
+    .select('url')
+    .eq('team_id', teamId)
+    .limit(1);
+
+  if (error) {
+    console.error(`Error fetching webhook URL for team ID: ${teamId}`, error.message);
+    return null;
+  }
+
+  if (!webhooksData || webhooksData.length === 0) {
+    return null;
+  }
+
+  let dataToSend = [];
+  if (data.result.links && data.result.links.length !== 0) {
+    for (let i = 0; i < data.result.links.length; i++) {
+      dataToSend.push({
+        content: data.result.links[i].content.content,
+        markdown: data.result.links[i].content.markdown,
+        metadata: data.result.links[i].content.metadata,
+      });
+    }
+  }
+
+  await fetch(webhooksData[0].url, {
+    method: 'POST',
+    headers: {
+      'Content-Type': 'application/json',
+    },
+    body: JSON.stringify({
+      success: data.success,
+      data: dataToSend,
+      error: data.error || undefined,
+    }),
+  });
+}
--- a/apps/api/src/strings.ts
+++ b/apps/api/src/strings.ts
@ -0,0 +1,2 @@
+export const errorNoResults =
+  "No results found, please check the URL or contact us at help@mendable.ai to file a ticket.";
--- a/apps/api/src/supabase_types.ts
+++ b/apps/api/src/supabase_types.ts
--- a/apps/api/src/types.ts
+++ b/apps/api/src/types.ts
@ -0,0 +1,26 @@
+export interface CrawlResult {
+  source: string;
+  content: string;
+  options?: {
+    summarize?: boolean;
+    summarize_max_chars?: number;
+  };
+  metadata?: any;
+  raw_context_id?: number | string;
+  permissions?: any[];
+}
+
+export interface IngestResult {
+  success: boolean;
+  error: string;
+  data: CrawlResult[];
+}
+
+export interface WebScraperOptions {
+  url: string;
+  mode: "crawl" | "single_urls" | "sitemap";
+  crawlerOptions: any;
+  team_id: string;
+}
+
+
--- a/apps/api/tsconfig.json
+++ b/apps/api/tsconfig.json
@ -0,0 +1,17 @@
+{
+  "compilerOptions": {
+    "rootDir": "./src",
+    "lib": ["es6","DOM"],
+    "target": "ES2020", // or higher
+    "module": "commonjs",
+    "esModuleInterop": true,
+    "sourceMap": true,
+    "outDir": "./dist/src",
+    "moduleResolution": "node",
+    "baseUrl": ".",
+    "paths": {
+      "*": ["node_modules/*", "src/types/*"],
+    }
+  },
+  "include": ["src/","src/**/*", "services/db/supabase.ts", "utils/utils.ts", "services/db/supabaseEmbeddings.ts", "utils/EventEmmitter.ts", "src/services/queue-service.ts"]
+}
--- a/apps/playwright-service/.DS_Store
+++ b/apps/playwright-service/.DS_Store
--- a/apps/playwright-service/.gitignore
+++ b/apps/playwright-service/.gitignore
@ -0,0 +1,152 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# PyCharm
+#  JetBrains specific template is maintainted in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
--- a/apps/playwright-service/Dockerfile
+++ b/apps/playwright-service/Dockerfile
@ -0,0 +1,38 @@
+FROM python:3.11-slim
+
+ENV PYTHONUNBUFFERED=1
+ENV PYTHONDONTWRITEBYTECODE=1
+ENV PIP_DISABLE_PIP_VERSION_CHECK=1
+
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    gcc \
+    libstdc++6
+
+WORKDIR /app
+
+# Install Python dependencies
+COPY requirements.txt ./
+
+# Remove py which is pulled in by retry, py is not needed and is a CVE
+RUN pip install --no-cache-dir --upgrade -r requirements.txt && \
+    pip uninstall -y py && \
+    playwright install chromium && playwright install-deps chromium && \
+    ln -s /usr/local/bin/supervisord /usr/bin/supervisord
+
+# Cleanup for CVEs and size reduction
+# https://github.com/tornadoweb/tornado/issues/3107
+# xserver-common and xvfb included by playwright installation but not needed after
+# perl-base is part of the base Python Debian image but not needed for Danswer functionality
+# perl-base could only be removed with --allow-remove-essential
+
+
+
+
+
+COPY . ./
+
+EXPOSE $PORT
+# run fast api hypercorn
+CMD hypercorn main:app --bind [::]:$PORT
+# CMD ["hypercorn", "main:app", "--bind", "[::]:$PORT"]
+# CMD ["sh", "-c", "uvicorn main:app --host 0.0.0.0 --port $PORT"]
--- a/apps/playwright-service/README.md
+++ b/apps/playwright-service/README.md
--- a/apps/playwright-service/main.py
+++ b/apps/playwright-service/main.py
@ -0,0 +1,28 @@
+from fastapi import FastAPI, Response
+from playwright.async_api import async_playwright
+import os
+from fastapi.responses import JSONResponse
+from pydantic import BaseModel
+app = FastAPI()
+
+from pydantic import BaseModel
+
+class UrlModel(BaseModel):
+    url: str
+
+@app.post("/html")  # Kept as POST to accept body parameters
+async def root(body: UrlModel):  # Using Pydantic model for request body
+    async with async_playwright() as p:
+        browser = await p.chromium.launch()
+
+        context = await browser.new_context()
+        page = await context.new_page()
+
+        await page.goto(body.url)  # Adjusted to use the url from the request body model
+        page_content = await page.content()  # Get the HTML content of the page
+
+        await browser.close()
+
+        json_compatible_item_data = {"content": page_content}
+        return JSONResponse(content=json_compatible_item_data)
+    
--- a/apps/playwright-service/requests.http
+++ b/apps/playwright-service/requests.http
--- a/apps/playwright-service/requirements.txt
+++ b/apps/playwright-service/requirements.txt
@ -0,0 +1,4 @@
+hypercorn==0.16.0
+fastapi==0.110.0
+playwright==1.42.0
+uvicorn
--- a/apps/playwright-service/runtime.txt
+++ b/apps/playwright-service/runtime.txt
@ -0,0 +1 @@
+3.11
--- a/apps/python-sdk/README.md
+++ b/apps/python-sdk/README.md
@ -0,0 +1,91 @@
+# Firecrawl Python SDK
+
+The Firecrawl Python SDK is a library that allows you to easily scrape and crawl websites, and output the data in a format ready for use with language models (LLMs). It provides a simple and intuitive interface for interacting with the Firecrawl API.
+
+## Installation
+
+To install the Firecrawl Python SDK, you can use pip:
+
+```bash
+pip install firecrawl-py
+```
+
+## Usage
+
+1. Get an API key from [firecrawl.dev](https://firecrawl.dev)
+2. Set the API key as an environment variable named `FIRECRAWL_API_KEY` or pass it as a parameter to the `FirecrawlApp` class.
+
+
+Here's an example of how to use the SDK:
+
+```python
+from firecrawl import FirecrawlApp
+
+# Initialize the FirecrawlApp with your API key
+app = FirecrawlApp(api_key='your_api_key')
+
+# Scrape a single URL
+url = 'https://mendable.ai'
+scraped_data = app.scrape_url(url)
+
+# Crawl a website
+crawl_url = 'https://mendable.ai'
+crawl_params = {
+    'crawlerOptions': {
+        'excludes': ['blog/*'],
+        'includes': [], # leave empty for all pages
+        'limit': 1000,
+    }
+}
+crawl_result = app.crawl_url(crawl_url, params=crawl_params)
+```
+
+### Scraping a URL
+
+To scrape a single URL, use the `scrape_url` method. It takes the URL as a parameter and returns the scraped data as a dictionary.
+
+```python
+url = 'https://example.com'
+scraped_data = app.scrape_url(url)
+```
+
+### Crawling a Website
+
+To crawl a website, use the `crawl_url` method. It takes the starting URL and optional parameters as arguments. The `params` argument allows you to specify additional options for the crawl job, such as the maximum number of pages to crawl, allowed domains, and the output format.
+
+The `wait_until_done` parameter determines whether the method should wait for the crawl job to complete before returning the result. If set to `True`, the method will periodically check the status of the crawl job until it is completed or the specified `timeout` (in seconds) is reached. If set to `False`, the method will return immediately with the job ID, and you can manually check the status of the crawl job using the `check_crawl_status` method.
+
+```python
+crawl_url = 'https://example.com'
+crawl_params = {
+    'crawlerOptions': {
+        'excludes': ['blog/*'],
+        'includes': [], # leave empty for all pages
+        'limit': 1000,
+    }
+}
+crawl_result = app.crawl_url(crawl_url, params=crawl_params, wait_until_done=True, timeout=5)
+```
+
+If `wait_until_done` is set to `True`, the `crawl_url` method will return the crawl result once the job is completed. If the job fails or is stopped, an exception will be raised.
+
+### Checking Crawl Status
+
+To check the status of a crawl job, use the `check_crawl_status` method. It takes the job ID as a parameter and returns the current status of the crawl job.
+
+```python
+job_id = crawl_result['jobId']
+status = app.check_crawl_status(job_id)
+```
+
+## Error Handling
+
+The SDK handles errors returned by the Firecrawl API and raises appropriate exceptions. If an error occurs during a request, an exception will be raised with a descriptive error message.
+
+## Contributing
+
+Contributions to the Firecrawl Python SDK are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request on the GitHub repository.
+
+## License
+
+The Firecrawl Python SDK is open-source and released under the [MIT License](https://opensource.org/licenses/MIT).
--- a/apps/python-sdk/build/lib/firecrawl/init.py
+++ b/apps/python-sdk/build/lib/firecrawl/init.py
@ -0,0 +1 @@
+from .firecrawl import FirecrawlApp
--- a/apps/python-sdk/build/lib/firecrawl/firecrawl.py
+++ b/apps/python-sdk/build/lib/firecrawl/firecrawl.py
@ -0,0 +1,96 @@
+import os
+import requests
+
+class FirecrawlApp:
+    def __init__(self, api_key=None):
+        self.api_key = api_key or os.getenv('FIRECRAWL_API_KEY')
+        if self.api_key is None:
+            raise ValueError('No API key provided')
+    
+    def scrape_url(self, url, params=None):
+        headers = {
+            'Content-Type': 'application/json',
+            'Authorization': f'Bearer {self.api_key}'
+        }
+        json_data = {'url': url}
+        if params:
+            json_data.update(params)
+        response = requests.post(
+            'https://api.firecrawl.dev/v0/scrape',
+            headers=headers,
+            json=json_data
+        )
+        if response.status_code == 200:
+            response = response.json()
+            if response['success'] == True:
+                return response['data']
+            else:
+                raise Exception(f'Failed to scrape URL. Error: {response["error"]}')
+            
+        elif response.status_code in [402, 409, 500]:
+            error_message = response.json().get('error', 'Unknown error occurred')
+            raise Exception(f'Failed to scrape URL. Status code: {response.status_code}. Error: {error_message}')
+        else:
+            raise Exception(f'Failed to scrape URL. Status code: {response.status_code}')
+
+    def crawl_url(self, url, params=None, wait_until_done=True, timeout=2):
+        headers = self._prepare_headers()
+        json_data = {'url': url}
+        if params:
+            json_data.update(params)
+        response = self._post_request('https://api.firecrawl.dev/v0/crawl', json_data, headers)
+        if response.status_code == 200:
+            job_id = response.json().get('jobId')
+            if wait_until_done:
+                return self._monitor_job_status(job_id, headers, timeout)
+            else:
+                return {'jobId': job_id}
+        else:
+            self._handle_error(response, 'start crawl job')
+
+    def check_crawl_status(self, job_id):
+        headers = self._prepare_headers()
+        response = self._get_request(f'https://api.firecrawl.dev/v0/crawl/status/{job_id}', headers)
+        if response.status_code == 200:
+            return response.json()
+        else:
+            self._handle_error(response, 'check crawl status')
+
+    def _prepare_headers(self):
+        return {
+            'Content-Type': 'application/json',
+            'Authorization': f'Bearer {self.api_key}'
+        }
+
+    def _post_request(self, url, data, headers):
+        return requests.post(url, headers=headers, json=data)
+
+    def _get_request(self, url, headers):
+        return requests.get(url, headers=headers)
+
+    def _monitor_job_status(self, job_id, headers, timeout):
+        import time
+        while True:
+            status_response = self._get_request(f'https://api.firecrawl.dev/v0/crawl/status/{job_id}', headers)
+            if status_response.status_code == 200:
+                status_data = status_response.json()
+                if status_data['status'] == 'completed':
+                    if 'data' in status_data:
+                        return status_data['data']
+                    else:
+                        raise Exception('Crawl job completed but no data was returned')
+                elif status_data['status'] in ['active', 'paused', 'pending', 'queued']:
+                    if timeout < 2:
+                        timeout = 2
+                    time.sleep(timeout)  # Wait for the specified timeout before checking again
+                else:
+                    raise Exception(f'Crawl job failed or was stopped. Status: {status_data["status"]}')
+            else:
+                self._handle_error(status_response, 'check crawl status')
+
+    def _handle_error(self, response, action):
+        if response.status_code in [402, 409, 500]:
+            error_message = response.json().get('error', 'Unknown error occurred')
+            raise Exception(f'Failed to {action}. Status code: {response.status_code}. Error: {error_message}')
+        else:
+            raise Exception(f'Unexpected error occurred while trying to {action}. Status code: {response.status_code}')
--- a/apps/python-sdk/dist/firecrawl-py-0.0.5.tar.gz
+++ b/apps/python-sdk/dist/firecrawl-py-0.0.5.tar.gz
--- a/apps/python-sdk/dist/firecrawl_py-0.0.5-py3-none-any.whl
+++ b/apps/python-sdk/dist/firecrawl_py-0.0.5-py3-none-any.whl
--- a/apps/python-sdk/example.py
+++ b/apps/python-sdk/example.py
@ -0,0 +1,13 @@
+from firecrawl import FirecrawlApp
+
+
+app = FirecrawlApp(api_key="a6a2d63a-ed2b-46a9-946d-2a7207efed4d")
+
+crawl_result = app.crawl_url('mendable.ai', {'crawlerOptions': {'excludes': ['blog/*']}})
+print(crawl_result[0]['markdown'])
+
+job_id = crawl_result['jobId']
+print(job_id)
+
+status = app.check_crawl_status(job_id)
+print(status)
--- a/apps/python-sdk/firecrawl/init.py
+++ b/apps/python-sdk/firecrawl/init.py
@ -0,0 +1 @@
+from .firecrawl import FirecrawlApp
--- a/apps/python-sdk/firecrawl/pycache/init.cpython-311.pyc
+++ b/apps/python-sdk/firecrawl/pycache/init.cpython-311.pyc
--- a/apps/python-sdk/firecrawl/pycache/firecrawl.cpython-311.pyc
+++ b/apps/python-sdk/firecrawl/pycache/firecrawl.cpython-311.pyc
--- a/apps/python-sdk/firecrawl/firecrawl.py
+++ b/apps/python-sdk/firecrawl/firecrawl.py
@ -0,0 +1,96 @@
+import os
+import requests
+
+class FirecrawlApp:
+    def __init__(self, api_key=None):
+        self.api_key = api_key or os.getenv('FIRECRAWL_API_KEY')
+        if self.api_key is None:
+            raise ValueError('No API key provided')
+    
+    def scrape_url(self, url, params=None):
+        headers = {
+            'Content-Type': 'application/json',
+            'Authorization': f'Bearer {self.api_key}'
+        }
+        json_data = {'url': url}
+        if params:
+            json_data.update(params)
+        response = requests.post(
+            'https://api.firecrawl.dev/v0/scrape',
+            headers=headers,
+            json=json_data
+        )
+        if response.status_code == 200:
+            response = response.json()
+            if response['success'] == True:
+                return response['data']
+            else:
+                raise Exception(f'Failed to scrape URL. Error: {response["error"]}')
+            
+        elif response.status_code in [402, 409, 500]:
+            error_message = response.json().get('error', 'Unknown error occurred')
+            raise Exception(f'Failed to scrape URL. Status code: {response.status_code}. Error: {error_message}')
+        else:
+            raise Exception(f'Failed to scrape URL. Status code: {response.status_code}')
+
+    def crawl_url(self, url, params=None, wait_until_done=True, timeout=2):
+        headers = self._prepare_headers()
+        json_data = {'url': url}
+        if params:
+            json_data.update(params)
+        response = self._post_request('https://api.firecrawl.dev/v0/crawl', json_data, headers)
+        if response.status_code == 200:
+            job_id = response.json().get('jobId')
+            if wait_until_done:
+                return self._monitor_job_status(job_id, headers, timeout)
+            else:
+                return {'jobId': job_id}
+        else:
+            self._handle_error(response, 'start crawl job')
+
+    def check_crawl_status(self, job_id):
+        headers = self._prepare_headers()
+        response = self._get_request(f'https://api.firecrawl.dev/v0/crawl/status/{job_id}', headers)
+        if response.status_code == 200:
+            return response.json()
+        else:
+            self._handle_error(response, 'check crawl status')
+
+    def _prepare_headers(self):
+        return {
+            'Content-Type': 'application/json',
+            'Authorization': f'Bearer {self.api_key}'
+        }
+
+    def _post_request(self, url, data, headers):
+        return requests.post(url, headers=headers, json=data)
+
+    def _get_request(self, url, headers):
+        return requests.get(url, headers=headers)
+
+    def _monitor_job_status(self, job_id, headers, timeout):
+        import time
+        while True:
+            status_response = self._get_request(f'https://api.firecrawl.dev/v0/crawl/status/{job_id}', headers)
+            if status_response.status_code == 200:
+                status_data = status_response.json()
+                if status_data['status'] == 'completed':
+                    if 'data' in status_data:
+                        return status_data['data']
+                    else:
+                        raise Exception('Crawl job completed but no data was returned')
+                elif status_data['status'] in ['active', 'paused', 'pending', 'queued']:
+                    if timeout < 2:
+                        timeout = 2
+                    time.sleep(timeout)  # Wait for the specified timeout before checking again
+                else:
+                    raise Exception(f'Crawl job failed or was stopped. Status: {status_data["status"]}')
+            else:
+                self._handle_error(status_response, 'check crawl status')
+
+    def _handle_error(self, response, action):
+        if response.status_code in [402, 409, 500]:
+            error_message = response.json().get('error', 'Unknown error occurred')
+            raise Exception(f'Failed to {action}. Status code: {response.status_code}. Error: {error_message}')
+        else:
+            raise Exception(f'Unexpected error occurred while trying to {action}. Status code: {response.status_code}')
--- a/apps/python-sdk/firecrawl_py.egg-info/PKG-INFO
+++ b/apps/python-sdk/firecrawl_py.egg-info/PKG-INFO
@ -0,0 +1,7 @@
+Metadata-Version: 2.1
+Name: firecrawl-py
+Version: 0.0.5
+Summary: Python SDK for Firecrawl API
+Home-page: https://github.com/mendableai/firecrawl-py
+Author: Mendable.ai
+Author-email: nick@mendable.ai
--- a/apps/python-sdk/firecrawl_py.egg-info/SOURCES.txt
+++ b/apps/python-sdk/firecrawl_py.egg-info/SOURCES.txt
@ -0,0 +1,9 @@
+README.md
+setup.py
+firecrawl/__init__.py
+firecrawl/firecrawl.py
+firecrawl_py.egg-info/PKG-INFO
+firecrawl_py.egg-info/SOURCES.txt
+firecrawl_py.egg-info/dependency_links.txt
+firecrawl_py.egg-info/requires.txt
+firecrawl_py.egg-info/top_level.txt
--- a/apps/python-sdk/firecrawl_py.egg-info/dependency_links.txt
+++ b/apps/python-sdk/firecrawl_py.egg-info/dependency_links.txt
@ -0,0 +1 @@
+
--- a/apps/python-sdk/firecrawl_py.egg-info/requires.txt
+++ b/apps/python-sdk/firecrawl_py.egg-info/requires.txt
@ -0,0 +1 @@
+requests
--- a/apps/python-sdk/firecrawl_py.egg-info/top_level.txt
+++ b/apps/python-sdk/firecrawl_py.egg-info/top_level.txt
@ -0,0 +1 @@
+firecrawl
--- a/apps/python-sdk/setup.py
+++ b/apps/python-sdk/setup.py
@ -0,0 +1,14 @@
+from setuptools import setup, find_packages
+
+setup(
+    name='firecrawl-py',
+    version='0.0.5',
+    url='https://github.com/mendableai/firecrawl-py',
+    author='Mendable.ai',
+    author_email='nick@mendable.ai',
+    description='Python SDK for Firecrawl API',
+    packages=find_packages(),    
+    install_requires=[
+        'requests',
+    ],
+)
--- a/apps/www/README.md
+++ b/apps/www/README.md
@ -0,0 +1 @@
+Coming soon!