Merge remote-tracking branch 'origin/main' into test/load-testing
This commit is contained in:
commit
ad7795f973
35
.github/ISSUE_TEMPLATE/bug_report.md
vendored
Normal file
35
.github/ISSUE_TEMPLATE/bug_report.md
vendored
Normal file
@ -0,0 +1,35 @@
|
|||||||
|
---
|
||||||
|
name: Bug report
|
||||||
|
about: Create a report to help us improve
|
||||||
|
title: "[BUG]"
|
||||||
|
labels: bug
|
||||||
|
assignees: ''
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Describe the Bug**
|
||||||
|
Provide a clear and concise description of what the bug is.
|
||||||
|
|
||||||
|
**To Reproduce**
|
||||||
|
Steps to reproduce the issue:
|
||||||
|
1. Configure the environment or settings with '...'
|
||||||
|
2. Run the command '...'
|
||||||
|
3. Observe the error or unexpected output at '...'
|
||||||
|
4. Log output/error message
|
||||||
|
|
||||||
|
**Expected Behavior**
|
||||||
|
A clear and concise description of what you expected to happen.
|
||||||
|
|
||||||
|
**Screenshots**
|
||||||
|
If applicable, add screenshots or copies of the command line output to help explain the issue.
|
||||||
|
|
||||||
|
**Environment (please complete the following information):**
|
||||||
|
- OS: [e.g. macOS, Linux, Windows]
|
||||||
|
- Firecrawl Version: [e.g. 1.2.3]
|
||||||
|
- Node.js Version: [e.g. 14.x]
|
||||||
|
|
||||||
|
**Logs**
|
||||||
|
If applicable, include detailed logs to help understand the problem.
|
||||||
|
|
||||||
|
**Additional Context**
|
||||||
|
Add any other context about the problem here, such as configuration specifics, network conditions, data volumes, etc.
|
26
.github/ISSUE_TEMPLATE/feature_request.md
vendored
Normal file
26
.github/ISSUE_TEMPLATE/feature_request.md
vendored
Normal file
@ -0,0 +1,26 @@
|
|||||||
|
---
|
||||||
|
name: Feature request
|
||||||
|
about: Suggest an idea for this project
|
||||||
|
title: "[Feat]"
|
||||||
|
labels: ''
|
||||||
|
assignees: ''
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Problem Description**
|
||||||
|
Describe the issue you're experiencing that has prompted this feature request. For example, "I find it difficult when..."
|
||||||
|
|
||||||
|
**Proposed Feature**
|
||||||
|
Provide a clear and concise description of the feature you would like implemented.
|
||||||
|
|
||||||
|
**Alternatives Considered**
|
||||||
|
Discuss any alternative solutions or features you've considered. Why were these alternatives not suitable?
|
||||||
|
|
||||||
|
**Implementation Suggestions**
|
||||||
|
If you have ideas on how the feature could be implemented, share them here. This could include technical details, API changes, or interaction mechanisms.
|
||||||
|
|
||||||
|
**Use Case**
|
||||||
|
Explain how this feature would be used and what benefits it would bring. Include specific examples to illustrate how this would improve functionality or user experience.
|
||||||
|
|
||||||
|
**Additional Context**
|
||||||
|
Add any other context such as comparisons with similar features in other products, or links to prototypes or mockups.
|
58
.github/archive/js-sdk.yml
vendored
Normal file
58
.github/archive/js-sdk.yml
vendored
Normal file
@ -0,0 +1,58 @@
|
|||||||
|
name: Run JavaScript SDK E2E Tests
|
||||||
|
|
||||||
|
on: []
|
||||||
|
|
||||||
|
env:
|
||||||
|
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
|
||||||
|
BULL_AUTH_KEY: ${{ secrets.BULL_AUTH_KEY }}
|
||||||
|
FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
|
||||||
|
HOST: ${{ secrets.HOST }}
|
||||||
|
LLAMAPARSE_API_KEY: ${{ secrets.LLAMAPARSE_API_KEY }}
|
||||||
|
LOGTAIL_KEY: ${{ secrets.LOGTAIL_KEY }}
|
||||||
|
POSTHOG_API_KEY: ${{ secrets.POSTHOG_API_KEY }}
|
||||||
|
POSTHOG_HOST: ${{ secrets.POSTHOG_HOST }}
|
||||||
|
NUM_WORKERS_PER_QUEUE: ${{ secrets.NUM_WORKERS_PER_QUEUE }}
|
||||||
|
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
|
||||||
|
PLAYWRIGHT_MICROSERVICE_URL: ${{ secrets.PLAYWRIGHT_MICROSERVICE_URL }}
|
||||||
|
PORT: ${{ secrets.PORT }}
|
||||||
|
REDIS_URL: ${{ secrets.REDIS_URL }}
|
||||||
|
SCRAPING_BEE_API_KEY: ${{ secrets.SCRAPING_BEE_API_KEY }}
|
||||||
|
SUPABASE_ANON_TOKEN: ${{ secrets.SUPABASE_ANON_TOKEN }}
|
||||||
|
SUPABASE_SERVICE_TOKEN: ${{ secrets.SUPABASE_SERVICE_TOKEN }}
|
||||||
|
SUPABASE_URL: ${{ secrets.SUPABASE_URL }}
|
||||||
|
TEST_API_KEY: ${{ secrets.TEST_API_KEY }}
|
||||||
|
HYPERDX_API_KEY: ${{ secrets.HYPERDX_API_KEY }}
|
||||||
|
HDX_NODE_BETA_MODE: 1
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
build:
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
services:
|
||||||
|
redis:
|
||||||
|
image: redis
|
||||||
|
ports:
|
||||||
|
- 6379:6379
|
||||||
|
|
||||||
|
steps:
|
||||||
|
- uses: actions/checkout@v3
|
||||||
|
- name: Set up Node.js
|
||||||
|
uses: actions/setup-node@v3
|
||||||
|
with:
|
||||||
|
node-version: "20"
|
||||||
|
- name: Install pnpm
|
||||||
|
run: npm install -g pnpm
|
||||||
|
- name: Install dependencies for API
|
||||||
|
run: pnpm install
|
||||||
|
working-directory: ./apps/api
|
||||||
|
- name: Start the application
|
||||||
|
run: npm start &
|
||||||
|
working-directory: ./apps/api
|
||||||
|
- name: Start workers
|
||||||
|
run: npm run workers &
|
||||||
|
working-directory: ./apps/api
|
||||||
|
- name: Install dependencies for JavaScript SDK
|
||||||
|
run: pnpm install
|
||||||
|
working-directory: ./apps/js-sdk/firecrawl
|
||||||
|
- name: Run E2E tests for JavaScript SDK
|
||||||
|
run: npm run test
|
||||||
|
working-directory: ./apps/js-sdk/firecrawl
|
46
.github/archive/publish-js-sdk.yml
vendored
Normal file
46
.github/archive/publish-js-sdk.yml
vendored
Normal file
@ -0,0 +1,46 @@
|
|||||||
|
name: Publish JavaScript SDK
|
||||||
|
|
||||||
|
on: []
|
||||||
|
|
||||||
|
env:
|
||||||
|
NPM_TOKEN: ${{ secrets.NPM_TOKEN }}
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
build-and-publish:
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
|
||||||
|
steps:
|
||||||
|
- uses: actions/checkout@v3
|
||||||
|
- name: Set up Node.js
|
||||||
|
uses: actions/setup-node@v3
|
||||||
|
with:
|
||||||
|
node-version: '20'
|
||||||
|
registry-url: 'https://registry.npmjs.org/'
|
||||||
|
scope: '@mendable'
|
||||||
|
always-auth: true
|
||||||
|
|
||||||
|
- name: Install pnpm
|
||||||
|
run: npm install -g pnpm
|
||||||
|
|
||||||
|
- name: Install python for running version check script
|
||||||
|
run: |
|
||||||
|
python -m pip install --upgrade pip
|
||||||
|
pip install setuptools wheel requests packaging
|
||||||
|
|
||||||
|
- name: Install dependencies for JavaScript SDK
|
||||||
|
run: pnpm install
|
||||||
|
working-directory: ./apps/js-sdk/firecrawl
|
||||||
|
|
||||||
|
- name: Run version check script
|
||||||
|
id: version_check_script
|
||||||
|
run: |
|
||||||
|
VERSION_INCREMENTED=$(python .github/scripts/check_version_has_incremented.py js ./apps/js-sdk/firecrawl @mendable/firecrawl-js)
|
||||||
|
echo "VERSION_INCREMENTED=$VERSION_INCREMENTED" >> $GITHUB_ENV
|
||||||
|
|
||||||
|
- name: Build and publish to npm
|
||||||
|
if: ${{ env.VERSION_INCREMENTED == 'true' }}
|
||||||
|
env:
|
||||||
|
NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }}
|
||||||
|
run: |
|
||||||
|
npm run build-and-publish
|
||||||
|
working-directory: ./apps/js-sdk/firecrawl
|
47
.github/archive/publish-python-sdk.yml
vendored
Normal file
47
.github/archive/publish-python-sdk.yml
vendored
Normal file
@ -0,0 +1,47 @@
|
|||||||
|
name: Publish Python SDK
|
||||||
|
|
||||||
|
on: []
|
||||||
|
|
||||||
|
env:
|
||||||
|
PYPI_USERNAME: ${{ secrets.PYPI_USERNAME }}
|
||||||
|
PYPI_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
build-and-publish:
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
|
||||||
|
steps:
|
||||||
|
- name: Checkout repository
|
||||||
|
uses: actions/checkout@v3
|
||||||
|
|
||||||
|
- name: Set up Python
|
||||||
|
uses: actions/setup-python@v4
|
||||||
|
with:
|
||||||
|
python-version: '3.x'
|
||||||
|
|
||||||
|
- name: Install dependencies
|
||||||
|
run: |
|
||||||
|
python -m pip install --upgrade pip
|
||||||
|
pip install setuptools wheel twine build requests packaging
|
||||||
|
|
||||||
|
- name: Run version check script
|
||||||
|
id: version_check_script
|
||||||
|
run: |
|
||||||
|
VERSION_INCREMENTED=$(python .github/scripts/check_version_has_incremented.py python ./apps/python-sdk/firecrawl firecrawl-py)
|
||||||
|
echo "VERSION_INCREMENTED=$VERSION_INCREMENTED" >> $GITHUB_ENV
|
||||||
|
|
||||||
|
- name: Build the package
|
||||||
|
if: ${{ env.VERSION_INCREMENTED == 'true' }}
|
||||||
|
run: |
|
||||||
|
python -m build
|
||||||
|
working-directory: ./apps/python-sdk
|
||||||
|
|
||||||
|
- name: Publish to PyPI
|
||||||
|
if: ${{ env.VERSION_INCREMENTED == 'true' }}
|
||||||
|
env:
|
||||||
|
TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
|
||||||
|
TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
|
||||||
|
run: |
|
||||||
|
twine upload dist/*
|
||||||
|
working-directory: ./apps/python-sdk
|
||||||
|
|
70
.github/archive/python-sdk.yml
vendored
Normal file
70
.github/archive/python-sdk.yml
vendored
Normal file
@ -0,0 +1,70 @@
|
|||||||
|
name: Run Python SDK E2E Tests
|
||||||
|
|
||||||
|
on: []
|
||||||
|
|
||||||
|
env:
|
||||||
|
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
|
||||||
|
BULL_AUTH_KEY: ${{ secrets.BULL_AUTH_KEY }}
|
||||||
|
FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
|
||||||
|
HOST: ${{ secrets.HOST }}
|
||||||
|
LLAMAPARSE_API_KEY: ${{ secrets.LLAMAPARSE_API_KEY }}
|
||||||
|
LOGTAIL_KEY: ${{ secrets.LOGTAIL_KEY }}
|
||||||
|
POSTHOG_API_KEY: ${{ secrets.POSTHOG_API_KEY }}
|
||||||
|
POSTHOG_HOST: ${{ secrets.POSTHOG_HOST }}
|
||||||
|
NUM_WORKERS_PER_QUEUE: ${{ secrets.NUM_WORKERS_PER_QUEUE }}
|
||||||
|
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
|
||||||
|
PLAYWRIGHT_MICROSERVICE_URL: ${{ secrets.PLAYWRIGHT_MICROSERVICE_URL }}
|
||||||
|
PORT: ${{ secrets.PORT }}
|
||||||
|
REDIS_URL: ${{ secrets.REDIS_URL }}
|
||||||
|
SCRAPING_BEE_API_KEY: ${{ secrets.SCRAPING_BEE_API_KEY }}
|
||||||
|
SUPABASE_ANON_TOKEN: ${{ secrets.SUPABASE_ANON_TOKEN }}
|
||||||
|
SUPABASE_SERVICE_TOKEN: ${{ secrets.SUPABASE_SERVICE_TOKEN }}
|
||||||
|
SUPABASE_URL: ${{ secrets.SUPABASE_URL }}
|
||||||
|
TEST_API_KEY: ${{ secrets.TEST_API_KEY }}
|
||||||
|
HYPERDX_API_KEY: ${{ secrets.HYPERDX_API_KEY }}
|
||||||
|
HDX_NODE_BETA_MODE: 1
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
build:
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
strategy:
|
||||||
|
matrix:
|
||||||
|
python-version: ["3.10"]
|
||||||
|
services:
|
||||||
|
redis:
|
||||||
|
image: redis
|
||||||
|
ports:
|
||||||
|
- 6379:6379
|
||||||
|
|
||||||
|
steps:
|
||||||
|
- uses: actions/checkout@v3
|
||||||
|
- name: Set up Node.js
|
||||||
|
uses: actions/setup-node@v3
|
||||||
|
with:
|
||||||
|
node-version: "20"
|
||||||
|
- name: Install pnpm
|
||||||
|
run: npm install -g pnpm
|
||||||
|
- name: Install dependencies for API
|
||||||
|
run: pnpm install
|
||||||
|
working-directory: ./apps/api
|
||||||
|
- name: Start the application
|
||||||
|
run: npm start &
|
||||||
|
working-directory: ./apps/api
|
||||||
|
id: start_app
|
||||||
|
- name: Start workers
|
||||||
|
run: npm run workers &
|
||||||
|
working-directory: ./apps/api
|
||||||
|
id: start_workers
|
||||||
|
- name: Set up Python ${{ matrix.python-version }}
|
||||||
|
uses: actions/setup-python@v4
|
||||||
|
with:
|
||||||
|
python-version: ${{ matrix.python-version }}
|
||||||
|
- name: Install Python dependencies
|
||||||
|
run: |
|
||||||
|
python -m pip install --upgrade pip
|
||||||
|
pip install -r requirements.txt
|
||||||
|
working-directory: ./apps/python-sdk
|
||||||
|
- name: Run E2E tests for Python SDK
|
||||||
|
run: |
|
||||||
|
pytest firecrawl/__tests__/e2e_withAuth/test.py
|
||||||
|
working-directory: ./apps/python-sdk
|
88
.github/scripts/check_version_has_incremented.py
vendored
Normal file
88
.github/scripts/check_version_has_incremented.py
vendored
Normal file
@ -0,0 +1,88 @@
|
|||||||
|
"""
|
||||||
|
checks local versions against published versions.
|
||||||
|
|
||||||
|
# Usage:
|
||||||
|
|
||||||
|
python .github/scripts/check_version_has_incremented.py js ./apps/js-sdk/firecrawl @mendable/firecrawl-js
|
||||||
|
Local version: 0.0.22
|
||||||
|
Published version: 0.0.21
|
||||||
|
true
|
||||||
|
|
||||||
|
python .github/scripts/check_version_has_incremented.py python ./apps/python-sdk/firecrawl firecrawl-py
|
||||||
|
Local version: 0.0.11
|
||||||
|
Published version: 0.0.11
|
||||||
|
false
|
||||||
|
|
||||||
|
"""
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import requests
|
||||||
|
from packaging.version import Version
|
||||||
|
from packaging.version import parse as parse_version
|
||||||
|
|
||||||
|
|
||||||
|
def get_python_version(file_path: str) -> str:
|
||||||
|
"""Extract version string from Python file."""
|
||||||
|
version_file = Path(file_path).read_text()
|
||||||
|
version_match = re.search(r"^__version__ = ['\"]([^'\"]*)['\"]", version_file, re.M)
|
||||||
|
if version_match:
|
||||||
|
return version_match.group(1).strip()
|
||||||
|
raise RuntimeError("Unable to find version string.")
|
||||||
|
|
||||||
|
def get_pypi_version(package_name: str) -> str:
|
||||||
|
"""Get latest version of Python package from PyPI."""
|
||||||
|
response = requests.get(f"https://pypi.org/pypi/{package_name}/json")
|
||||||
|
version = response.json()['info']['version']
|
||||||
|
return version.strip()
|
||||||
|
|
||||||
|
def get_js_version(file_path: str) -> str:
|
||||||
|
"""Extract version string from package.json."""
|
||||||
|
with open(file_path, 'r') as file:
|
||||||
|
package_json = json.load(file)
|
||||||
|
if 'version' in package_json:
|
||||||
|
return package_json['version'].strip()
|
||||||
|
raise RuntimeError("Unable to find version string in package.json.")
|
||||||
|
|
||||||
|
def get_npm_version(package_name: str) -> str:
|
||||||
|
"""Get latest version of JavaScript package from npm."""
|
||||||
|
response = requests.get(f"https://registry.npmjs.org/{package_name}/latest")
|
||||||
|
version = response.json()['version']
|
||||||
|
return version.strip()
|
||||||
|
|
||||||
|
def is_version_incremented(local_version: str, published_version: str) -> bool:
|
||||||
|
"""Compare local and published versions."""
|
||||||
|
local_version_parsed: Version = parse_version(local_version)
|
||||||
|
published_version_parsed: Version = parse_version(published_version)
|
||||||
|
return local_version_parsed > published_version_parsed
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
package_type = sys.argv[1]
|
||||||
|
package_path = sys.argv[2]
|
||||||
|
package_name = sys.argv[3]
|
||||||
|
|
||||||
|
if package_type == "python":
|
||||||
|
# Get current version from __init__.py
|
||||||
|
current_version = get_python_version(os.path.join(package_path, '__init__.py'))
|
||||||
|
# Get published version from PyPI
|
||||||
|
published_version = get_pypi_version(package_name)
|
||||||
|
elif package_type == "js":
|
||||||
|
# Get current version from package.json
|
||||||
|
current_version = get_js_version(os.path.join(package_path, 'package.json'))
|
||||||
|
# Get published version from npm
|
||||||
|
published_version = get_npm_version(package_name)
|
||||||
|
else:
|
||||||
|
raise ValueError("Invalid package type. Use 'python' or 'js'.")
|
||||||
|
|
||||||
|
# Print versions for debugging
|
||||||
|
# print(f"Local version: {current_version}")
|
||||||
|
# print(f"Published version: {published_version}")
|
||||||
|
|
||||||
|
# Compare versions and print result
|
||||||
|
if is_version_incremented(current_version, published_version):
|
||||||
|
print("true")
|
||||||
|
else:
|
||||||
|
print("false")
|
2
.github/scripts/requirements.txt
vendored
Normal file
2
.github/scripts/requirements.txt
vendored
Normal file
@ -0,0 +1,2 @@
|
|||||||
|
requests
|
||||||
|
packaging
|
20
.github/workflows/clean-before-24h-complete-jobs.yml
vendored
Normal file
20
.github/workflows/clean-before-24h-complete-jobs.yml
vendored
Normal file
@ -0,0 +1,20 @@
|
|||||||
|
name: Clean Before 24h Completed Jobs
|
||||||
|
on:
|
||||||
|
schedule:
|
||||||
|
- cron: '0 0 * * *'
|
||||||
|
|
||||||
|
env:
|
||||||
|
BULL_AUTH_KEY: ${{ secrets.BULL_AUTH_KEY }}
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
clean-jobs:
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
steps:
|
||||||
|
- name: Send GET request to clean jobs
|
||||||
|
run: |
|
||||||
|
response=$(curl --write-out '%{http_code}' --silent --output /dev/null https://api.firecrawl.dev/admin/${{ secrets.BULL_AUTH_KEY }}/clean-before-24h-complete-jobs)
|
||||||
|
if [ "$response" -ne 200 ]; then
|
||||||
|
echo "Failed to clean jobs. Response: $response"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
echo "Successfully cleaned jobs. Response: $response"
|
37
.github/workflows/fly-direct.yml
vendored
Normal file
37
.github/workflows/fly-direct.yml
vendored
Normal file
@ -0,0 +1,37 @@
|
|||||||
|
name: Fly Deploy Direct
|
||||||
|
on:
|
||||||
|
schedule:
|
||||||
|
- cron: '0 * * * *'
|
||||||
|
|
||||||
|
env:
|
||||||
|
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
|
||||||
|
BULL_AUTH_KEY: ${{ secrets.BULL_AUTH_KEY }}
|
||||||
|
FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
|
||||||
|
HOST: ${{ secrets.HOST }}
|
||||||
|
LLAMAPARSE_API_KEY: ${{ secrets.LLAMAPARSE_API_KEY }}
|
||||||
|
LOGTAIL_KEY: ${{ secrets.LOGTAIL_KEY }}
|
||||||
|
POSTHOG_API_KEY: ${{ secrets.POSTHOG_API_KEY }}
|
||||||
|
POSTHOG_HOST: ${{ secrets.POSTHOG_HOST }}
|
||||||
|
NUM_WORKERS_PER_QUEUE: ${{ secrets.NUM_WORKERS_PER_QUEUE }}
|
||||||
|
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
|
||||||
|
PLAYWRIGHT_MICROSERVICE_URL: ${{ secrets.PLAYWRIGHT_MICROSERVICE_URL }}
|
||||||
|
PORT: ${{ secrets.PORT }}
|
||||||
|
REDIS_URL: ${{ secrets.REDIS_URL }}
|
||||||
|
SCRAPING_BEE_API_KEY: ${{ secrets.SCRAPING_BEE_API_KEY }}
|
||||||
|
SUPABASE_ANON_TOKEN: ${{ secrets.SUPABASE_ANON_TOKEN }}
|
||||||
|
SUPABASE_SERVICE_TOKEN: ${{ secrets.SUPABASE_SERVICE_TOKEN }}
|
||||||
|
SUPABASE_URL: ${{ secrets.SUPABASE_URL }}
|
||||||
|
TEST_API_KEY: ${{ secrets.TEST_API_KEY }}
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
deploy:
|
||||||
|
name: Deploy app
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
steps:
|
||||||
|
- uses: actions/checkout@v3
|
||||||
|
- name: Change directory
|
||||||
|
run: cd apps/api
|
||||||
|
- uses: superfly/flyctl-actions/setup-flyctl@master
|
||||||
|
- run: flyctl deploy ./apps/api --remote-only -a firecrawl-scraper-js
|
||||||
|
env:
|
||||||
|
FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
|
166
.github/workflows/fly.yml
vendored
166
.github/workflows/fly.yml
vendored
@ -3,8 +3,6 @@ on:
|
|||||||
push:
|
push:
|
||||||
branches:
|
branches:
|
||||||
- main
|
- main
|
||||||
# schedule:
|
|
||||||
# - cron: '0 */4 * * *'
|
|
||||||
|
|
||||||
env:
|
env:
|
||||||
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
|
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
|
||||||
@ -25,9 +23,12 @@ env:
|
|||||||
SUPABASE_SERVICE_TOKEN: ${{ secrets.SUPABASE_SERVICE_TOKEN }}
|
SUPABASE_SERVICE_TOKEN: ${{ secrets.SUPABASE_SERVICE_TOKEN }}
|
||||||
SUPABASE_URL: ${{ secrets.SUPABASE_URL }}
|
SUPABASE_URL: ${{ secrets.SUPABASE_URL }}
|
||||||
TEST_API_KEY: ${{ secrets.TEST_API_KEY }}
|
TEST_API_KEY: ${{ secrets.TEST_API_KEY }}
|
||||||
|
PYPI_USERNAME: ${{ secrets.PYPI_USERNAME }}
|
||||||
|
PYPI_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
|
||||||
|
NPM_TOKEN: ${{ secrets.NPM_TOKEN }}
|
||||||
|
|
||||||
jobs:
|
jobs:
|
||||||
pre-deploy:
|
pre-deploy-e2e-tests:
|
||||||
name: Pre-deploy checks
|
name: Pre-deploy checks
|
||||||
runs-on: ubuntu-latest
|
runs-on: ubuntu-latest
|
||||||
services:
|
services:
|
||||||
@ -61,7 +62,7 @@ jobs:
|
|||||||
|
|
||||||
pre-deploy-test-suite:
|
pre-deploy-test-suite:
|
||||||
name: Test Suite
|
name: Test Suite
|
||||||
needs: pre-deploy
|
needs: pre-deploy-e2e-tests
|
||||||
runs-on: ubuntu-latest
|
runs-on: ubuntu-latest
|
||||||
services:
|
services:
|
||||||
redis:
|
redis:
|
||||||
@ -95,10 +96,83 @@ jobs:
|
|||||||
npm run test
|
npm run test
|
||||||
working-directory: ./apps/test-suite
|
working-directory: ./apps/test-suite
|
||||||
|
|
||||||
|
python-sdk-tests:
|
||||||
|
name: Python SDK Tests
|
||||||
|
needs: pre-deploy-e2e-tests
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
services:
|
||||||
|
redis:
|
||||||
|
image: redis
|
||||||
|
ports:
|
||||||
|
- 6379:6379
|
||||||
|
steps:
|
||||||
|
- uses: actions/checkout@v3
|
||||||
|
- name: Set up Python
|
||||||
|
uses: actions/setup-python@v4
|
||||||
|
with:
|
||||||
|
python-version: '3.x'
|
||||||
|
- name: Install pnpm
|
||||||
|
run: npm install -g pnpm
|
||||||
|
- name: Install dependencies
|
||||||
|
run: pnpm install
|
||||||
|
working-directory: ./apps/api
|
||||||
|
- name: Start the application
|
||||||
|
run: npm start &
|
||||||
|
working-directory: ./apps/api
|
||||||
|
id: start_app
|
||||||
|
- name: Start workers
|
||||||
|
run: npm run workers &
|
||||||
|
working-directory: ./apps/api
|
||||||
|
id: start_workers
|
||||||
|
- name: Install Python dependencies
|
||||||
|
run: |
|
||||||
|
python -m pip install --upgrade pip
|
||||||
|
pip install -r requirements.txt
|
||||||
|
working-directory: ./apps/python-sdk
|
||||||
|
- name: Run E2E tests for Python SDK
|
||||||
|
run: |
|
||||||
|
pytest firecrawl/__tests__/e2e_withAuth/test.py
|
||||||
|
working-directory: ./apps/python-sdk
|
||||||
|
|
||||||
|
js-sdk-tests:
|
||||||
|
name: JavaScript SDK Tests
|
||||||
|
needs: pre-deploy-e2e-tests
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
services:
|
||||||
|
redis:
|
||||||
|
image: redis
|
||||||
|
ports:
|
||||||
|
- 6379:6379
|
||||||
|
steps:
|
||||||
|
- uses: actions/checkout@v3
|
||||||
|
- name: Set up Node.js
|
||||||
|
uses: actions/setup-node@v3
|
||||||
|
with:
|
||||||
|
node-version: "20"
|
||||||
|
- name: Install pnpm
|
||||||
|
run: npm install -g pnpm
|
||||||
|
- name: Install dependencies
|
||||||
|
run: pnpm install
|
||||||
|
working-directory: ./apps/api
|
||||||
|
- name: Start the application
|
||||||
|
run: npm start &
|
||||||
|
working-directory: ./apps/api
|
||||||
|
id: start_app
|
||||||
|
- name: Start workers
|
||||||
|
run: npm run workers &
|
||||||
|
working-directory: ./apps/api
|
||||||
|
id: start_workers
|
||||||
|
- name: Install dependencies for JavaScript SDK
|
||||||
|
run: pnpm install
|
||||||
|
working-directory: ./apps/js-sdk/firecrawl
|
||||||
|
- name: Run E2E tests for JavaScript SDK
|
||||||
|
run: npm run test
|
||||||
|
working-directory: ./apps/js-sdk/firecrawl
|
||||||
|
|
||||||
deploy:
|
deploy:
|
||||||
name: Deploy app
|
name: Deploy app
|
||||||
runs-on: ubuntu-latest
|
runs-on: ubuntu-latest
|
||||||
needs: pre-deploy-test-suite
|
needs: [pre-deploy-test-suite, python-sdk-tests, js-sdk-tests]
|
||||||
steps:
|
steps:
|
||||||
- uses: actions/checkout@v3
|
- uses: actions/checkout@v3
|
||||||
- name: Change directory
|
- name: Change directory
|
||||||
@ -107,3 +181,85 @@ jobs:
|
|||||||
- run: flyctl deploy ./apps/api --remote-only -a firecrawl-scraper-js
|
- run: flyctl deploy ./apps/api --remote-only -a firecrawl-scraper-js
|
||||||
env:
|
env:
|
||||||
FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
|
FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
|
||||||
|
|
||||||
|
build-and-publish-python-sdk:
|
||||||
|
name: Build and publish Python SDK
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
needs: deploy
|
||||||
|
|
||||||
|
steps:
|
||||||
|
- name: Checkout repository
|
||||||
|
uses: actions/checkout@v3
|
||||||
|
|
||||||
|
- name: Set up Python
|
||||||
|
uses: actions/setup-python@v4
|
||||||
|
with:
|
||||||
|
python-version: '3.x'
|
||||||
|
|
||||||
|
- name: Install dependencies
|
||||||
|
run: |
|
||||||
|
python -m pip install --upgrade pip
|
||||||
|
pip install setuptools wheel twine build requests packaging
|
||||||
|
|
||||||
|
- name: Run version check script
|
||||||
|
id: version_check_script
|
||||||
|
run: |
|
||||||
|
PYTHON_SDK_VERSION_INCREMENTED=$(python .github/scripts/check_version_has_incremented.py python ./apps/python-sdk/firecrawl firecrawl-py)
|
||||||
|
echo "PYTHON_SDK_VERSION_INCREMENTED=$PYTHON_SDK_VERSION_INCREMENTED" >> $GITHUB_ENV
|
||||||
|
|
||||||
|
- name: Build the package
|
||||||
|
if: ${{ env.PYTHON_SDK_VERSION_INCREMENTED == 'true' }}
|
||||||
|
run: |
|
||||||
|
python -m build
|
||||||
|
working-directory: ./apps/python-sdk
|
||||||
|
|
||||||
|
- name: Publish to PyPI
|
||||||
|
if: ${{ env.PYTHON_SDK_VERSION_INCREMENTED == 'true' }}
|
||||||
|
env:
|
||||||
|
TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
|
||||||
|
TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
|
||||||
|
run: |
|
||||||
|
twine upload dist/*
|
||||||
|
working-directory: ./apps/python-sdk
|
||||||
|
|
||||||
|
build-and-publish-js-sdk:
|
||||||
|
name: Build and publish JavaScript SDK
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
needs: deploy
|
||||||
|
|
||||||
|
steps:
|
||||||
|
- uses: actions/checkout@v3
|
||||||
|
- name: Set up Node.js
|
||||||
|
uses: actions/setup-node@v3
|
||||||
|
with:
|
||||||
|
node-version: '20'
|
||||||
|
registry-url: 'https://registry.npmjs.org/'
|
||||||
|
scope: '@mendable'
|
||||||
|
always-auth: true
|
||||||
|
|
||||||
|
- name: Install pnpm
|
||||||
|
run: npm install -g pnpm
|
||||||
|
|
||||||
|
- name: Install python for running version check script
|
||||||
|
run: |
|
||||||
|
python -m pip install --upgrade pip
|
||||||
|
pip install setuptools wheel requests packaging
|
||||||
|
|
||||||
|
- name: Install dependencies for JavaScript SDK
|
||||||
|
run: pnpm install
|
||||||
|
working-directory: ./apps/js-sdk/firecrawl
|
||||||
|
|
||||||
|
- name: Run version check script
|
||||||
|
id: version_check_script
|
||||||
|
run: |
|
||||||
|
VERSION_INCREMENTED=$(python .github/scripts/check_version_has_incremented.py js ./apps/js-sdk/firecrawl @mendable/firecrawl-js)
|
||||||
|
echo "VERSION_INCREMENTED=$VERSION_INCREMENTED" >> $GITHUB_ENV
|
||||||
|
|
||||||
|
- name: Build and publish to npm
|
||||||
|
if: ${{ env.VERSION_INCREMENTED == 'true' }}
|
||||||
|
env:
|
||||||
|
NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }}
|
||||||
|
run: |
|
||||||
|
npm run build-and-publish
|
||||||
|
working-directory: ./apps/js-sdk/firecrawl
|
||||||
|
|
@ -39,7 +39,7 @@ SUPABASE_SERVICE_TOKEN=
|
|||||||
TEST_API_KEY= # use if you've set up authentication and want to test with a real API key
|
TEST_API_KEY= # use if you've set up authentication and want to test with a real API key
|
||||||
SCRAPING_BEE_API_KEY= #Set if you'd like to use scraping Be to handle JS blocking
|
SCRAPING_BEE_API_KEY= #Set if you'd like to use scraping Be to handle JS blocking
|
||||||
OPENAI_API_KEY= # add for LLM dependednt features (image alt generation, etc.)
|
OPENAI_API_KEY= # add for LLM dependednt features (image alt generation, etc.)
|
||||||
BULL_AUTH_KEY= #
|
BULL_AUTH_KEY= @
|
||||||
LOGTAIL_KEY= # Use if you're configuring basic logging with logtail
|
LOGTAIL_KEY= # Use if you're configuring basic logging with logtail
|
||||||
PLAYWRIGHT_MICROSERVICE_URL= # set if you'd like to run a playwright fallback
|
PLAYWRIGHT_MICROSERVICE_URL= # set if you'd like to run a playwright fallback
|
||||||
LLAMAPARSE_API_KEY= #Set if you have a llamaparse key you'd like to use to parse pdfs
|
LLAMAPARSE_API_KEY= #Set if you have a llamaparse key you'd like to use to parse pdfs
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
# 🔥 Firecrawl
|
# 🔥 Firecrawl
|
||||||
|
|
||||||
Crawl and convert any website into LLM-ready markdown. Built by [Mendable.ai](https://mendable.ai?ref=gfirecrawl) and the firecrawl community.
|
Crawl and convert any website into LLM-ready markdown or structured data. Built by [Mendable.ai](https://mendable.ai?ref=gfirecrawl) and the Firecrawl community. Includes powerful scraping, crawling and data extraction capabilities.
|
||||||
|
|
||||||
_This repository is in its early development stages. We are still merging custom modules in the mono repo. It's not completely yet ready for full self-host deployment, but you can already run it locally._
|
_This repository is in its early development stages. We are still merging custom modules in the mono repo. It's not completely yet ready for full self-host deployment, but you can already run it locally._
|
||||||
|
|
||||||
@ -402,7 +402,6 @@ const searchResults = await app.search(query, {
|
|||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
## Contributing
|
## Contributing
|
||||||
|
|
||||||
We love contributions! Please read our [contributing guide](CONTRIBUTING.md) before submitting a pull request.
|
We love contributions! Please read our [contributing guide](CONTRIBUTING.md) before submitting a pull request.
|
||||||
|
@ -29,3 +29,6 @@ docker compose up
|
|||||||
|
|
||||||
|
|
||||||
This will run a local instance of Firecrawl which can be accessed at `http://localhost:3002`.
|
This will run a local instance of Firecrawl which can be accessed at `http://localhost:3002`.
|
||||||
|
|
||||||
|
# Install Firecrawl on a Kubernetes Cluster (Simple Version)
|
||||||
|
Read the [examples/k8n/README.md](examples/k8n/README.md) for instructions on how to install Firecrawl on a Kubernetes Cluster.
|
@ -3,7 +3,7 @@ NUM_WORKERS_PER_QUEUE=8
|
|||||||
PORT=3002
|
PORT=3002
|
||||||
HOST=0.0.0.0
|
HOST=0.0.0.0
|
||||||
REDIS_URL=redis://localhost:6379
|
REDIS_URL=redis://localhost:6379
|
||||||
PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000
|
PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/html
|
||||||
|
|
||||||
## To turn on DB authentication, you need to set up supabase.
|
## To turn on DB authentication, you need to set up supabase.
|
||||||
USE_DB_AUTHENTICATION=true
|
USE_DB_AUTHENTICATION=true
|
||||||
@ -21,7 +21,7 @@ RATE_LIMIT_TEST_API_KEY_SCRAPE= # set if you'd like to test the scraping rate li
|
|||||||
RATE_LIMIT_TEST_API_KEY_CRAWL= # set if you'd like to test the crawling rate limit
|
RATE_LIMIT_TEST_API_KEY_CRAWL= # set if you'd like to test the crawling rate limit
|
||||||
SCRAPING_BEE_API_KEY= #Set if you'd like to use scraping Be to handle JS blocking
|
SCRAPING_BEE_API_KEY= #Set if you'd like to use scraping Be to handle JS blocking
|
||||||
OPENAI_API_KEY= # add for LLM dependednt features (image alt generation, etc.)
|
OPENAI_API_KEY= # add for LLM dependednt features (image alt generation, etc.)
|
||||||
BULL_AUTH_KEY= #
|
BULL_AUTH_KEY= @
|
||||||
LOGTAIL_KEY= # Use if you're configuring basic logging with logtail
|
LOGTAIL_KEY= # Use if you're configuring basic logging with logtail
|
||||||
LLAMAPARSE_API_KEY= #Set if you have a llamaparse key you'd like to use to parse pdfs
|
LLAMAPARSE_API_KEY= #Set if you have a llamaparse key you'd like to use to parse pdfs
|
||||||
SERPER_API_KEY= #Set if you have a serper key you'd like to use as a search api
|
SERPER_API_KEY= #Set if you have a serper key you'd like to use as a search api
|
||||||
@ -31,8 +31,28 @@ POSTHOG_HOST= # set if you'd like to send posthog events like job logs
|
|||||||
|
|
||||||
STRIPE_PRICE_ID_STANDARD=
|
STRIPE_PRICE_ID_STANDARD=
|
||||||
STRIPE_PRICE_ID_SCALE=
|
STRIPE_PRICE_ID_SCALE=
|
||||||
|
STRIPE_PRICE_ID_STARTER=
|
||||||
|
STRIPE_PRICE_ID_HOBBY=
|
||||||
|
STRIPE_PRICE_ID_HOBBY_YEARLY=
|
||||||
|
STRIPE_PRICE_ID_STANDARD_NEW=
|
||||||
|
STRIPE_PRICE_ID_STANDARD_NEW_YEARLY=
|
||||||
|
STRIPE_PRICE_ID_GROWTH=
|
||||||
|
STRIPE_PRICE_ID_GROWTH_YEARLY=
|
||||||
|
|
||||||
HYPERDX_API_KEY=
|
HYPERDX_API_KEY=
|
||||||
HDX_NODE_BETA_MODE=1
|
HDX_NODE_BETA_MODE=1
|
||||||
|
|
||||||
FIRE_ENGINE_BETA_URL= # set if you'd like to use the fire engine closed beta
|
FIRE_ENGINE_BETA_URL= # set if you'd like to use the fire engine closed beta
|
||||||
|
|
||||||
|
# Proxy Settings for Playwright (Alternative you can can use a proxy service like oxylabs, which rotates IPs for you on every request)
|
||||||
|
PROXY_SERVER=
|
||||||
|
PROXY_USERNAME=
|
||||||
|
PROXY_PASSWORD=
|
||||||
|
# set if you'd like to block media requests to save proxy bandwidth
|
||||||
|
BLOCK_MEDIA=
|
||||||
|
|
||||||
|
# Set this to the URL of your webhook when using the self-hosted version of FireCrawl
|
||||||
|
SELF_HOSTED_WEBHOOK_URL=
|
||||||
|
|
||||||
|
# Resend API Key for transactional emails
|
||||||
|
RESEND_API_KEY=
|
||||||
|
@ -24,8 +24,15 @@ kill_timeout = '5s'
|
|||||||
|
|
||||||
[http_service.concurrency]
|
[http_service.concurrency]
|
||||||
type = "requests"
|
type = "requests"
|
||||||
hard_limit = 200
|
hard_limit = 100
|
||||||
soft_limit = 100
|
soft_limit = 50
|
||||||
|
|
||||||
|
[[http_service.checks]]
|
||||||
|
grace_period = "20s"
|
||||||
|
interval = "30s"
|
||||||
|
method = "GET"
|
||||||
|
timeout = "15s"
|
||||||
|
path = "/"
|
||||||
|
|
||||||
[[services]]
|
[[services]]
|
||||||
protocol = 'tcp'
|
protocol = 'tcp'
|
||||||
@ -43,8 +50,8 @@ kill_timeout = '5s'
|
|||||||
|
|
||||||
[services.concurrency]
|
[services.concurrency]
|
||||||
type = 'connections'
|
type = 'connections'
|
||||||
hard_limit = 75
|
hard_limit = 30
|
||||||
soft_limit = 30
|
soft_limit = 12
|
||||||
|
|
||||||
[[vm]]
|
[[vm]]
|
||||||
size = 'performance-4x'
|
size = 'performance-4x'
|
||||||
|
@ -50,6 +50,27 @@
|
|||||||
"type": "boolean",
|
"type": "boolean",
|
||||||
"description": "Include the raw HTML content of the page. Will output a html key in the response.",
|
"description": "Include the raw HTML content of the page. Will output a html key in the response.",
|
||||||
"default": false
|
"default": false
|
||||||
|
},
|
||||||
|
"screenshot": {
|
||||||
|
"type": "boolean",
|
||||||
|
"description": "Include a screenshot of the top of the page that you are scraping.",
|
||||||
|
"default": false
|
||||||
|
},
|
||||||
|
"waitFor": {
|
||||||
|
"type": "integer",
|
||||||
|
"description": "Wait x amount of milliseconds for the page to load to fetch content",
|
||||||
|
"default": 0
|
||||||
|
},
|
||||||
|
"removeTags": {
|
||||||
|
"type": "array",
|
||||||
|
"items": {
|
||||||
|
"type": "string"
|
||||||
|
},
|
||||||
|
"description": "Tags, classes and ids to remove from the page. Use comma separated values. Example: 'script, .ad, #footer'"
|
||||||
|
},
|
||||||
|
"headers": {
|
||||||
|
"type": "object",
|
||||||
|
"description": "Headers to send with the request. Can be used to send cookies, user-agent, etc."
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
@ -171,10 +192,20 @@
|
|||||||
"description": "The crawling mode to use. Fast mode crawls 4x faster websites without sitemap, but may not be as accurate and shouldn't be used in heavy js-rendered websites.",
|
"description": "The crawling mode to use. Fast mode crawls 4x faster websites without sitemap, but may not be as accurate and shouldn't be used in heavy js-rendered websites.",
|
||||||
"default": "default"
|
"default": "default"
|
||||||
},
|
},
|
||||||
|
"ignoreSitemap": {
|
||||||
|
"type": "boolean",
|
||||||
|
"description": "Ignore the website sitemap when crawling",
|
||||||
|
"default": false
|
||||||
|
},
|
||||||
"limit": {
|
"limit": {
|
||||||
"type": "integer",
|
"type": "integer",
|
||||||
"description": "Maximum number of pages to crawl",
|
"description": "Maximum number of pages to crawl",
|
||||||
"default": 10000
|
"default": 10000
|
||||||
|
},
|
||||||
|
"allowBackwardCrawling": {
|
||||||
|
"type": "boolean",
|
||||||
|
"description": "Allow backward crawling (crawl from the base URL to the previous URLs)",
|
||||||
|
"default": false
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
@ -190,6 +221,27 @@
|
|||||||
"type": "boolean",
|
"type": "boolean",
|
||||||
"description": "Include the raw HTML content of the page. Will output a html key in the response.",
|
"description": "Include the raw HTML content of the page. Will output a html key in the response.",
|
||||||
"default": false
|
"default": false
|
||||||
|
},
|
||||||
|
"screenshot": {
|
||||||
|
"type": "boolean",
|
||||||
|
"description": "Include a screenshot of the top of the page that you are scraping.",
|
||||||
|
"default": false
|
||||||
|
},
|
||||||
|
"headers": {
|
||||||
|
"type": "object",
|
||||||
|
"description": "Headers to send with the request when scraping. Can be used to send cookies, user-agent, etc."
|
||||||
|
},
|
||||||
|
"removeTags": {
|
||||||
|
"type": "array",
|
||||||
|
"items": {
|
||||||
|
"type": "string"
|
||||||
|
},
|
||||||
|
"description": "Tags, classes and ids to remove from the page. Use comma separated values. Example: 'script, .ad, #footer'"
|
||||||
|
},
|
||||||
|
"replaceAllPathsWithAbsolutePaths": {
|
||||||
|
"type": "boolean",
|
||||||
|
"description": "Replace all relative paths with absolute paths for images and links",
|
||||||
|
"default": false
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@ -363,7 +415,7 @@
|
|||||||
"items": {
|
"items": {
|
||||||
"$ref": "#/components/schemas/CrawlStatusResponseObj"
|
"$ref": "#/components/schemas/CrawlStatusResponseObj"
|
||||||
},
|
},
|
||||||
"description": "Partial documents returned as it is being crawls (streaming). When a page is ready it will append to the parial_data array - so no need to wait for all the website to be crawled."
|
"description": "Partial documents returned as it is being crawled (streaming). **This feature is currently in alpha - expect breaking changes** When a page is ready, it will append to the partial_data array, so there is no need to wait for the entire website to be crawled. There is a max of 50 items in the array response. The oldest item (top of the array) will be removed when the new item is added to the array."
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@ -474,9 +526,126 @@
|
|||||||
"type": "string",
|
"type": "string",
|
||||||
"nullable": true
|
"nullable": true
|
||||||
},
|
},
|
||||||
|
"keywords": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"robots": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"ogTitle": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"ogDescription": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"ogUrl": {
|
||||||
|
"type": "string",
|
||||||
|
"format": "uri",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"ogImage": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"ogAudio": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"ogDeterminer": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"ogLocale": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"ogLocaleAlternate": {
|
||||||
|
"type": "array",
|
||||||
|
"items": {
|
||||||
|
"type": "string"
|
||||||
|
},
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"ogSiteName": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"ogVideo": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"dctermsCreated": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"dcDateCreated": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"dcDate": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"dctermsType": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"dcType": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"dctermsAudience": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"dctermsSubject": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"dcSubject": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"dcDescription": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"dctermsKeywords": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"modifiedTime": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"publishedTime": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"articleTag": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"articleSection": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
"sourceURL": {
|
"sourceURL": {
|
||||||
"type": "string",
|
"type": "string",
|
||||||
"format": "uri"
|
"format": "uri"
|
||||||
|
},
|
||||||
|
"pageStatusCode": {
|
||||||
|
"type": "integer",
|
||||||
|
"description": "The status code of the page"
|
||||||
|
},
|
||||||
|
"pageError": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true,
|
||||||
|
"description": "The error message of the page"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
@ -508,6 +677,10 @@
|
|||||||
"nullable": true,
|
"nullable": true,
|
||||||
"description": "Raw HTML content of the page if `includeHtml` is true"
|
"description": "Raw HTML content of the page if `includeHtml` is true"
|
||||||
},
|
},
|
||||||
|
"index": {
|
||||||
|
"type": "integer",
|
||||||
|
"description": "The number of the page that was crawled. This is useful for `partial_data` so you know which page the data is from."
|
||||||
|
},
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"type": "object",
|
"type": "object",
|
||||||
"properties": {
|
"properties": {
|
||||||
@ -521,9 +694,126 @@
|
|||||||
"type": "string",
|
"type": "string",
|
||||||
"nullable": true
|
"nullable": true
|
||||||
},
|
},
|
||||||
|
"keywords": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"robots": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"ogTitle": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"ogDescription": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"ogUrl": {
|
||||||
|
"type": "string",
|
||||||
|
"format": "uri",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"ogImage": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"ogAudio": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"ogDeterminer": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"ogLocale": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"ogLocaleAlternate": {
|
||||||
|
"type": "array",
|
||||||
|
"items": {
|
||||||
|
"type": "string"
|
||||||
|
},
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"ogSiteName": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"ogVideo": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"dctermsCreated": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"dcDateCreated": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"dcDate": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"dctermsType": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"dcType": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"dctermsAudience": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"dctermsSubject": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"dcSubject": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"dcDescription": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"dctermsKeywords": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"modifiedTime": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"publishedTime": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"articleTag": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
|
"articleSection": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true
|
||||||
|
},
|
||||||
"sourceURL": {
|
"sourceURL": {
|
||||||
"type": "string",
|
"type": "string",
|
||||||
"format": "uri"
|
"format": "uri"
|
||||||
|
},
|
||||||
|
"pageStatusCode": {
|
||||||
|
"type": "integer",
|
||||||
|
"description": "The status code of the page"
|
||||||
|
},
|
||||||
|
"pageError": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": true,
|
||||||
|
"description": "The error message of the page"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@ -30,6 +30,7 @@
|
|||||||
"@types/cors": "^2.8.13",
|
"@types/cors": "^2.8.13",
|
||||||
"@types/express": "^4.17.17",
|
"@types/express": "^4.17.17",
|
||||||
"@types/jest": "^29.5.12",
|
"@types/jest": "^29.5.12",
|
||||||
|
"@types/node": "^20.14.1",
|
||||||
"body-parser": "^1.20.1",
|
"body-parser": "^1.20.1",
|
||||||
"express": "^4.18.2",
|
"express": "^4.18.2",
|
||||||
"jest": "^29.6.3",
|
"jest": "^29.6.3",
|
||||||
@ -90,6 +91,7 @@
|
|||||||
"puppeteer": "^22.6.3",
|
"puppeteer": "^22.6.3",
|
||||||
"rate-limiter-flexible": "^2.4.2",
|
"rate-limiter-flexible": "^2.4.2",
|
||||||
"redis": "^4.6.7",
|
"redis": "^4.6.7",
|
||||||
|
"resend": "^3.2.0",
|
||||||
"robots-parser": "^3.0.1",
|
"robots-parser": "^3.0.1",
|
||||||
"scrapingbee": "^1.7.4",
|
"scrapingbee": "^1.7.4",
|
||||||
"stripe": "^12.2.0",
|
"stripe": "^12.2.0",
|
||||||
|
File diff suppressed because it is too large
Load Diff
@ -1,5 +1,4 @@
|
|||||||
import request from "supertest";
|
import request from "supertest";
|
||||||
import { app } from "../../index";
|
|
||||||
import dotenv from "dotenv";
|
import dotenv from "dotenv";
|
||||||
const fs = require("fs");
|
const fs = require("fs");
|
||||||
const path = require("path");
|
const path = require("path");
|
||||||
|
File diff suppressed because it is too large
Load Diff
@ -1,12 +1,13 @@
|
|||||||
import { parseApi } from "../../src/lib/parseApi";
|
import { parseApi } from "../../src/lib/parseApi";
|
||||||
import { getRateLimiter, } from "../../src/services/rate-limiter";
|
import { getRateLimiter, } from "../../src/services/rate-limiter";
|
||||||
import { AuthResponse, RateLimiterMode } from "../../src/types";
|
import { AuthResponse, NotificationType, RateLimiterMode } from "../../src/types";
|
||||||
import { supabase_service } from "../../src/services/supabase";
|
import { supabase_service } from "../../src/services/supabase";
|
||||||
import { withAuth } from "../../src/lib/withAuth";
|
import { withAuth } from "../../src/lib/withAuth";
|
||||||
import { RateLimiterRedis } from "rate-limiter-flexible";
|
import { RateLimiterRedis } from "rate-limiter-flexible";
|
||||||
import { setTraceAttributes } from '@hyperdx/node-opentelemetry';
|
import { setTraceAttributes } from '@hyperdx/node-opentelemetry';
|
||||||
|
import { sendNotification } from "../services/notification/email_notification";
|
||||||
|
|
||||||
export async function authenticateUser(req, res, mode?: RateLimiterMode) : Promise<AuthResponse> {
|
export async function authenticateUser(req, res, mode?: RateLimiterMode): Promise<AuthResponse> {
|
||||||
return withAuth(supaAuthenticateUser)(req, res, mode);
|
return withAuth(supaAuthenticateUser)(req, res, mode);
|
||||||
}
|
}
|
||||||
function setTrace(team_id: string, api_key: string) {
|
function setTrace(team_id: string, api_key: string) {
|
||||||
@ -29,6 +30,7 @@ export async function supaAuthenticateUser(
|
|||||||
team_id?: string;
|
team_id?: string;
|
||||||
error?: string;
|
error?: string;
|
||||||
status?: number;
|
status?: number;
|
||||||
|
plan?: string;
|
||||||
}> {
|
}> {
|
||||||
const authHeader = req.headers.authorization;
|
const authHeader = req.headers.authorization;
|
||||||
if (!authHeader) {
|
if (!authHeader) {
|
||||||
@ -51,8 +53,11 @@ export async function supaAuthenticateUser(
|
|||||||
let subscriptionData: { team_id: string, plan: string } | null = null;
|
let subscriptionData: { team_id: string, plan: string } | null = null;
|
||||||
let normalizedApi: string;
|
let normalizedApi: string;
|
||||||
|
|
||||||
|
let team_id: string;
|
||||||
|
|
||||||
if (token == "this_is_just_a_preview_token") {
|
if (token == "this_is_just_a_preview_token") {
|
||||||
rateLimiter = getRateLimiter(RateLimiterMode.Preview, token);
|
rateLimiter = getRateLimiter(RateLimiterMode.Preview, token);
|
||||||
|
team_id = "preview";
|
||||||
} else {
|
} else {
|
||||||
normalizedApi = parseApi(token);
|
normalizedApi = parseApi(token);
|
||||||
|
|
||||||
@ -89,7 +94,9 @@ export async function supaAuthenticateUser(
|
|||||||
status: 401,
|
status: 401,
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
const team_id = data[0].team_id;
|
const internal_team_id = data[0].team_id;
|
||||||
|
team_id = internal_team_id;
|
||||||
|
|
||||||
const plan = getPlanByPriceId(data[0].price_id);
|
const plan = getPlanByPriceId(data[0].price_id);
|
||||||
// HyperDX Logging
|
// HyperDX Logging
|
||||||
setTrace(team_id, normalizedApi);
|
setTrace(team_id, normalizedApi);
|
||||||
@ -104,12 +111,13 @@ export async function supaAuthenticateUser(
|
|||||||
case RateLimiterMode.Scrape:
|
case RateLimiterMode.Scrape:
|
||||||
rateLimiter = getRateLimiter(RateLimiterMode.Scrape, token, subscriptionData.plan);
|
rateLimiter = getRateLimiter(RateLimiterMode.Scrape, token, subscriptionData.plan);
|
||||||
break;
|
break;
|
||||||
|
case RateLimiterMode.Search:
|
||||||
|
rateLimiter = getRateLimiter(RateLimiterMode.Search, token, subscriptionData.plan);
|
||||||
|
break;
|
||||||
case RateLimiterMode.CrawlStatus:
|
case RateLimiterMode.CrawlStatus:
|
||||||
rateLimiter = getRateLimiter(RateLimiterMode.CrawlStatus, token);
|
rateLimiter = getRateLimiter(RateLimiterMode.CrawlStatus, token);
|
||||||
break;
|
break;
|
||||||
case RateLimiterMode.Search:
|
|
||||||
rateLimiter = getRateLimiter(RateLimiterMode.Search, token);
|
|
||||||
break;
|
|
||||||
case RateLimiterMode.Preview:
|
case RateLimiterMode.Preview:
|
||||||
rateLimiter = getRateLimiter(RateLimiterMode.Preview, token);
|
rateLimiter = getRateLimiter(RateLimiterMode.Preview, token);
|
||||||
break;
|
break;
|
||||||
@ -122,13 +130,23 @@ export async function supaAuthenticateUser(
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
const team_endpoint_token = token === "this_is_just_a_preview_token" ? iptoken : team_id;
|
||||||
|
|
||||||
try {
|
try {
|
||||||
await rateLimiter.consume(iptoken);
|
await rateLimiter.consume(team_endpoint_token);
|
||||||
} catch (rateLimiterRes) {
|
} catch (rateLimiterRes) {
|
||||||
console.error(rateLimiterRes);
|
console.error(rateLimiterRes);
|
||||||
|
const secs = Math.round(rateLimiterRes.msBeforeNext / 1000) || 1;
|
||||||
|
const retryDate = new Date(Date.now() + rateLimiterRes.msBeforeNext);
|
||||||
|
|
||||||
|
// We can only send a rate limit email every 7 days, send notification already has the date in between checking
|
||||||
|
const startDate = new Date();
|
||||||
|
const endDate = new Date();
|
||||||
|
endDate.setDate(endDate.getDate() + 7);
|
||||||
|
// await sendNotification(team_id, NotificationType.RATE_LIMIT_REACHED, startDate.toISOString(), endDate.toISOString());
|
||||||
return {
|
return {
|
||||||
success: false,
|
success: false,
|
||||||
error: "Rate limit exceeded. Too many requests, try again in 1 minute.",
|
error: `Rate limit exceeded. Consumed points: ${rateLimiterRes.consumedPoints}, Remaining points: ${rateLimiterRes.remainingPoints}. Upgrade your plan at https://firecrawl.dev/pricing for increased rate limits or please retry after ${secs}s, resets at ${retryDate}`,
|
||||||
status: 429,
|
status: 429,
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
@ -170,16 +188,24 @@ export async function supaAuthenticateUser(
|
|||||||
subscriptionData = data[0];
|
subscriptionData = data[0];
|
||||||
}
|
}
|
||||||
|
|
||||||
return { success: true, team_id: subscriptionData.team_id };
|
return { success: true, team_id: subscriptionData.team_id, plan: subscriptionData.plan ?? ""};
|
||||||
}
|
}
|
||||||
|
|
||||||
function getPlanByPriceId(price_id: string) {
|
function getPlanByPriceId(price_id: string) {
|
||||||
switch (price_id) {
|
switch (price_id) {
|
||||||
|
case process.env.STRIPE_PRICE_ID_STARTER:
|
||||||
|
return 'starter';
|
||||||
case process.env.STRIPE_PRICE_ID_STANDARD:
|
case process.env.STRIPE_PRICE_ID_STANDARD:
|
||||||
return 'standard';
|
return 'standard';
|
||||||
case process.env.STRIPE_PRICE_ID_SCALE:
|
case process.env.STRIPE_PRICE_ID_SCALE:
|
||||||
return 'scale';
|
return 'scale';
|
||||||
|
case process.env.STRIPE_PRICE_ID_HOBBY || process.env.STRIPE_PRICE_ID_HOBBY_YEARLY:
|
||||||
|
return 'hobby';
|
||||||
|
case process.env.STRIPE_PRICE_ID_STANDARD_NEW || process.env.STRIPE_PRICE_ID_STANDARD_NEW_YEARLY:
|
||||||
|
return 'standard-new';
|
||||||
|
case process.env.STRIPE_PRICE_ID_GROWTH || process.env.STRIPE_PRICE_ID_GROWTH_YEARLY:
|
||||||
|
return 'growth';
|
||||||
default:
|
default:
|
||||||
return 'starter';
|
return 'free';
|
||||||
}
|
}
|
||||||
}
|
}
|
@ -7,6 +7,8 @@ import { RateLimiterMode } from "../../src/types";
|
|||||||
import { addWebScraperJob } from "../../src/services/queue-jobs";
|
import { addWebScraperJob } from "../../src/services/queue-jobs";
|
||||||
import { isUrlBlocked } from "../../src/scraper/WebScraper/utils/blocklist";
|
import { isUrlBlocked } from "../../src/scraper/WebScraper/utils/blocklist";
|
||||||
import { logCrawl } from "../../src/services/logging/crawl_log";
|
import { logCrawl } from "../../src/services/logging/crawl_log";
|
||||||
|
import { validateIdempotencyKey } from "../../src/services/idempotency/validate";
|
||||||
|
import { createIdempotencyKey } from "../../src/services/idempotency/create";
|
||||||
|
|
||||||
export async function crawlController(req: Request, res: Response) {
|
export async function crawlController(req: Request, res: Response) {
|
||||||
try {
|
try {
|
||||||
@ -19,6 +21,19 @@ export async function crawlController(req: Request, res: Response) {
|
|||||||
return res.status(status).json({ error });
|
return res.status(status).json({ error });
|
||||||
}
|
}
|
||||||
|
|
||||||
|
if (req.headers["x-idempotency-key"]) {
|
||||||
|
const isIdempotencyValid = await validateIdempotencyKey(req);
|
||||||
|
if (!isIdempotencyValid) {
|
||||||
|
return res.status(409).json({ error: "Idempotency key already used" });
|
||||||
|
}
|
||||||
|
try {
|
||||||
|
createIdempotencyKey(req);
|
||||||
|
} catch (error) {
|
||||||
|
console.error(error);
|
||||||
|
return res.status(500).json({ error: error.message });
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
const { success: creditsCheckSuccess, message: creditsCheckMessage } =
|
const { success: creditsCheckSuccess, message: creditsCheckMessage } =
|
||||||
await checkTeamCredits(team_id, 1);
|
await checkTeamCredits(team_id, 1);
|
||||||
if (!creditsCheckSuccess) {
|
if (!creditsCheckSuccess) {
|
||||||
@ -40,8 +55,16 @@ export async function crawlController(req: Request, res: Response) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
const mode = req.body.mode ?? "crawl";
|
const mode = req.body.mode ?? "crawl";
|
||||||
const crawlerOptions = req.body.crawlerOptions ?? {};
|
|
||||||
const pageOptions = req.body.pageOptions ?? { onlyMainContent: false, includeHtml: false };
|
const crawlerOptions = req.body.crawlerOptions ?? {
|
||||||
|
allowBackwardCrawling: false
|
||||||
|
};
|
||||||
|
const pageOptions = req.body.pageOptions ?? {
|
||||||
|
onlyMainContent: false,
|
||||||
|
includeHtml: false,
|
||||||
|
removeTags: [],
|
||||||
|
parsePDF: true
|
||||||
|
};
|
||||||
|
|
||||||
if (mode === "single_urls" && !url.includes(",")) {
|
if (mode === "single_urls" && !url.includes(",")) {
|
||||||
try {
|
try {
|
||||||
@ -49,9 +72,7 @@ export async function crawlController(req: Request, res: Response) {
|
|||||||
await a.setOptions({
|
await a.setOptions({
|
||||||
mode: "single_urls",
|
mode: "single_urls",
|
||||||
urls: [url],
|
urls: [url],
|
||||||
crawlerOptions: {
|
crawlerOptions: { ...crawlerOptions, returnOnlyUrls: true },
|
||||||
returnOnlyUrls: true,
|
|
||||||
},
|
|
||||||
pageOptions: pageOptions,
|
pageOptions: pageOptions,
|
||||||
});
|
});
|
||||||
|
|
||||||
@ -76,7 +97,7 @@ export async function crawlController(req: Request, res: Response) {
|
|||||||
const job = await addWebScraperJob({
|
const job = await addWebScraperJob({
|
||||||
url: url,
|
url: url,
|
||||||
mode: mode ?? "crawl", // fix for single urls not working
|
mode: mode ?? "crawl", // fix for single urls not working
|
||||||
crawlerOptions: { ...crawlerOptions },
|
crawlerOptions: crawlerOptions,
|
||||||
team_id: team_id,
|
team_id: team_id,
|
||||||
pageOptions: pageOptions,
|
pageOptions: pageOptions,
|
||||||
origin: req.body.origin ?? "api",
|
origin: req.body.origin ?? "api",
|
||||||
|
@ -26,7 +26,7 @@ export async function crawlPreviewController(req: Request, res: Response) {
|
|||||||
|
|
||||||
const mode = req.body.mode ?? "crawl";
|
const mode = req.body.mode ?? "crawl";
|
||||||
const crawlerOptions = req.body.crawlerOptions ?? {};
|
const crawlerOptions = req.body.crawlerOptions ?? {};
|
||||||
const pageOptions = req.body.pageOptions ?? { onlyMainContent: false, includeHtml: false };
|
const pageOptions = req.body.pageOptions ?? { onlyMainContent: false, includeHtml: false, removeTags: [] };
|
||||||
|
|
||||||
const job = await addWebScraperJob({
|
const job = await addWebScraperJob({
|
||||||
url: url,
|
url: url,
|
||||||
|
@ -15,7 +15,8 @@ export async function scrapeHelper(
|
|||||||
crawlerOptions: any,
|
crawlerOptions: any,
|
||||||
pageOptions: PageOptions,
|
pageOptions: PageOptions,
|
||||||
extractorOptions: ExtractorOptions,
|
extractorOptions: ExtractorOptions,
|
||||||
timeout: number
|
timeout: number,
|
||||||
|
plan?: string
|
||||||
): Promise<{
|
): Promise<{
|
||||||
success: boolean;
|
success: boolean;
|
||||||
error?: string;
|
error?: string;
|
||||||
@ -60,11 +61,13 @@ export async function scrapeHelper(
|
|||||||
(doc: { content?: string }) => doc.content && doc.content.trim().length > 0
|
(doc: { content?: string }) => doc.content && doc.content.trim().length > 0
|
||||||
);
|
);
|
||||||
if (filteredDocs.length === 0) {
|
if (filteredDocs.length === 0) {
|
||||||
return { success: true, error: "No page found", returnCode: 200 };
|
return { success: true, error: "No page found", returnCode: 200, data: docs[0] };
|
||||||
}
|
}
|
||||||
|
|
||||||
let creditsToBeBilled = filteredDocs.length;
|
let creditsToBeBilled = filteredDocs.length;
|
||||||
const creditsPerLLMExtract = 5;
|
const creditsPerLLMExtract = 50;
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
if (extractorOptions.mode === "llm-extraction") {
|
if (extractorOptions.mode === "llm-extraction") {
|
||||||
creditsToBeBilled = creditsToBeBilled + (creditsPerLLMExtract * filteredDocs.length);
|
creditsToBeBilled = creditsToBeBilled + (creditsPerLLMExtract * filteredDocs.length);
|
||||||
@ -93,7 +96,7 @@ export async function scrapeHelper(
|
|||||||
export async function scrapeController(req: Request, res: Response) {
|
export async function scrapeController(req: Request, res: Response) {
|
||||||
try {
|
try {
|
||||||
// make sure to authenticate user first, Bearer <token>
|
// make sure to authenticate user first, Bearer <token>
|
||||||
const { success, team_id, error, status } = await authenticateUser(
|
const { success, team_id, error, status, plan } = await authenticateUser(
|
||||||
req,
|
req,
|
||||||
res,
|
res,
|
||||||
RateLimiterMode.Scrape
|
RateLimiterMode.Scrape
|
||||||
@ -102,7 +105,13 @@ export async function scrapeController(req: Request, res: Response) {
|
|||||||
return res.status(status).json({ error });
|
return res.status(status).json({ error });
|
||||||
}
|
}
|
||||||
const crawlerOptions = req.body.crawlerOptions ?? {};
|
const crawlerOptions = req.body.crawlerOptions ?? {};
|
||||||
const pageOptions = req.body.pageOptions ?? { onlyMainContent: false, includeHtml: false };
|
const pageOptions = req.body.pageOptions ?? {
|
||||||
|
onlyMainContent: false,
|
||||||
|
includeHtml: false,
|
||||||
|
waitFor: 0,
|
||||||
|
screenshot: false,
|
||||||
|
parsePDF: true
|
||||||
|
};
|
||||||
const extractorOptions = req.body.extractorOptions ?? {
|
const extractorOptions = req.body.extractorOptions ?? {
|
||||||
mode: "markdown"
|
mode: "markdown"
|
||||||
}
|
}
|
||||||
@ -129,7 +138,8 @@ export async function scrapeController(req: Request, res: Response) {
|
|||||||
crawlerOptions,
|
crawlerOptions,
|
||||||
pageOptions,
|
pageOptions,
|
||||||
extractorOptions,
|
extractorOptions,
|
||||||
timeout
|
timeout,
|
||||||
|
plan
|
||||||
);
|
);
|
||||||
const endTime = new Date().getTime();
|
const endTime = new Date().getTime();
|
||||||
const timeTakenInSeconds = (endTime - startTime) / 1000;
|
const timeTakenInSeconds = (endTime - startTime) / 1000;
|
||||||
|
@ -28,11 +28,13 @@ export async function searchHelper(
|
|||||||
|
|
||||||
const tbs = searchOptions.tbs ?? null;
|
const tbs = searchOptions.tbs ?? null;
|
||||||
const filter = searchOptions.filter ?? null;
|
const filter = searchOptions.filter ?? null;
|
||||||
|
const num_results = searchOptions.limit ?? 7;
|
||||||
|
const num_results_buffer = Math.floor(num_results * 1.5);
|
||||||
|
|
||||||
let res = await search({
|
let res = await search({
|
||||||
query: query,
|
query: query,
|
||||||
advanced: advanced,
|
advanced: advanced,
|
||||||
num_results: searchOptions.limit ?? 7,
|
num_results: num_results_buffer,
|
||||||
tbs: tbs,
|
tbs: tbs,
|
||||||
filter: filter,
|
filter: filter,
|
||||||
lang: searchOptions.lang ?? "en",
|
lang: searchOptions.lang ?? "en",
|
||||||
@ -42,11 +44,27 @@ export async function searchHelper(
|
|||||||
|
|
||||||
let justSearch = pageOptions.fetchPageContent === false;
|
let justSearch = pageOptions.fetchPageContent === false;
|
||||||
|
|
||||||
|
|
||||||
if (justSearch) {
|
if (justSearch) {
|
||||||
|
const billingResult = await billTeam(
|
||||||
|
team_id,
|
||||||
|
res.length
|
||||||
|
);
|
||||||
|
if (!billingResult.success) {
|
||||||
|
return {
|
||||||
|
success: false,
|
||||||
|
error:
|
||||||
|
"Failed to bill team. Insufficient credits or subscription not found.",
|
||||||
|
returnCode: 402,
|
||||||
|
};
|
||||||
|
}
|
||||||
return { success: true, data: res, returnCode: 200 };
|
return { success: true, data: res, returnCode: 200 };
|
||||||
}
|
}
|
||||||
|
|
||||||
res = res.filter((r) => !isUrlBlocked(r.url));
|
res = res.filter((r) => !isUrlBlocked(r.url));
|
||||||
|
if (res.length > num_results) {
|
||||||
|
res = res.slice(0, num_results);
|
||||||
|
}
|
||||||
|
|
||||||
if (res.length === 0) {
|
if (res.length === 0) {
|
||||||
return { success: true, error: "No search results found", returnCode: 200 };
|
return { success: true, error: "No search results found", returnCode: 200 };
|
||||||
@ -67,6 +85,7 @@ export async function searchHelper(
|
|||||||
onlyMainContent: pageOptions?.onlyMainContent ?? true,
|
onlyMainContent: pageOptions?.onlyMainContent ?? true,
|
||||||
fetchPageContent: pageOptions?.fetchPageContent ?? true,
|
fetchPageContent: pageOptions?.fetchPageContent ?? true,
|
||||||
includeHtml: pageOptions?.includeHtml ?? false,
|
includeHtml: pageOptions?.includeHtml ?? false,
|
||||||
|
removeTags: pageOptions?.removeTags ?? [],
|
||||||
fallback: false,
|
fallback: false,
|
||||||
},
|
},
|
||||||
});
|
});
|
||||||
@ -82,7 +101,7 @@ export async function searchHelper(
|
|||||||
);
|
);
|
||||||
|
|
||||||
if (filteredDocs.length === 0) {
|
if (filteredDocs.length === 0) {
|
||||||
return { success: true, error: "No page found", returnCode: 200 };
|
return { success: true, error: "No page found", returnCode: 200, data: docs };
|
||||||
}
|
}
|
||||||
|
|
||||||
const billingResult = await billTeam(
|
const billingResult = await billTeam(
|
||||||
@ -121,6 +140,7 @@ export async function searchController(req: Request, res: Response) {
|
|||||||
includeHtml: false,
|
includeHtml: false,
|
||||||
onlyMainContent: true,
|
onlyMainContent: true,
|
||||||
fetchPageContent: true,
|
fetchPageContent: true,
|
||||||
|
removeTags: [],
|
||||||
fallback: false,
|
fallback: false,
|
||||||
};
|
};
|
||||||
const origin = req.body.origin ?? "api";
|
const origin = req.body.origin ?? "api";
|
||||||
|
@ -5,59 +5,77 @@ import "dotenv/config";
|
|||||||
import { getWebScraperQueue } from "./services/queue-service";
|
import { getWebScraperQueue } from "./services/queue-service";
|
||||||
import { redisClient } from "./services/rate-limiter";
|
import { redisClient } from "./services/rate-limiter";
|
||||||
import { v0Router } from "./routes/v0";
|
import { v0Router } from "./routes/v0";
|
||||||
import { initSDK } from '@hyperdx/node-opentelemetry';
|
import { initSDK } from "@hyperdx/node-opentelemetry";
|
||||||
|
import cluster from "cluster";
|
||||||
|
import os from "os";
|
||||||
|
|
||||||
const { createBullBoard } = require("@bull-board/api");
|
const { createBullBoard } = require("@bull-board/api");
|
||||||
const { BullAdapter } = require("@bull-board/api/bullAdapter");
|
const { BullAdapter } = require("@bull-board/api/bullAdapter");
|
||||||
const { ExpressAdapter } = require("@bull-board/express");
|
const { ExpressAdapter } = require("@bull-board/express");
|
||||||
|
|
||||||
export const app = express();
|
const numCPUs = process.env.ENV === "local" ? 2 : os.cpus().length;
|
||||||
|
console.log(`Number of CPUs: ${numCPUs} available`);
|
||||||
|
|
||||||
global.isProduction = process.env.IS_PRODUCTION === "true";
|
if (cluster.isMaster) {
|
||||||
|
console.log(`Master ${process.pid} is running`);
|
||||||
|
|
||||||
app.use(bodyParser.urlencoded({ extended: true }));
|
// Fork workers.
|
||||||
app.use(bodyParser.json({ limit: "10mb" }));
|
for (let i = 0; i < numCPUs; i++) {
|
||||||
|
cluster.fork();
|
||||||
|
}
|
||||||
|
|
||||||
app.use(cors()); // Add this line to enable CORS
|
cluster.on("exit", (worker, code, signal) => {
|
||||||
|
console.log(`Worker ${worker.process.pid} exited`);
|
||||||
|
console.log("Starting a new worker");
|
||||||
|
cluster.fork();
|
||||||
|
});
|
||||||
|
} else {
|
||||||
|
const app = express();
|
||||||
|
|
||||||
const serverAdapter = new ExpressAdapter();
|
global.isProduction = process.env.IS_PRODUCTION === "true";
|
||||||
serverAdapter.setBasePath(`/admin/${process.env.BULL_AUTH_KEY}/queues`);
|
|
||||||
|
|
||||||
const { addQueue, removeQueue, setQueues, replaceQueues } = createBullBoard({
|
app.use(bodyParser.urlencoded({ extended: true }));
|
||||||
|
app.use(bodyParser.json({ limit: "10mb" }));
|
||||||
|
|
||||||
|
app.use(cors()); // Add this line to enable CORS
|
||||||
|
|
||||||
|
const serverAdapter = new ExpressAdapter();
|
||||||
|
serverAdapter.setBasePath(`/admin/${process.env.BULL_AUTH_KEY}/queues`);
|
||||||
|
|
||||||
|
const { addQueue, removeQueue, setQueues, replaceQueues } = createBullBoard({
|
||||||
queues: [new BullAdapter(getWebScraperQueue())],
|
queues: [new BullAdapter(getWebScraperQueue())],
|
||||||
serverAdapter: serverAdapter,
|
serverAdapter: serverAdapter,
|
||||||
});
|
});
|
||||||
|
|
||||||
app.use(
|
app.use(
|
||||||
`/admin/${process.env.BULL_AUTH_KEY}/queues`,
|
`/admin/${process.env.BULL_AUTH_KEY}/queues`,
|
||||||
serverAdapter.getRouter()
|
serverAdapter.getRouter()
|
||||||
);
|
);
|
||||||
|
|
||||||
app.get("/", (req, res) => {
|
app.get("/", (req, res) => {
|
||||||
res.send("SCRAPERS-JS: Hello, world! Fly.io");
|
res.send("SCRAPERS-JS: Hello, world! Fly.io");
|
||||||
});
|
});
|
||||||
|
|
||||||
//write a simple test function
|
//write a simple test function
|
||||||
app.get("/test", async (req, res) => {
|
app.get("/test", async (req, res) => {
|
||||||
res.send("Hello, world!");
|
res.send("Hello, world!");
|
||||||
});
|
});
|
||||||
|
|
||||||
// register router
|
// register router
|
||||||
app.use(v0Router);
|
app.use(v0Router);
|
||||||
|
|
||||||
const DEFAULT_PORT = process.env.PORT ?? 3002;
|
const DEFAULT_PORT = process.env.PORT ?? 3002;
|
||||||
const HOST = process.env.HOST ?? "localhost";
|
const HOST = process.env.HOST ?? "localhost";
|
||||||
redisClient.connect();
|
redisClient.connect();
|
||||||
|
|
||||||
// HyperDX OpenTelemetry
|
// HyperDX OpenTelemetry
|
||||||
if(process.env.ENV === 'production') {
|
if (process.env.ENV === "production") {
|
||||||
initSDK({ consoleCapture: true, additionalInstrumentations: []});
|
initSDK({ consoleCapture: true, additionalInstrumentations: [] });
|
||||||
}
|
}
|
||||||
|
|
||||||
|
function startServer(port = DEFAULT_PORT) {
|
||||||
export function startServer(port = DEFAULT_PORT) {
|
|
||||||
const server = app.listen(Number(port), HOST, () => {
|
const server = app.listen(Number(port), HOST, () => {
|
||||||
console.log(`Server listening on port ${port}`);
|
console.log(`Worker ${process.pid} listening on port ${port}`);
|
||||||
console.log(
|
console.log(
|
||||||
`For the UI, open http://${HOST}:${port}/admin/${process.env.BULL_AUTH_KEY}/queues`
|
`For the UI, open http://${HOST}:${port}/admin/${process.env.BULL_AUTH_KEY}/queues`
|
||||||
);
|
);
|
||||||
@ -68,14 +86,14 @@ export function startServer(port = DEFAULT_PORT) {
|
|||||||
);
|
);
|
||||||
});
|
});
|
||||||
return server;
|
return server;
|
||||||
}
|
}
|
||||||
|
|
||||||
if (require.main === module) {
|
if (require.main === module) {
|
||||||
startServer();
|
startServer();
|
||||||
}
|
}
|
||||||
|
|
||||||
// Use this as a "health check" that way we dont destroy the server
|
// Use this as a "health check" that way we dont destroy the server
|
||||||
app.get(`/admin/${process.env.BULL_AUTH_KEY}/queues`, async (req, res) => {
|
app.get(`/admin/${process.env.BULL_AUTH_KEY}/queues`, async (req, res) => {
|
||||||
try {
|
try {
|
||||||
const webScraperQueue = getWebScraperQueue();
|
const webScraperQueue = getWebScraperQueue();
|
||||||
const [webScraperActive] = await Promise.all([
|
const [webScraperActive] = await Promise.all([
|
||||||
@ -92,9 +110,9 @@ app.get(`/admin/${process.env.BULL_AUTH_KEY}/queues`, async (req, res) => {
|
|||||||
console.error(error);
|
console.error(error);
|
||||||
return res.status(500).json({ error: error.message });
|
return res.status(500).json({ error: error.message });
|
||||||
}
|
}
|
||||||
});
|
});
|
||||||
|
|
||||||
app.get(`/serverHealthCheck`, async (req, res) => {
|
app.get(`/serverHealthCheck`, async (req, res) => {
|
||||||
try {
|
try {
|
||||||
const webScraperQueue = getWebScraperQueue();
|
const webScraperQueue = getWebScraperQueue();
|
||||||
const [waitingJobs] = await Promise.all([
|
const [waitingJobs] = await Promise.all([
|
||||||
@ -110,9 +128,9 @@ app.get(`/serverHealthCheck`, async (req, res) => {
|
|||||||
console.error(error);
|
console.error(error);
|
||||||
return res.status(500).json({ error: error.message });
|
return res.status(500).json({ error: error.message });
|
||||||
}
|
}
|
||||||
});
|
});
|
||||||
|
|
||||||
app.get('/serverHealthCheck/notify', async (req, res) => {
|
app.get("/serverHealthCheck/notify", async (req, res) => {
|
||||||
if (process.env.SLACK_WEBHOOK_URL) {
|
if (process.env.SLACK_WEBHOOK_URL) {
|
||||||
const treshold = 1; // The treshold value for the active jobs
|
const treshold = 1; // The treshold value for the active jobs
|
||||||
const timeout = 60000; // 1 minute // The timeout value for the check in milliseconds
|
const timeout = 60000; // 1 minute // The timeout value for the check in milliseconds
|
||||||
@ -138,19 +156,21 @@ app.get('/serverHealthCheck/notify', async (req, res) => {
|
|||||||
if (waitingJobsCount >= treshold) {
|
if (waitingJobsCount >= treshold) {
|
||||||
const slackWebhookUrl = process.env.SLACK_WEBHOOK_URL;
|
const slackWebhookUrl = process.env.SLACK_WEBHOOK_URL;
|
||||||
const message = {
|
const message = {
|
||||||
text: `⚠️ Warning: The number of active jobs (${waitingJobsCount}) has exceeded the threshold (${treshold}) for more than ${timeout/60000} minute(s).`,
|
text: `⚠️ Warning: The number of active jobs (${waitingJobsCount}) has exceeded the threshold (${treshold}) for more than ${
|
||||||
|
timeout / 60000
|
||||||
|
} minute(s).`,
|
||||||
};
|
};
|
||||||
|
|
||||||
const response = await fetch(slackWebhookUrl, {
|
const response = await fetch(slackWebhookUrl, {
|
||||||
method: 'POST',
|
method: "POST",
|
||||||
headers: {
|
headers: {
|
||||||
'Content-Type': 'application/json',
|
"Content-Type": "application/json",
|
||||||
},
|
},
|
||||||
body: JSON.stringify(message),
|
body: JSON.stringify(message),
|
||||||
})
|
});
|
||||||
|
|
||||||
if (!response.ok) {
|
if (!response.ok) {
|
||||||
console.error('Failed to send Slack notification')
|
console.error("Failed to send Slack notification");
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}, timeout);
|
}, timeout);
|
||||||
@ -162,9 +182,38 @@ app.get('/serverHealthCheck/notify', async (req, res) => {
|
|||||||
|
|
||||||
checkWaitingJobs();
|
checkWaitingJobs();
|
||||||
}
|
}
|
||||||
});
|
});
|
||||||
|
|
||||||
|
app.get(
|
||||||
|
`/admin/${process.env.BULL_AUTH_KEY}/clean-before-24h-complete-jobs`,
|
||||||
|
async (req, res) => {
|
||||||
|
try {
|
||||||
|
const webScraperQueue = getWebScraperQueue();
|
||||||
|
const completedJobs = await webScraperQueue.getJobs(["completed"]);
|
||||||
|
const before24hJobs = completedJobs.filter(
|
||||||
|
(job) => job.finishedOn < Date.now() - 24 * 60 * 60 * 1000
|
||||||
|
);
|
||||||
|
const jobIds = before24hJobs.map((job) => job.id) as string[];
|
||||||
|
let count = 0;
|
||||||
|
for (const jobId of jobIds) {
|
||||||
|
try {
|
||||||
|
await webScraperQueue.removeJobs(jobId);
|
||||||
|
count++;
|
||||||
|
} catch (jobError) {
|
||||||
|
console.error(`Failed to remove job with ID ${jobId}:`, jobError);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
res.status(200).send(`Removed ${count} completed jobs.`);
|
||||||
|
} catch (error) {
|
||||||
|
console.error("Failed to clean last 24h complete jobs:", error);
|
||||||
|
res.status(500).send("Failed to clean jobs");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
);
|
||||||
|
|
||||||
app.get("/is-production", (req, res) => {
|
app.get("/is-production", (req, res) => {
|
||||||
res.send({ isProduction: global.isProduction });
|
res.send({ isProduction: global.isProduction });
|
||||||
});
|
});
|
||||||
|
|
||||||
|
console.log(`Worker ${process.pid} started`);
|
||||||
|
}
|
||||||
|
@ -1,4 +1,3 @@
|
|||||||
import Turndown from "turndown";
|
|
||||||
import OpenAI from "openai";
|
import OpenAI from "openai";
|
||||||
import Ajv from "ajv";
|
import Ajv from "ajv";
|
||||||
const ajv = new Ajv(); // Initialize AJV for JSON schema validation
|
const ajv = new Ajv(); // Initialize AJV for JSON schema validation
|
||||||
|
@ -15,6 +15,12 @@ export type PageOptions = {
|
|||||||
includeHtml?: boolean;
|
includeHtml?: boolean;
|
||||||
fallback?: boolean;
|
fallback?: boolean;
|
||||||
fetchPageContent?: boolean;
|
fetchPageContent?: boolean;
|
||||||
|
waitFor?: number;
|
||||||
|
screenshot?: boolean;
|
||||||
|
headers?: Record<string, string>;
|
||||||
|
replaceAllPathsWithAbsolutePaths?: boolean;
|
||||||
|
parsePDF?: boolean;
|
||||||
|
removeTags?: string | string[];
|
||||||
};
|
};
|
||||||
|
|
||||||
export type ExtractorOptions = {
|
export type ExtractorOptions = {
|
||||||
@ -32,10 +38,7 @@ export type SearchOptions = {
|
|||||||
location?: string;
|
location?: string;
|
||||||
};
|
};
|
||||||
|
|
||||||
export type WebScraperOptions = {
|
export type CrawlerOptions = {
|
||||||
urls: string[];
|
|
||||||
mode: "single_urls" | "sitemap" | "crawl";
|
|
||||||
crawlerOptions?: {
|
|
||||||
returnOnlyUrls?: boolean;
|
returnOnlyUrls?: boolean;
|
||||||
includes?: string[];
|
includes?: string[];
|
||||||
excludes?: string[];
|
excludes?: string[];
|
||||||
@ -44,8 +47,15 @@ export type WebScraperOptions = {
|
|||||||
limit?: number;
|
limit?: number;
|
||||||
generateImgAltText?: boolean;
|
generateImgAltText?: boolean;
|
||||||
replaceAllPathsWithAbsolutePaths?: boolean;
|
replaceAllPathsWithAbsolutePaths?: boolean;
|
||||||
|
ignoreSitemap?: boolean;
|
||||||
mode?: "default" | "fast"; // have a mode of some sort
|
mode?: "default" | "fast"; // have a mode of some sort
|
||||||
};
|
allowBackwardCrawling?: boolean;
|
||||||
|
}
|
||||||
|
|
||||||
|
export type WebScraperOptions = {
|
||||||
|
urls: string[];
|
||||||
|
mode: "single_urls" | "sitemap" | "crawl";
|
||||||
|
crawlerOptions?: CrawlerOptions;
|
||||||
pageOptions?: PageOptions;
|
pageOptions?: PageOptions;
|
||||||
extractorOptions?: ExtractorOptions;
|
extractorOptions?: ExtractorOptions;
|
||||||
concurrentRequests?: number;
|
concurrentRequests?: number;
|
||||||
@ -74,6 +84,8 @@ export class Document {
|
|||||||
provider?: string;
|
provider?: string;
|
||||||
warning?: string;
|
warning?: string;
|
||||||
|
|
||||||
|
index?: number;
|
||||||
|
|
||||||
constructor(data: Partial<Document>) {
|
constructor(data: Partial<Document>) {
|
||||||
if (!data.content) {
|
if (!data.content) {
|
||||||
throw new Error("Missing required fields");
|
throw new Error("Missing required fields");
|
||||||
@ -105,3 +117,11 @@ export class SearchResult {
|
|||||||
return `SearchResult(url=${this.url}, title=${this.title}, description=${this.description})`;
|
return `SearchResult(url=${this.url}, title=${this.title}, description=${this.description})`;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
export interface FireEngineResponse {
|
||||||
|
html: string;
|
||||||
|
screenshot: string;
|
||||||
|
pageStatusCode?: number;
|
||||||
|
pageError?: string;
|
||||||
|
}
|
||||||
|
|
||||||
|
@ -1,42 +1,42 @@
|
|||||||
import { scrapWithFireEngine } from "../../src/scraper/WebScraper/single_url";
|
// import { scrapWithFireEngine } from "../../src/scraper/WebScraper/single_url";
|
||||||
|
|
||||||
const delay = (ms: number) => new Promise((resolve) => setTimeout(resolve, ms));
|
// const delay = (ms: number) => new Promise((resolve) => setTimeout(resolve, ms));
|
||||||
|
|
||||||
const scrapInBatches = async (
|
// const scrapInBatches = async (
|
||||||
urls: string[],
|
// urls: string[],
|
||||||
batchSize: number,
|
// batchSize: number,
|
||||||
delayMs: number
|
// delayMs: number
|
||||||
) => {
|
// ) => {
|
||||||
let successCount = 0;
|
// let successCount = 0;
|
||||||
let errorCount = 0;
|
// let errorCount = 0;
|
||||||
|
|
||||||
for (let i = 0; i < urls.length; i += batchSize) {
|
// for (let i = 0; i < urls.length; i += batchSize) {
|
||||||
const batch = urls
|
// const batch = urls
|
||||||
.slice(i, i + batchSize)
|
// .slice(i, i + batchSize)
|
||||||
.map((url) => scrapWithFireEngine(url));
|
// .map((url) => scrapWithFireEngine(url));
|
||||||
try {
|
// try {
|
||||||
const results = await Promise.all(batch);
|
// const results = await Promise.all(batch);
|
||||||
results.forEach((data, index) => {
|
// results.forEach((data, index) => {
|
||||||
if (data.trim() === "") {
|
// if (data.trim() === "") {
|
||||||
errorCount++;
|
// errorCount++;
|
||||||
} else {
|
// } else {
|
||||||
successCount++;
|
// successCount++;
|
||||||
console.log(
|
// console.log(
|
||||||
`Scraping result ${i + index + 1}:`,
|
// `Scraping result ${i + index + 1}:`,
|
||||||
data.trim().substring(0, 20) + "..."
|
// data.trim().substring(0, 20) + "..."
|
||||||
);
|
// );
|
||||||
}
|
// }
|
||||||
});
|
// });
|
||||||
} catch (error) {
|
// } catch (error) {
|
||||||
console.error("Error during scraping:", error);
|
// console.error("Error during scraping:", error);
|
||||||
}
|
// }
|
||||||
await delay(delayMs);
|
// await delay(delayMs);
|
||||||
}
|
// }
|
||||||
|
|
||||||
console.log(`Total successful scrapes: ${successCount}`);
|
// console.log(`Total successful scrapes: ${successCount}`);
|
||||||
console.log(`Total errored scrapes: ${errorCount}`);
|
// console.log(`Total errored scrapes: ${errorCount}`);
|
||||||
};
|
// };
|
||||||
function run() {
|
// function run() {
|
||||||
const urls = Array.from({ length: 200 }, () => "https://scrapethissite.com");
|
// const urls = Array.from({ length: 200 }, () => "https://scrapethissite.com");
|
||||||
scrapInBatches(urls, 10, 1000);
|
// scrapInBatches(urls, 10, 1000);
|
||||||
}
|
// }
|
||||||
|
@ -19,6 +19,9 @@ export async function startWebScraperPipeline({
|
|||||||
inProgress: (progress) => {
|
inProgress: (progress) => {
|
||||||
if (progress.currentDocument) {
|
if (progress.currentDocument) {
|
||||||
partialDocs.push(progress.currentDocument);
|
partialDocs.push(progress.currentDocument);
|
||||||
|
if (partialDocs.length > 50) {
|
||||||
|
partialDocs = partialDocs.slice(-50);
|
||||||
|
}
|
||||||
job.progress({ ...progress, partialDocs: partialDocs });
|
job.progress({ ...progress, partialDocs: partialDocs });
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
|
@ -3,7 +3,7 @@ import cheerio, { load } from "cheerio";
|
|||||||
import { URL } from "url";
|
import { URL } from "url";
|
||||||
import { getLinksFromSitemap } from "./sitemap";
|
import { getLinksFromSitemap } from "./sitemap";
|
||||||
import async from "async";
|
import async from "async";
|
||||||
import { Progress } from "../../lib/entities";
|
import { CrawlerOptions, PageOptions, Progress } from "../../lib/entities";
|
||||||
import { scrapSingleUrl, scrapWithScrapingBee } from "./single_url";
|
import { scrapSingleUrl, scrapWithScrapingBee } from "./single_url";
|
||||||
import robotsParser from "robots-parser";
|
import robotsParser from "robots-parser";
|
||||||
|
|
||||||
@ -20,15 +20,17 @@ export class WebCrawler {
|
|||||||
private robotsTxtUrl: string;
|
private robotsTxtUrl: string;
|
||||||
private robots: any;
|
private robots: any;
|
||||||
private generateImgAltText: boolean;
|
private generateImgAltText: boolean;
|
||||||
|
private allowBackwardCrawling: boolean;
|
||||||
|
|
||||||
constructor({
|
constructor({
|
||||||
initialUrl,
|
initialUrl,
|
||||||
includes,
|
includes,
|
||||||
excludes,
|
excludes,
|
||||||
maxCrawledLinks,
|
maxCrawledLinks = 10000,
|
||||||
limit = 10000,
|
limit = 10000,
|
||||||
generateImgAltText = false,
|
generateImgAltText = false,
|
||||||
maxCrawledDepth = 10,
|
maxCrawledDepth = 10,
|
||||||
|
allowBackwardCrawling = false
|
||||||
}: {
|
}: {
|
||||||
initialUrl: string;
|
initialUrl: string;
|
||||||
includes?: string[];
|
includes?: string[];
|
||||||
@ -37,6 +39,7 @@ export class WebCrawler {
|
|||||||
limit?: number;
|
limit?: number;
|
||||||
generateImgAltText?: boolean;
|
generateImgAltText?: boolean;
|
||||||
maxCrawledDepth?: number;
|
maxCrawledDepth?: number;
|
||||||
|
allowBackwardCrawling?: boolean;
|
||||||
}) {
|
}) {
|
||||||
this.initialUrl = initialUrl;
|
this.initialUrl = initialUrl;
|
||||||
this.baseUrl = new URL(initialUrl).origin;
|
this.baseUrl = new URL(initialUrl).origin;
|
||||||
@ -49,6 +52,7 @@ export class WebCrawler {
|
|||||||
this.maxCrawledLinks = maxCrawledLinks ?? limit;
|
this.maxCrawledLinks = maxCrawledLinks ?? limit;
|
||||||
this.maxCrawledDepth = maxCrawledDepth ?? 10;
|
this.maxCrawledDepth = maxCrawledDepth ?? 10;
|
||||||
this.generateImgAltText = generateImgAltText ?? false;
|
this.generateImgAltText = generateImgAltText ?? false;
|
||||||
|
this.allowBackwardCrawling = allowBackwardCrawling ?? false;
|
||||||
}
|
}
|
||||||
|
|
||||||
private filterLinks(sitemapLinks: string[], limit: number, maxDepth: number): string[] {
|
private filterLinks(sitemapLinks: string[], limit: number, maxDepth: number): string[] {
|
||||||
@ -90,10 +94,16 @@ export class WebCrawler {
|
|||||||
const linkHostname = normalizedLink.hostname.replace(/^www\./, '');
|
const linkHostname = normalizedLink.hostname.replace(/^www\./, '');
|
||||||
|
|
||||||
// Ensure the protocol and hostname match, and the path starts with the initial URL's path
|
// Ensure the protocol and hostname match, and the path starts with the initial URL's path
|
||||||
if (linkHostname !== initialHostname || !normalizedLink.pathname.startsWith(normalizedInitialUrl.pathname)) {
|
if (linkHostname !== initialHostname) {
|
||||||
return false;
|
return false;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
if (!this.allowBackwardCrawling) {
|
||||||
|
if (!normalizedLink.pathname.startsWith(normalizedInitialUrl.pathname)) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
const isAllowed = this.robots.isAllowed(link, "FireCrawlAgent") ?? true;
|
const isAllowed = this.robots.isAllowed(link, "FireCrawlAgent") ?? true;
|
||||||
// Check if the link is disallowed by robots.txt
|
// Check if the link is disallowed by robots.txt
|
||||||
if (!isAllowed) {
|
if (!isAllowed) {
|
||||||
@ -108,6 +118,8 @@ export class WebCrawler {
|
|||||||
|
|
||||||
public async start(
|
public async start(
|
||||||
inProgress?: (progress: Progress) => void,
|
inProgress?: (progress: Progress) => void,
|
||||||
|
pageOptions?: PageOptions,
|
||||||
|
crawlerOptions?: CrawlerOptions,
|
||||||
concurrencyLimit: number = 5,
|
concurrencyLimit: number = 5,
|
||||||
limit: number = 10000,
|
limit: number = 10000,
|
||||||
maxDepth: number = 10
|
maxDepth: number = 10
|
||||||
@ -122,17 +134,21 @@ export class WebCrawler {
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
if(!crawlerOptions?.ignoreSitemap){
|
||||||
const sitemapLinks = await this.tryFetchSitemapLinks(this.initialUrl);
|
const sitemapLinks = await this.tryFetchSitemapLinks(this.initialUrl);
|
||||||
if (sitemapLinks.length > 0) {
|
if (sitemapLinks.length > 0) {
|
||||||
let filteredLinks = this.filterLinks(sitemapLinks, limit, maxDepth);
|
let filteredLinks = this.filterLinks(sitemapLinks, limit, maxDepth);
|
||||||
return filteredLinks.map(link => ({ url: link, html: "" }));
|
return filteredLinks.map(link => ({ url: link, html: "" }));
|
||||||
}
|
}
|
||||||
|
}
|
||||||
|
|
||||||
const urls = await this.crawlUrls(
|
const urls = await this.crawlUrls(
|
||||||
[this.initialUrl],
|
[this.initialUrl],
|
||||||
|
pageOptions,
|
||||||
concurrencyLimit,
|
concurrencyLimit,
|
||||||
inProgress
|
inProgress
|
||||||
);
|
);
|
||||||
|
|
||||||
if (
|
if (
|
||||||
urls.length === 0 &&
|
urls.length === 0 &&
|
||||||
this.filterLinks([this.initialUrl], limit, this.maxCrawledDepth).length > 0
|
this.filterLinks([this.initialUrl], limit, this.maxCrawledDepth).length > 0
|
||||||
@ -140,7 +156,6 @@ export class WebCrawler {
|
|||||||
return [{ url: this.initialUrl, html: "" }];
|
return [{ url: this.initialUrl, html: "" }];
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
// make sure to run include exclude here again
|
// make sure to run include exclude here again
|
||||||
const filteredUrls = this.filterLinks(urls.map(urlObj => urlObj.url), limit, this.maxCrawledDepth);
|
const filteredUrls = this.filterLinks(urls.map(urlObj => urlObj.url), limit, this.maxCrawledDepth);
|
||||||
return filteredUrls.map(url => ({ url, html: urls.find(urlObj => urlObj.url === url)?.html || "" }));
|
return filteredUrls.map(url => ({ url, html: urls.find(urlObj => urlObj.url === url)?.html || "" }));
|
||||||
@ -148,17 +163,18 @@ export class WebCrawler {
|
|||||||
|
|
||||||
private async crawlUrls(
|
private async crawlUrls(
|
||||||
urls: string[],
|
urls: string[],
|
||||||
|
pageOptions: PageOptions,
|
||||||
concurrencyLimit: number,
|
concurrencyLimit: number,
|
||||||
inProgress?: (progress: Progress) => void,
|
inProgress?: (progress: Progress) => void,
|
||||||
): Promise<{ url: string, html: string }[]> {
|
): Promise<{ url: string, html: string }[]> {
|
||||||
const queue = async.queue(async (task: string, callback) => {
|
const queue = async.queue(async (task: string, callback) => {
|
||||||
if (this.crawledUrls.size >= this.maxCrawledLinks) {
|
if (this.crawledUrls.size >= Math.min(this.maxCrawledLinks, this.limit)) {
|
||||||
if (callback && typeof callback === "function") {
|
if (callback && typeof callback === "function") {
|
||||||
callback();
|
callback();
|
||||||
}
|
}
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
const newUrls = await this.crawl(task);
|
const newUrls = await this.crawl(task, pageOptions);
|
||||||
// add the initial url if not already added
|
// add the initial url if not already added
|
||||||
// if (this.visited.size === 1) {
|
// if (this.visited.size === 1) {
|
||||||
// let normalizedInitial = this.initialUrl;
|
// let normalizedInitial = this.initialUrl;
|
||||||
@ -176,19 +192,19 @@ export class WebCrawler {
|
|||||||
if (inProgress && newUrls.length > 0) {
|
if (inProgress && newUrls.length > 0) {
|
||||||
inProgress({
|
inProgress({
|
||||||
current: this.crawledUrls.size,
|
current: this.crawledUrls.size,
|
||||||
total: this.maxCrawledLinks,
|
total: Math.min(this.maxCrawledLinks, this.limit),
|
||||||
status: "SCRAPING",
|
status: "SCRAPING",
|
||||||
currentDocumentUrl: newUrls[newUrls.length - 1].url,
|
currentDocumentUrl: newUrls[newUrls.length - 1].url,
|
||||||
});
|
});
|
||||||
} else if (inProgress) {
|
} else if (inProgress) {
|
||||||
inProgress({
|
inProgress({
|
||||||
current: this.crawledUrls.size,
|
current: this.crawledUrls.size,
|
||||||
total: this.maxCrawledLinks,
|
total: Math.min(this.maxCrawledLinks, this.limit),
|
||||||
status: "SCRAPING",
|
status: "SCRAPING",
|
||||||
currentDocumentUrl: task,
|
currentDocumentUrl: task,
|
||||||
});
|
});
|
||||||
}
|
}
|
||||||
await this.crawlUrls(newUrls.map((p) => p.url), concurrencyLimit, inProgress);
|
await this.crawlUrls(newUrls.map((p) => p.url), pageOptions, concurrencyLimit, inProgress);
|
||||||
if (callback && typeof callback === "function") {
|
if (callback && typeof callback === "function") {
|
||||||
callback();
|
callback();
|
||||||
}
|
}
|
||||||
@ -207,20 +223,18 @@ export class WebCrawler {
|
|||||||
return Array.from(this.crawledUrls.entries()).map(([url, html]) => ({ url, html }));
|
return Array.from(this.crawledUrls.entries()).map(([url, html]) => ({ url, html }));
|
||||||
}
|
}
|
||||||
|
|
||||||
async crawl(url: string): Promise<{url: string, html: string}[]> {
|
async crawl(url: string, pageOptions: PageOptions): Promise<{url: string, html: string, pageStatusCode?: number, pageError?: string}[]> {
|
||||||
if (this.visited.has(url) || !this.robots.isAllowed(url, "FireCrawlAgent")){
|
const normalizedUrl = this.normalizeCrawlUrl(url);
|
||||||
|
if (this.visited.has(normalizedUrl) || !this.robots.isAllowed(url, "FireCrawlAgent")) {
|
||||||
return [];
|
return [];
|
||||||
}
|
}
|
||||||
this.visited.add(url);
|
this.visited.add(normalizedUrl);
|
||||||
|
|
||||||
|
|
||||||
if (!url.startsWith("http")) {
|
if (!url.startsWith("http")) {
|
||||||
url = "https://" + url;
|
url = "https://" + url;
|
||||||
|
|
||||||
}
|
}
|
||||||
if (url.endsWith("/")) {
|
if (url.endsWith("/")) {
|
||||||
url = url.slice(0, -1);
|
url = url.slice(0, -1);
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
if (this.isFile(url) || this.isSocialMediaOrEmail(url)) {
|
if (this.isFile(url) || this.isSocialMediaOrEmail(url)) {
|
||||||
@ -228,25 +242,30 @@ export class WebCrawler {
|
|||||||
}
|
}
|
||||||
|
|
||||||
try {
|
try {
|
||||||
let content : string = "";
|
let content: string = "";
|
||||||
|
let pageStatusCode: number;
|
||||||
|
let pageError: string | undefined = undefined;
|
||||||
|
|
||||||
// If it is the first link, fetch with single url
|
// If it is the first link, fetch with single url
|
||||||
if (this.visited.size === 1) {
|
if (this.visited.size === 1) {
|
||||||
const page = await scrapSingleUrl(url, {includeHtml: true});
|
const page = await scrapSingleUrl(url, { ...pageOptions, includeHtml: true });
|
||||||
content = page.html ?? ""
|
content = page.html ?? "";
|
||||||
|
pageStatusCode = page.metadata?.pageStatusCode;
|
||||||
|
pageError = page.metadata?.pageError || undefined;
|
||||||
} else {
|
} else {
|
||||||
const response = await axios.get(url);
|
const response = await axios.get(url);
|
||||||
content = response.data ?? "";
|
content = response.data ?? "";
|
||||||
|
pageStatusCode = response.status;
|
||||||
|
pageError = response.statusText != "OK" ? response.statusText : undefined;
|
||||||
}
|
}
|
||||||
const $ = load(content);
|
const $ = load(content);
|
||||||
let links: {url: string, html: string}[] = [];
|
let links: { url: string, html: string, pageStatusCode?: number, pageError?: string }[] = [];
|
||||||
|
|
||||||
// Add the initial URL to the list of links
|
// Add the initial URL to the list of links
|
||||||
if(this.visited.size === 1)
|
if (this.visited.size === 1) {
|
||||||
{
|
links.push({ url, html: content, pageStatusCode, pageError });
|
||||||
links.push({url, html: content});
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
$("a").each((_, element) => {
|
$("a").each((_, element) => {
|
||||||
const href = $(element).attr("href");
|
const href = $(element).attr("href");
|
||||||
if (href) {
|
if (href) {
|
||||||
@ -254,32 +273,43 @@ export class WebCrawler {
|
|||||||
if (!href.startsWith("http")) {
|
if (!href.startsWith("http")) {
|
||||||
fullUrl = new URL(href, this.baseUrl).toString();
|
fullUrl = new URL(href, this.baseUrl).toString();
|
||||||
}
|
}
|
||||||
const url = new URL(fullUrl);
|
const urlObj = new URL(fullUrl);
|
||||||
const path = url.pathname;
|
const path = urlObj.pathname;
|
||||||
|
|
||||||
if (
|
if (
|
||||||
this.isInternalLink(fullUrl) &&
|
this.isInternalLink(fullUrl) &&
|
||||||
this.matchesPattern(fullUrl) &&
|
this.matchesPattern(fullUrl) &&
|
||||||
this.noSections(fullUrl) &&
|
this.noSections(fullUrl) &&
|
||||||
this.matchesIncludes(path) &&
|
// The idea here to comment this out is to allow wider website coverage as we filter this anyway afterwards
|
||||||
|
// this.matchesIncludes(path) &&
|
||||||
!this.matchesExcludes(path) &&
|
!this.matchesExcludes(path) &&
|
||||||
this.robots.isAllowed(fullUrl, "FireCrawlAgent")
|
this.robots.isAllowed(fullUrl, "FireCrawlAgent")
|
||||||
) {
|
) {
|
||||||
links.push({url: fullUrl, html: content});
|
links.push({ url: fullUrl, html: content, pageStatusCode, pageError });
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
});
|
});
|
||||||
|
|
||||||
if(this.visited.size === 1){
|
if (this.visited.size === 1) {
|
||||||
return links;
|
return links;
|
||||||
}
|
}
|
||||||
// Create a new list to return to avoid modifying the visited list
|
// Create a new list to return to avoid modifying the visited list
|
||||||
return links.filter((link) => !this.visited.has(link.url));
|
return links.filter((link) => !this.visited.has(this.normalizeCrawlUrl(link.url)));
|
||||||
} catch (error) {
|
} catch (error) {
|
||||||
return [];
|
return [];
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
private normalizeCrawlUrl(url: string): string {
|
||||||
|
try{
|
||||||
|
const urlObj = new URL(url);
|
||||||
|
urlObj.searchParams.sort(); // Sort query parameters to normalize
|
||||||
|
return urlObj.toString();
|
||||||
|
} catch (error) {
|
||||||
|
return url;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
private matchesIncludes(url: string): boolean {
|
private matchesIncludes(url: string): boolean {
|
||||||
if (this.includes.length === 0 || this.includes[0] == "") return true;
|
if (this.includes.length === 0 || this.includes[0] == "") return true;
|
||||||
return this.includes.some((pattern) => new RegExp(pattern).test(url));
|
return this.includes.some((pattern) => new RegExp(pattern).test(url));
|
||||||
@ -324,6 +354,12 @@ export class WebCrawler {
|
|||||||
// ".docx",
|
// ".docx",
|
||||||
".xlsx",
|
".xlsx",
|
||||||
".xml",
|
".xml",
|
||||||
|
".avi",
|
||||||
|
".flv",
|
||||||
|
".woff",
|
||||||
|
".ttf",
|
||||||
|
".woff2",
|
||||||
|
".webp"
|
||||||
];
|
];
|
||||||
return fileExtensions.some((ext) => url.endsWith(ext));
|
return fileExtensions.some((ext) => url.endsWith(ext));
|
||||||
}
|
}
|
||||||
@ -382,7 +418,6 @@ export class WebCrawler {
|
|||||||
|
|
||||||
// Normalize and check if the URL is present in any of the sitemaps
|
// Normalize and check if the URL is present in any of the sitemaps
|
||||||
const normalizedUrl = normalizeUrl(url);
|
const normalizedUrl = normalizeUrl(url);
|
||||||
|
|
||||||
const normalizedSitemapLinks = sitemapLinks.map(link => normalizeUrl(link));
|
const normalizedSitemapLinks = sitemapLinks.map(link => normalizeUrl(link));
|
||||||
|
|
||||||
// has to be greater than 0 to avoid adding the initial URL to the sitemap links, and preventing crawler to crawl
|
// has to be greater than 0 to avoid adding the initial URL to the sitemap links, and preventing crawler to crawl
|
||||||
|
@ -0,0 +1,51 @@
|
|||||||
|
export async function handleCustomScraping(
|
||||||
|
text: string,
|
||||||
|
url: string
|
||||||
|
): Promise<{ scraper: string; url: string; waitAfterLoad?: number, pageOptions?: { scrollXPaths?: string[] } } | null> {
|
||||||
|
// Check for Readme Docs special case
|
||||||
|
if (text.includes('<meta name="readme-deploy"')) {
|
||||||
|
console.log(
|
||||||
|
`Special use case detected for ${url}, using Fire Engine with wait time 1000ms`
|
||||||
|
);
|
||||||
|
return {
|
||||||
|
scraper: "fire-engine",
|
||||||
|
url: url,
|
||||||
|
waitAfterLoad: 1000,
|
||||||
|
pageOptions: {
|
||||||
|
scrollXPaths: ['//*[@id="ReferencePlayground"]/section[3]/div/pre/div/div/div[5]']
|
||||||
|
}
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
// Check for Vanta security portals
|
||||||
|
if (text.includes('<link href="https://static.vanta.com')) {
|
||||||
|
console.log(
|
||||||
|
`Vanta link detected for ${url}, using Fire Engine with wait time 3000ms`
|
||||||
|
);
|
||||||
|
return {
|
||||||
|
scraper: "fire-engine",
|
||||||
|
url: url,
|
||||||
|
waitAfterLoad: 3000,
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
// Check for Google Drive PDF links in the raw HTML
|
||||||
|
const googleDrivePdfPattern =
|
||||||
|
/https:\/\/drive\.google\.com\/file\/d\/([^\/]+)\/view/;
|
||||||
|
const googleDrivePdfLink = text.match(googleDrivePdfPattern);
|
||||||
|
if (googleDrivePdfLink) {
|
||||||
|
console.log(
|
||||||
|
`Google Drive PDF link detected for ${url}: ${googleDrivePdfLink[0]}`
|
||||||
|
);
|
||||||
|
|
||||||
|
const fileId = googleDrivePdfLink[1];
|
||||||
|
const pdfUrl = `https://drive.google.com/uc?export=download&id=${fileId}`;
|
||||||
|
|
||||||
|
return {
|
||||||
|
scraper: "pdf",
|
||||||
|
url: pdfUrl
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
return null;
|
||||||
|
}
|
@ -31,12 +31,14 @@ export class WebScraperDataProvider {
|
|||||||
private limit: number = 10000;
|
private limit: number = 10000;
|
||||||
private concurrentRequests: number = 20;
|
private concurrentRequests: number = 20;
|
||||||
private generateImgAltText: boolean = false;
|
private generateImgAltText: boolean = false;
|
||||||
|
private ignoreSitemap: boolean = false;
|
||||||
private pageOptions?: PageOptions;
|
private pageOptions?: PageOptions;
|
||||||
private extractorOptions?: ExtractorOptions;
|
private extractorOptions?: ExtractorOptions;
|
||||||
private replaceAllPathsWithAbsolutePaths?: boolean = false;
|
private replaceAllPathsWithAbsolutePaths?: boolean = false;
|
||||||
private generateImgAltTextModel: "gpt-4-turbo" | "claude-3-opus" =
|
private generateImgAltTextModel: "gpt-4-turbo" | "claude-3-opus" =
|
||||||
"gpt-4-turbo";
|
"gpt-4-turbo";
|
||||||
private crawlerMode: string = "default";
|
private crawlerMode: string = "default";
|
||||||
|
private allowBackwardCrawling: boolean = false;
|
||||||
|
|
||||||
authorize(): void {
|
authorize(): void {
|
||||||
throw new Error("Method not implemented.");
|
throw new Error("Method not implemented.");
|
||||||
@ -72,7 +74,7 @@ export class WebScraperDataProvider {
|
|||||||
total: totalUrls,
|
total: totalUrls,
|
||||||
status: "SCRAPING",
|
status: "SCRAPING",
|
||||||
currentDocumentUrl: url,
|
currentDocumentUrl: url,
|
||||||
currentDocument: result,
|
currentDocument: { ...result, index: processedUrls },
|
||||||
});
|
});
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -84,13 +86,15 @@ export class WebScraperDataProvider {
|
|||||||
const job = await getWebScraperQueue().getJob(this.bullJobId);
|
const job = await getWebScraperQueue().getJob(this.bullJobId);
|
||||||
const jobStatus = await job.getState();
|
const jobStatus = await job.getState();
|
||||||
if (jobStatus === "failed") {
|
if (jobStatus === "failed") {
|
||||||
throw new Error(
|
console.error(
|
||||||
"Job has failed or has been cancelled by the user. Stopping the job..."
|
"Job has failed or has been cancelled by the user. Stopping the job..."
|
||||||
);
|
);
|
||||||
|
return [] as Document[];
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
} catch (error) {
|
} catch (error) {
|
||||||
console.error(error);
|
console.error(error);
|
||||||
|
return [] as Document[];
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
return results.filter((result) => result !== null) as Document[];
|
return results.filter((result) => result !== null) as Document[];
|
||||||
@ -159,18 +163,27 @@ export class WebScraperDataProvider {
|
|||||||
inProgress?: (progress: Progress) => void
|
inProgress?: (progress: Progress) => void
|
||||||
): Promise<Document[]> {
|
): Promise<Document[]> {
|
||||||
|
|
||||||
|
const pathSplits = new URL(this.urls[0]).pathname.split('/');
|
||||||
|
const baseURLDepth = pathSplits.length - (pathSplits[0].length === 0 && pathSplits[pathSplits.length - 1].length === 0 ? 1 : 0);
|
||||||
|
const adjustedMaxDepth = this.maxCrawledDepth + baseURLDepth;
|
||||||
|
|
||||||
const crawler = new WebCrawler({
|
const crawler = new WebCrawler({
|
||||||
initialUrl: this.urls[0],
|
initialUrl: this.urls[0],
|
||||||
includes: this.includes,
|
includes: this.includes,
|
||||||
excludes: this.excludes,
|
excludes: this.excludes,
|
||||||
maxCrawledLinks: this.maxCrawledLinks,
|
maxCrawledLinks: this.maxCrawledLinks,
|
||||||
maxCrawledDepth: this.maxCrawledDepth,
|
maxCrawledDepth: adjustedMaxDepth,
|
||||||
limit: this.limit,
|
limit: this.limit,
|
||||||
generateImgAltText: this.generateImgAltText,
|
generateImgAltText: this.generateImgAltText,
|
||||||
|
allowBackwardCrawling: this.allowBackwardCrawling,
|
||||||
});
|
});
|
||||||
|
|
||||||
let links = await crawler.start(
|
let links = await crawler.start(
|
||||||
inProgress,
|
inProgress,
|
||||||
|
this.pageOptions,
|
||||||
|
{
|
||||||
|
ignoreSitemap: this.ignoreSitemap,
|
||||||
|
},
|
||||||
5,
|
5,
|
||||||
this.limit,
|
this.limit,
|
||||||
this.maxCrawledDepth
|
this.maxCrawledDepth
|
||||||
@ -213,6 +226,7 @@ export class WebScraperDataProvider {
|
|||||||
return this.returnOnlyUrlsResponse(links, inProgress);
|
return this.returnOnlyUrlsResponse(links, inProgress);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
let documents = await this.processLinks(links, inProgress);
|
let documents = await this.processLinks(links, inProgress);
|
||||||
return this.cacheAndFinalizeDocuments(documents, links);
|
return this.cacheAndFinalizeDocuments(documents, links);
|
||||||
}
|
}
|
||||||
@ -231,7 +245,7 @@ export class WebScraperDataProvider {
|
|||||||
content: "",
|
content: "",
|
||||||
html: this.pageOptions?.includeHtml ? "" : undefined,
|
html: this.pageOptions?.includeHtml ? "" : undefined,
|
||||||
markdown: "",
|
markdown: "",
|
||||||
metadata: { sourceURL: url },
|
metadata: { sourceURL: url, pageStatusCode: 200 },
|
||||||
}));
|
}));
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -270,10 +284,10 @@ export class WebScraperDataProvider {
|
|||||||
private async fetchPdfDocuments(pdfLinks: string[]): Promise<Document[]> {
|
private async fetchPdfDocuments(pdfLinks: string[]): Promise<Document[]> {
|
||||||
return Promise.all(
|
return Promise.all(
|
||||||
pdfLinks.map(async (pdfLink) => {
|
pdfLinks.map(async (pdfLink) => {
|
||||||
const pdfContent = await fetchAndProcessPdf(pdfLink);
|
const { content, pageStatusCode, pageError } = await fetchAndProcessPdf(pdfLink, this.pageOptions.parsePDF);
|
||||||
return {
|
return {
|
||||||
content: pdfContent,
|
content: content,
|
||||||
metadata: { sourceURL: pdfLink },
|
metadata: { sourceURL: pdfLink, pageStatusCode, pageError },
|
||||||
provider: "web-scraper",
|
provider: "web-scraper",
|
||||||
};
|
};
|
||||||
})
|
})
|
||||||
@ -282,10 +296,10 @@ export class WebScraperDataProvider {
|
|||||||
private async fetchDocxDocuments(docxLinks: string[]): Promise<Document[]> {
|
private async fetchDocxDocuments(docxLinks: string[]): Promise<Document[]> {
|
||||||
return Promise.all(
|
return Promise.all(
|
||||||
docxLinks.map(async (p) => {
|
docxLinks.map(async (p) => {
|
||||||
const docXDocument = await fetchAndProcessDocx(p);
|
const { content, pageStatusCode, pageError } = await fetchAndProcessDocx(p);
|
||||||
return {
|
return {
|
||||||
content: docXDocument,
|
content,
|
||||||
metadata: { sourceURL: p },
|
metadata: { sourceURL: p, pageStatusCode, pageError },
|
||||||
provider: "web-scraper",
|
provider: "web-scraper",
|
||||||
};
|
};
|
||||||
})
|
})
|
||||||
@ -293,9 +307,10 @@ export class WebScraperDataProvider {
|
|||||||
}
|
}
|
||||||
|
|
||||||
private applyPathReplacements(documents: Document[]): Document[] {
|
private applyPathReplacements(documents: Document[]): Document[] {
|
||||||
return this.replaceAllPathsWithAbsolutePaths
|
if (this.replaceAllPathsWithAbsolutePaths) {
|
||||||
? replacePathsWithAbsolutePaths(documents)
|
documents = replacePathsWithAbsolutePaths(documents);
|
||||||
: replaceImgPathsWithAbsolutePaths(documents);
|
}
|
||||||
|
return replaceImgPathsWithAbsolutePaths(documents);
|
||||||
}
|
}
|
||||||
|
|
||||||
private async applyImgAltText(documents: Document[]): Promise<Document[]> {
|
private async applyImgAltText(documents: Document[]): Promise<Document[]> {
|
||||||
@ -464,12 +479,20 @@ export class WebScraperDataProvider {
|
|||||||
this.limit = options.crawlerOptions?.limit ?? 10000;
|
this.limit = options.crawlerOptions?.limit ?? 10000;
|
||||||
this.generateImgAltText =
|
this.generateImgAltText =
|
||||||
options.crawlerOptions?.generateImgAltText ?? false;
|
options.crawlerOptions?.generateImgAltText ?? false;
|
||||||
this.pageOptions = options.pageOptions ?? { onlyMainContent: false, includeHtml: false };
|
this.pageOptions = options.pageOptions ?? {
|
||||||
|
onlyMainContent: false,
|
||||||
|
includeHtml: false,
|
||||||
|
replaceAllPathsWithAbsolutePaths: false,
|
||||||
|
parsePDF: true,
|
||||||
|
removeTags: []
|
||||||
|
};
|
||||||
this.extractorOptions = options.extractorOptions ?? {mode: "markdown"}
|
this.extractorOptions = options.extractorOptions ?? {mode: "markdown"}
|
||||||
this.replaceAllPathsWithAbsolutePaths = options.crawlerOptions?.replaceAllPathsWithAbsolutePaths ?? false;
|
this.replaceAllPathsWithAbsolutePaths = options.crawlerOptions?.replaceAllPathsWithAbsolutePaths ?? options.pageOptions?.replaceAllPathsWithAbsolutePaths ?? false;
|
||||||
//! @nicolas, for some reason this was being injected and breaking everything. Don't have time to find source of the issue so adding this check
|
//! @nicolas, for some reason this was being injected and breaking everything. Don't have time to find source of the issue so adding this check
|
||||||
this.excludes = this.excludes.filter((item) => item !== "");
|
this.excludes = this.excludes.filter((item) => item !== "");
|
||||||
this.crawlerMode = options.crawlerOptions?.mode ?? "default";
|
this.crawlerMode = options.crawlerOptions?.mode ?? "default";
|
||||||
|
this.ignoreSitemap = options.crawlerOptions?.ignoreSitemap ?? false;
|
||||||
|
this.allowBackwardCrawling = options.crawlerOptions?.allowBackwardCrawling ?? false;
|
||||||
|
|
||||||
// make sure all urls start with https://
|
// make sure all urls start with https://
|
||||||
this.urls = this.urls.map((url) => {
|
this.urls = this.urls.map((url) => {
|
||||||
|
@ -2,11 +2,13 @@ import * as cheerio from "cheerio";
|
|||||||
import { ScrapingBeeClient } from "scrapingbee";
|
import { ScrapingBeeClient } from "scrapingbee";
|
||||||
import { extractMetadata } from "./utils/metadata";
|
import { extractMetadata } from "./utils/metadata";
|
||||||
import dotenv from "dotenv";
|
import dotenv from "dotenv";
|
||||||
import { Document, PageOptions } from "../../lib/entities";
|
import { Document, PageOptions, FireEngineResponse } from "../../lib/entities";
|
||||||
import { parseMarkdown } from "../../lib/html-to-markdown";
|
import { parseMarkdown } from "../../lib/html-to-markdown";
|
||||||
import { excludeNonMainTags } from "./utils/excludeTags";
|
import { excludeNonMainTags } from "./utils/excludeTags";
|
||||||
import { urlSpecificParams } from "./utils/custom/website_params";
|
import { urlSpecificParams } from "./utils/custom/website_params";
|
||||||
import { fetchAndProcessPdf } from "./utils/pdfProcessor";
|
import { fetchAndProcessPdf } from "./utils/pdfProcessor";
|
||||||
|
import { handleCustomScraping } from "./custom/handleCustomScraping";
|
||||||
|
import axios from "axios";
|
||||||
|
|
||||||
dotenv.config();
|
dotenv.config();
|
||||||
|
|
||||||
@ -18,6 +20,7 @@ const baseScrapers = [
|
|||||||
"fetch",
|
"fetch",
|
||||||
] as const;
|
] as const;
|
||||||
|
|
||||||
|
const universalTimeout = 15000;
|
||||||
|
|
||||||
export async function generateRequestParams(
|
export async function generateRequestParams(
|
||||||
url: string,
|
url: string,
|
||||||
@ -44,131 +47,195 @@ export async function generateRequestParams(
|
|||||||
}
|
}
|
||||||
export async function scrapWithFireEngine(
|
export async function scrapWithFireEngine(
|
||||||
url: string,
|
url: string,
|
||||||
|
waitFor: number = 0,
|
||||||
|
screenshot: boolean = false,
|
||||||
|
pageOptions: { scrollXPaths?: string[], parsePDF?: boolean } = { parsePDF: true },
|
||||||
|
headers?: Record<string, string>,
|
||||||
options?: any
|
options?: any
|
||||||
): Promise<string> {
|
): Promise<FireEngineResponse> {
|
||||||
try {
|
try {
|
||||||
const reqParams = await generateRequestParams(url);
|
const reqParams = await generateRequestParams(url);
|
||||||
const wait_playwright = reqParams["params"]?.wait ?? 0;
|
// If the user has passed a wait parameter in the request, use that
|
||||||
|
const waitParam = reqParams["params"]?.wait ?? waitFor;
|
||||||
|
const screenshotParam = reqParams["params"]?.screenshot ?? screenshot;
|
||||||
|
console.log(
|
||||||
|
`[Fire-Engine] Scraping ${url} with wait: ${waitParam} and screenshot: ${screenshotParam}`
|
||||||
|
);
|
||||||
|
|
||||||
const response = await fetch(process.env.FIRE_ENGINE_BETA_URL+ "/scrape", {
|
const response = await axios.post(
|
||||||
method: "POST",
|
process.env.FIRE_ENGINE_BETA_URL + "/scrape",
|
||||||
|
{
|
||||||
|
url: url,
|
||||||
|
wait: waitParam,
|
||||||
|
screenshot: screenshotParam,
|
||||||
|
headers: headers,
|
||||||
|
pageOptions: pageOptions,
|
||||||
|
},
|
||||||
|
{
|
||||||
headers: {
|
headers: {
|
||||||
"Content-Type": "application/json",
|
"Content-Type": "application/json",
|
||||||
},
|
},
|
||||||
body: JSON.stringify({ url: url, wait: wait_playwright }),
|
timeout: universalTimeout + waitParam
|
||||||
});
|
}
|
||||||
|
);
|
||||||
|
|
||||||
if (!response.ok) {
|
if (response.status !== 200) {
|
||||||
console.error(
|
console.error(
|
||||||
`[Fire-Engine] Error fetching url: ${url} with status: ${response.status}`
|
`[Fire-Engine] Error fetching url: ${url} with status: ${response.status}`
|
||||||
);
|
);
|
||||||
return "";
|
return { html: "", screenshot: "", pageStatusCode: response.data?.pageStatusCode, pageError: response.data?.pageError };
|
||||||
}
|
}
|
||||||
|
|
||||||
const contentType = response.headers['content-type'];
|
const contentType = response.headers["content-type"];
|
||||||
if (contentType && contentType.includes('application/pdf')) {
|
if (contentType && contentType.includes("application/pdf")) {
|
||||||
return fetchAndProcessPdf(url);
|
const { content, pageStatusCode, pageError } = await fetchAndProcessPdf(url, pageOptions?.parsePDF);
|
||||||
|
return { html: content, screenshot: "", pageStatusCode, pageError };
|
||||||
} else {
|
} else {
|
||||||
const data = await response.json();
|
const data = response.data;
|
||||||
const html = data.content;
|
const html = data.content;
|
||||||
return html ?? "";
|
const screenshot = data.screenshot;
|
||||||
|
return { html: html ?? "", screenshot: screenshot ?? "", pageStatusCode: data.pageStatusCode, pageError: data.pageError };
|
||||||
}
|
}
|
||||||
} catch (error) {
|
} catch (error) {
|
||||||
|
if (error.code === 'ECONNABORTED') {
|
||||||
|
console.log(`[Fire-Engine] Request timed out for ${url}`);
|
||||||
|
} else {
|
||||||
console.error(`[Fire-Engine][c] Error fetching url: ${url} -> ${error}`);
|
console.error(`[Fire-Engine][c] Error fetching url: ${url} -> ${error}`);
|
||||||
return "";
|
}
|
||||||
|
return { html: "", screenshot: "" };
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
export async function scrapWithScrapingBee(
|
export async function scrapWithScrapingBee(
|
||||||
url: string,
|
url: string,
|
||||||
wait_browser: string = "domcontentloaded",
|
wait_browser: string = "domcontentloaded",
|
||||||
timeout: number = 15000
|
timeout: number = universalTimeout,
|
||||||
): Promise<string> {
|
pageOptions: { parsePDF?: boolean } = { parsePDF: true }
|
||||||
|
): Promise<{ content: string, pageStatusCode?: number, pageError?: string }> {
|
||||||
try {
|
try {
|
||||||
const client = new ScrapingBeeClient(process.env.SCRAPING_BEE_API_KEY);
|
const client = new ScrapingBeeClient(process.env.SCRAPING_BEE_API_KEY);
|
||||||
const clientParams = await generateRequestParams(
|
const clientParams = await generateRequestParams(
|
||||||
url,
|
url,
|
||||||
wait_browser,
|
wait_browser,
|
||||||
timeout
|
timeout,
|
||||||
);
|
);
|
||||||
|
|
||||||
const response = await client.get(clientParams);
|
const response = await client.get({
|
||||||
|
...clientParams,
|
||||||
if (response.status !== 200 && response.status !== 404) {
|
params: {
|
||||||
console.error(
|
...clientParams.params,
|
||||||
`[ScrapingBee] Error fetching url: ${url} with status code ${response.status}`
|
'transparent_status_code': 'True'
|
||||||
);
|
|
||||||
return "";
|
|
||||||
}
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
const contentType = response.headers["content-type"];
|
||||||
|
if (contentType && contentType.includes("application/pdf")) {
|
||||||
|
return await fetchAndProcessPdf(url, pageOptions?.parsePDF);
|
||||||
|
|
||||||
const contentType = response.headers['content-type'];
|
|
||||||
if (contentType && contentType.includes('application/pdf')) {
|
|
||||||
return fetchAndProcessPdf(url);
|
|
||||||
} else {
|
} else {
|
||||||
|
let text = "";
|
||||||
|
try {
|
||||||
const decoder = new TextDecoder();
|
const decoder = new TextDecoder();
|
||||||
const text = decoder.decode(response.data);
|
text = decoder.decode(response.data);
|
||||||
return text;
|
} catch (decodeError) {
|
||||||
|
console.error(`[ScrapingBee][c] Error decoding response data for url: ${url} -> ${decodeError}`);
|
||||||
|
}
|
||||||
|
return { content: text, pageStatusCode: response.status, pageError: response.statusText != "OK" ? response.statusText : undefined };
|
||||||
}
|
}
|
||||||
} catch (error) {
|
} catch (error) {
|
||||||
console.error(`[ScrapingBee][c] Error fetching url: ${url} -> ${error}`);
|
console.error(`[ScrapingBee][c] Error fetching url: ${url} -> ${error}`);
|
||||||
return "";
|
return { content: "", pageStatusCode: error.response.status, pageError: error.response.statusText };
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
export async function scrapWithPlaywright(url: string): Promise<string> {
|
export async function scrapWithPlaywright(
|
||||||
|
url: string,
|
||||||
|
waitFor: number = 0,
|
||||||
|
headers?: Record<string, string>,
|
||||||
|
pageOptions: { parsePDF?: boolean } = { parsePDF: true }
|
||||||
|
): Promise<{ content: string, pageStatusCode?: number, pageError?: string }> {
|
||||||
try {
|
try {
|
||||||
const reqParams = await generateRequestParams(url);
|
const reqParams = await generateRequestParams(url);
|
||||||
const wait_playwright = reqParams["params"]?.wait ?? 0;
|
// If the user has passed a wait parameter in the request, use that
|
||||||
|
const waitParam = reqParams["params"]?.wait ?? waitFor;
|
||||||
|
|
||||||
const response = await fetch(process.env.PLAYWRIGHT_MICROSERVICE_URL, {
|
const response = await axios.post(process.env.PLAYWRIGHT_MICROSERVICE_URL, {
|
||||||
method: "POST",
|
url: url,
|
||||||
|
wait_after_load: waitParam,
|
||||||
|
headers: headers,
|
||||||
|
}, {
|
||||||
headers: {
|
headers: {
|
||||||
"Content-Type": "application/json",
|
"Content-Type": "application/json",
|
||||||
},
|
},
|
||||||
body: JSON.stringify({ url: url, wait: wait_playwright }),
|
timeout: universalTimeout + waitParam, // Add waitParam to timeout to account for the wait time
|
||||||
|
transformResponse: [(data) => data] // Prevent axios from parsing JSON automatically
|
||||||
});
|
});
|
||||||
|
|
||||||
if (!response.ok) {
|
if (response.status !== 200) {
|
||||||
console.error(
|
console.error(
|
||||||
`[Playwright] Error fetching url: ${url} with status: ${response.status}`
|
`[Playwright] Error fetching url: ${url} with status: ${response.status}`
|
||||||
);
|
);
|
||||||
return "";
|
return { content: "", pageStatusCode: response.data?.pageStatusCode, pageError: response.data?.pageError };
|
||||||
}
|
}
|
||||||
|
|
||||||
const contentType = response.headers['content-type'];
|
const contentType = response.headers["content-type"];
|
||||||
if (contentType && contentType.includes('application/pdf')) {
|
if (contentType && contentType.includes("application/pdf")) {
|
||||||
return fetchAndProcessPdf(url);
|
return await fetchAndProcessPdf(url, pageOptions?.parsePDF);
|
||||||
} else {
|
} else {
|
||||||
const data = await response.json();
|
const textData = response.data;
|
||||||
|
try {
|
||||||
|
const data = JSON.parse(textData);
|
||||||
const html = data.content;
|
const html = data.content;
|
||||||
return html ?? "";
|
return { content: html ?? "", pageStatusCode: data.pageStatusCode, pageError: data.pageError };
|
||||||
|
} catch (jsonError) {
|
||||||
|
console.error(`[Playwright] Error parsing JSON response for url: ${url} -> ${jsonError}`);
|
||||||
|
return { content: "" };
|
||||||
|
}
|
||||||
}
|
}
|
||||||
} catch (error) {
|
} catch (error) {
|
||||||
console.error(`[Playwright][c] Error fetching url: ${url} -> ${error}`);
|
if (error.code === 'ECONNABORTED') {
|
||||||
return "";
|
console.log(`[Playwright] Request timed out for ${url}`);
|
||||||
|
} else {
|
||||||
|
console.error(`[Playwright] Error fetching url: ${url} -> ${error}`);
|
||||||
|
}
|
||||||
|
return { content: "" };
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
export async function scrapWithFetch(url: string): Promise<string> {
|
export async function scrapWithFetch(
|
||||||
|
url: string,
|
||||||
|
pageOptions: { parsePDF?: boolean } = { parsePDF: true }
|
||||||
|
): Promise<{ content: string, pageStatusCode?: number, pageError?: string }> {
|
||||||
try {
|
try {
|
||||||
const response = await fetch(url);
|
const response = await axios.get(url, {
|
||||||
if (!response.ok) {
|
headers: {
|
||||||
|
"Content-Type": "application/json",
|
||||||
|
},
|
||||||
|
timeout: universalTimeout,
|
||||||
|
transformResponse: [(data) => data] // Prevent axios from parsing JSON automatically
|
||||||
|
});
|
||||||
|
|
||||||
|
if (response.status !== 200) {
|
||||||
console.error(
|
console.error(
|
||||||
`[Fetch] Error fetching url: ${url} with status: ${response.status}`
|
`[Axios] Error fetching url: ${url} with status: ${response.status}`
|
||||||
);
|
);
|
||||||
return "";
|
return { content: "", pageStatusCode: response.status, pageError: response.statusText };
|
||||||
}
|
}
|
||||||
|
|
||||||
const contentType = response.headers['content-type'];
|
const contentType = response.headers["content-type"];
|
||||||
if (contentType && contentType.includes('application/pdf')) {
|
if (contentType && contentType.includes("application/pdf")) {
|
||||||
return fetchAndProcessPdf(url);
|
return await fetchAndProcessPdf(url, pageOptions?.parsePDF);
|
||||||
} else {
|
} else {
|
||||||
const text = await response.text();
|
const text = response.data;
|
||||||
return text;
|
return { content: text, pageStatusCode: 200 };
|
||||||
}
|
}
|
||||||
} catch (error) {
|
} catch (error) {
|
||||||
console.error(`[Fetch][c] Error fetching url: ${url} -> ${error}`);
|
if (error.code === 'ECONNABORTED') {
|
||||||
return "";
|
console.log(`[Axios] Request timed out for ${url}`);
|
||||||
|
} else {
|
||||||
|
console.error(`[Axios] Error fetching url: ${url} -> ${error}`);
|
||||||
|
}
|
||||||
|
return { content: "" };
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -178,8 +245,13 @@ export async function scrapWithFetch(url: string): Promise<string> {
|
|||||||
* @param defaultScraper The default scraper to use if the URL does not have a specific scraper order defined
|
* @param defaultScraper The default scraper to use if the URL does not have a specific scraper order defined
|
||||||
* @returns The order of scrapers to be used for scraping a URL
|
* @returns The order of scrapers to be used for scraping a URL
|
||||||
*/
|
*/
|
||||||
function getScrapingFallbackOrder(defaultScraper?: string) {
|
function getScrapingFallbackOrder(
|
||||||
const availableScrapers = baseScrapers.filter(scraper => {
|
defaultScraper?: string,
|
||||||
|
isWaitPresent: boolean = false,
|
||||||
|
isScreenshotPresent: boolean = false,
|
||||||
|
isHeadersPresent: boolean = false
|
||||||
|
) {
|
||||||
|
const availableScrapers = baseScrapers.filter((scraper) => {
|
||||||
switch (scraper) {
|
switch (scraper) {
|
||||||
case "scrapingBee":
|
case "scrapingBee":
|
||||||
case "scrapingBeeLoad":
|
case "scrapingBeeLoad":
|
||||||
@ -193,16 +265,50 @@ function getScrapingFallbackOrder(defaultScraper?: string) {
|
|||||||
}
|
}
|
||||||
});
|
});
|
||||||
|
|
||||||
const defaultOrder = ["fire-engine", "scrapingBee", "playwright", "scrapingBeeLoad", "fetch"];
|
let defaultOrder = [
|
||||||
const filteredDefaultOrder = defaultOrder.filter((scraper: typeof baseScrapers[number]) => availableScrapers.includes(scraper));
|
"scrapingBee",
|
||||||
const uniqueScrapers = new Set(defaultScraper ? [defaultScraper, ...filteredDefaultOrder, ...availableScrapers] : [...filteredDefaultOrder, ...availableScrapers]);
|
"fire-engine",
|
||||||
|
"playwright",
|
||||||
|
"scrapingBeeLoad",
|
||||||
|
"fetch",
|
||||||
|
];
|
||||||
|
|
||||||
|
if (isWaitPresent || isScreenshotPresent || isHeadersPresent) {
|
||||||
|
defaultOrder = [
|
||||||
|
"fire-engine",
|
||||||
|
"playwright",
|
||||||
|
...defaultOrder.filter(
|
||||||
|
(scraper) => scraper !== "fire-engine" && scraper !== "playwright"
|
||||||
|
),
|
||||||
|
];
|
||||||
|
}
|
||||||
|
|
||||||
|
const filteredDefaultOrder = defaultOrder.filter(
|
||||||
|
(scraper: (typeof baseScrapers)[number]) =>
|
||||||
|
availableScrapers.includes(scraper)
|
||||||
|
);
|
||||||
|
const uniqueScrapers = new Set(
|
||||||
|
defaultScraper
|
||||||
|
? [defaultScraper, ...filteredDefaultOrder, ...availableScrapers]
|
||||||
|
: [...filteredDefaultOrder, ...availableScrapers]
|
||||||
|
);
|
||||||
|
|
||||||
const scrapersInOrder = Array.from(uniqueScrapers);
|
const scrapersInOrder = Array.from(uniqueScrapers);
|
||||||
return scrapersInOrder as typeof baseScrapers[number][];
|
return scrapersInOrder as (typeof baseScrapers)[number][];
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
export async function scrapSingleUrl(
|
export async function scrapSingleUrl(
|
||||||
urlToScrap: string,
|
urlToScrap: string,
|
||||||
pageOptions: PageOptions = { onlyMainContent: true, includeHtml: false },
|
pageOptions: PageOptions = {
|
||||||
|
onlyMainContent: true,
|
||||||
|
includeHtml: false,
|
||||||
|
waitFor: 0,
|
||||||
|
screenshot: false,
|
||||||
|
headers: undefined
|
||||||
|
},
|
||||||
existingHtml: string = ""
|
existingHtml: string = ""
|
||||||
): Promise<Document> {
|
): Promise<Document> {
|
||||||
urlToScrap = urlToScrap.trim();
|
urlToScrap = urlToScrap.trim();
|
||||||
@ -210,6 +316,19 @@ export async function scrapSingleUrl(
|
|||||||
const removeUnwantedElements = (html: string, pageOptions: PageOptions) => {
|
const removeUnwantedElements = (html: string, pageOptions: PageOptions) => {
|
||||||
const soup = cheerio.load(html);
|
const soup = cheerio.load(html);
|
||||||
soup("script, style, iframe, noscript, meta, head").remove();
|
soup("script, style, iframe, noscript, meta, head").remove();
|
||||||
|
|
||||||
|
if (pageOptions.removeTags) {
|
||||||
|
if (typeof pageOptions.removeTags === 'string') {
|
||||||
|
pageOptions.removeTags.split(',').forEach((tag) => {
|
||||||
|
soup(tag.trim()).remove();
|
||||||
|
});
|
||||||
|
} else if (Array.isArray(pageOptions.removeTags)) {
|
||||||
|
pageOptions.removeTags.forEach((tag) => {
|
||||||
|
soup(tag).remove();
|
||||||
|
});
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
if (pageOptions.onlyMainContent) {
|
if (pageOptions.onlyMainContent) {
|
||||||
// remove any other tags that are not in the main content
|
// remove any other tags that are not in the main content
|
||||||
excludeNonMainTags.forEach((tag) => {
|
excludeNonMainTags.forEach((tag) => {
|
||||||
@ -221,46 +340,100 @@ export async function scrapSingleUrl(
|
|||||||
|
|
||||||
const attemptScraping = async (
|
const attemptScraping = async (
|
||||||
url: string,
|
url: string,
|
||||||
method: typeof baseScrapers[number]
|
method: (typeof baseScrapers)[number]
|
||||||
) => {
|
) => {
|
||||||
let text = "";
|
let scraperResponse: { text: string, screenshot: string, metadata: { pageStatusCode?: number, pageError?: string | null } } = { text: "", screenshot: "", metadata: {} };
|
||||||
|
let screenshot = "";
|
||||||
switch (method) {
|
switch (method) {
|
||||||
case "fire-engine":
|
case "fire-engine":
|
||||||
if (process.env.FIRE_ENGINE_BETA_URL) {
|
if (process.env.FIRE_ENGINE_BETA_URL) {
|
||||||
text = await scrapWithFireEngine(url);
|
console.log(`Scraping ${url} with Fire Engine`);
|
||||||
|
const response = await scrapWithFireEngine(
|
||||||
|
url,
|
||||||
|
pageOptions.waitFor,
|
||||||
|
pageOptions.screenshot,
|
||||||
|
pageOptions.headers
|
||||||
|
);
|
||||||
|
scraperResponse.text = response.html;
|
||||||
|
scraperResponse.screenshot = response.screenshot;
|
||||||
|
scraperResponse.metadata.pageStatusCode = response.pageStatusCode;
|
||||||
|
scraperResponse.metadata.pageError = response.pageError;
|
||||||
}
|
}
|
||||||
break;
|
break;
|
||||||
case "scrapingBee":
|
case "scrapingBee":
|
||||||
if (process.env.SCRAPING_BEE_API_KEY) {
|
if (process.env.SCRAPING_BEE_API_KEY) {
|
||||||
text = await scrapWithScrapingBee(
|
const response = await scrapWithScrapingBee(
|
||||||
url,
|
url,
|
||||||
"domcontentloaded",
|
"domcontentloaded",
|
||||||
pageOptions.fallback === false ? 7000 : 15000
|
pageOptions.fallback === false ? 7000 : 15000
|
||||||
);
|
);
|
||||||
|
scraperResponse.text = response.content;
|
||||||
|
scraperResponse.metadata.pageStatusCode = response.pageStatusCode;
|
||||||
|
scraperResponse.metadata.pageError = response.pageError;
|
||||||
}
|
}
|
||||||
break;
|
break;
|
||||||
case "playwright":
|
case "playwright":
|
||||||
if (process.env.PLAYWRIGHT_MICROSERVICE_URL) {
|
if (process.env.PLAYWRIGHT_MICROSERVICE_URL) {
|
||||||
text = await scrapWithPlaywright(url);
|
const response = await scrapWithPlaywright(url, pageOptions.waitFor, pageOptions.headers);
|
||||||
|
scraperResponse.text = response.content;
|
||||||
|
scraperResponse.metadata.pageStatusCode = response.pageStatusCode;
|
||||||
|
scraperResponse.metadata.pageError = response.pageError;
|
||||||
}
|
}
|
||||||
break;
|
break;
|
||||||
case "scrapingBeeLoad":
|
case "scrapingBeeLoad":
|
||||||
if (process.env.SCRAPING_BEE_API_KEY) {
|
if (process.env.SCRAPING_BEE_API_KEY) {
|
||||||
text = await scrapWithScrapingBee(url, "networkidle2");
|
const response = await scrapWithScrapingBee(url, "networkidle2");
|
||||||
|
scraperResponse.text = response.content;
|
||||||
|
scraperResponse.metadata.pageStatusCode = response.pageStatusCode;
|
||||||
|
scraperResponse.metadata.pageError = response.pageError;
|
||||||
}
|
}
|
||||||
break;
|
break;
|
||||||
case "fetch":
|
case "fetch":
|
||||||
text = await scrapWithFetch(url);
|
const response = await scrapWithFetch(url);
|
||||||
|
scraperResponse.text = response.content;
|
||||||
|
scraperResponse.metadata.pageStatusCode = response.pageStatusCode;
|
||||||
|
scraperResponse.metadata.pageError = response.pageError;
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
|
|
||||||
//* TODO: add an optional to return markdown or structured/extracted content
|
let customScrapedContent : FireEngineResponse | null = null;
|
||||||
let cleanedHtml = removeUnwantedElements(text, pageOptions);
|
|
||||||
|
|
||||||
return [await parseMarkdown(cleanedHtml), text];
|
// Check for custom scraping conditions
|
||||||
|
const customScraperResult = await handleCustomScraping(scraperResponse.text, url);
|
||||||
|
|
||||||
|
if (customScraperResult){
|
||||||
|
switch (customScraperResult.scraper) {
|
||||||
|
case "fire-engine":
|
||||||
|
customScrapedContent = await scrapWithFireEngine(customScraperResult.url, customScraperResult.waitAfterLoad, false, customScraperResult.pageOptions)
|
||||||
|
if (screenshot) {
|
||||||
|
customScrapedContent.screenshot = screenshot;
|
||||||
|
}
|
||||||
|
break;
|
||||||
|
case "pdf":
|
||||||
|
const { content, pageStatusCode, pageError } = await fetchAndProcessPdf(customScraperResult.url, pageOptions?.parsePDF);
|
||||||
|
customScrapedContent = { html: content, screenshot, pageStatusCode, pageError }
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (customScrapedContent) {
|
||||||
|
scraperResponse.text = customScrapedContent.html;
|
||||||
|
screenshot = customScrapedContent.screenshot;
|
||||||
|
}
|
||||||
|
|
||||||
|
//* TODO: add an optional to return markdown or structured/extracted content
|
||||||
|
let cleanedHtml = removeUnwantedElements(scraperResponse.text, pageOptions);
|
||||||
|
|
||||||
|
return {
|
||||||
|
text: await parseMarkdown(cleanedHtml),
|
||||||
|
html: scraperResponse.text,
|
||||||
|
screenshot: scraperResponse.screenshot,
|
||||||
|
pageStatusCode: scraperResponse.metadata.pageStatusCode,
|
||||||
|
pageError: scraperResponse.metadata.pageError || undefined
|
||||||
};
|
};
|
||||||
|
};
|
||||||
|
let { text, html, screenshot, pageStatusCode, pageError } = { text: "", html: "", screenshot: "", pageStatusCode: 200, pageError: undefined };
|
||||||
try {
|
try {
|
||||||
let [text, html] = ["", ""];
|
|
||||||
let urlKey = urlToScrap;
|
let urlKey = urlToScrap;
|
||||||
try {
|
try {
|
||||||
urlKey = new URL(urlToScrap).hostname.replace(/^www\./, "");
|
urlKey = new URL(urlToScrap).hostname.replace(/^www\./, "");
|
||||||
@ -268,7 +441,12 @@ export async function scrapSingleUrl(
|
|||||||
console.error(`Invalid URL key, trying: ${urlToScrap}`);
|
console.error(`Invalid URL key, trying: ${urlToScrap}`);
|
||||||
}
|
}
|
||||||
const defaultScraper = urlSpecificParams[urlKey]?.defaultScraper ?? "";
|
const defaultScraper = urlSpecificParams[urlKey]?.defaultScraper ?? "";
|
||||||
const scrapersInOrder = getScrapingFallbackOrder(defaultScraper)
|
const scrapersInOrder = getScrapingFallbackOrder(
|
||||||
|
defaultScraper,
|
||||||
|
pageOptions && pageOptions.waitFor && pageOptions.waitFor > 0,
|
||||||
|
pageOptions && pageOptions.screenshot && pageOptions.screenshot === true,
|
||||||
|
pageOptions && pageOptions.headers && pageOptions.headers !== undefined
|
||||||
|
);
|
||||||
|
|
||||||
for (const scraper of scrapersInOrder) {
|
for (const scraper of scrapersInOrder) {
|
||||||
// If exists text coming from crawler, use it
|
// If exists text coming from crawler, use it
|
||||||
@ -278,8 +456,21 @@ export async function scrapSingleUrl(
|
|||||||
html = existingHtml;
|
html = existingHtml;
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
[text, html] = await attemptScraping(urlToScrap, scraper);
|
|
||||||
|
const attempt = await attemptScraping(urlToScrap, scraper);
|
||||||
|
text = attempt.text ?? '';
|
||||||
|
html = attempt.html ?? '';
|
||||||
|
screenshot = attempt.screenshot ?? '';
|
||||||
|
if (attempt.pageStatusCode) {
|
||||||
|
pageStatusCode = attempt.pageStatusCode;
|
||||||
|
}
|
||||||
|
if (attempt.pageError) {
|
||||||
|
pageError = attempt.pageError;
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
if (text && text.trim().length >= 100) break;
|
if (text && text.trim().length >= 100) break;
|
||||||
|
if (pageStatusCode && pageStatusCode == 404) break;
|
||||||
const nextScraperIndex = scrapersInOrder.indexOf(scraper) + 1;
|
const nextScraperIndex = scrapersInOrder.indexOf(scraper) + 1;
|
||||||
if (nextScraperIndex < scrapersInOrder.length) {
|
if (nextScraperIndex < scrapersInOrder.length) {
|
||||||
console.info(`Falling back to ${scrapersInOrder[nextScraperIndex]}`);
|
console.info(`Falling back to ${scrapersInOrder[nextScraperIndex]}`);
|
||||||
@ -292,12 +483,34 @@ export async function scrapSingleUrl(
|
|||||||
|
|
||||||
const soup = cheerio.load(html);
|
const soup = cheerio.load(html);
|
||||||
const metadata = extractMetadata(soup, urlToScrap);
|
const metadata = extractMetadata(soup, urlToScrap);
|
||||||
const document: Document = {
|
|
||||||
|
let document: Document;
|
||||||
|
if (screenshot && screenshot.length > 0) {
|
||||||
|
document = {
|
||||||
content: text,
|
content: text,
|
||||||
markdown: text,
|
markdown: text,
|
||||||
html: pageOptions.includeHtml ? html : undefined,
|
html: pageOptions.includeHtml ? html : undefined,
|
||||||
metadata: { ...metadata, sourceURL: urlToScrap },
|
metadata: {
|
||||||
|
...metadata,
|
||||||
|
screenshot: screenshot,
|
||||||
|
sourceURL: urlToScrap,
|
||||||
|
pageStatusCode: pageStatusCode,
|
||||||
|
pageError: pageError
|
||||||
|
},
|
||||||
};
|
};
|
||||||
|
} else {
|
||||||
|
document = {
|
||||||
|
content: text,
|
||||||
|
markdown: text,
|
||||||
|
html: pageOptions.includeHtml ? html : undefined,
|
||||||
|
metadata: {
|
||||||
|
...metadata,
|
||||||
|
sourceURL: urlToScrap,
|
||||||
|
pageStatusCode: pageStatusCode,
|
||||||
|
pageError: pageError
|
||||||
|
},
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
return document;
|
return document;
|
||||||
} catch (error) {
|
} catch (error) {
|
||||||
@ -306,7 +519,11 @@ export async function scrapSingleUrl(
|
|||||||
content: "",
|
content: "",
|
||||||
markdown: "",
|
markdown: "",
|
||||||
html: "",
|
html: "",
|
||||||
metadata: { sourceURL: urlToScrap },
|
metadata: {
|
||||||
|
sourceURL: urlToScrap,
|
||||||
|
pageStatusCode: pageStatusCode,
|
||||||
|
pageError: pageError
|
||||||
|
},
|
||||||
} as Document;
|
} as Document;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@ -12,6 +12,7 @@ export async function getLinksFromSitemap(
|
|||||||
content = response.data;
|
content = response.data;
|
||||||
} catch (error) {
|
} catch (error) {
|
||||||
console.error(`Request failed for ${sitemapUrl}: ${error}`);
|
console.error(`Request failed for ${sitemapUrl}: ${error}`);
|
||||||
|
|
||||||
return allUrls;
|
return allUrls;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -3,11 +3,13 @@ import * as docxProcessor from "../docxProcessor";
|
|||||||
describe("DOCX Processing Module - Integration Test", () => {
|
describe("DOCX Processing Module - Integration Test", () => {
|
||||||
it("should correctly process a simple DOCX file without the LLAMAPARSE_API_KEY", async () => {
|
it("should correctly process a simple DOCX file without the LLAMAPARSE_API_KEY", async () => {
|
||||||
delete process.env.LLAMAPARSE_API_KEY;
|
delete process.env.LLAMAPARSE_API_KEY;
|
||||||
const docxContent = await docxProcessor.fetchAndProcessDocx(
|
const { content, pageStatusCode, pageError } = await docxProcessor.fetchAndProcessDocx(
|
||||||
"https://nvca.org/wp-content/uploads/2019/06/NVCA-Model-Document-Stock-Purchase-Agreement.docx"
|
"https://nvca.org/wp-content/uploads/2019/06/NVCA-Model-Document-Stock-Purchase-Agreement.docx"
|
||||||
);
|
);
|
||||||
expect(docxContent.trim()).toContain(
|
expect(content.trim()).toContain(
|
||||||
"SERIES A PREFERRED STOCK PURCHASE AGREEMENT"
|
"SERIES A PREFERRED STOCK PURCHASE AGREEMENT"
|
||||||
);
|
);
|
||||||
|
expect(pageStatusCode).toBe(200);
|
||||||
|
expect(pageError).toBeUndefined();
|
||||||
});
|
});
|
||||||
});
|
});
|
||||||
|
@ -3,8 +3,10 @@ import * as pdfProcessor from '../pdfProcessor';
|
|||||||
describe('PDF Processing Module - Integration Test', () => {
|
describe('PDF Processing Module - Integration Test', () => {
|
||||||
it('should correctly process a simple PDF file without the LLAMAPARSE_API_KEY', async () => {
|
it('should correctly process a simple PDF file without the LLAMAPARSE_API_KEY', async () => {
|
||||||
delete process.env.LLAMAPARSE_API_KEY;
|
delete process.env.LLAMAPARSE_API_KEY;
|
||||||
const pdfContent = await pdfProcessor.fetchAndProcessPdf('https://s3.us-east-1.amazonaws.com/storage.mendable.ai/rafa-testing/test%20%281%29.pdf');
|
const { content, pageStatusCode, pageError } = await pdfProcessor.fetchAndProcessPdf('https://s3.us-east-1.amazonaws.com/storage.mendable.ai/rafa-testing/test%20%281%29.pdf', true);
|
||||||
expect(pdfContent.trim()).toEqual("Dummy PDF file");
|
expect(content.trim()).toEqual("Dummy PDF file");
|
||||||
|
expect(pageStatusCode).toEqual(200);
|
||||||
|
expect(pageError).toBeUndefined();
|
||||||
});
|
});
|
||||||
|
|
||||||
// We're hitting the LLAMAPARSE rate limit 🫠
|
// We're hitting the LLAMAPARSE rate limit 🫠
|
||||||
|
@ -6,12 +6,14 @@ describe('replacePaths', () => {
|
|||||||
it('should replace relative paths with absolute paths', () => {
|
it('should replace relative paths with absolute paths', () => {
|
||||||
const documents: Document[] = [{
|
const documents: Document[] = [{
|
||||||
metadata: { sourceURL: 'https://example.com' },
|
metadata: { sourceURL: 'https://example.com' },
|
||||||
content: 'This is a [link](/path/to/resource) and an image ![alt text](/path/to/image.jpg).'
|
content: 'This is a [link](/path/to/resource).',
|
||||||
|
markdown: 'This is a [link](/path/to/resource).'
|
||||||
}];
|
}];
|
||||||
|
|
||||||
const expectedDocuments: Document[] = [{
|
const expectedDocuments: Document[] = [{
|
||||||
metadata: { sourceURL: 'https://example.com' },
|
metadata: { sourceURL: 'https://example.com' },
|
||||||
content: 'This is a [link](https://example.com/path/to/resource) and an image ![alt text](https://example.com/path/to/image.jpg).'
|
content: 'This is a [link](https://example.com/path/to/resource).',
|
||||||
|
markdown: 'This is a [link](https://example.com/path/to/resource).'
|
||||||
}];
|
}];
|
||||||
|
|
||||||
const result = replacePathsWithAbsolutePaths(documents);
|
const result = replacePathsWithAbsolutePaths(documents);
|
||||||
@ -21,7 +23,8 @@ describe('replacePaths', () => {
|
|||||||
it('should not alter absolute URLs', () => {
|
it('should not alter absolute URLs', () => {
|
||||||
const documents: Document[] = [{
|
const documents: Document[] = [{
|
||||||
metadata: { sourceURL: 'https://example.com' },
|
metadata: { sourceURL: 'https://example.com' },
|
||||||
content: 'This is an [external link](https://external.com/path) and an image ![alt text](https://example.com/path/to/image.jpg).'
|
content: 'This is an [external link](https://external.com/path).',
|
||||||
|
markdown: 'This is an [external link](https://external.com/path).'
|
||||||
}];
|
}];
|
||||||
|
|
||||||
const result = replacePathsWithAbsolutePaths(documents);
|
const result = replacePathsWithAbsolutePaths(documents);
|
||||||
@ -31,7 +34,8 @@ describe('replacePaths', () => {
|
|||||||
it('should not alter data URLs for images', () => {
|
it('should not alter data URLs for images', () => {
|
||||||
const documents: Document[] = [{
|
const documents: Document[] = [{
|
||||||
metadata: { sourceURL: 'https://example.com' },
|
metadata: { sourceURL: 'https://example.com' },
|
||||||
content: 'This is an image: ![alt text]().'
|
content: 'This is an image: ![alt text]().',
|
||||||
|
markdown: 'This is an image: ![alt text]().'
|
||||||
}];
|
}];
|
||||||
|
|
||||||
const result = replacePathsWithAbsolutePaths(documents);
|
const result = replacePathsWithAbsolutePaths(documents);
|
||||||
@ -41,12 +45,14 @@ describe('replacePaths', () => {
|
|||||||
it('should handle multiple links and images correctly', () => {
|
it('should handle multiple links and images correctly', () => {
|
||||||
const documents: Document[] = [{
|
const documents: Document[] = [{
|
||||||
metadata: { sourceURL: 'https://example.com' },
|
metadata: { sourceURL: 'https://example.com' },
|
||||||
content: 'Here are two links: [link1](/path1) and [link2](/path2), and two images: ![img1](/img1.jpg) ![img2](/img2.jpg).'
|
content: 'Here are two links: [link1](/path1) and [link2](/path2).',
|
||||||
|
markdown: 'Here are two links: [link1](/path1) and [link2](/path2).'
|
||||||
}];
|
}];
|
||||||
|
|
||||||
const expectedDocuments: Document[] = [{
|
const expectedDocuments: Document[] = [{
|
||||||
metadata: { sourceURL: 'https://example.com' },
|
metadata: { sourceURL: 'https://example.com' },
|
||||||
content: 'Here are two links: [link1](https://example.com/path1) and [link2](https://example.com/path2), and two images: ![img1](https://example.com/img1.jpg) ![img2](https://example.com/img2.jpg).'
|
content: 'Here are two links: [link1](https://example.com/path1) and [link2](https://example.com/path2).',
|
||||||
|
markdown: 'Here are two links: [link1](https://example.com/path1) and [link2](https://example.com/path2).'
|
||||||
}];
|
}];
|
||||||
|
|
||||||
const result = replacePathsWithAbsolutePaths(documents);
|
const result = replacePathsWithAbsolutePaths(documents);
|
||||||
@ -56,12 +62,14 @@ describe('replacePaths', () => {
|
|||||||
it('should correctly handle a mix of absolute and relative paths', () => {
|
it('should correctly handle a mix of absolute and relative paths', () => {
|
||||||
const documents: Document[] = [{
|
const documents: Document[] = [{
|
||||||
metadata: { sourceURL: 'https://example.com' },
|
metadata: { sourceURL: 'https://example.com' },
|
||||||
content: 'Mixed paths: [relative](/path), [absolute](https://example.com/path), and [data image]().'
|
content: 'Mixed paths: [relative](/path), [absolute](https://example.com/path), and [data image]().',
|
||||||
|
markdown: 'Mixed paths: [relative](/path), [absolute](https://example.com/path), and [data image]().'
|
||||||
}];
|
}];
|
||||||
|
|
||||||
const expectedDocuments: Document[] = [{
|
const expectedDocuments: Document[] = [{
|
||||||
metadata: { sourceURL: 'https://example.com' },
|
metadata: { sourceURL: 'https://example.com' },
|
||||||
content: 'Mixed paths: [relative](https://example.com/path), [absolute](https://example.com/path), and [data image]().'
|
content: 'Mixed paths: [relative](https://example.com/path), [absolute](https://example.com/path), and [data image]().',
|
||||||
|
markdown: 'Mixed paths: [relative](https://example.com/path), [absolute](https://example.com/path), and [data image]().'
|
||||||
}];
|
}];
|
||||||
|
|
||||||
const result = replacePathsWithAbsolutePaths(documents);
|
const result = replacePathsWithAbsolutePaths(documents);
|
||||||
@ -74,12 +82,14 @@ describe('replacePaths', () => {
|
|||||||
it('should replace relative image paths with absolute paths', () => {
|
it('should replace relative image paths with absolute paths', () => {
|
||||||
const documents: Document[] = [{
|
const documents: Document[] = [{
|
||||||
metadata: { sourceURL: 'https://example.com' },
|
metadata: { sourceURL: 'https://example.com' },
|
||||||
content: 'Here is an image: ![alt text](/path/to/image.jpg).'
|
content: 'Here is an image: ![alt text](/path/to/image.jpg).',
|
||||||
|
markdown: 'Here is an image: ![alt text](/path/to/image.jpg).'
|
||||||
}];
|
}];
|
||||||
|
|
||||||
const expectedDocuments: Document[] = [{
|
const expectedDocuments: Document[] = [{
|
||||||
metadata: { sourceURL: 'https://example.com' },
|
metadata: { sourceURL: 'https://example.com' },
|
||||||
content: 'Here is an image: ![alt text](https://example.com/path/to/image.jpg).'
|
content: 'Here is an image: ![alt text](https://example.com/path/to/image.jpg).',
|
||||||
|
markdown: 'Here is an image: ![alt text](https://example.com/path/to/image.jpg).'
|
||||||
}];
|
}];
|
||||||
|
|
||||||
const result = replaceImgPathsWithAbsolutePaths(documents);
|
const result = replaceImgPathsWithAbsolutePaths(documents);
|
||||||
@ -89,7 +99,8 @@ describe('replacePaths', () => {
|
|||||||
it('should not alter data:image URLs', () => {
|
it('should not alter data:image URLs', () => {
|
||||||
const documents: Document[] = [{
|
const documents: Document[] = [{
|
||||||
metadata: { sourceURL: 'https://example.com' },
|
metadata: { sourceURL: 'https://example.com' },
|
||||||
content: 'An image with a data URL: ![alt text]().'
|
content: 'An image with a data URL: ![alt text]().',
|
||||||
|
markdown: 'An image with a data URL: ![alt text](data:image/png;base4,ABC123==).'
|
||||||
}];
|
}];
|
||||||
|
|
||||||
const result = replaceImgPathsWithAbsolutePaths(documents);
|
const result = replaceImgPathsWithAbsolutePaths(documents);
|
||||||
@ -99,12 +110,14 @@ describe('replacePaths', () => {
|
|||||||
it('should handle multiple images with a mix of data and relative URLs', () => {
|
it('should handle multiple images with a mix of data and relative URLs', () => {
|
||||||
const documents: Document[] = [{
|
const documents: Document[] = [{
|
||||||
metadata: { sourceURL: 'https://example.com' },
|
metadata: { sourceURL: 'https://example.com' },
|
||||||
content: 'Multiple images: ![img1](/img1.jpg) ![img2]() ![img3](/img3.jpg).'
|
content: 'Multiple images: ![img1](/img1.jpg) ![img2]() ![img3](/img3.jpg).',
|
||||||
|
markdown: 'Multiple images: ![img1](/img1.jpg) ![img2]() ![img3](/img3.jpg).'
|
||||||
}];
|
}];
|
||||||
|
|
||||||
const expectedDocuments: Document[] = [{
|
const expectedDocuments: Document[] = [{
|
||||||
metadata: { sourceURL: 'https://example.com' },
|
metadata: { sourceURL: 'https://example.com' },
|
||||||
content: 'Multiple images: ![img1](https://example.com/img1.jpg) ![img2]() ![img3](https://example.com/img3.jpg).'
|
content: 'Multiple images: ![img1](https://example.com/img1.jpg) ![img2]() ![img3](https://example.com/img3.jpg).',
|
||||||
|
markdown: 'Multiple images: ![img1](https://example.com/img1.jpg) ![img2]() ![img3](https://example.com/img3.jpg).'
|
||||||
}];
|
}];
|
||||||
|
|
||||||
const result = replaceImgPathsWithAbsolutePaths(documents);
|
const result = replaceImgPathsWithAbsolutePaths(documents);
|
||||||
|
@ -0,0 +1,66 @@
|
|||||||
|
import { isUrlBlocked } from '../blocklist';
|
||||||
|
|
||||||
|
describe('isUrlBlocked', () => {
|
||||||
|
it('should return true for blocked social media URLs', () => {
|
||||||
|
const blockedUrls = [
|
||||||
|
'https://www.facebook.com',
|
||||||
|
'https://twitter.com/someuser',
|
||||||
|
'https://instagram.com/someuser',
|
||||||
|
'https://www.linkedin.com/in/someuser',
|
||||||
|
'https://pinterest.com/someuser',
|
||||||
|
'https://snapchat.com/someuser',
|
||||||
|
'https://tiktok.com/@someuser',
|
||||||
|
'https://reddit.com/r/somesubreddit',
|
||||||
|
'https://flickr.com/photos/someuser',
|
||||||
|
'https://whatsapp.com/someuser',
|
||||||
|
'https://wechat.com/someuser',
|
||||||
|
'https://telegram.org/someuser',
|
||||||
|
];
|
||||||
|
|
||||||
|
blockedUrls.forEach(url => {
|
||||||
|
if (!isUrlBlocked(url)) {
|
||||||
|
console.log(`URL not blocked: ${url}`);
|
||||||
|
}
|
||||||
|
expect(isUrlBlocked(url)).toBe(true);
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
it('should return false for URLs containing allowed keywords', () => {
|
||||||
|
const allowedUrls = [
|
||||||
|
'https://www.facebook.com/privacy',
|
||||||
|
'https://twitter.com/terms',
|
||||||
|
'https://instagram.com/legal',
|
||||||
|
'https://www.linkedin.com/help',
|
||||||
|
'https://pinterest.com/about',
|
||||||
|
'https://snapchat.com/support',
|
||||||
|
'https://tiktok.com/contact',
|
||||||
|
'https://reddit.com/user-agreement',
|
||||||
|
'https://tumblr.com/policy',
|
||||||
|
'https://flickr.com/blog',
|
||||||
|
'https://whatsapp.com/press',
|
||||||
|
'https://wechat.com/careers',
|
||||||
|
'https://telegram.org/conditions',
|
||||||
|
'https://wix.com/careers',
|
||||||
|
];
|
||||||
|
|
||||||
|
allowedUrls.forEach(url => {
|
||||||
|
expect(isUrlBlocked(url)).toBe(false);
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
it('should return false for non-blocked URLs', () => {
|
||||||
|
const nonBlockedUrls = [
|
||||||
|
'https://www.example.com',
|
||||||
|
'https://www.somewebsite.org',
|
||||||
|
'https://subdomain.example.com',
|
||||||
|
'firecrawl.dev',
|
||||||
|
'amazon.com',
|
||||||
|
'wix.com',
|
||||||
|
'https://wix.com'
|
||||||
|
];
|
||||||
|
|
||||||
|
nonBlockedUrls.forEach(url => {
|
||||||
|
expect(isUrlBlocked(url)).toBe(false);
|
||||||
|
});
|
||||||
|
});
|
||||||
|
});
|
@ -1,5 +1,6 @@
|
|||||||
const socialMediaBlocklist = [
|
const socialMediaBlocklist = [
|
||||||
'facebook.com',
|
'facebook.com',
|
||||||
|
'x.com',
|
||||||
'twitter.com',
|
'twitter.com',
|
||||||
'instagram.com',
|
'instagram.com',
|
||||||
'linkedin.com',
|
'linkedin.com',
|
||||||
@ -14,14 +15,40 @@ const socialMediaBlocklist = [
|
|||||||
'telegram.org',
|
'telegram.org',
|
||||||
];
|
];
|
||||||
|
|
||||||
const allowedUrls = [
|
const allowedKeywords = [
|
||||||
'linkedin.com/pulse'
|
'pulse',
|
||||||
|
'privacy',
|
||||||
|
'terms',
|
||||||
|
'policy',
|
||||||
|
'user-agreement',
|
||||||
|
'legal',
|
||||||
|
'help',
|
||||||
|
'support',
|
||||||
|
'contact',
|
||||||
|
'about',
|
||||||
|
'careers',
|
||||||
|
'blog',
|
||||||
|
'press',
|
||||||
|
'conditions',
|
||||||
];
|
];
|
||||||
|
|
||||||
export function isUrlBlocked(url: string): boolean {
|
export function isUrlBlocked(url: string): boolean {
|
||||||
if (allowedUrls.some(allowedUrl => url.includes(allowedUrl))) {
|
// Check if the URL contains any allowed keywords
|
||||||
|
if (allowedKeywords.some(keyword => url.includes(keyword))) {
|
||||||
return false;
|
return false;
|
||||||
}
|
}
|
||||||
|
|
||||||
return socialMediaBlocklist.some(domain => url.includes(domain));
|
try {
|
||||||
|
// Check if the URL matches any domain in the blocklist
|
||||||
|
return socialMediaBlocklist.some(domain => {
|
||||||
|
// Create a regular expression to match the exact domain
|
||||||
|
const domainPattern = new RegExp(`(^|\\.)${domain.replace('.', '\\.')}$`);
|
||||||
|
// Test the hostname of the URL against the pattern
|
||||||
|
return domainPattern.test(new URL(url).hostname);
|
||||||
|
});
|
||||||
|
} catch (e) {
|
||||||
|
// If an error occurs (e.g., invalid URL), return false
|
||||||
|
return false;
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -171,5 +171,22 @@ export const urlSpecificParams = {
|
|||||||
accept:
|
accept:
|
||||||
"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
|
"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
|
||||||
},
|
},
|
||||||
}
|
},
|
||||||
|
"firecrawl.dev":{
|
||||||
|
defaultScraper: "fire-engine",
|
||||||
|
params: {
|
||||||
|
headers: {
|
||||||
|
"User-Agent":
|
||||||
|
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
|
||||||
|
"sec-fetch-site": "same-origin",
|
||||||
|
"sec-fetch-mode": "cors",
|
||||||
|
"sec-fetch-dest": "empty",
|
||||||
|
referer: "https://www.google.com/",
|
||||||
|
"accept-language": "en-US,en;q=0.9",
|
||||||
|
"accept-encoding": "gzip, deflate, br",
|
||||||
|
accept:
|
||||||
|
"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
|
||||||
|
},
|
||||||
|
},
|
||||||
|
},
|
||||||
};
|
};
|
||||||
|
@ -5,14 +5,14 @@ import path from "path";
|
|||||||
import os from "os";
|
import os from "os";
|
||||||
import mammoth from "mammoth";
|
import mammoth from "mammoth";
|
||||||
|
|
||||||
export async function fetchAndProcessDocx(url: string): Promise<string> {
|
export async function fetchAndProcessDocx(url: string): Promise<{ content: string; pageStatusCode: number; pageError: string }> {
|
||||||
const tempFilePath = await downloadDocx(url);
|
const { tempFilePath, pageStatusCode, pageError } = await downloadDocx(url);
|
||||||
const content = await processDocxToText(tempFilePath);
|
const content = await processDocxToText(tempFilePath);
|
||||||
fs.unlinkSync(tempFilePath); // Clean up the temporary file
|
fs.unlinkSync(tempFilePath); // Clean up the temporary file
|
||||||
return content;
|
return { content, pageStatusCode, pageError };
|
||||||
}
|
}
|
||||||
|
|
||||||
async function downloadDocx(url: string): Promise<string> {
|
async function downloadDocx(url: string): Promise<{ tempFilePath: string; pageStatusCode: number; pageError: string }> {
|
||||||
const response = await axios({
|
const response = await axios({
|
||||||
url,
|
url,
|
||||||
method: "GET",
|
method: "GET",
|
||||||
@ -25,7 +25,7 @@ async function downloadDocx(url: string): Promise<string> {
|
|||||||
response.data.pipe(writer);
|
response.data.pipe(writer);
|
||||||
|
|
||||||
return new Promise((resolve, reject) => {
|
return new Promise((resolve, reject) => {
|
||||||
writer.on("finish", () => resolve(tempFilePath));
|
writer.on("finish", () => resolve({ tempFilePath, pageStatusCode: response.status, pageError: response.statusText != "OK" ? response.statusText : undefined }));
|
||||||
writer.on("error", reject);
|
writer.on("error", reject);
|
||||||
});
|
});
|
||||||
}
|
}
|
||||||
|
@ -29,6 +29,9 @@ interface Metadata {
|
|||||||
publishedTime?: string;
|
publishedTime?: string;
|
||||||
articleTag?: string;
|
articleTag?: string;
|
||||||
articleSection?: string;
|
articleSection?: string;
|
||||||
|
sourceURL?: string;
|
||||||
|
pageStatusCode?: number;
|
||||||
|
pageError?: string;
|
||||||
}
|
}
|
||||||
|
|
||||||
export function extractMetadata(soup: CheerioAPI, url: string): Metadata {
|
export function extractMetadata(soup: CheerioAPI, url: string): Metadata {
|
||||||
@ -61,6 +64,9 @@ export function extractMetadata(soup: CheerioAPI, url: string): Metadata {
|
|||||||
let publishedTime: string | null = null;
|
let publishedTime: string | null = null;
|
||||||
let articleTag: string | null = null;
|
let articleTag: string | null = null;
|
||||||
let articleSection: string | null = null;
|
let articleSection: string | null = null;
|
||||||
|
let sourceURL: string | null = null;
|
||||||
|
let pageStatusCode: number | null = null;
|
||||||
|
let pageError: string | null = null;
|
||||||
|
|
||||||
try {
|
try {
|
||||||
title = soup("title").text() || null;
|
title = soup("title").text() || null;
|
||||||
@ -132,5 +138,8 @@ export function extractMetadata(soup: CheerioAPI, url: string): Metadata {
|
|||||||
...(publishedTime ? { publishedTime } : {}),
|
...(publishedTime ? { publishedTime } : {}),
|
||||||
...(articleTag ? { articleTag } : {}),
|
...(articleTag ? { articleTag } : {}),
|
||||||
...(articleSection ? { articleSection } : {}),
|
...(articleSection ? { articleSection } : {}),
|
||||||
|
...(sourceURL ? { sourceURL } : {}),
|
||||||
|
...(pageStatusCode ? { pageStatusCode } : {}),
|
||||||
|
...(pageError ? { pageError } : {}),
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
|
@ -9,14 +9,14 @@ import os from "os";
|
|||||||
|
|
||||||
dotenv.config();
|
dotenv.config();
|
||||||
|
|
||||||
export async function fetchAndProcessPdf(url: string): Promise<string> {
|
export async function fetchAndProcessPdf(url: string, parsePDF: boolean): Promise<{ content: string, pageStatusCode?: number, pageError?: string }> {
|
||||||
const tempFilePath = await downloadPdf(url);
|
const { tempFilePath, pageStatusCode, pageError } = await downloadPdf(url);
|
||||||
const content = await processPdfToText(tempFilePath);
|
const content = await processPdfToText(tempFilePath, parsePDF);
|
||||||
fs.unlinkSync(tempFilePath); // Clean up the temporary file
|
fs.unlinkSync(tempFilePath); // Clean up the temporary file
|
||||||
return content;
|
return { content, pageStatusCode, pageError };
|
||||||
}
|
}
|
||||||
|
|
||||||
async function downloadPdf(url: string): Promise<string> {
|
async function downloadPdf(url: string): Promise<{ tempFilePath: string, pageStatusCode?: number, pageError?: string }> {
|
||||||
const response = await axios({
|
const response = await axios({
|
||||||
url,
|
url,
|
||||||
method: "GET",
|
method: "GET",
|
||||||
@ -29,15 +29,15 @@ async function downloadPdf(url: string): Promise<string> {
|
|||||||
response.data.pipe(writer);
|
response.data.pipe(writer);
|
||||||
|
|
||||||
return new Promise((resolve, reject) => {
|
return new Promise((resolve, reject) => {
|
||||||
writer.on("finish", () => resolve(tempFilePath));
|
writer.on("finish", () => resolve({ tempFilePath, pageStatusCode: response.status, pageError: response.statusText != "OK" ? response.statusText : undefined }));
|
||||||
writer.on("error", reject);
|
writer.on("error", reject);
|
||||||
});
|
});
|
||||||
}
|
}
|
||||||
|
|
||||||
export async function processPdfToText(filePath: string): Promise<string> {
|
export async function processPdfToText(filePath: string, parsePDF: boolean): Promise<string> {
|
||||||
let content = "";
|
let content = "";
|
||||||
|
|
||||||
if (process.env.LLAMAPARSE_API_KEY) {
|
if (process.env.LLAMAPARSE_API_KEY && parsePDF) {
|
||||||
const apiKey = process.env.LLAMAPARSE_API_KEY;
|
const apiKey = process.env.LLAMAPARSE_API_KEY;
|
||||||
const headers = {
|
const headers = {
|
||||||
Authorization: `Bearer ${apiKey}`,
|
Authorization: `Bearer ${apiKey}`,
|
||||||
@ -80,7 +80,7 @@ export async function processPdfToText(filePath: string): Promise<string> {
|
|||||||
await new Promise((resolve) => setTimeout(resolve, 500)); // Wait for 0.5 seconds
|
await new Promise((resolve) => setTimeout(resolve, 500)); // Wait for 0.5 seconds
|
||||||
}
|
}
|
||||||
} catch (error) {
|
} catch (error) {
|
||||||
console.error("Error fetching result:", error || '');
|
console.error("Error fetching result w/ LlamaIndex");
|
||||||
attempt++;
|
attempt++;
|
||||||
await new Promise((resolve) => setTimeout(resolve, 500)); // Wait for 0.5 seconds before retrying
|
await new Promise((resolve) => setTimeout(resolve, 500)); // Wait for 0.5 seconds before retrying
|
||||||
// You may want to handle specific errors differently
|
// You may want to handle specific errors differently
|
||||||
@ -92,11 +92,13 @@ export async function processPdfToText(filePath: string): Promise<string> {
|
|||||||
}
|
}
|
||||||
content = resultResponse.data[resultType];
|
content = resultResponse.data[resultType];
|
||||||
} catch (error) {
|
} catch (error) {
|
||||||
console.error("Error processing document:", filePath, error);
|
console.error("Error processing pdf document w/ LlamaIndex(2)");
|
||||||
content = await processPdf(filePath);
|
content = await processPdf(filePath);
|
||||||
}
|
}
|
||||||
} else {
|
} else if (parsePDF) {
|
||||||
content = await processPdf(filePath);
|
content = await processPdf(filePath);
|
||||||
|
} else {
|
||||||
|
content = fs.readFileSync(filePath, "utf-8");
|
||||||
}
|
}
|
||||||
return content;
|
return content;
|
||||||
}
|
}
|
||||||
|
@ -10,6 +10,7 @@ export const replacePathsWithAbsolutePaths = (documents: Document[]): Document[]
|
|||||||
) || [];
|
) || [];
|
||||||
|
|
||||||
paths.forEach((path: string) => {
|
paths.forEach((path: string) => {
|
||||||
|
try {
|
||||||
const isImage = path.startsWith("!");
|
const isImage = path.startsWith("!");
|
||||||
let matchedUrl = path.match(/\(([^)]+)\)/) || path.match(/href="([^"]+)"/);
|
let matchedUrl = path.match(/\(([^)]+)\)/) || path.match(/href="([^"]+)"/);
|
||||||
let url = matchedUrl[1];
|
let url = matchedUrl[1];
|
||||||
@ -22,18 +23,18 @@ export const replacePathsWithAbsolutePaths = (documents: Document[]): Document[]
|
|||||||
}
|
}
|
||||||
|
|
||||||
const markdownLinkOrImageText = path.match(/(!?\[.*?\])/)[0];
|
const markdownLinkOrImageText = path.match(/(!?\[.*?\])/)[0];
|
||||||
if (isImage) {
|
// Image is handled afterwards
|
||||||
document.content = document.content.replace(
|
if (!isImage) {
|
||||||
path,
|
|
||||||
`${markdownLinkOrImageText}(${url})`
|
|
||||||
);
|
|
||||||
} else {
|
|
||||||
document.content = document.content.replace(
|
document.content = document.content.replace(
|
||||||
path,
|
path,
|
||||||
`${markdownLinkOrImageText}(${url})`
|
`${markdownLinkOrImageText}(${url})`
|
||||||
);
|
);
|
||||||
|
}
|
||||||
|
} catch (error) {
|
||||||
|
|
||||||
}
|
}
|
||||||
});
|
});
|
||||||
|
document.markdown = document.content;
|
||||||
});
|
});
|
||||||
|
|
||||||
return documents;
|
return documents;
|
||||||
@ -60,8 +61,10 @@ export const replaceImgPathsWithAbsolutePaths = (documents: Document[]): Documen
|
|||||||
if (!imageUrl.startsWith("http")) {
|
if (!imageUrl.startsWith("http")) {
|
||||||
if (imageUrl.startsWith("/")) {
|
if (imageUrl.startsWith("/")) {
|
||||||
imageUrl = imageUrl.substring(1);
|
imageUrl = imageUrl.substring(1);
|
||||||
}
|
|
||||||
imageUrl = new URL(imageUrl, baseUrl).toString();
|
imageUrl = new URL(imageUrl, baseUrl).toString();
|
||||||
|
} else {
|
||||||
|
imageUrl = new URL(imageUrl, document.metadata.sourceURL).toString();
|
||||||
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -70,6 +73,7 @@ export const replaceImgPathsWithAbsolutePaths = (documents: Document[]): Documen
|
|||||||
`![${altText}](${imageUrl})`
|
`![${altText}](${imageUrl})`
|
||||||
);
|
);
|
||||||
});
|
});
|
||||||
|
document.markdown = document.content;
|
||||||
});
|
});
|
||||||
|
|
||||||
return documents;
|
return documents;
|
||||||
|
@ -1,7 +1,9 @@
|
|||||||
|
import { NotificationType } from "../../types";
|
||||||
import { withAuth } from "../../lib/withAuth";
|
import { withAuth } from "../../lib/withAuth";
|
||||||
|
import { sendNotification } from "../notification/email_notification";
|
||||||
import { supabase_service } from "../supabase";
|
import { supabase_service } from "../supabase";
|
||||||
|
|
||||||
const FREE_CREDITS = 300;
|
const FREE_CREDITS = 500;
|
||||||
|
|
||||||
export async function billTeam(team_id: string, credits: number) {
|
export async function billTeam(team_id: string, credits: number) {
|
||||||
return withAuth(supaBillTeam)(team_id, credits);
|
return withAuth(supaBillTeam)(team_id, credits);
|
||||||
@ -34,7 +36,10 @@ export async function supaBillTeam(team_id: string, credits: number) {
|
|||||||
|
|
||||||
let couponCredits = 0;
|
let couponCredits = 0;
|
||||||
if (coupons && coupons.length > 0) {
|
if (coupons && coupons.length > 0) {
|
||||||
couponCredits = coupons.reduce((total, coupon) => total + coupon.credits, 0);
|
couponCredits = coupons.reduce(
|
||||||
|
(total, coupon) => total + coupon.credits,
|
||||||
|
0
|
||||||
|
);
|
||||||
}
|
}
|
||||||
|
|
||||||
let sortedCoupons = coupons.sort((a, b) => b.credits - a.credits);
|
let sortedCoupons = coupons.sort((a, b) => b.credits - a.credits);
|
||||||
@ -55,17 +60,16 @@ export async function supaBillTeam(team_id: string, credits: number) {
|
|||||||
await supabase_service
|
await supabase_service
|
||||||
.from("coupons")
|
.from("coupons")
|
||||||
.update({
|
.update({
|
||||||
credits: 0
|
credits: 0,
|
||||||
})
|
})
|
||||||
.eq("id", sortedCoupons[0].id);
|
.eq("id", sortedCoupons[0].id);
|
||||||
sortedCoupons.shift();
|
sortedCoupons.shift();
|
||||||
|
|
||||||
} else {
|
} else {
|
||||||
// update coupon credits
|
// update coupon credits
|
||||||
await supabase_service
|
await supabase_service
|
||||||
.from("coupons")
|
.from("coupons")
|
||||||
.update({
|
.update({
|
||||||
credits: sortedCoupons[0].credits - usedCredits
|
credits: sortedCoupons[0].credits - usedCredits,
|
||||||
})
|
})
|
||||||
.eq("id", sortedCoupons[0].id);
|
.eq("id", sortedCoupons[0].id);
|
||||||
usedCredits = 0;
|
usedCredits = 0;
|
||||||
@ -82,7 +86,7 @@ export async function supaBillTeam(team_id: string, credits: number) {
|
|||||||
await supabase_service
|
await supabase_service
|
||||||
.from("coupons")
|
.from("coupons")
|
||||||
.update({
|
.update({
|
||||||
credits: 0
|
credits: 0,
|
||||||
})
|
})
|
||||||
.eq("id", sortedCoupons[i].id);
|
.eq("id", sortedCoupons[i].id);
|
||||||
}
|
}
|
||||||
@ -99,14 +103,18 @@ export async function supaBillTeam(team_id: string, credits: number) {
|
|||||||
await supabase_service
|
await supabase_service
|
||||||
.from("coupons")
|
.from("coupons")
|
||||||
.update({
|
.update({
|
||||||
credits: 0
|
credits: 0,
|
||||||
})
|
})
|
||||||
.eq("id", sortedCoupons[i].id);
|
.eq("id", sortedCoupons[i].id);
|
||||||
}
|
}
|
||||||
const usedCredits = credits - couponCredits;
|
const usedCredits = credits - couponCredits;
|
||||||
return await createCreditUsage({ team_id, subscription_id: subscription.id, credits: usedCredits });
|
return await createCreditUsage({
|
||||||
|
team_id,
|
||||||
} else { // using only coupon credits
|
subscription_id: subscription.id,
|
||||||
|
credits: usedCredits,
|
||||||
|
});
|
||||||
|
} else {
|
||||||
|
// using only coupon credits
|
||||||
let usedCredits = credits;
|
let usedCredits = credits;
|
||||||
while (usedCredits > 0) {
|
while (usedCredits > 0) {
|
||||||
// update coupons
|
// update coupons
|
||||||
@ -116,24 +124,27 @@ export async function supaBillTeam(team_id: string, credits: number) {
|
|||||||
await supabase_service
|
await supabase_service
|
||||||
.from("coupons")
|
.from("coupons")
|
||||||
.update({
|
.update({
|
||||||
credits: 0
|
credits: 0,
|
||||||
})
|
})
|
||||||
.eq("id", sortedCoupons[0].id);
|
.eq("id", sortedCoupons[0].id);
|
||||||
sortedCoupons.shift();
|
sortedCoupons.shift();
|
||||||
|
|
||||||
} else {
|
} else {
|
||||||
// update coupon credits
|
// update coupon credits
|
||||||
await supabase_service
|
await supabase_service
|
||||||
.from("coupons")
|
.from("coupons")
|
||||||
.update({
|
.update({
|
||||||
credits: sortedCoupons[0].credits - usedCredits
|
credits: sortedCoupons[0].credits - usedCredits,
|
||||||
})
|
})
|
||||||
.eq("id", sortedCoupons[0].id);
|
.eq("id", sortedCoupons[0].id);
|
||||||
usedCredits = 0;
|
usedCredits = 0;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
return await createCreditUsage({ team_id, subscription_id: subscription.id, credits: 0 });
|
return await createCreditUsage({
|
||||||
|
team_id,
|
||||||
|
subscription_id: subscription.id,
|
||||||
|
credits: 0,
|
||||||
|
});
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -142,7 +153,11 @@ export async function supaBillTeam(team_id: string, credits: number) {
|
|||||||
return await createCreditUsage({ team_id, credits });
|
return await createCreditUsage({ team_id, credits });
|
||||||
}
|
}
|
||||||
|
|
||||||
return await createCreditUsage({ team_id, subscription_id: subscription.id, credits });
|
return await createCreditUsage({
|
||||||
|
team_id,
|
||||||
|
subscription_id: subscription.id,
|
||||||
|
credits,
|
||||||
|
});
|
||||||
}
|
}
|
||||||
|
|
||||||
export async function checkTeamCredits(team_id: string, credits: number) {
|
export async function checkTeamCredits(team_id: string, credits: number) {
|
||||||
@ -155,7 +170,8 @@ export async function supaCheckTeamCredits(team_id: string, credits: number) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
// Retrieve the team's active subscription
|
// Retrieve the team's active subscription
|
||||||
const { data: subscription, error: subscriptionError } = await supabase_service
|
const { data: subscription, error: subscriptionError } =
|
||||||
|
await supabase_service
|
||||||
.from("subscriptions")
|
.from("subscriptions")
|
||||||
.select("id, price_id, current_period_start, current_period_end")
|
.select("id, price_id, current_period_start, current_period_end")
|
||||||
.eq("team_id", team_id)
|
.eq("team_id", team_id)
|
||||||
@ -171,7 +187,10 @@ export async function supaCheckTeamCredits(team_id: string, credits: number) {
|
|||||||
|
|
||||||
let couponCredits = 0;
|
let couponCredits = 0;
|
||||||
if (coupons && coupons.length > 0) {
|
if (coupons && coupons.length > 0) {
|
||||||
couponCredits = coupons.reduce((total, coupon) => total + coupon.credits, 0);
|
couponCredits = coupons.reduce(
|
||||||
|
(total, coupon) => total + coupon.credits,
|
||||||
|
0
|
||||||
|
);
|
||||||
}
|
}
|
||||||
|
|
||||||
// Free credits, no coupons
|
// Free credits, no coupons
|
||||||
@ -187,12 +206,10 @@ export async function supaCheckTeamCredits(team_id: string, credits: number) {
|
|||||||
.select("credits_used")
|
.select("credits_used")
|
||||||
.is("subscription_id", null)
|
.is("subscription_id", null)
|
||||||
.eq("team_id", team_id);
|
.eq("team_id", team_id);
|
||||||
// .gte("created_at", subscription.current_period_start)
|
|
||||||
// .lte("created_at", subscription.current_period_end);
|
|
||||||
|
|
||||||
if (creditUsageError) {
|
if (creditUsageError) {
|
||||||
throw new Error(
|
throw new Error(
|
||||||
`Failed to retrieve credit usage for subscription_id: ${subscription.id}`
|
`Failed to retrieve credit usage for team_id: ${team_id}`
|
||||||
);
|
);
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -202,8 +219,32 @@ export async function supaCheckTeamCredits(team_id: string, credits: number) {
|
|||||||
);
|
);
|
||||||
|
|
||||||
console.log("totalCreditsUsed", totalCreditsUsed);
|
console.log("totalCreditsUsed", totalCreditsUsed);
|
||||||
|
|
||||||
|
const end = new Date();
|
||||||
|
end.setDate(end.getDate() + 30);
|
||||||
|
// check if usage is within 80% of the limit
|
||||||
|
const creditLimit = FREE_CREDITS;
|
||||||
|
const creditUsagePercentage = (totalCreditsUsed + credits) / creditLimit;
|
||||||
|
|
||||||
|
if (creditUsagePercentage >= 0.8) {
|
||||||
|
await sendNotification(
|
||||||
|
team_id,
|
||||||
|
NotificationType.APPROACHING_LIMIT,
|
||||||
|
new Date().toISOString(),
|
||||||
|
end.toISOString()
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
// 5. Compare the total credits used with the credits allowed by the plan.
|
// 5. Compare the total credits used with the credits allowed by the plan.
|
||||||
if (totalCreditsUsed + credits > FREE_CREDITS) {
|
if (totalCreditsUsed + credits > FREE_CREDITS) {
|
||||||
|
// Send email notification for insufficient credits
|
||||||
|
|
||||||
|
await sendNotification(
|
||||||
|
team_id,
|
||||||
|
NotificationType.LIMIT_REACHED,
|
||||||
|
new Date().toISOString(),
|
||||||
|
end.toISOString()
|
||||||
|
);
|
||||||
return {
|
return {
|
||||||
success: false,
|
success: false,
|
||||||
message: "Insufficient credits, please upgrade!",
|
message: "Insufficient credits, please upgrade!",
|
||||||
@ -214,11 +255,11 @@ export async function supaCheckTeamCredits(team_id: string, credits: number) {
|
|||||||
|
|
||||||
let totalCreditsUsed = 0;
|
let totalCreditsUsed = 0;
|
||||||
try {
|
try {
|
||||||
const { data: creditUsages, error: creditUsageError } = await supabase_service
|
const { data: creditUsages, error: creditUsageError } =
|
||||||
.rpc("get_credit_usage_2", {
|
await supabase_service.rpc("get_credit_usage_2", {
|
||||||
sub_id: subscription.id,
|
sub_id: subscription.id,
|
||||||
start_time: subscription.current_period_start,
|
start_time: subscription.current_period_start,
|
||||||
end_time: subscription.current_period_end
|
end_time: subscription.current_period_end,
|
||||||
});
|
});
|
||||||
|
|
||||||
if (creditUsageError) {
|
if (creditUsageError) {
|
||||||
@ -227,12 +268,11 @@ export async function supaCheckTeamCredits(team_id: string, credits: number) {
|
|||||||
|
|
||||||
if (creditUsages && creditUsages.length > 0) {
|
if (creditUsages && creditUsages.length > 0) {
|
||||||
totalCreditsUsed = creditUsages[0].total_credits_used;
|
totalCreditsUsed = creditUsages[0].total_credits_used;
|
||||||
// console.log("Total Credits Used:", totalCreditsUsed);
|
|
||||||
}
|
}
|
||||||
} catch (error) {
|
} catch (error) {
|
||||||
console.error("Error calculating credit usage:", error);
|
console.error("Error calculating credit usage:", error);
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
// Adjust total credits used by subtracting coupon value
|
// Adjust total credits used by subtracting coupon value
|
||||||
const adjustedCreditsUsed = Math.max(0, totalCreditsUsed - couponCredits);
|
const adjustedCreditsUsed = Math.max(0, totalCreditsUsed - couponCredits);
|
||||||
|
|
||||||
@ -244,12 +284,31 @@ export async function supaCheckTeamCredits(team_id: string, credits: number) {
|
|||||||
.single();
|
.single();
|
||||||
|
|
||||||
if (priceError) {
|
if (priceError) {
|
||||||
throw new Error(`Failed to retrieve price for price_id: ${subscription.price_id}`);
|
throw new Error(
|
||||||
|
`Failed to retrieve price for price_id: ${subscription.price_id}`
|
||||||
|
);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
const creditLimit = price.credits;
|
||||||
|
const creditUsagePercentage = (adjustedCreditsUsed + credits) / creditLimit;
|
||||||
|
|
||||||
// Compare the adjusted total credits used with the credits allowed by the plan
|
// Compare the adjusted total credits used with the credits allowed by the plan
|
||||||
if (adjustedCreditsUsed + credits > price.credits) {
|
if (adjustedCreditsUsed + credits > price.credits) {
|
||||||
|
await sendNotification(
|
||||||
|
team_id,
|
||||||
|
NotificationType.LIMIT_REACHED,
|
||||||
|
subscription.current_period_start,
|
||||||
|
subscription.current_period_end
|
||||||
|
);
|
||||||
return { success: false, message: "Insufficient credits, please upgrade!" };
|
return { success: false, message: "Insufficient credits, please upgrade!" };
|
||||||
|
} else if (creditUsagePercentage >= 0.8) {
|
||||||
|
// Send email notification for approaching credit limit
|
||||||
|
await sendNotification(
|
||||||
|
team_id,
|
||||||
|
NotificationType.APPROACHING_LIMIT,
|
||||||
|
subscription.current_period_start,
|
||||||
|
subscription.current_period_end
|
||||||
|
);
|
||||||
}
|
}
|
||||||
|
|
||||||
return { success: true, message: "Sufficient credits available" };
|
return { success: true, message: "Sufficient credits available" };
|
||||||
@ -275,7 +334,10 @@ export async function countCreditsAndRemainingForCurrentBillingPeriod(
|
|||||||
|
|
||||||
let couponCredits = 0;
|
let couponCredits = 0;
|
||||||
if (coupons && coupons.length > 0) {
|
if (coupons && coupons.length > 0) {
|
||||||
couponCredits = coupons.reduce((total, coupon) => total + coupon.credits, 0);
|
couponCredits = coupons.reduce(
|
||||||
|
(total, coupon) => total + coupon.credits,
|
||||||
|
0
|
||||||
|
);
|
||||||
}
|
}
|
||||||
|
|
||||||
if (subscriptionError || !subscription) {
|
if (subscriptionError || !subscription) {
|
||||||
@ -288,7 +350,9 @@ export async function countCreditsAndRemainingForCurrentBillingPeriod(
|
|||||||
.eq("team_id", team_id);
|
.eq("team_id", team_id);
|
||||||
|
|
||||||
if (creditUsageError || !creditUsages) {
|
if (creditUsageError || !creditUsages) {
|
||||||
throw new Error(`Failed to retrieve credit usage for team_id: ${team_id}`);
|
throw new Error(
|
||||||
|
`Failed to retrieve credit usage for team_id: ${team_id}`
|
||||||
|
);
|
||||||
}
|
}
|
||||||
|
|
||||||
const totalCreditsUsed = creditUsages.reduce(
|
const totalCreditsUsed = creditUsages.reduce(
|
||||||
@ -297,7 +361,11 @@ export async function countCreditsAndRemainingForCurrentBillingPeriod(
|
|||||||
);
|
);
|
||||||
|
|
||||||
const remainingCredits = FREE_CREDITS + couponCredits - totalCreditsUsed;
|
const remainingCredits = FREE_CREDITS + couponCredits - totalCreditsUsed;
|
||||||
return { totalCreditsUsed: totalCreditsUsed, remainingCredits, totalCredits: FREE_CREDITS + couponCredits };
|
return {
|
||||||
|
totalCreditsUsed: totalCreditsUsed,
|
||||||
|
remainingCredits,
|
||||||
|
totalCredits: FREE_CREDITS + couponCredits,
|
||||||
|
};
|
||||||
}
|
}
|
||||||
|
|
||||||
const { data: creditUsages, error: creditUsageError } = await supabase_service
|
const { data: creditUsages, error: creditUsageError } = await supabase_service
|
||||||
@ -308,10 +376,15 @@ export async function countCreditsAndRemainingForCurrentBillingPeriod(
|
|||||||
.lte("created_at", subscription.current_period_end);
|
.lte("created_at", subscription.current_period_end);
|
||||||
|
|
||||||
if (creditUsageError || !creditUsages) {
|
if (creditUsageError || !creditUsages) {
|
||||||
throw new Error(`Failed to retrieve credit usage for subscription_id: ${subscription.id}`);
|
throw new Error(
|
||||||
|
`Failed to retrieve credit usage for subscription_id: ${subscription.id}`
|
||||||
|
);
|
||||||
}
|
}
|
||||||
|
|
||||||
const totalCreditsUsed = creditUsages.reduce((acc, usage) => acc + usage.credits_used, 0);
|
const totalCreditsUsed = creditUsages.reduce(
|
||||||
|
(acc, usage) => acc + usage.credits_used,
|
||||||
|
0
|
||||||
|
);
|
||||||
|
|
||||||
const { data: price, error: priceError } = await supabase_service
|
const { data: price, error: priceError } = await supabase_service
|
||||||
.from("prices")
|
.from("prices")
|
||||||
@ -320,7 +393,9 @@ export async function countCreditsAndRemainingForCurrentBillingPeriod(
|
|||||||
.single();
|
.single();
|
||||||
|
|
||||||
if (priceError || !price) {
|
if (priceError || !price) {
|
||||||
throw new Error(`Failed to retrieve price for price_id: ${subscription.price_id}`);
|
throw new Error(
|
||||||
|
`Failed to retrieve price for price_id: ${subscription.price_id}`
|
||||||
|
);
|
||||||
}
|
}
|
||||||
|
|
||||||
const remainingCredits = price.credits + couponCredits - totalCreditsUsed;
|
const remainingCredits = price.credits + couponCredits - totalCreditsUsed;
|
||||||
@ -328,11 +403,19 @@ export async function countCreditsAndRemainingForCurrentBillingPeriod(
|
|||||||
return {
|
return {
|
||||||
totalCreditsUsed,
|
totalCreditsUsed,
|
||||||
remainingCredits,
|
remainingCredits,
|
||||||
totalCredits: price.credits
|
totalCredits: price.credits,
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
|
|
||||||
async function createCreditUsage({ team_id, subscription_id, credits }: { team_id: string, subscription_id?: string, credits: number }) {
|
async function createCreditUsage({
|
||||||
|
team_id,
|
||||||
|
subscription_id,
|
||||||
|
credits,
|
||||||
|
}: {
|
||||||
|
team_id: string;
|
||||||
|
subscription_id?: string;
|
||||||
|
credits: number;
|
||||||
|
}) {
|
||||||
const { data: credit_usage } = await supabase_service
|
const { data: credit_usage } = await supabase_service
|
||||||
.from("credit_usage")
|
.from("credit_usage")
|
||||||
.insert([
|
.insert([
|
||||||
|
22
apps/api/src/services/idempotency/create.ts
Normal file
22
apps/api/src/services/idempotency/create.ts
Normal file
@ -0,0 +1,22 @@
|
|||||||
|
import { Request } from "express";
|
||||||
|
import { supabase_service } from "../supabase";
|
||||||
|
|
||||||
|
export async function createIdempotencyKey(
|
||||||
|
req: Request,
|
||||||
|
): Promise<string> {
|
||||||
|
const idempotencyKey = req.headers['x-idempotency-key'] as string;
|
||||||
|
if (!idempotencyKey) {
|
||||||
|
throw new Error("No idempotency key provided in the request headers.");
|
||||||
|
}
|
||||||
|
|
||||||
|
const { data, error } = await supabase_service
|
||||||
|
.from("idempotency_keys")
|
||||||
|
.insert({ key: idempotencyKey });
|
||||||
|
|
||||||
|
if (error) {
|
||||||
|
console.error("Failed to create idempotency key:", error);
|
||||||
|
throw error;
|
||||||
|
}
|
||||||
|
|
||||||
|
return idempotencyKey;
|
||||||
|
}
|
32
apps/api/src/services/idempotency/validate.ts
Normal file
32
apps/api/src/services/idempotency/validate.ts
Normal file
@ -0,0 +1,32 @@
|
|||||||
|
import { Request } from "express";
|
||||||
|
import { supabase_service } from "../supabase";
|
||||||
|
import { validate as isUuid } from 'uuid';
|
||||||
|
|
||||||
|
export async function validateIdempotencyKey(
|
||||||
|
req: Request,
|
||||||
|
): Promise<boolean> {
|
||||||
|
const idempotencyKey = req.headers['x-idempotency-key'];
|
||||||
|
if (!idempotencyKey) {
|
||||||
|
// // not returning for missing idempotency key for now
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
if (!isUuid(idempotencyKey)) {
|
||||||
|
console.error("Invalid idempotency key provided in the request headers.");
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
const { data, error } = await supabase_service
|
||||||
|
.from("idempotency_keys")
|
||||||
|
.select("key")
|
||||||
|
.eq("key", idempotencyKey);
|
||||||
|
|
||||||
|
if (error) {
|
||||||
|
console.error(error);
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!data || data.length === 0) {
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
|
||||||
|
return false;
|
||||||
|
}
|
121
apps/api/src/services/notification/email_notification.ts
Normal file
121
apps/api/src/services/notification/email_notification.ts
Normal file
@ -0,0 +1,121 @@
|
|||||||
|
import { supabase_service } from "../supabase";
|
||||||
|
import { withAuth } from "../../lib/withAuth";
|
||||||
|
import { Resend } from "resend";
|
||||||
|
import { NotificationType } from "../../types";
|
||||||
|
|
||||||
|
const emailTemplates: Record<
|
||||||
|
NotificationType,
|
||||||
|
{ subject: string; html: string }
|
||||||
|
> = {
|
||||||
|
[NotificationType.APPROACHING_LIMIT]: {
|
||||||
|
subject: "You've used 80% of your credit limit - Firecrawl",
|
||||||
|
html: "Hey there,<br/><p>You are approaching your credit limit for this billing period. Your usage right now is around 80% of your total credit limit. Consider upgrading your plan to avoid hitting the limit. Check out our <a href='https://firecrawl.dev/pricing'>pricing page</a> for more info.</p><br/>Thanks,<br/>Firecrawl Team<br/>",
|
||||||
|
},
|
||||||
|
[NotificationType.LIMIT_REACHED]: {
|
||||||
|
subject:
|
||||||
|
"Credit Limit Reached! Take action now to resume usage - Firecrawl",
|
||||||
|
html: "Hey there,<br/><p>You have reached your credit limit for this billing period. To resume usage, please upgrade your plan. Check out our <a href='https://firecrawl.dev/pricing'>pricing page</a> for more info.</p><br/>Thanks,<br/>Firecrawl Team<br/>",
|
||||||
|
},
|
||||||
|
[NotificationType.RATE_LIMIT_REACHED]: {
|
||||||
|
subject: "Rate Limit Reached - Firecrawl",
|
||||||
|
html: "Hey there,<br/><p>You've hit one of the Firecrawl endpoint's rate limit! Take a breather and try again in a few moments. If you need higher rate limits, consider upgrading your plan. Check out our <a href='https://firecrawl.dev/pricing'>pricing page</a> for more info.</p><p>If you have any questions, feel free to reach out to us at <a href='mailto:hello@firecrawl.com'>hello@firecrawl.com</a></p><br/>Thanks,<br/>Firecrawl Team<br/><br/>Ps. this email is only sent once every 7 days if you reach a rate limit.",
|
||||||
|
},
|
||||||
|
};
|
||||||
|
|
||||||
|
export async function sendNotification(
|
||||||
|
team_id: string,
|
||||||
|
notificationType: NotificationType,
|
||||||
|
startDateString: string,
|
||||||
|
endDateString: string
|
||||||
|
) {
|
||||||
|
return withAuth(sendNotificationInternal)(
|
||||||
|
team_id,
|
||||||
|
notificationType,
|
||||||
|
startDateString,
|
||||||
|
endDateString
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
async function sendEmailNotification(
|
||||||
|
email: string,
|
||||||
|
notificationType: NotificationType
|
||||||
|
) {
|
||||||
|
const resend = new Resend(process.env.RESEND_API_KEY);
|
||||||
|
|
||||||
|
try {
|
||||||
|
const { data, error } = await resend.emails.send({
|
||||||
|
from: "Firecrawl <firecrawl@getmendableai.com>",
|
||||||
|
to: [email],
|
||||||
|
reply_to: "hello@firecrawl.com",
|
||||||
|
subject: emailTemplates[notificationType].subject,
|
||||||
|
html: emailTemplates[notificationType].html,
|
||||||
|
});
|
||||||
|
|
||||||
|
if (error) {
|
||||||
|
console.error("Error sending email: ", error);
|
||||||
|
return { success: false };
|
||||||
|
}
|
||||||
|
} catch (error) {
|
||||||
|
console.error("Error sending email (2): ", error);
|
||||||
|
return { success: false };
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
export async function sendNotificationInternal(
|
||||||
|
team_id: string,
|
||||||
|
notificationType: NotificationType,
|
||||||
|
startDateString: string,
|
||||||
|
endDateString: string
|
||||||
|
): Promise<{ success: boolean }> {
|
||||||
|
if (team_id === "preview") {
|
||||||
|
return { success: true };
|
||||||
|
}
|
||||||
|
const { data, error } = await supabase_service
|
||||||
|
.from("user_notifications")
|
||||||
|
.select("*")
|
||||||
|
.eq("team_id", team_id)
|
||||||
|
.eq("notification_type", notificationType)
|
||||||
|
.gte("sent_date", startDateString)
|
||||||
|
.lte("sent_date", endDateString);
|
||||||
|
|
||||||
|
if (error) {
|
||||||
|
console.error("Error fetching notifications: ", error);
|
||||||
|
return { success: false };
|
||||||
|
}
|
||||||
|
|
||||||
|
if (data.length !== 0) {
|
||||||
|
return { success: false };
|
||||||
|
} else {
|
||||||
|
// get the emails from the user with the team_id
|
||||||
|
const { data: emails, error: emailsError } = await supabase_service
|
||||||
|
.from("users")
|
||||||
|
.select("email")
|
||||||
|
.eq("team_id", team_id);
|
||||||
|
|
||||||
|
if (emailsError) {
|
||||||
|
console.error("Error fetching emails: ", emailsError);
|
||||||
|
return { success: false };
|
||||||
|
}
|
||||||
|
|
||||||
|
for (const email of emails) {
|
||||||
|
await sendEmailNotification(email.email, notificationType);
|
||||||
|
}
|
||||||
|
|
||||||
|
const { error: insertError } = await supabase_service
|
||||||
|
.from("user_notifications")
|
||||||
|
.insert([
|
||||||
|
{
|
||||||
|
team_id: team_id,
|
||||||
|
notification_type: notificationType,
|
||||||
|
sent_date: new Date().toISOString(),
|
||||||
|
},
|
||||||
|
]);
|
||||||
|
|
||||||
|
if (insertError) {
|
||||||
|
console.error("Error inserting notification record: ", insertError);
|
||||||
|
return { success: false };
|
||||||
|
}
|
||||||
|
|
||||||
|
return { success: true };
|
||||||
|
}
|
||||||
|
}
|
@ -1,6 +1,7 @@
|
|||||||
import Queue from "bull";
|
import Queue from "bull";
|
||||||
|
import { Queue as BullQueue } from "bull";
|
||||||
|
|
||||||
let webScraperQueue;
|
let webScraperQueue: BullQueue;
|
||||||
|
|
||||||
export function getWebScraperQueue() {
|
export function getWebScraperQueue() {
|
||||||
if (!webScraperQueue) {
|
if (!webScraperQueue) {
|
||||||
|
@ -38,7 +38,7 @@ getWebScraperQueue().process(
|
|||||||
error: message /* etc... */,
|
error: message /* etc... */,
|
||||||
};
|
};
|
||||||
|
|
||||||
await callWebhook(job.data.team_id, data);
|
await callWebhook(job.data.team_id, job.id as string, data);
|
||||||
|
|
||||||
await logJob({
|
await logJob({
|
||||||
success: success,
|
success: success,
|
||||||
@ -78,7 +78,7 @@ getWebScraperQueue().process(
|
|||||||
error:
|
error:
|
||||||
"Something went wrong... Contact help@mendable.ai or try again." /* etc... */,
|
"Something went wrong... Contact help@mendable.ai or try again." /* etc... */,
|
||||||
};
|
};
|
||||||
await callWebhook(job.data.team_id, data);
|
await callWebhook(job.data.team_id, job.id as string, data);
|
||||||
await logJob({
|
await logJob({
|
||||||
success: false,
|
success: false,
|
||||||
message: typeof error === 'string' ? error : (error.message ?? "Something went wrong... Contact help@mendable.ai"),
|
message: typeof error === 'string' ? error : (error.message ?? "Something went wrong... Contact help@mendable.ai"),
|
||||||
|
@ -2,47 +2,75 @@ import { RateLimiterRedis } from "rate-limiter-flexible";
|
|||||||
import * as redis from "redis";
|
import * as redis from "redis";
|
||||||
import { RateLimiterMode } from "../../src/types";
|
import { RateLimiterMode } from "../../src/types";
|
||||||
|
|
||||||
const MAX_CRAWLS_PER_MINUTE_STARTER = 3;
|
const RATE_LIMITS = {
|
||||||
const MAX_CRAWLS_PER_MINUTE_STANDARD = 5;
|
crawl: {
|
||||||
const MAX_CRAWLS_PER_MINUTE_SCALE = 20;
|
default: 3,
|
||||||
|
free: 2,
|
||||||
const MAX_SCRAPES_PER_MINUTE_STARTER = 20;
|
starter: 3,
|
||||||
const MAX_SCRAPES_PER_MINUTE_STANDARD = 40;
|
standard: 5,
|
||||||
const MAX_SCRAPES_PER_MINUTE_SCALE = 50;
|
standardOld: 40,
|
||||||
|
scale: 20,
|
||||||
const MAX_SEARCHES_PER_MINUTE_STARTER = 20;
|
hobby: 3,
|
||||||
const MAX_SEARCHES_PER_MINUTE_STANDARD = 40;
|
standardNew: 10,
|
||||||
const MAX_SEARCHES_PER_MINUTE_SCALE = 50;
|
growth: 50,
|
||||||
|
},
|
||||||
const MAX_REQUESTS_PER_MINUTE_PREVIEW = 5;
|
scrape: {
|
||||||
const MAX_REQUESTS_PER_MINUTE_ACCOUNT = 20;
|
default: 20,
|
||||||
const MAX_REQUESTS_PER_MINUTE_CRAWL_STATUS = 150;
|
free: 5,
|
||||||
|
starter: 20,
|
||||||
|
standard: 50,
|
||||||
|
standardOld: 40,
|
||||||
|
scale: 50,
|
||||||
|
hobby: 10,
|
||||||
|
standardNew: 50,
|
||||||
|
growth: 500,
|
||||||
|
},
|
||||||
|
search: {
|
||||||
|
default: 20,
|
||||||
|
free: 5,
|
||||||
|
starter: 20,
|
||||||
|
standard: 40,
|
||||||
|
standardOld: 40,
|
||||||
|
scale: 50,
|
||||||
|
hobby: 10,
|
||||||
|
standardNew: 50,
|
||||||
|
growth: 500,
|
||||||
|
},
|
||||||
|
preview: {
|
||||||
|
free: 5,
|
||||||
|
default: 5,
|
||||||
|
},
|
||||||
|
account: {
|
||||||
|
free: 20,
|
||||||
|
default: 20,
|
||||||
|
},
|
||||||
|
crawlStatus: {
|
||||||
|
free: 150,
|
||||||
|
default: 150,
|
||||||
|
},
|
||||||
|
testSuite: {
|
||||||
|
free: 10000,
|
||||||
|
default: 10000,
|
||||||
|
},
|
||||||
|
};
|
||||||
|
|
||||||
export const redisClient = redis.createClient({
|
export const redisClient = redis.createClient({
|
||||||
url: process.env.REDIS_URL,
|
url: process.env.REDIS_URL,
|
||||||
legacyMode: true,
|
legacyMode: true,
|
||||||
});
|
});
|
||||||
|
|
||||||
export const previewRateLimiter = new RateLimiterRedis({
|
const createRateLimiter = (keyPrefix, points) =>
|
||||||
|
new RateLimiterRedis({
|
||||||
storeClient: redisClient,
|
storeClient: redisClient,
|
||||||
keyPrefix: "preview",
|
keyPrefix,
|
||||||
points: MAX_REQUESTS_PER_MINUTE_PREVIEW,
|
points,
|
||||||
duration: 60, // Duration in seconds
|
duration: 60, // Duration in seconds
|
||||||
});
|
});
|
||||||
|
|
||||||
export const serverRateLimiter = new RateLimiterRedis({
|
export const serverRateLimiter = createRateLimiter(
|
||||||
storeClient: redisClient,
|
"server",
|
||||||
keyPrefix: "server",
|
RATE_LIMITS.account.default
|
||||||
points: MAX_REQUESTS_PER_MINUTE_ACCOUNT,
|
);
|
||||||
duration: 60, // Duration in seconds
|
|
||||||
});
|
|
||||||
|
|
||||||
export const crawlStatusRateLimiter = new RateLimiterRedis({
|
|
||||||
storeClient: redisClient,
|
|
||||||
keyPrefix: "crawl-status",
|
|
||||||
points: MAX_REQUESTS_PER_MINUTE_CRAWL_STATUS,
|
|
||||||
duration: 60, // Duration in seconds
|
|
||||||
});
|
|
||||||
|
|
||||||
export const testSuiteRateLimiter = new RateLimiterRedis({
|
export const testSuiteRateLimiter = new RateLimiterRedis({
|
||||||
storeClient: redisClient,
|
storeClient: redisClient,
|
||||||
@ -51,84 +79,21 @@ export const testSuiteRateLimiter = new RateLimiterRedis({
|
|||||||
duration: 60, // Duration in seconds
|
duration: 60, // Duration in seconds
|
||||||
});
|
});
|
||||||
|
|
||||||
|
export function getRateLimiter(
|
||||||
export function getRateLimiter(mode: RateLimiterMode, token: string, plan?: string){
|
mode: RateLimiterMode,
|
||||||
// Special test suite case. TODO: Change this later.
|
token: string,
|
||||||
if (token.includes("57017") || token.includes("6254cf9")){
|
plan?: string
|
||||||
|
) {
|
||||||
|
if (token.includes("a01ccae") || token.includes("6254cf9")) {
|
||||||
return testSuiteRateLimiter;
|
return testSuiteRateLimiter;
|
||||||
}
|
}
|
||||||
switch (mode) {
|
|
||||||
case RateLimiterMode.Preview:
|
const rateLimitConfig = RATE_LIMITS[mode]; // {default : 5}
|
||||||
return previewRateLimiter;
|
if (!rateLimitConfig) return serverRateLimiter;
|
||||||
case RateLimiterMode.CrawlStatus:
|
|
||||||
return crawlStatusRateLimiter;
|
const planKey = plan ? plan.replace("-", "") : "default"; // "default"
|
||||||
case RateLimiterMode.Crawl:
|
const points =
|
||||||
if (plan === "standard"){
|
rateLimitConfig[planKey] || rateLimitConfig.default || rateLimitConfig; // 5
|
||||||
return new RateLimiterRedis({
|
|
||||||
storeClient: redisClient,
|
return createRateLimiter(`${mode}-${planKey}`, points);
|
||||||
keyPrefix: "crawl-standard",
|
|
||||||
points: MAX_CRAWLS_PER_MINUTE_STANDARD,
|
|
||||||
duration: 60, // Duration in seconds
|
|
||||||
});
|
|
||||||
} else if (plan === "scale"){
|
|
||||||
return new RateLimiterRedis({
|
|
||||||
storeClient: redisClient,
|
|
||||||
keyPrefix: "crawl-scale",
|
|
||||||
points: MAX_CRAWLS_PER_MINUTE_SCALE,
|
|
||||||
duration: 60, // Duration in seconds
|
|
||||||
});
|
|
||||||
}
|
|
||||||
return new RateLimiterRedis({
|
|
||||||
storeClient: redisClient,
|
|
||||||
keyPrefix: "crawl-starter",
|
|
||||||
points: MAX_CRAWLS_PER_MINUTE_STARTER,
|
|
||||||
duration: 60, // Duration in seconds
|
|
||||||
});
|
|
||||||
case RateLimiterMode.Scrape:
|
|
||||||
if (plan === "standard"){
|
|
||||||
return new RateLimiterRedis({
|
|
||||||
storeClient: redisClient,
|
|
||||||
keyPrefix: "scrape-standard",
|
|
||||||
points: MAX_SCRAPES_PER_MINUTE_STANDARD,
|
|
||||||
duration: 60, // Duration in seconds
|
|
||||||
});
|
|
||||||
} else if (plan === "scale"){
|
|
||||||
return new RateLimiterRedis({
|
|
||||||
storeClient: redisClient,
|
|
||||||
keyPrefix: "scrape-scale",
|
|
||||||
points: MAX_SCRAPES_PER_MINUTE_SCALE,
|
|
||||||
duration: 60, // Duration in seconds
|
|
||||||
});
|
|
||||||
}
|
|
||||||
return new RateLimiterRedis({
|
|
||||||
storeClient: redisClient,
|
|
||||||
keyPrefix: "scrape-starter",
|
|
||||||
points: MAX_SCRAPES_PER_MINUTE_STARTER,
|
|
||||||
duration: 60, // Duration in seconds
|
|
||||||
});
|
|
||||||
case RateLimiterMode.Search:
|
|
||||||
if (plan === "standard"){
|
|
||||||
return new RateLimiterRedis({
|
|
||||||
storeClient: redisClient,
|
|
||||||
keyPrefix: "search-standard",
|
|
||||||
points: MAX_SEARCHES_PER_MINUTE_STANDARD,
|
|
||||||
duration: 60, // Duration in seconds
|
|
||||||
});
|
|
||||||
} else if (plan === "scale"){
|
|
||||||
return new RateLimiterRedis({
|
|
||||||
storeClient: redisClient,
|
|
||||||
keyPrefix: "search-scale",
|
|
||||||
points: MAX_SEARCHES_PER_MINUTE_SCALE,
|
|
||||||
duration: 60, // Duration in seconds
|
|
||||||
});
|
|
||||||
}
|
|
||||||
return new RateLimiterRedis({
|
|
||||||
storeClient: redisClient,
|
|
||||||
keyPrefix: "search-starter",
|
|
||||||
points: MAX_SEARCHES_PER_MINUTE_STARTER,
|
|
||||||
duration: 60, // Duration in seconds
|
|
||||||
});
|
|
||||||
default:
|
|
||||||
return serverRateLimiter;
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
@ -1,8 +1,35 @@
|
|||||||
import Redis from 'ioredis';
|
import Redis from "ioredis";
|
||||||
|
|
||||||
// Initialize Redis client
|
// Initialize Redis client
|
||||||
const redis = new Redis(process.env.REDIS_URL);
|
const redis = new Redis(process.env.REDIS_URL);
|
||||||
|
|
||||||
|
// Listen to 'error' events to the Redis connection
|
||||||
|
redis.on("error", (error) => {
|
||||||
|
try {
|
||||||
|
if (error.message === "ECONNRESET") {
|
||||||
|
console.log("Connection to Redis Session Store timed out.");
|
||||||
|
} else if (error.message === "ECONNREFUSED") {
|
||||||
|
console.log("Connection to Redis Session Store refused!");
|
||||||
|
} else console.log(error);
|
||||||
|
} catch (error) {}
|
||||||
|
});
|
||||||
|
|
||||||
|
// Listen to 'reconnecting' event to Redis
|
||||||
|
redis.on("reconnecting", (err) => {
|
||||||
|
try {
|
||||||
|
if (redis.status === "reconnecting")
|
||||||
|
console.log("Reconnecting to Redis Session Store...");
|
||||||
|
else console.log("Error reconnecting to Redis Session Store.");
|
||||||
|
} catch (error) {}
|
||||||
|
});
|
||||||
|
|
||||||
|
// Listen to the 'connect' event to Redis
|
||||||
|
redis.on("connect", (err) => {
|
||||||
|
try {
|
||||||
|
if (!err) console.log("Connected to Redis Session Store!");
|
||||||
|
} catch (error) {}
|
||||||
|
});
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Set a value in Redis with an optional expiration time.
|
* Set a value in Redis with an optional expiration time.
|
||||||
* @param {string} key The key under which to store the value.
|
* @param {string} key The key under which to store the value.
|
||||||
@ -11,7 +38,7 @@ const redis = new Redis(process.env.REDIS_URL);
|
|||||||
*/
|
*/
|
||||||
const setValue = async (key: string, value: string, expire?: number) => {
|
const setValue = async (key: string, value: string, expire?: number) => {
|
||||||
if (expire) {
|
if (expire) {
|
||||||
await redis.set(key, value, 'EX', expire);
|
await redis.set(key, value, "EX", expire);
|
||||||
} else {
|
} else {
|
||||||
await redis.set(key, value);
|
await redis.set(key, value);
|
||||||
}
|
}
|
||||||
|
@ -1,15 +1,24 @@
|
|||||||
import { supabase_service } from "./supabase";
|
import { supabase_service } from "./supabase";
|
||||||
|
|
||||||
export const callWebhook = async (teamId: string, data: any) => {
|
export const callWebhook = async (teamId: string, jobId: string,data: any) => {
|
||||||
try {
|
try {
|
||||||
const { data: webhooksData, error } = await supabase_service
|
const selfHostedUrl = process.env.SELF_HOSTED_WEBHOOK_URL;
|
||||||
.from('webhooks')
|
const useDbAuthentication = process.env.USE_DB_AUTHENTICATION === 'true';
|
||||||
.select('url')
|
let webhookUrl = selfHostedUrl;
|
||||||
.eq('team_id', teamId)
|
|
||||||
.limit(1);
|
|
||||||
|
|
||||||
|
// Only fetch the webhook URL from the database if the self-hosted webhook URL is not set
|
||||||
|
// and the USE_DB_AUTHENTICATION environment variable is set to true
|
||||||
|
if (!selfHostedUrl && useDbAuthentication) {
|
||||||
|
const { data: webhooksData, error } = await supabase_service
|
||||||
|
.from("webhooks")
|
||||||
|
.select("url")
|
||||||
|
.eq("team_id", teamId)
|
||||||
|
.limit(1);
|
||||||
if (error) {
|
if (error) {
|
||||||
console.error(`Error fetching webhook URL for team ID: ${teamId}`, error.message);
|
console.error(
|
||||||
|
`Error fetching webhook URL for team ID: ${teamId}`,
|
||||||
|
error.message
|
||||||
|
);
|
||||||
return null;
|
return null;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -17,6 +26,9 @@ export const callWebhook = async (teamId: string, data: any) => {
|
|||||||
return null;
|
return null;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
webhookUrl = webhooksData[0].url;
|
||||||
|
}
|
||||||
|
|
||||||
let dataToSend = [];
|
let dataToSend = [];
|
||||||
if (data.result.links && data.result.links.length !== 0) {
|
if (data.result.links && data.result.links.length !== 0) {
|
||||||
for (let i = 0; i < data.result.links.length; i++) {
|
for (let i = 0; i < data.result.links.length; i++) {
|
||||||
@ -28,19 +40,22 @@ export const callWebhook = async (teamId: string, data: any) => {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
await fetch(webhooksData[0].url, {
|
await fetch(webhookUrl, {
|
||||||
method: 'POST',
|
method: "POST",
|
||||||
headers: {
|
headers: {
|
||||||
'Content-Type': 'application/json',
|
"Content-Type": "application/json",
|
||||||
},
|
},
|
||||||
body: JSON.stringify({
|
body: JSON.stringify({
|
||||||
success: data.success,
|
success: data.success,
|
||||||
|
jobId: jobId,
|
||||||
data: dataToSend,
|
data: dataToSend,
|
||||||
error: data.error || undefined,
|
error: data.error || undefined,
|
||||||
}),
|
}),
|
||||||
});
|
});
|
||||||
} catch (error) {
|
} catch (error) {
|
||||||
console.error(`Error sending webhook for team ID: ${teamId}`, error.message);
|
console.error(
|
||||||
|
`Error sending webhook for team ID: ${teamId}`,
|
||||||
|
error.message
|
||||||
|
);
|
||||||
}
|
}
|
||||||
};
|
};
|
||||||
|
|
||||||
|
@ -57,6 +57,12 @@ export interface AuthResponse {
|
|||||||
team_id?: string;
|
team_id?: string;
|
||||||
error?: string;
|
error?: string;
|
||||||
status?: number;
|
status?: number;
|
||||||
|
plan?: string;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
export enum NotificationType {
|
||||||
|
APPROACHING_LIMIT = "approachingLimit",
|
||||||
|
LIMIT_REACHED = "limitReached",
|
||||||
|
RATE_LIMIT_REACHED = "rateLimitReached",
|
||||||
|
}
|
@ -1,3 +1,4 @@
|
|||||||
|
import { v4 as uuidv4 } from 'uuid';
|
||||||
import FirecrawlApp from '@mendable/firecrawl-js';
|
import FirecrawlApp from '@mendable/firecrawl-js';
|
||||||
import { z } from "zod";
|
import { z } from "zod";
|
||||||
|
|
||||||
@ -8,7 +9,8 @@ const scrapeResult = await app.scrapeUrl('firecrawl.dev');
|
|||||||
console.log(scrapeResult.data.content)
|
console.log(scrapeResult.data.content)
|
||||||
|
|
||||||
// Crawl a website:
|
// Crawl a website:
|
||||||
const crawlResult = await app.crawlUrl('mendable.ai', {crawlerOptions: {excludes: ['blog/*'], limit: 5}}, false);
|
const idempotencyKey = uuidv4(); // optional
|
||||||
|
const crawlResult = await app.crawlUrl('mendable.ai', {crawlerOptions: {excludes: ['blog/*'], limit: 5}}, false, 2, idempotencyKey);
|
||||||
console.log(crawlResult)
|
console.log(crawlResult)
|
||||||
|
|
||||||
const jobId = await crawlResult['jobId'];
|
const jobId = await crawlResult['jobId'];
|
||||||
|
3
apps/js-sdk/firecrawl/.env.example
Normal file
3
apps/js-sdk/firecrawl/.env.example
Normal file
@ -0,0 +1,3 @@
|
|||||||
|
API_URL=http://localhost:3002
|
||||||
|
TEST_API_KEY=fc-YOUR_API_KEY
|
||||||
|
|
@ -19,6 +19,7 @@ export default class FirecrawlApp {
|
|||||||
* @param {FirecrawlAppConfig} config - Configuration options for the FirecrawlApp instance.
|
* @param {FirecrawlAppConfig} config - Configuration options for the FirecrawlApp instance.
|
||||||
*/
|
*/
|
||||||
constructor({ apiKey = null }) {
|
constructor({ apiKey = null }) {
|
||||||
|
this.apiUrl = "https://api.firecrawl.dev";
|
||||||
this.apiKey = apiKey || "";
|
this.apiKey = apiKey || "";
|
||||||
if (!this.apiKey) {
|
if (!this.apiKey) {
|
||||||
throw new Error("No API key provided");
|
throw new Error("No API key provided");
|
||||||
@ -47,7 +48,7 @@ export default class FirecrawlApp {
|
|||||||
jsonData = Object.assign(Object.assign({}, jsonData), { extractorOptions: Object.assign(Object.assign({}, params.extractorOptions), { extractionSchema: schema, mode: params.extractorOptions.mode || "llm-extraction" }) });
|
jsonData = Object.assign(Object.assign({}, jsonData), { extractorOptions: Object.assign(Object.assign({}, params.extractorOptions), { extractionSchema: schema, mode: params.extractorOptions.mode || "llm-extraction" }) });
|
||||||
}
|
}
|
||||||
try {
|
try {
|
||||||
const response = yield axios.post("https://api.firecrawl.dev/v0/scrape", jsonData, { headers });
|
const response = yield axios.post(this.apiUrl + "/v0/scrape", jsonData, { headers });
|
||||||
if (response.status === 200) {
|
if (response.status === 200) {
|
||||||
const responseData = response.data;
|
const responseData = response.data;
|
||||||
if (responseData.success) {
|
if (responseData.success) {
|
||||||
@ -84,7 +85,7 @@ export default class FirecrawlApp {
|
|||||||
jsonData = Object.assign(Object.assign({}, jsonData), params);
|
jsonData = Object.assign(Object.assign({}, jsonData), params);
|
||||||
}
|
}
|
||||||
try {
|
try {
|
||||||
const response = yield axios.post("https://api.firecrawl.dev/v0/search", jsonData, { headers });
|
const response = yield axios.post(this.apiUrl + "/v0/search", jsonData, { headers });
|
||||||
if (response.status === 200) {
|
if (response.status === 200) {
|
||||||
const responseData = response.data;
|
const responseData = response.data;
|
||||||
if (responseData.success) {
|
if (responseData.success) {
|
||||||
@ -109,22 +110,23 @@ export default class FirecrawlApp {
|
|||||||
* @param {string} url - The URL to crawl.
|
* @param {string} url - The URL to crawl.
|
||||||
* @param {Params | null} params - Additional parameters for the crawl request.
|
* @param {Params | null} params - Additional parameters for the crawl request.
|
||||||
* @param {boolean} waitUntilDone - Whether to wait for the crawl job to complete.
|
* @param {boolean} waitUntilDone - Whether to wait for the crawl job to complete.
|
||||||
* @param {number} timeout - Timeout in seconds for job status checks.
|
* @param {number} pollInterval - Time in seconds for job status checks.
|
||||||
|
* @param {string} idempotencyKey - Optional idempotency key for the request.
|
||||||
* @returns {Promise<CrawlResponse | any>} The response from the crawl operation.
|
* @returns {Promise<CrawlResponse | any>} The response from the crawl operation.
|
||||||
*/
|
*/
|
||||||
crawlUrl(url_1) {
|
crawlUrl(url_1) {
|
||||||
return __awaiter(this, arguments, void 0, function* (url, params = null, waitUntilDone = true, timeout = 2) {
|
return __awaiter(this, arguments, void 0, function* (url, params = null, waitUntilDone = true, pollInterval = 2, idempotencyKey) {
|
||||||
const headers = this.prepareHeaders();
|
const headers = this.prepareHeaders(idempotencyKey);
|
||||||
let jsonData = { url };
|
let jsonData = { url };
|
||||||
if (params) {
|
if (params) {
|
||||||
jsonData = Object.assign(Object.assign({}, jsonData), params);
|
jsonData = Object.assign(Object.assign({}, jsonData), params);
|
||||||
}
|
}
|
||||||
try {
|
try {
|
||||||
const response = yield this.postRequest("https://api.firecrawl.dev/v0/crawl", jsonData, headers);
|
const response = yield this.postRequest(this.apiUrl + "/v0/crawl", jsonData, headers);
|
||||||
if (response.status === 200) {
|
if (response.status === 200) {
|
||||||
const jobId = response.data.jobId;
|
const jobId = response.data.jobId;
|
||||||
if (waitUntilDone) {
|
if (waitUntilDone) {
|
||||||
return this.monitorJobStatus(jobId, headers, timeout);
|
return this.monitorJobStatus(jobId, headers, pollInterval);
|
||||||
}
|
}
|
||||||
else {
|
else {
|
||||||
return { success: true, jobId };
|
return { success: true, jobId };
|
||||||
@ -150,9 +152,14 @@ export default class FirecrawlApp {
|
|||||||
return __awaiter(this, void 0, void 0, function* () {
|
return __awaiter(this, void 0, void 0, function* () {
|
||||||
const headers = this.prepareHeaders();
|
const headers = this.prepareHeaders();
|
||||||
try {
|
try {
|
||||||
const response = yield this.getRequest(`https://api.firecrawl.dev/v0/crawl/status/${jobId}`, headers);
|
const response = yield this.getRequest(this.apiUrl + `/v0/crawl/status/${jobId}`, headers);
|
||||||
if (response.status === 200) {
|
if (response.status === 200) {
|
||||||
return response.data;
|
return {
|
||||||
|
success: true,
|
||||||
|
status: response.data.status,
|
||||||
|
data: response.data.data,
|
||||||
|
partial_data: !response.data.data ? response.data.partial_data : undefined,
|
||||||
|
};
|
||||||
}
|
}
|
||||||
else {
|
else {
|
||||||
this.handleError(response, "check crawl status");
|
this.handleError(response, "check crawl status");
|
||||||
@ -172,11 +179,8 @@ export default class FirecrawlApp {
|
|||||||
* Prepares the headers for an API request.
|
* Prepares the headers for an API request.
|
||||||
* @returns {AxiosRequestHeaders} The prepared headers.
|
* @returns {AxiosRequestHeaders} The prepared headers.
|
||||||
*/
|
*/
|
||||||
prepareHeaders() {
|
prepareHeaders(idempotencyKey) {
|
||||||
return {
|
return Object.assign({ 'Content-Type': 'application/json', 'Authorization': `Bearer ${this.apiKey}` }, (idempotencyKey ? { 'x-idempotency-key': idempotencyKey } : {}));
|
||||||
"Content-Type": "application/json",
|
|
||||||
Authorization: `Bearer ${this.apiKey}`,
|
|
||||||
};
|
|
||||||
}
|
}
|
||||||
/**
|
/**
|
||||||
* Sends a POST request to the specified URL.
|
* Sends a POST request to the specified URL.
|
||||||
@ -204,10 +208,10 @@ export default class FirecrawlApp {
|
|||||||
* @param {number} timeout - Timeout in seconds for job status checks.
|
* @param {number} timeout - Timeout in seconds for job status checks.
|
||||||
* @returns {Promise<any>} The final job status or data.
|
* @returns {Promise<any>} The final job status or data.
|
||||||
*/
|
*/
|
||||||
monitorJobStatus(jobId, headers, timeout) {
|
monitorJobStatus(jobId, headers, checkInterval) {
|
||||||
return __awaiter(this, void 0, void 0, function* () {
|
return __awaiter(this, void 0, void 0, function* () {
|
||||||
while (true) {
|
while (true) {
|
||||||
const statusResponse = yield this.getRequest(`https://api.firecrawl.dev/v0/crawl/status/${jobId}`, headers);
|
const statusResponse = yield this.getRequest(this.apiUrl + `/v0/crawl/status/${jobId}`, headers);
|
||||||
if (statusResponse.status === 200) {
|
if (statusResponse.status === 200) {
|
||||||
const statusData = statusResponse.data;
|
const statusData = statusResponse.data;
|
||||||
if (statusData.status === "completed") {
|
if (statusData.status === "completed") {
|
||||||
@ -219,10 +223,10 @@ export default class FirecrawlApp {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
else if (["active", "paused", "pending", "queued"].includes(statusData.status)) {
|
else if (["active", "paused", "pending", "queued"].includes(statusData.status)) {
|
||||||
if (timeout < 2) {
|
if (checkInterval < 2) {
|
||||||
timeout = 2;
|
checkInterval = 2;
|
||||||
}
|
}
|
||||||
yield new Promise((resolve) => setTimeout(resolve, timeout * 1000)); // Wait for the specified timeout before checking again
|
yield new Promise((resolve) => setTimeout(resolve, checkInterval * 1000)); // Wait for the specified timeout before checking again
|
||||||
}
|
}
|
||||||
else {
|
else {
|
||||||
throw new Error(`Crawl job failed or was stopped. Status: ${statusData.status}`);
|
throw new Error(`Crawl job failed or was stopped. Status: ${statusData.status}`);
|
||||||
|
66
apps/js-sdk/firecrawl/package-lock.json
generated
66
apps/js-sdk/firecrawl/package-lock.json
generated
@ -1,22 +1,27 @@
|
|||||||
{
|
{
|
||||||
"name": "@mendable/firecrawl-js",
|
"name": "@mendable/firecrawl-js",
|
||||||
"version": "0.0.17-beta.8",
|
"version": "0.0.22",
|
||||||
"lockfileVersion": 3,
|
"lockfileVersion": 3,
|
||||||
"requires": true,
|
"requires": true,
|
||||||
"packages": {
|
"packages": {
|
||||||
"": {
|
"": {
|
||||||
"name": "@mendable/firecrawl-js",
|
"name": "@mendable/firecrawl-js",
|
||||||
"version": "0.0.17-beta.8",
|
"version": "0.0.22",
|
||||||
"license": "MIT",
|
"license": "MIT",
|
||||||
"dependencies": {
|
"dependencies": {
|
||||||
"axios": "^1.6.8",
|
"axios": "^1.6.8",
|
||||||
|
"dotenv": "^16.4.5",
|
||||||
|
"uuid": "^9.0.1",
|
||||||
"zod": "^3.23.8",
|
"zod": "^3.23.8",
|
||||||
"zod-to-json-schema": "^3.23.0"
|
"zod-to-json-schema": "^3.23.0"
|
||||||
},
|
},
|
||||||
"devDependencies": {
|
"devDependencies": {
|
||||||
"@jest/globals": "^29.7.0",
|
"@jest/globals": "^29.7.0",
|
||||||
"@types/axios": "^0.14.0",
|
"@types/axios": "^0.14.0",
|
||||||
"@types/node": "^20.12.7",
|
"@types/dotenv": "^8.2.0",
|
||||||
|
"@types/jest": "^29.5.12",
|
||||||
|
"@types/node": "^20.12.12",
|
||||||
|
"@types/uuid": "^9.0.8",
|
||||||
"jest": "^29.7.0",
|
"jest": "^29.7.0",
|
||||||
"ts-jest": "^29.1.2",
|
"ts-jest": "^29.1.2",
|
||||||
"typescript": "^5.4.5"
|
"typescript": "^5.4.5"
|
||||||
@ -1013,6 +1018,16 @@
|
|||||||
"@babel/types": "^7.20.7"
|
"@babel/types": "^7.20.7"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
|
"node_modules/@types/dotenv": {
|
||||||
|
"version": "8.2.0",
|
||||||
|
"resolved": "https://registry.npmjs.org/@types/dotenv/-/dotenv-8.2.0.tgz",
|
||||||
|
"integrity": "sha512-ylSC9GhfRH7m1EUXBXofhgx4lUWmFeQDINW5oLuS+gxWdfUeW4zJdeVTYVkexEW+e2VUvlZR2kGnGGipAWR7kw==",
|
||||||
|
"deprecated": "This is a stub types definition. dotenv provides its own type definitions, so you do not need this installed.",
|
||||||
|
"dev": true,
|
||||||
|
"dependencies": {
|
||||||
|
"dotenv": "*"
|
||||||
|
}
|
||||||
|
},
|
||||||
"node_modules/@types/graceful-fs": {
|
"node_modules/@types/graceful-fs": {
|
||||||
"version": "4.1.9",
|
"version": "4.1.9",
|
||||||
"resolved": "https://registry.npmjs.org/@types/graceful-fs/-/graceful-fs-4.1.9.tgz",
|
"resolved": "https://registry.npmjs.org/@types/graceful-fs/-/graceful-fs-4.1.9.tgz",
|
||||||
@ -1046,10 +1061,20 @@
|
|||||||
"@types/istanbul-lib-report": "*"
|
"@types/istanbul-lib-report": "*"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
|
"node_modules/@types/jest": {
|
||||||
|
"version": "29.5.12",
|
||||||
|
"resolved": "https://registry.npmjs.org/@types/jest/-/jest-29.5.12.tgz",
|
||||||
|
"integrity": "sha512-eDC8bTvT/QhYdxJAulQikueigY5AsdBRH2yDKW3yveW7svY3+DzN84/2NUgkw10RTiJbWqZrTtoGVdYlvFJdLw==",
|
||||||
|
"dev": true,
|
||||||
|
"dependencies": {
|
||||||
|
"expect": "^29.0.0",
|
||||||
|
"pretty-format": "^29.0.0"
|
||||||
|
}
|
||||||
|
},
|
||||||
"node_modules/@types/node": {
|
"node_modules/@types/node": {
|
||||||
"version": "20.12.7",
|
"version": "20.12.12",
|
||||||
"resolved": "https://registry.npmjs.org/@types/node/-/node-20.12.7.tgz",
|
"resolved": "https://registry.npmjs.org/@types/node/-/node-20.12.12.tgz",
|
||||||
"integrity": "sha512-wq0cICSkRLVaf3UGLMGItu/PtdY7oaXaI/RVU+xliKVOtRna3PRY57ZDfztpDL0n11vfymMUnXv8QwYCO7L1wg==",
|
"integrity": "sha512-eWLDGF/FOSPtAvEqeRAQ4C8LSA7M1I7i0ky1I8U7kD1J5ITyW3AsRhQrKVoWf5pFKZ2kILsEGJhsI9r93PYnOw==",
|
||||||
"dev": true,
|
"dev": true,
|
||||||
"dependencies": {
|
"dependencies": {
|
||||||
"undici-types": "~5.26.4"
|
"undici-types": "~5.26.4"
|
||||||
@ -1061,6 +1086,12 @@
|
|||||||
"integrity": "sha512-9aEbYZ3TbYMznPdcdr3SmIrLXwC/AKZXQeCf9Pgao5CKb8CyHuEX5jzWPTkvregvhRJHcpRO6BFoGW9ycaOkYw==",
|
"integrity": "sha512-9aEbYZ3TbYMznPdcdr3SmIrLXwC/AKZXQeCf9Pgao5CKb8CyHuEX5jzWPTkvregvhRJHcpRO6BFoGW9ycaOkYw==",
|
||||||
"dev": true
|
"dev": true
|
||||||
},
|
},
|
||||||
|
"node_modules/@types/uuid": {
|
||||||
|
"version": "9.0.8",
|
||||||
|
"resolved": "https://registry.npmjs.org/@types/uuid/-/uuid-9.0.8.tgz",
|
||||||
|
"integrity": "sha512-jg+97EGIcY9AGHJJRaaPVgetKDsrTgbRjQ5Msgjh/DQKEFl0DtyRr/VCOyD1T2R1MNeWPK/u7JoGhlDZnKBAfA==",
|
||||||
|
"dev": true
|
||||||
|
},
|
||||||
"node_modules/@types/yargs": {
|
"node_modules/@types/yargs": {
|
||||||
"version": "17.0.32",
|
"version": "17.0.32",
|
||||||
"resolved": "https://registry.npmjs.org/@types/yargs/-/yargs-17.0.32.tgz",
|
"resolved": "https://registry.npmjs.org/@types/yargs/-/yargs-17.0.32.tgz",
|
||||||
@ -1602,6 +1633,17 @@
|
|||||||
"node": "^14.15.0 || ^16.10.0 || >=18.0.0"
|
"node": "^14.15.0 || ^16.10.0 || >=18.0.0"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
|
"node_modules/dotenv": {
|
||||||
|
"version": "16.4.5",
|
||||||
|
"resolved": "https://registry.npmjs.org/dotenv/-/dotenv-16.4.5.tgz",
|
||||||
|
"integrity": "sha512-ZmdL2rui+eB2YwhsWzjInR8LldtZHGDoQ1ugH85ppHKwpUHL7j7rN0Ti9NCnGiQbhaZ11FpR+7ao1dNsmduNUg==",
|
||||||
|
"engines": {
|
||||||
|
"node": ">=12"
|
||||||
|
},
|
||||||
|
"funding": {
|
||||||
|
"url": "https://dotenvx.com"
|
||||||
|
}
|
||||||
|
},
|
||||||
"node_modules/electron-to-chromium": {
|
"node_modules/electron-to-chromium": {
|
||||||
"version": "1.4.748",
|
"version": "1.4.748",
|
||||||
"resolved": "https://registry.npmjs.org/electron-to-chromium/-/electron-to-chromium-1.4.748.tgz",
|
"resolved": "https://registry.npmjs.org/electron-to-chromium/-/electron-to-chromium-1.4.748.tgz",
|
||||||
@ -3641,6 +3683,18 @@
|
|||||||
"browserslist": ">= 4.21.0"
|
"browserslist": ">= 4.21.0"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
|
"node_modules/uuid": {
|
||||||
|
"version": "9.0.1",
|
||||||
|
"resolved": "https://registry.npmjs.org/uuid/-/uuid-9.0.1.tgz",
|
||||||
|
"integrity": "sha512-b+1eJOlsR9K8HJpow9Ok3fiWOWSIcIzXodvv0rQjVoOVNpWMpxf1wZNpt4y9h10odCNrqnYp1OBzRktckBe3sA==",
|
||||||
|
"funding": [
|
||||||
|
"https://github.com/sponsors/broofa",
|
||||||
|
"https://github.com/sponsors/ctavan"
|
||||||
|
],
|
||||||
|
"bin": {
|
||||||
|
"uuid": "dist/bin/uuid"
|
||||||
|
}
|
||||||
|
},
|
||||||
"node_modules/v8-to-istanbul": {
|
"node_modules/v8-to-istanbul": {
|
||||||
"version": "9.2.0",
|
"version": "9.2.0",
|
||||||
"resolved": "https://registry.npmjs.org/v8-to-istanbul/-/v8-to-istanbul-9.2.0.tgz",
|
"resolved": "https://registry.npmjs.org/v8-to-istanbul/-/v8-to-istanbul-9.2.0.tgz",
|
||||||
|
@ -1,15 +1,15 @@
|
|||||||
{
|
{
|
||||||
"name": "@mendable/firecrawl-js",
|
"name": "@mendable/firecrawl-js",
|
||||||
"version": "0.0.21",
|
"version": "0.0.26",
|
||||||
"description": "JavaScript SDK for Firecrawl API",
|
"description": "JavaScript SDK for Firecrawl API",
|
||||||
"main": "build/index.js",
|
"main": "build/index.js",
|
||||||
"types": "types/index.d.ts",
|
"types": "types/index.d.ts",
|
||||||
"type": "module",
|
"type": "module",
|
||||||
"scripts": {
|
"scripts": {
|
||||||
"build": "tsc",
|
"build": "tsc",
|
||||||
"publish": "npm run build && npm publish --access public",
|
"build-and-publish": "npm run build && npm publish --access public",
|
||||||
"publish-beta": "npm run build && npm publish --access public --tag beta",
|
"publish-beta": "npm run build && npm publish --access public --tag beta",
|
||||||
"test": "jest src/**/*.test.ts"
|
"test": "jest src/__tests__/**/*.test.ts"
|
||||||
},
|
},
|
||||||
"repository": {
|
"repository": {
|
||||||
"type": "git",
|
"type": "git",
|
||||||
@ -19,6 +19,8 @@
|
|||||||
"license": "MIT",
|
"license": "MIT",
|
||||||
"dependencies": {
|
"dependencies": {
|
||||||
"axios": "^1.6.8",
|
"axios": "^1.6.8",
|
||||||
|
"dotenv": "^16.4.5",
|
||||||
|
"uuid": "^9.0.1",
|
||||||
"zod": "^3.23.8",
|
"zod": "^3.23.8",
|
||||||
"zod-to-json-schema": "^3.23.0"
|
"zod-to-json-schema": "^3.23.0"
|
||||||
},
|
},
|
||||||
@ -29,7 +31,10 @@
|
|||||||
"devDependencies": {
|
"devDependencies": {
|
||||||
"@jest/globals": "^29.7.0",
|
"@jest/globals": "^29.7.0",
|
||||||
"@types/axios": "^0.14.0",
|
"@types/axios": "^0.14.0",
|
||||||
"@types/node": "^20.12.7",
|
"@types/dotenv": "^8.2.0",
|
||||||
|
"@types/jest": "^29.5.12",
|
||||||
|
"@types/node": "^20.12.12",
|
||||||
|
"@types/uuid": "^9.0.8",
|
||||||
"jest": "^29.7.0",
|
"jest": "^29.7.0",
|
||||||
"ts-jest": "^29.1.2",
|
"ts-jest": "^29.1.2",
|
||||||
"typescript": "^5.4.5"
|
"typescript": "^5.4.5"
|
||||||
|
155
apps/js-sdk/firecrawl/src/__tests__/e2e_withAuth/index.test.ts
Normal file
155
apps/js-sdk/firecrawl/src/__tests__/e2e_withAuth/index.test.ts
Normal file
@ -0,0 +1,155 @@
|
|||||||
|
import FirecrawlApp from '../../index';
|
||||||
|
import { v4 as uuidv4 } from 'uuid';
|
||||||
|
import dotenv from 'dotenv';
|
||||||
|
|
||||||
|
dotenv.config();
|
||||||
|
|
||||||
|
const TEST_API_KEY = process.env.TEST_API_KEY;
|
||||||
|
const API_URL = "http://127.0.0.1:3002";
|
||||||
|
|
||||||
|
describe('FirecrawlApp E2E Tests', () => {
|
||||||
|
test.concurrent('should throw error for no API key', () => {
|
||||||
|
expect(() => {
|
||||||
|
new FirecrawlApp({ apiKey: null, apiUrl: API_URL });
|
||||||
|
}).toThrow("No API key provided");
|
||||||
|
});
|
||||||
|
|
||||||
|
test.concurrent('should throw error for invalid API key on scrape', async () => {
|
||||||
|
const invalidApp = new FirecrawlApp({ apiKey: "invalid_api_key", apiUrl: API_URL });
|
||||||
|
await expect(invalidApp.scrapeUrl('https://roastmywebsite.ai')).rejects.toThrow("Request failed with status code 401");
|
||||||
|
});
|
||||||
|
|
||||||
|
test.concurrent('should throw error for blocklisted URL on scrape', async () => {
|
||||||
|
const app = new FirecrawlApp({ apiKey: TEST_API_KEY, apiUrl: API_URL });
|
||||||
|
const blocklistedUrl = "https://facebook.com/fake-test";
|
||||||
|
await expect(app.scrapeUrl(blocklistedUrl)).rejects.toThrow("Request failed with status code 403");
|
||||||
|
});
|
||||||
|
|
||||||
|
test.concurrent('should return successful response with valid preview token', async () => {
|
||||||
|
const app = new FirecrawlApp({ apiKey: "this_is_just_a_preview_token", apiUrl: API_URL });
|
||||||
|
const response = await app.scrapeUrl('https://roastmywebsite.ai');
|
||||||
|
expect(response).not.toBeNull();
|
||||||
|
expect(response.data.content).toContain("_Roast_");
|
||||||
|
}, 30000); // 30 seconds timeout
|
||||||
|
|
||||||
|
test.concurrent('should return successful response for valid scrape', async () => {
|
||||||
|
const app = new FirecrawlApp({ apiKey: TEST_API_KEY, apiUrl: API_URL });
|
||||||
|
const response = await app.scrapeUrl('https://roastmywebsite.ai');
|
||||||
|
expect(response).not.toBeNull();
|
||||||
|
expect(response.data.content).toContain("_Roast_");
|
||||||
|
expect(response.data).toHaveProperty('markdown');
|
||||||
|
expect(response.data).toHaveProperty('metadata');
|
||||||
|
expect(response.data).not.toHaveProperty('html');
|
||||||
|
}, 30000); // 30 seconds timeout
|
||||||
|
|
||||||
|
test.concurrent('should return successful response with valid API key and include HTML', async () => {
|
||||||
|
const app = new FirecrawlApp({ apiKey: TEST_API_KEY, apiUrl: API_URL });
|
||||||
|
const response = await app.scrapeUrl('https://roastmywebsite.ai', { pageOptions: { includeHtml: true } });
|
||||||
|
expect(response).not.toBeNull();
|
||||||
|
expect(response.data.content).toContain("_Roast_");
|
||||||
|
expect(response.data.markdown).toContain("_Roast_");
|
||||||
|
expect(response.data.html).toContain("<h1");
|
||||||
|
}, 30000); // 30 seconds timeout
|
||||||
|
|
||||||
|
test.concurrent('should return successful response for valid scrape with PDF file', async () => {
|
||||||
|
const app = new FirecrawlApp({ apiKey: TEST_API_KEY, apiUrl: API_URL });
|
||||||
|
const response = await app.scrapeUrl('https://arxiv.org/pdf/astro-ph/9301001.pdf');
|
||||||
|
expect(response).not.toBeNull();
|
||||||
|
expect(response.data.content).toContain('We present spectrophotometric observations of the Broad Line Radio Galaxy');
|
||||||
|
}, 30000); // 30 seconds timeout
|
||||||
|
|
||||||
|
test.concurrent('should return successful response for valid scrape with PDF file without explicit extension', async () => {
|
||||||
|
const app = new FirecrawlApp({ apiKey: TEST_API_KEY, apiUrl: API_URL });
|
||||||
|
const response = await app.scrapeUrl('https://arxiv.org/pdf/astro-ph/9301001');
|
||||||
|
expect(response).not.toBeNull();
|
||||||
|
expect(response.data.content).toContain('We present spectrophotometric observations of the Broad Line Radio Galaxy');
|
||||||
|
}, 30000); // 30 seconds timeout
|
||||||
|
|
||||||
|
test.concurrent('should throw error for invalid API key on crawl', async () => {
|
||||||
|
const invalidApp = new FirecrawlApp({ apiKey: "invalid_api_key", apiUrl: API_URL });
|
||||||
|
await expect(invalidApp.crawlUrl('https://roastmywebsite.ai')).rejects.toThrow("Request failed with status code 401");
|
||||||
|
});
|
||||||
|
|
||||||
|
test.concurrent('should throw error for blocklisted URL on crawl', async () => {
|
||||||
|
const app = new FirecrawlApp({ apiKey: TEST_API_KEY, apiUrl: API_URL });
|
||||||
|
const blocklistedUrl = "https://twitter.com/fake-test";
|
||||||
|
await expect(app.crawlUrl(blocklistedUrl)).rejects.toThrow("Request failed with status code 403");
|
||||||
|
});
|
||||||
|
|
||||||
|
test.concurrent('should return successful response for crawl and wait for completion', async () => {
|
||||||
|
const app = new FirecrawlApp({ apiKey: TEST_API_KEY, apiUrl: API_URL });
|
||||||
|
const response = await app.crawlUrl('https://roastmywebsite.ai', { crawlerOptions: { excludes: ['blog/*'] } }, true, 30);
|
||||||
|
expect(response).not.toBeNull();
|
||||||
|
expect(response[0].content).toContain("_Roast_");
|
||||||
|
}, 60000); // 60 seconds timeout
|
||||||
|
|
||||||
|
test.concurrent('should handle idempotency key for crawl', async () => {
|
||||||
|
const app = new FirecrawlApp({ apiKey: TEST_API_KEY, apiUrl: API_URL });
|
||||||
|
const uniqueIdempotencyKey = uuidv4();
|
||||||
|
const response = await app.crawlUrl('https://roastmywebsite.ai', { crawlerOptions: { excludes: ['blog/*'] } }, false, 2, uniqueIdempotencyKey);
|
||||||
|
expect(response).not.toBeNull();
|
||||||
|
expect(response.jobId).toBeDefined();
|
||||||
|
|
||||||
|
await expect(app.crawlUrl('https://roastmywebsite.ai', { crawlerOptions: { excludes: ['blog/*'] } }, true, 2, uniqueIdempotencyKey)).rejects.toThrow("Request failed with status code 409");
|
||||||
|
});
|
||||||
|
|
||||||
|
test.concurrent('should check crawl status', async () => {
|
||||||
|
const app = new FirecrawlApp({ apiKey: TEST_API_KEY, apiUrl: API_URL });
|
||||||
|
const response = await app.crawlUrl('https://roastmywebsite.ai', { crawlerOptions: { excludes: ['blog/*'] } }, false);
|
||||||
|
expect(response).not.toBeNull();
|
||||||
|
expect(response.jobId).toBeDefined();
|
||||||
|
|
||||||
|
let statusResponse = await app.checkCrawlStatus(response.jobId);
|
||||||
|
const maxChecks = 15;
|
||||||
|
let checks = 0;
|
||||||
|
|
||||||
|
while (statusResponse.status === 'active' && checks < maxChecks) {
|
||||||
|
await new Promise(resolve => setTimeout(resolve, 1000));
|
||||||
|
expect(statusResponse.partial_data).not.toBeNull();
|
||||||
|
statusResponse = await app.checkCrawlStatus(response.jobId);
|
||||||
|
checks++;
|
||||||
|
}
|
||||||
|
|
||||||
|
expect(statusResponse).not.toBeNull();
|
||||||
|
expect(statusResponse.status).toBe('completed');
|
||||||
|
expect(statusResponse.data.length).toBeGreaterThan(0);
|
||||||
|
}, 35000); // 35 seconds timeout
|
||||||
|
|
||||||
|
test.concurrent('should return successful response for search', async () => {
|
||||||
|
const app = new FirecrawlApp({ apiKey: TEST_API_KEY, apiUrl: API_URL });
|
||||||
|
const response = await app.search("test query");
|
||||||
|
expect(response).not.toBeNull();
|
||||||
|
expect(response.data[0].content).toBeDefined();
|
||||||
|
expect(response.data.length).toBeGreaterThan(2);
|
||||||
|
}, 30000); // 30 seconds timeout
|
||||||
|
|
||||||
|
test.concurrent('should throw error for invalid API key on search', async () => {
|
||||||
|
const invalidApp = new FirecrawlApp({ apiKey: "invalid_api_key", apiUrl: API_URL });
|
||||||
|
await expect(invalidApp.search("test query")).rejects.toThrow("Request failed with status code 401");
|
||||||
|
});
|
||||||
|
|
||||||
|
test.concurrent('should perform LLM extraction', async () => {
|
||||||
|
const app = new FirecrawlApp({ apiKey: TEST_API_KEY, apiUrl: API_URL });
|
||||||
|
const response = await app.scrapeUrl("https://mendable.ai", {
|
||||||
|
extractorOptions: {
|
||||||
|
mode: 'llm-extraction',
|
||||||
|
extractionPrompt: "Based on the information on the page, find what the company's mission is and whether it supports SSO, and whether it is open source",
|
||||||
|
extractionSchema: {
|
||||||
|
type: 'object',
|
||||||
|
properties: {
|
||||||
|
company_mission: { type: 'string' },
|
||||||
|
supports_sso: { type: 'boolean' },
|
||||||
|
is_open_source: { type: 'boolean' }
|
||||||
|
},
|
||||||
|
required: ['company_mission', 'supports_sso', 'is_open_source']
|
||||||
|
}
|
||||||
|
}
|
||||||
|
});
|
||||||
|
expect(response).not.toBeNull();
|
||||||
|
expect(response.data.llm_extraction).toBeDefined();
|
||||||
|
const llmExtraction = response.data.llm_extraction;
|
||||||
|
expect(llmExtraction.company_mission).toBeDefined();
|
||||||
|
expect(typeof llmExtraction.supports_sso).toBe('boolean');
|
||||||
|
expect(typeof llmExtraction.is_open_source).toBe('boolean');
|
||||||
|
}, 30000); // 30 seconds timeout
|
||||||
|
});
|
@ -6,6 +6,7 @@ import { zodToJsonSchema } from "zod-to-json-schema";
|
|||||||
*/
|
*/
|
||||||
export interface FirecrawlAppConfig {
|
export interface FirecrawlAppConfig {
|
||||||
apiKey?: string | null;
|
apiKey?: string | null;
|
||||||
|
apiUrl?: string | null;
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
@ -55,6 +56,7 @@ export interface JobStatusResponse {
|
|||||||
status: string;
|
status: string;
|
||||||
jobId?: string;
|
jobId?: string;
|
||||||
data?: any;
|
data?: any;
|
||||||
|
partial_data?: any,
|
||||||
error?: string;
|
error?: string;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -63,6 +65,7 @@ export interface JobStatusResponse {
|
|||||||
*/
|
*/
|
||||||
export default class FirecrawlApp {
|
export default class FirecrawlApp {
|
||||||
private apiKey: string;
|
private apiKey: string;
|
||||||
|
private apiUrl: string = "https://api.firecrawl.dev";
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Initializes a new instance of the FirecrawlApp class.
|
* Initializes a new instance of the FirecrawlApp class.
|
||||||
@ -107,7 +110,7 @@ export default class FirecrawlApp {
|
|||||||
}
|
}
|
||||||
try {
|
try {
|
||||||
const response: AxiosResponse = await axios.post(
|
const response: AxiosResponse = await axios.post(
|
||||||
"https://api.firecrawl.dev/v0/scrape",
|
this.apiUrl + "/v0/scrape",
|
||||||
jsonData,
|
jsonData,
|
||||||
{ headers },
|
{ headers },
|
||||||
);
|
);
|
||||||
@ -147,7 +150,7 @@ export default class FirecrawlApp {
|
|||||||
}
|
}
|
||||||
try {
|
try {
|
||||||
const response: AxiosResponse = await axios.post(
|
const response: AxiosResponse = await axios.post(
|
||||||
"https://api.firecrawl.dev/v0/search",
|
this.apiUrl + "/v0/search",
|
||||||
jsonData,
|
jsonData,
|
||||||
{ headers }
|
{ headers }
|
||||||
);
|
);
|
||||||
@ -172,30 +175,32 @@ export default class FirecrawlApp {
|
|||||||
* @param {string} url - The URL to crawl.
|
* @param {string} url - The URL to crawl.
|
||||||
* @param {Params | null} params - Additional parameters for the crawl request.
|
* @param {Params | null} params - Additional parameters for the crawl request.
|
||||||
* @param {boolean} waitUntilDone - Whether to wait for the crawl job to complete.
|
* @param {boolean} waitUntilDone - Whether to wait for the crawl job to complete.
|
||||||
* @param {number} timeout - Timeout in seconds for job status checks.
|
* @param {number} pollInterval - Time in seconds for job status checks.
|
||||||
|
* @param {string} idempotencyKey - Optional idempotency key for the request.
|
||||||
* @returns {Promise<CrawlResponse | any>} The response from the crawl operation.
|
* @returns {Promise<CrawlResponse | any>} The response from the crawl operation.
|
||||||
*/
|
*/
|
||||||
async crawlUrl(
|
async crawlUrl(
|
||||||
url: string,
|
url: string,
|
||||||
params: Params | null = null,
|
params: Params | null = null,
|
||||||
waitUntilDone: boolean = true,
|
waitUntilDone: boolean = true,
|
||||||
timeout: number = 2
|
pollInterval: number = 2,
|
||||||
|
idempotencyKey?: string
|
||||||
): Promise<CrawlResponse | any> {
|
): Promise<CrawlResponse | any> {
|
||||||
const headers = this.prepareHeaders();
|
const headers = this.prepareHeaders(idempotencyKey);
|
||||||
let jsonData: Params = { url };
|
let jsonData: Params = { url };
|
||||||
if (params) {
|
if (params) {
|
||||||
jsonData = { ...jsonData, ...params };
|
jsonData = { ...jsonData, ...params };
|
||||||
}
|
}
|
||||||
try {
|
try {
|
||||||
const response: AxiosResponse = await this.postRequest(
|
const response: AxiosResponse = await this.postRequest(
|
||||||
"https://api.firecrawl.dev/v0/crawl",
|
this.apiUrl + "/v0/crawl",
|
||||||
jsonData,
|
jsonData,
|
||||||
headers
|
headers
|
||||||
);
|
);
|
||||||
if (response.status === 200) {
|
if (response.status === 200) {
|
||||||
const jobId: string = response.data.jobId;
|
const jobId: string = response.data.jobId;
|
||||||
if (waitUntilDone) {
|
if (waitUntilDone) {
|
||||||
return this.monitorJobStatus(jobId, headers, timeout);
|
return this.monitorJobStatus(jobId, headers, pollInterval);
|
||||||
} else {
|
} else {
|
||||||
return { success: true, jobId };
|
return { success: true, jobId };
|
||||||
}
|
}
|
||||||
@ -218,11 +223,16 @@ export default class FirecrawlApp {
|
|||||||
const headers: AxiosRequestHeaders = this.prepareHeaders();
|
const headers: AxiosRequestHeaders = this.prepareHeaders();
|
||||||
try {
|
try {
|
||||||
const response: AxiosResponse = await this.getRequest(
|
const response: AxiosResponse = await this.getRequest(
|
||||||
`https://api.firecrawl.dev/v0/crawl/status/${jobId}`,
|
this.apiUrl + `/v0/crawl/status/${jobId}`,
|
||||||
headers
|
headers
|
||||||
);
|
);
|
||||||
if (response.status === 200) {
|
if (response.status === 200) {
|
||||||
return response.data;
|
return {
|
||||||
|
success: true,
|
||||||
|
status: response.data.status,
|
||||||
|
data: response.data.data,
|
||||||
|
partial_data: !response.data.data ? response.data.partial_data : undefined,
|
||||||
|
};
|
||||||
} else {
|
} else {
|
||||||
this.handleError(response, "check crawl status");
|
this.handleError(response, "check crawl status");
|
||||||
}
|
}
|
||||||
@ -240,11 +250,12 @@ export default class FirecrawlApp {
|
|||||||
* Prepares the headers for an API request.
|
* Prepares the headers for an API request.
|
||||||
* @returns {AxiosRequestHeaders} The prepared headers.
|
* @returns {AxiosRequestHeaders} The prepared headers.
|
||||||
*/
|
*/
|
||||||
prepareHeaders(): AxiosRequestHeaders {
|
prepareHeaders(idempotencyKey?: string): AxiosRequestHeaders {
|
||||||
return {
|
return {
|
||||||
"Content-Type": "application/json",
|
'Content-Type': 'application/json',
|
||||||
Authorization: `Bearer ${this.apiKey}`,
|
'Authorization': `Bearer ${this.apiKey}`,
|
||||||
} as AxiosRequestHeaders;
|
...(idempotencyKey ? { 'x-idempotency-key': idempotencyKey } : {}),
|
||||||
|
} as AxiosRequestHeaders & { 'x-idempotency-key'?: string };
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
@ -285,11 +296,11 @@ export default class FirecrawlApp {
|
|||||||
async monitorJobStatus(
|
async monitorJobStatus(
|
||||||
jobId: string,
|
jobId: string,
|
||||||
headers: AxiosRequestHeaders,
|
headers: AxiosRequestHeaders,
|
||||||
timeout: number
|
checkInterval: number
|
||||||
): Promise<any> {
|
): Promise<any> {
|
||||||
while (true) {
|
while (true) {
|
||||||
const statusResponse: AxiosResponse = await this.getRequest(
|
const statusResponse: AxiosResponse = await this.getRequest(
|
||||||
`https://api.firecrawl.dev/v0/crawl/status/${jobId}`,
|
this.apiUrl + `/v0/crawl/status/${jobId}`,
|
||||||
headers
|
headers
|
||||||
);
|
);
|
||||||
if (statusResponse.status === 200) {
|
if (statusResponse.status === 200) {
|
||||||
@ -303,10 +314,10 @@ export default class FirecrawlApp {
|
|||||||
} else if (
|
} else if (
|
||||||
["active", "paused", "pending", "queued"].includes(statusData.status)
|
["active", "paused", "pending", "queued"].includes(statusData.status)
|
||||||
) {
|
) {
|
||||||
if (timeout < 2) {
|
if (checkInterval < 2) {
|
||||||
timeout = 2;
|
checkInterval = 2;
|
||||||
}
|
}
|
||||||
await new Promise((resolve) => setTimeout(resolve, timeout * 1000)); // Wait for the specified timeout before checking again
|
await new Promise((resolve) => setTimeout(resolve, checkInterval * 1000)); // Wait for the specified timeout before checking again
|
||||||
} else {
|
} else {
|
||||||
throw new Error(
|
throw new Error(
|
||||||
`Crawl job failed or was stopped. Status: ${statusData.status}`
|
`Crawl job failed or was stopped. Status: ${statusData.status}`
|
||||||
|
12
apps/js-sdk/firecrawl/types/index.d.ts
vendored
12
apps/js-sdk/firecrawl/types/index.d.ts
vendored
@ -5,6 +5,7 @@ import { z } from "zod";
|
|||||||
*/
|
*/
|
||||||
export interface FirecrawlAppConfig {
|
export interface FirecrawlAppConfig {
|
||||||
apiKey?: string | null;
|
apiKey?: string | null;
|
||||||
|
apiUrl?: string | null;
|
||||||
}
|
}
|
||||||
/**
|
/**
|
||||||
* Generic parameter interface.
|
* Generic parameter interface.
|
||||||
@ -50,6 +51,7 @@ export interface JobStatusResponse {
|
|||||||
status: string;
|
status: string;
|
||||||
jobId?: string;
|
jobId?: string;
|
||||||
data?: any;
|
data?: any;
|
||||||
|
partial_data?: any;
|
||||||
error?: string;
|
error?: string;
|
||||||
}
|
}
|
||||||
/**
|
/**
|
||||||
@ -57,6 +59,7 @@ export interface JobStatusResponse {
|
|||||||
*/
|
*/
|
||||||
export default class FirecrawlApp {
|
export default class FirecrawlApp {
|
||||||
private apiKey;
|
private apiKey;
|
||||||
|
private apiUrl;
|
||||||
/**
|
/**
|
||||||
* Initializes a new instance of the FirecrawlApp class.
|
* Initializes a new instance of the FirecrawlApp class.
|
||||||
* @param {FirecrawlAppConfig} config - Configuration options for the FirecrawlApp instance.
|
* @param {FirecrawlAppConfig} config - Configuration options for the FirecrawlApp instance.
|
||||||
@ -81,10 +84,11 @@ export default class FirecrawlApp {
|
|||||||
* @param {string} url - The URL to crawl.
|
* @param {string} url - The URL to crawl.
|
||||||
* @param {Params | null} params - Additional parameters for the crawl request.
|
* @param {Params | null} params - Additional parameters for the crawl request.
|
||||||
* @param {boolean} waitUntilDone - Whether to wait for the crawl job to complete.
|
* @param {boolean} waitUntilDone - Whether to wait for the crawl job to complete.
|
||||||
* @param {number} timeout - Timeout in seconds for job status checks.
|
* @param {number} pollInterval - Time in seconds for job status checks.
|
||||||
|
* @param {string} idempotencyKey - Optional idempotency key for the request.
|
||||||
* @returns {Promise<CrawlResponse | any>} The response from the crawl operation.
|
* @returns {Promise<CrawlResponse | any>} The response from the crawl operation.
|
||||||
*/
|
*/
|
||||||
crawlUrl(url: string, params?: Params | null, waitUntilDone?: boolean, timeout?: number): Promise<CrawlResponse | any>;
|
crawlUrl(url: string, params?: Params | null, waitUntilDone?: boolean, pollInterval?: number, idempotencyKey?: string): Promise<CrawlResponse | any>;
|
||||||
/**
|
/**
|
||||||
* Checks the status of a crawl job using the Firecrawl API.
|
* Checks the status of a crawl job using the Firecrawl API.
|
||||||
* @param {string} jobId - The job ID of the crawl operation.
|
* @param {string} jobId - The job ID of the crawl operation.
|
||||||
@ -95,7 +99,7 @@ export default class FirecrawlApp {
|
|||||||
* Prepares the headers for an API request.
|
* Prepares the headers for an API request.
|
||||||
* @returns {AxiosRequestHeaders} The prepared headers.
|
* @returns {AxiosRequestHeaders} The prepared headers.
|
||||||
*/
|
*/
|
||||||
prepareHeaders(): AxiosRequestHeaders;
|
prepareHeaders(idempotencyKey?: string): AxiosRequestHeaders;
|
||||||
/**
|
/**
|
||||||
* Sends a POST request to the specified URL.
|
* Sends a POST request to the specified URL.
|
||||||
* @param {string} url - The URL to send the request to.
|
* @param {string} url - The URL to send the request to.
|
||||||
@ -118,7 +122,7 @@ export default class FirecrawlApp {
|
|||||||
* @param {number} timeout - Timeout in seconds for job status checks.
|
* @param {number} timeout - Timeout in seconds for job status checks.
|
||||||
* @returns {Promise<any>} The final job status or data.
|
* @returns {Promise<any>} The final job status or data.
|
||||||
*/
|
*/
|
||||||
monitorJobStatus(jobId: string, headers: AxiosRequestHeaders, timeout: number): Promise<any>;
|
monitorJobStatus(jobId: string, headers: AxiosRequestHeaders, checkInterval: number): Promise<any>;
|
||||||
/**
|
/**
|
||||||
* Handles errors from API responses.
|
* Handles errors from API responses.
|
||||||
* @param {AxiosResponse} response - The response from the API.
|
* @param {AxiosResponse} response - The response from the API.
|
||||||
|
25
apps/js-sdk/package-lock.json
generated
25
apps/js-sdk/package-lock.json
generated
@ -11,8 +11,10 @@
|
|||||||
"dependencies": {
|
"dependencies": {
|
||||||
"@mendable/firecrawl-js": "^0.0.19",
|
"@mendable/firecrawl-js": "^0.0.19",
|
||||||
"axios": "^1.6.8",
|
"axios": "^1.6.8",
|
||||||
|
"dotenv": "^16.4.5",
|
||||||
"ts-node": "^10.9.2",
|
"ts-node": "^10.9.2",
|
||||||
"typescript": "^5.4.5",
|
"typescript": "^5.4.5",
|
||||||
|
"uuid": "^9.0.1",
|
||||||
"zod": "^3.23.8"
|
"zod": "^3.23.8"
|
||||||
},
|
},
|
||||||
"devDependencies": {
|
"devDependencies": {
|
||||||
@ -530,6 +532,17 @@
|
|||||||
"node": ">=0.3.1"
|
"node": ">=0.3.1"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
|
"node_modules/dotenv": {
|
||||||
|
"version": "16.4.5",
|
||||||
|
"resolved": "https://registry.npmjs.org/dotenv/-/dotenv-16.4.5.tgz",
|
||||||
|
"integrity": "sha512-ZmdL2rui+eB2YwhsWzjInR8LldtZHGDoQ1ugH85ppHKwpUHL7j7rN0Ti9NCnGiQbhaZ11FpR+7ao1dNsmduNUg==",
|
||||||
|
"engines": {
|
||||||
|
"node": ">=12"
|
||||||
|
},
|
||||||
|
"funding": {
|
||||||
|
"url": "https://dotenvx.com"
|
||||||
|
}
|
||||||
|
},
|
||||||
"node_modules/esbuild": {
|
"node_modules/esbuild": {
|
||||||
"version": "0.20.2",
|
"version": "0.20.2",
|
||||||
"resolved": "https://registry.npmjs.org/esbuild/-/esbuild-0.20.2.tgz",
|
"resolved": "https://registry.npmjs.org/esbuild/-/esbuild-0.20.2.tgz",
|
||||||
@ -743,6 +756,18 @@
|
|||||||
"integrity": "sha512-JlCMO+ehdEIKqlFxk6IfVoAUVmgz7cU7zD/h9XZ0qzeosSHmUJVOzSQvvYSYWXkFXC+IfLKSIffhv0sVZup6pA==",
|
"integrity": "sha512-JlCMO+ehdEIKqlFxk6IfVoAUVmgz7cU7zD/h9XZ0qzeosSHmUJVOzSQvvYSYWXkFXC+IfLKSIffhv0sVZup6pA==",
|
||||||
"peer": true
|
"peer": true
|
||||||
},
|
},
|
||||||
|
"node_modules/uuid": {
|
||||||
|
"version": "9.0.1",
|
||||||
|
"resolved": "https://registry.npmjs.org/uuid/-/uuid-9.0.1.tgz",
|
||||||
|
"integrity": "sha512-b+1eJOlsR9K8HJpow9Ok3fiWOWSIcIzXodvv0rQjVoOVNpWMpxf1wZNpt4y9h10odCNrqnYp1OBzRktckBe3sA==",
|
||||||
|
"funding": [
|
||||||
|
"https://github.com/sponsors/broofa",
|
||||||
|
"https://github.com/sponsors/ctavan"
|
||||||
|
],
|
||||||
|
"bin": {
|
||||||
|
"uuid": "dist/bin/uuid"
|
||||||
|
}
|
||||||
|
},
|
||||||
"node_modules/v8-compile-cache-lib": {
|
"node_modules/v8-compile-cache-lib": {
|
||||||
"version": "3.0.1",
|
"version": "3.0.1",
|
||||||
"resolved": "https://registry.npmjs.org/v8-compile-cache-lib/-/v8-compile-cache-lib-3.0.1.tgz",
|
"resolved": "https://registry.npmjs.org/v8-compile-cache-lib/-/v8-compile-cache-lib-3.0.1.tgz",
|
||||||
|
63
apps/playwright-service/get_error.py
Normal file
63
apps/playwright-service/get_error.py
Normal file
@ -0,0 +1,63 @@
|
|||||||
|
def get_error(status_code: int) -> str:
|
||||||
|
error_messages = {
|
||||||
|
300: "Multiple Choices",
|
||||||
|
301: "Moved Permanently",
|
||||||
|
302: "Found",
|
||||||
|
303: "See Other",
|
||||||
|
304: "Not Modified",
|
||||||
|
305: "Use Proxy",
|
||||||
|
307: "Temporary Redirect",
|
||||||
|
308: "Permanent Redirect",
|
||||||
|
309: "Resume Incomplete",
|
||||||
|
310: "Too Many Redirects",
|
||||||
|
311: "Unavailable For Legal Reasons",
|
||||||
|
312: "Previously Used",
|
||||||
|
313: "I'm Used",
|
||||||
|
314: "Switch Proxy",
|
||||||
|
315: "Temporary Redirect",
|
||||||
|
316: "Resume Incomplete",
|
||||||
|
317: "Too Many Redirects",
|
||||||
|
400: "Bad Request",
|
||||||
|
401: "Unauthorized",
|
||||||
|
403: "Forbidden",
|
||||||
|
404: "Not Found",
|
||||||
|
405: "Method Not Allowed",
|
||||||
|
406: "Not Acceptable",
|
||||||
|
407: "Proxy Authentication Required",
|
||||||
|
408: "Request Timeout",
|
||||||
|
409: "Conflict",
|
||||||
|
410: "Gone",
|
||||||
|
411: "Length Required",
|
||||||
|
412: "Precondition Failed",
|
||||||
|
413: "Payload Too Large",
|
||||||
|
414: "URI Too Long",
|
||||||
|
415: "Unsupported Media Type",
|
||||||
|
416: "Range Not Satisfiable",
|
||||||
|
417: "Expectation Failed",
|
||||||
|
418: "I'm a teapot",
|
||||||
|
421: "Misdirected Request",
|
||||||
|
422: "Unprocessable Entity",
|
||||||
|
423: "Locked",
|
||||||
|
424: "Failed Dependency",
|
||||||
|
425: "Too Early",
|
||||||
|
426: "Upgrade Required",
|
||||||
|
428: "Precondition Required",
|
||||||
|
429: "Too Many Requests",
|
||||||
|
431: "Request Header Fields Too Large",
|
||||||
|
451: "Unavailable For Legal Reasons",
|
||||||
|
500: "Internal Server Error",
|
||||||
|
501: "Not Implemented",
|
||||||
|
502: "Bad Gateway",
|
||||||
|
503: "Service Unavailable",
|
||||||
|
504: "Gateway Timeout",
|
||||||
|
505: "HTTP Version Not Supported",
|
||||||
|
506: "Variant Also Negotiates",
|
||||||
|
507: "Insufficient Storage",
|
||||||
|
508: "Loop Detected",
|
||||||
|
510: "Not Extended",
|
||||||
|
511: "Network Authentication Required",
|
||||||
|
599: "Network Connect Timeout Error"
|
||||||
|
}
|
||||||
|
if status_code < 300:
|
||||||
|
return None
|
||||||
|
return error_messages.get(status_code, "Unknown Error")
|
@ -1,38 +1,95 @@
|
|||||||
|
"""
|
||||||
|
This module provides a FastAPI application that uses Playwright to fetch and return
|
||||||
|
the HTML content of a specified URL. It supports optional proxy settings and media blocking.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from os import environ
|
||||||
|
|
||||||
from fastapi import FastAPI
|
from fastapi import FastAPI
|
||||||
from playwright.async_api import async_playwright, Browser
|
|
||||||
from fastapi.responses import JSONResponse
|
from fastapi.responses import JSONResponse
|
||||||
|
from playwright.async_api import Browser, async_playwright
|
||||||
from pydantic import BaseModel
|
from pydantic import BaseModel
|
||||||
|
from get_error import get_error
|
||||||
|
|
||||||
|
PROXY_SERVER = environ.get("PROXY_SERVER", None)
|
||||||
|
PROXY_USERNAME = environ.get("PROXY_USERNAME", None)
|
||||||
|
PROXY_PASSWORD = environ.get("PROXY_PASSWORD", None)
|
||||||
|
BLOCK_MEDIA = environ.get("BLOCK_MEDIA", "False").upper() == "TRUE"
|
||||||
|
|
||||||
app = FastAPI()
|
app = FastAPI()
|
||||||
|
|
||||||
class UrlModel(BaseModel):
|
class UrlModel(BaseModel):
|
||||||
|
"""Model representing the URL and associated parameters for the request."""
|
||||||
url: str
|
url: str
|
||||||
wait: int = None
|
wait_after_load: int = 0
|
||||||
|
timeout: int = 15000
|
||||||
|
headers: dict = None
|
||||||
|
|
||||||
browser: Browser = None
|
browser: Browser = None
|
||||||
|
|
||||||
|
|
||||||
@app.on_event("startup")
|
@app.on_event("startup")
|
||||||
async def startup_event():
|
async def startup_event():
|
||||||
|
"""Event handler for application startup to initialize the browser."""
|
||||||
global browser
|
global browser
|
||||||
playwright = await async_playwright().start()
|
playwright = await async_playwright().start()
|
||||||
browser = await playwright.chromium.launch()
|
browser = await playwright.chromium.launch()
|
||||||
|
|
||||||
|
|
||||||
@app.on_event("shutdown")
|
@app.on_event("shutdown")
|
||||||
async def shutdown_event():
|
async def shutdown_event():
|
||||||
|
"""Event handler for application shutdown to close the browser."""
|
||||||
await browser.close()
|
await browser.close()
|
||||||
|
|
||||||
|
|
||||||
@app.post("/html")
|
@app.post("/html")
|
||||||
async def root(body: UrlModel):
|
async def root(body: UrlModel):
|
||||||
|
"""
|
||||||
|
Endpoint to fetch and return HTML content of a given URL.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
body (UrlModel): The URL model containing the target URL, wait time, and timeout.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
JSONResponse: The HTML content of the page.
|
||||||
|
"""
|
||||||
|
context = None
|
||||||
|
if PROXY_SERVER and PROXY_USERNAME and PROXY_PASSWORD:
|
||||||
|
context = await browser.new_context(
|
||||||
|
proxy={
|
||||||
|
"server": PROXY_SERVER,
|
||||||
|
"username": PROXY_USERNAME,
|
||||||
|
"password": PROXY_PASSWORD,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
else:
|
||||||
context = await browser.new_context()
|
context = await browser.new_context()
|
||||||
|
|
||||||
|
if BLOCK_MEDIA:
|
||||||
|
await context.route(
|
||||||
|
"**/*.{png,jpg,jpeg,gif,svg,mp3,mp4,avi,flac,ogg,wav,webm}",
|
||||||
|
handler=lambda route, request: route.abort(),
|
||||||
|
)
|
||||||
|
|
||||||
page = await context.new_page()
|
page = await context.new_page()
|
||||||
await page.goto(body.url, timeout=15000) # Set max timeout to 15s
|
|
||||||
if body.wait: # Check if wait parameter is provided in the request body
|
# Set headers if provided
|
||||||
await page.wait_for_timeout(body.wait) # Convert seconds to milliseconds for playwright
|
if body.headers:
|
||||||
|
await page.set_extra_http_headers(body.headers)
|
||||||
|
|
||||||
|
response = await page.goto(
|
||||||
|
body.url,
|
||||||
|
wait_until="load",
|
||||||
|
timeout=body.timeout,
|
||||||
|
)
|
||||||
|
page_status_code = response.status
|
||||||
|
page_error = get_error(page_status_code)
|
||||||
|
# Wait != timeout. Wait is the time to wait after the page is loaded - useful in some cases were "load" / "networkidle" is not enough
|
||||||
|
if body.wait_after_load > 0:
|
||||||
|
await page.wait_for_timeout(body.wait_after_load)
|
||||||
|
|
||||||
page_content = await page.content()
|
page_content = await page.content()
|
||||||
await context.close()
|
await context.close()
|
||||||
json_compatible_item_data = {"content": page_content}
|
json_compatible_item_data = {
|
||||||
|
"content": page_content,
|
||||||
|
"pageStatusCode": page_status_code,
|
||||||
|
"pageError": page_error
|
||||||
|
}
|
||||||
return JSONResponse(content=json_compatible_item_data)
|
return JSONResponse(content=json_compatible_item_data)
|
2
apps/python-sdk/.pylintrc
Normal file
2
apps/python-sdk/.pylintrc
Normal file
@ -0,0 +1,2 @@
|
|||||||
|
[FORMAT]
|
||||||
|
max-line-length = 120
|
@ -117,6 +117,25 @@ status = app.check_crawl_status(job_id)
|
|||||||
|
|
||||||
The SDK handles errors returned by the Firecrawl API and raises appropriate exceptions. If an error occurs during a request, an exception will be raised with a descriptive error message.
|
The SDK handles errors returned by the Firecrawl API and raises appropriate exceptions. If an error occurs during a request, an exception will be raised with a descriptive error message.
|
||||||
|
|
||||||
|
## Running the Tests with Pytest
|
||||||
|
|
||||||
|
To ensure the functionality of the Firecrawl Python SDK, we have included end-to-end tests using `pytest`. These tests cover various aspects of the SDK, including URL scraping, web searching, and website crawling.
|
||||||
|
|
||||||
|
### Running the Tests
|
||||||
|
|
||||||
|
To run the tests, execute the following commands:
|
||||||
|
|
||||||
|
Install pytest:
|
||||||
|
```bash
|
||||||
|
pip install pytest
|
||||||
|
```
|
||||||
|
|
||||||
|
Run:
|
||||||
|
```bash
|
||||||
|
pytest firecrawl/__tests__/e2e_withAuth/test.py
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
## Contributing
|
## Contributing
|
||||||
|
|
||||||
Contributions to the Firecrawl Python SDK are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request on the GitHub repository.
|
Contributions to the Firecrawl Python SDK are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request on the GitHub repository.
|
||||||
|
@ -1,18 +1,50 @@
|
|||||||
|
"""
|
||||||
|
FirecrawlApp Module
|
||||||
|
|
||||||
|
This module provides a class `FirecrawlApp` for interacting with the Firecrawl API.
|
||||||
|
It includes methods to scrape URLs, perform searches, initiate and monitor crawl jobs,
|
||||||
|
and check the status of these jobs. The module uses requests for HTTP communication
|
||||||
|
and handles retries for certain HTTP status codes.
|
||||||
|
|
||||||
|
Classes:
|
||||||
|
- FirecrawlApp: Main class for interacting with the Firecrawl API.
|
||||||
|
"""
|
||||||
|
|
||||||
import os
|
import os
|
||||||
from typing import Any, Dict, Optional
|
|
||||||
import requests
|
|
||||||
import time
|
import time
|
||||||
|
from typing import Any, Dict, Optional
|
||||||
|
|
||||||
|
import requests
|
||||||
|
|
||||||
|
|
||||||
class FirecrawlApp:
|
class FirecrawlApp:
|
||||||
def __init__(self, api_key=None, api_url='https://api.firecrawl.dev'):
|
"""
|
||||||
|
Initialize the FirecrawlApp instance.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
api_key (Optional[str]): API key for authenticating with the Firecrawl API.
|
||||||
|
api_url (Optional[str]): Base URL for the Firecrawl API.
|
||||||
|
"""
|
||||||
|
def __init__(self, api_key: Optional[str] = None, api_url: Optional[str] = None) -> None:
|
||||||
self.api_key = api_key or os.getenv('FIRECRAWL_API_KEY')
|
self.api_key = api_key or os.getenv('FIRECRAWL_API_KEY')
|
||||||
if self.api_key is None:
|
if self.api_key is None:
|
||||||
raise ValueError('No API key provided')
|
raise ValueError('No API key provided')
|
||||||
self.api_url = api_url or os.getenv('FIRECRAWL_API_URL')
|
self.api_url = api_url or os.getenv('FIRECRAWL_API_URL', 'https://api.firecrawl.dev')
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
def scrape_url(self, url: str, params: Optional[Dict[str, Any]] = None) -> Any:
|
def scrape_url(self, url: str, params: Optional[Dict[str, Any]] = None) -> Any:
|
||||||
|
"""
|
||||||
|
Scrape the specified URL using the Firecrawl API.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
url (str): The URL to scrape.
|
||||||
|
params (Optional[Dict[str, Any]]): Additional parameters for the scrape request.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Any: The scraped data if the request is successful.
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
Exception: If the scrape request fails.
|
||||||
|
"""
|
||||||
|
|
||||||
headers = {
|
headers = {
|
||||||
'Content-Type': 'application/json',
|
'Content-Type': 'application/json',
|
||||||
'Authorization': f'Bearer {self.api_key}'
|
'Authorization': f'Bearer {self.api_key}'
|
||||||
@ -41,11 +73,11 @@ class FirecrawlApp:
|
|||||||
response = requests.post(
|
response = requests.post(
|
||||||
f'{self.api_url}/v0/scrape',
|
f'{self.api_url}/v0/scrape',
|
||||||
headers=headers,
|
headers=headers,
|
||||||
json=scrape_params
|
json=scrape_params,
|
||||||
)
|
)
|
||||||
if response.status_code == 200:
|
if response.status_code == 200:
|
||||||
response = response.json()
|
response = response.json()
|
||||||
if response['success']:
|
if response['success'] and 'data' in response:
|
||||||
return response['data']
|
return response['data']
|
||||||
else:
|
else:
|
||||||
raise Exception(f'Failed to scrape URL. Error: {response["error"]}')
|
raise Exception(f'Failed to scrape URL. Error: {response["error"]}')
|
||||||
@ -56,6 +88,19 @@ class FirecrawlApp:
|
|||||||
raise Exception(f'Failed to scrape URL. Status code: {response.status_code}')
|
raise Exception(f'Failed to scrape URL. Status code: {response.status_code}')
|
||||||
|
|
||||||
def search(self, query, params=None):
|
def search(self, query, params=None):
|
||||||
|
"""
|
||||||
|
Perform a search using the Firecrawl API.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
query (str): The search query.
|
||||||
|
params (Optional[Dict[str, Any]]): Additional parameters for the search request.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Any: The search results if the request is successful.
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
Exception: If the search request fails.
|
||||||
|
"""
|
||||||
headers = {
|
headers = {
|
||||||
'Content-Type': 'application/json',
|
'Content-Type': 'application/json',
|
||||||
'Authorization': f'Bearer {self.api_key}'
|
'Authorization': f'Bearer {self.api_key}'
|
||||||
@ -70,7 +115,8 @@ class FirecrawlApp:
|
|||||||
)
|
)
|
||||||
if response.status_code == 200:
|
if response.status_code == 200:
|
||||||
response = response.json()
|
response = response.json()
|
||||||
if response['success'] == True:
|
|
||||||
|
if response['success'] and 'data' in response:
|
||||||
return response['data']
|
return response['data']
|
||||||
else:
|
else:
|
||||||
raise Exception(f'Failed to search. Error: {response["error"]}')
|
raise Exception(f'Failed to search. Error: {response["error"]}')
|
||||||
@ -81,8 +127,24 @@ class FirecrawlApp:
|
|||||||
else:
|
else:
|
||||||
raise Exception(f'Failed to search. Status code: {response.status_code}')
|
raise Exception(f'Failed to search. Status code: {response.status_code}')
|
||||||
|
|
||||||
def crawl_url(self, url, params=None, wait_until_done=True, timeout=2):
|
def crawl_url(self, url, params=None, wait_until_done=True, timeout=2, idempotency_key=None):
|
||||||
headers = self._prepare_headers()
|
"""
|
||||||
|
Initiate a crawl job for the specified URL using the Firecrawl API.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
url (str): The URL to crawl.
|
||||||
|
params (Optional[Dict[str, Any]]): Additional parameters for the crawl request.
|
||||||
|
wait_until_done (bool): Whether to wait until the crawl job is completed.
|
||||||
|
timeout (int): Timeout between status checks when waiting for job completion.
|
||||||
|
idempotency_key (Optional[str]): A unique uuid key to ensure idempotency of requests.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Any: The crawl job ID or the crawl results if waiting until completion.
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
Exception: If the crawl job initiation or monitoring fails.
|
||||||
|
"""
|
||||||
|
headers = self._prepare_headers(idempotency_key)
|
||||||
json_data = {'url': url}
|
json_data = {'url': url}
|
||||||
if params:
|
if params:
|
||||||
json_data.update(params)
|
json_data.update(params)
|
||||||
@ -97,6 +159,18 @@ class FirecrawlApp:
|
|||||||
self._handle_error(response, 'start crawl job')
|
self._handle_error(response, 'start crawl job')
|
||||||
|
|
||||||
def check_crawl_status(self, job_id):
|
def check_crawl_status(self, job_id):
|
||||||
|
"""
|
||||||
|
Check the status of a crawl job using the Firecrawl API.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
job_id (str): The ID of the crawl job.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Any: The status of the crawl job.
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
Exception: If the status check request fails.
|
||||||
|
"""
|
||||||
headers = self._prepare_headers()
|
headers = self._prepare_headers()
|
||||||
response = self._get_request(f'{self.api_url}/v0/crawl/status/{job_id}', headers)
|
response = self._get_request(f'{self.api_url}/v0/crawl/status/{job_id}', headers)
|
||||||
if response.status_code == 200:
|
if response.status_code == 200:
|
||||||
@ -104,13 +178,45 @@ class FirecrawlApp:
|
|||||||
else:
|
else:
|
||||||
self._handle_error(response, 'check crawl status')
|
self._handle_error(response, 'check crawl status')
|
||||||
|
|
||||||
def _prepare_headers(self):
|
def _prepare_headers(self, idempotency_key=None):
|
||||||
|
"""
|
||||||
|
Prepare the headers for API requests.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
idempotency_key (Optional[str]): A unique key to ensure idempotency of requests.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dict[str, str]: The headers including content type, authorization, and optionally idempotency key.
|
||||||
|
"""
|
||||||
|
if idempotency_key:
|
||||||
return {
|
return {
|
||||||
'Content-Type': 'application/json',
|
'Content-Type': 'application/json',
|
||||||
'Authorization': f'Bearer {self.api_key}'
|
'Authorization': f'Bearer {self.api_key}',
|
||||||
|
'x-idempotency-key': idempotency_key
|
||||||
|
}
|
||||||
|
|
||||||
|
return {
|
||||||
|
'Content-Type': 'application/json',
|
||||||
|
'Authorization': f'Bearer {self.api_key}',
|
||||||
}
|
}
|
||||||
|
|
||||||
def _post_request(self, url, data, headers, retries=3, backoff_factor=0.5):
|
def _post_request(self, url, data, headers, retries=3, backoff_factor=0.5):
|
||||||
|
"""
|
||||||
|
Make a POST request with retries.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
url (str): The URL to send the POST request to.
|
||||||
|
data (Dict[str, Any]): The JSON data to include in the POST request.
|
||||||
|
headers (Dict[str, str]): The headers to include in the POST request.
|
||||||
|
retries (int): Number of retries for the request.
|
||||||
|
backoff_factor (float): Backoff factor for retries.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
requests.Response: The response from the POST request.
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
requests.RequestException: If the request fails after the specified retries.
|
||||||
|
"""
|
||||||
for attempt in range(retries):
|
for attempt in range(retries):
|
||||||
response = requests.post(url, headers=headers, json=data)
|
response = requests.post(url, headers=headers, json=data)
|
||||||
if response.status_code == 502:
|
if response.status_code == 502:
|
||||||
@ -120,6 +226,21 @@ class FirecrawlApp:
|
|||||||
return response
|
return response
|
||||||
|
|
||||||
def _get_request(self, url, headers, retries=3, backoff_factor=0.5):
|
def _get_request(self, url, headers, retries=3, backoff_factor=0.5):
|
||||||
|
"""
|
||||||
|
Make a GET request with retries.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
url (str): The URL to send the GET request to.
|
||||||
|
headers (Dict[str, str]): The headers to include in the GET request.
|
||||||
|
retries (int): Number of retries for the request.
|
||||||
|
backoff_factor (float): Backoff factor for retries.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
requests.Response: The response from the GET request.
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
requests.RequestException: If the request fails after the specified retries.
|
||||||
|
"""
|
||||||
for attempt in range(retries):
|
for attempt in range(retries):
|
||||||
response = requests.get(url, headers=headers)
|
response = requests.get(url, headers=headers)
|
||||||
if response.status_code == 502:
|
if response.status_code == 502:
|
||||||
@ -129,7 +250,20 @@ class FirecrawlApp:
|
|||||||
return response
|
return response
|
||||||
|
|
||||||
def _monitor_job_status(self, job_id, headers, timeout):
|
def _monitor_job_status(self, job_id, headers, timeout):
|
||||||
import time
|
"""
|
||||||
|
Monitor the status of a crawl job until completion.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
job_id (str): The ID of the crawl job.
|
||||||
|
headers (Dict[str, str]): The headers to include in the status check requests.
|
||||||
|
timeout (int): Timeout between status checks.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Any: The crawl results if the job is completed successfully.
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
Exception: If the job fails or an error occurs during status checks.
|
||||||
|
"""
|
||||||
while True:
|
while True:
|
||||||
status_response = self._get_request(f'{self.api_url}/v0/crawl/status/{job_id}', headers)
|
status_response = self._get_request(f'{self.api_url}/v0/crawl/status/{job_id}', headers)
|
||||||
if status_response.status_code == 200:
|
if status_response.status_code == 200:
|
||||||
@ -139,9 +273,8 @@ class FirecrawlApp:
|
|||||||
return status_data['data']
|
return status_data['data']
|
||||||
else:
|
else:
|
||||||
raise Exception('Crawl job completed but no data was returned')
|
raise Exception('Crawl job completed but no data was returned')
|
||||||
elif status_data['status'] in ['active', 'paused', 'pending', 'queued']:
|
elif status_data['status'] in ['active', 'paused', 'pending', 'queued', 'waiting']:
|
||||||
if timeout < 2:
|
timeout=max(timeout,2)
|
||||||
timeout = 2
|
|
||||||
time.sleep(timeout) # Wait for the specified timeout before checking again
|
time.sleep(timeout) # Wait for the specified timeout before checking again
|
||||||
else:
|
else:
|
||||||
raise Exception(f'Crawl job failed or was stopped. Status: {status_data["status"]}')
|
raise Exception(f'Crawl job failed or was stopped. Status: {status_data["status"]}')
|
||||||
@ -149,6 +282,16 @@ class FirecrawlApp:
|
|||||||
self._handle_error(status_response, 'check crawl status')
|
self._handle_error(status_response, 'check crawl status')
|
||||||
|
|
||||||
def _handle_error(self, response, action):
|
def _handle_error(self, response, action):
|
||||||
|
"""
|
||||||
|
Handle errors from API responses.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
response (requests.Response): The response object from the API request.
|
||||||
|
action (str): Description of the action that was being performed.
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
Exception: An exception with a message containing the status code and error details from the response.
|
||||||
|
"""
|
||||||
if response.status_code in [402, 408, 409, 500]:
|
if response.status_code in [402, 408, 409, 500]:
|
||||||
error_message = response.json().get('error', 'Unknown error occurred')
|
error_message = response.json().get('error', 'Unknown error occurred')
|
||||||
raise Exception(f'Failed to {action}. Status code: {response.status_code}. Error: {error_message}')
|
raise Exception(f'Failed to {action}. Status code: {response.status_code}. Error: {error_message}')
|
||||||
|
BIN
apps/python-sdk/dist/firecrawl-py-0.0.12.tar.gz
vendored
Normal file
BIN
apps/python-sdk/dist/firecrawl-py-0.0.12.tar.gz
vendored
Normal file
Binary file not shown.
BIN
apps/python-sdk/dist/firecrawl-py-0.0.9.tar.gz
vendored
BIN
apps/python-sdk/dist/firecrawl-py-0.0.9.tar.gz
vendored
Binary file not shown.
BIN
apps/python-sdk/dist/firecrawl_py-0.0.12-py3-none-any.whl
vendored
Normal file
BIN
apps/python-sdk/dist/firecrawl_py-0.0.12-py3-none-any.whl
vendored
Normal file
Binary file not shown.
Binary file not shown.
@ -1,4 +1,5 @@
|
|||||||
from firecrawl import FirecrawlApp
|
import uuid
|
||||||
|
from firecrawl.firecrawl import FirecrawlApp
|
||||||
|
|
||||||
app = FirecrawlApp(api_key="fc-YOUR_API_KEY")
|
app = FirecrawlApp(api_key="fc-YOUR_API_KEY")
|
||||||
|
|
||||||
@ -7,7 +8,8 @@ scrape_result = app.scrape_url('firecrawl.dev')
|
|||||||
print(scrape_result['markdown'])
|
print(scrape_result['markdown'])
|
||||||
|
|
||||||
# Crawl a website:
|
# Crawl a website:
|
||||||
crawl_result = app.crawl_url('mendable.ai', {'crawlerOptions': {'excludes': ['blog/*']}})
|
idempotency_key = str(uuid.uuid4()) # optional idempotency key
|
||||||
|
crawl_result = app.crawl_url('mendable.ai', {'crawlerOptions': {'excludes': ['blog/*']}}, True, 2, idempotency_key)
|
||||||
print(crawl_result)
|
print(crawl_result)
|
||||||
|
|
||||||
# LLM Extraction:
|
# LLM Extraction:
|
||||||
|
@ -1 +1,57 @@
|
|||||||
|
"""
|
||||||
|
This is the Firecrawl package.
|
||||||
|
|
||||||
|
This package provides a Python SDK for interacting with the Firecrawl API.
|
||||||
|
It includes methods to scrape URLs, perform searches, initiate and monitor crawl jobs,
|
||||||
|
and check the status of these jobs.
|
||||||
|
|
||||||
|
For more information visit https://github.com/firecrawl/
|
||||||
|
"""
|
||||||
|
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
|
||||||
from .firecrawl import FirecrawlApp
|
from .firecrawl import FirecrawlApp
|
||||||
|
|
||||||
|
__version__ = "0.0.16"
|
||||||
|
|
||||||
|
# Define the logger for the Firecrawl project
|
||||||
|
logger: logging.Logger = logging.getLogger("firecrawl")
|
||||||
|
|
||||||
|
|
||||||
|
def _basic_config() -> None:
|
||||||
|
"""Set up basic configuration for logging with a specific format and date format."""
|
||||||
|
try:
|
||||||
|
logging.basicConfig(
|
||||||
|
format="[%(asctime)s - %(name)s:%(lineno)d - %(levelname)s] %(message)s",
|
||||||
|
datefmt="%Y-%m-%d %H:%M:%S",
|
||||||
|
)
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("Failed to configure logging: %s", e)
|
||||||
|
|
||||||
|
|
||||||
|
def setup_logging() -> None:
|
||||||
|
"""Set up logging based on the FIRECRAWL_LOGGING_LEVEL environment variable."""
|
||||||
|
env = os.environ.get(
|
||||||
|
"FIRECRAWL_LOGGING_LEVEL", "INFO"
|
||||||
|
).upper() # Default to 'INFO' level
|
||||||
|
_basic_config()
|
||||||
|
|
||||||
|
if env == "DEBUG":
|
||||||
|
logger.setLevel(logging.DEBUG)
|
||||||
|
elif env == "INFO":
|
||||||
|
logger.setLevel(logging.INFO)
|
||||||
|
elif env == "WARNING":
|
||||||
|
logger.setLevel(logging.WARNING)
|
||||||
|
elif env == "ERROR":
|
||||||
|
logger.setLevel(logging.ERROR)
|
||||||
|
elif env == "CRITICAL":
|
||||||
|
logger.setLevel(logging.CRITICAL)
|
||||||
|
else:
|
||||||
|
logger.setLevel(logging.INFO)
|
||||||
|
logger.warning("Unknown logging level: %s, defaulting to INFO", env)
|
||||||
|
|
||||||
|
|
||||||
|
# Initialize logging configuration when the module is imported
|
||||||
|
setup_logging()
|
||||||
|
logger.debug("Debugging logger setup")
|
||||||
|
Binary file not shown.
Binary file not shown.
@ -0,0 +1,3 @@
|
|||||||
|
API_URL=http://localhost:3002
|
||||||
|
ABSOLUTE_FIRECRAWL_PATH=/Users/user/firecrawl/apps/python-sdk/firecrawl/firecrawl.py
|
||||||
|
TEST_API_KEY=fc-YOUR_API_KEY
|
Binary file not shown.
168
apps/python-sdk/firecrawl/__tests__/e2e_withAuth/test.py
Normal file
168
apps/python-sdk/firecrawl/__tests__/e2e_withAuth/test.py
Normal file
@ -0,0 +1,168 @@
|
|||||||
|
import importlib.util
|
||||||
|
import pytest
|
||||||
|
import time
|
||||||
|
import os
|
||||||
|
from uuid import uuid4
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
|
||||||
|
load_dotenv()
|
||||||
|
|
||||||
|
API_URL = "http://127.0.0.1:3002";
|
||||||
|
ABSOLUTE_FIRECRAWL_PATH = "firecrawl/firecrawl.py"
|
||||||
|
TEST_API_KEY = os.getenv('TEST_API_KEY')
|
||||||
|
|
||||||
|
print(f"ABSOLUTE_FIRECRAWL_PATH: {ABSOLUTE_FIRECRAWL_PATH}")
|
||||||
|
|
||||||
|
spec = importlib.util.spec_from_file_location("FirecrawlApp", ABSOLUTE_FIRECRAWL_PATH)
|
||||||
|
firecrawl = importlib.util.module_from_spec(spec)
|
||||||
|
spec.loader.exec_module(firecrawl)
|
||||||
|
FirecrawlApp = firecrawl.FirecrawlApp
|
||||||
|
|
||||||
|
def test_no_api_key():
|
||||||
|
with pytest.raises(Exception) as excinfo:
|
||||||
|
invalid_app = FirecrawlApp(api_url=API_URL)
|
||||||
|
assert "No API key provided" in str(excinfo.value)
|
||||||
|
|
||||||
|
def test_scrape_url_invalid_api_key():
|
||||||
|
invalid_app = FirecrawlApp(api_url=API_URL, api_key="invalid_api_key")
|
||||||
|
with pytest.raises(Exception) as excinfo:
|
||||||
|
invalid_app.scrape_url('https://firecrawl.dev')
|
||||||
|
assert "Unexpected error during scrape URL: Status code 401. Unauthorized: Invalid token" in str(excinfo.value)
|
||||||
|
|
||||||
|
def test_blocklisted_url():
|
||||||
|
blocklisted_url = "https://facebook.com/fake-test"
|
||||||
|
app = FirecrawlApp(api_url=API_URL, api_key=TEST_API_KEY)
|
||||||
|
with pytest.raises(Exception) as excinfo:
|
||||||
|
app.scrape_url(blocklisted_url)
|
||||||
|
assert "Unexpected error during scrape URL: Status code 403. Firecrawl currently does not support social media scraping due to policy restrictions. We're actively working on building support for it." in str(excinfo.value)
|
||||||
|
|
||||||
|
def test_successful_response_with_valid_preview_token():
|
||||||
|
app = FirecrawlApp(api_url=API_URL, api_key="this_is_just_a_preview_token")
|
||||||
|
response = app.scrape_url('https://roastmywebsite.ai')
|
||||||
|
assert response is not None
|
||||||
|
assert 'content' in response
|
||||||
|
assert "_Roast_" in response['content']
|
||||||
|
|
||||||
|
def test_scrape_url_e2e():
|
||||||
|
app = FirecrawlApp(api_url=API_URL, api_key=TEST_API_KEY)
|
||||||
|
response = app.scrape_url('https://roastmywebsite.ai')
|
||||||
|
assert response is not None
|
||||||
|
assert 'content' in response
|
||||||
|
assert 'markdown' in response
|
||||||
|
assert 'metadata' in response
|
||||||
|
assert 'html' not in response
|
||||||
|
assert "_Roast_" in response['content']
|
||||||
|
|
||||||
|
def test_successful_response_with_valid_api_key_and_include_html():
|
||||||
|
app = FirecrawlApp(api_url=API_URL, api_key=TEST_API_KEY)
|
||||||
|
response = app.scrape_url('https://roastmywebsite.ai', {'pageOptions': {'includeHtml': True}})
|
||||||
|
assert response is not None
|
||||||
|
assert 'content' in response
|
||||||
|
assert 'markdown' in response
|
||||||
|
assert 'html' in response
|
||||||
|
assert 'metadata' in response
|
||||||
|
assert "_Roast_" in response['content']
|
||||||
|
assert "_Roast_" in response['markdown']
|
||||||
|
assert "<h1" in response['html']
|
||||||
|
|
||||||
|
def test_successful_response_for_valid_scrape_with_pdf_file():
|
||||||
|
app = FirecrawlApp(api_url=API_URL, api_key=TEST_API_KEY)
|
||||||
|
response = app.scrape_url('https://arxiv.org/pdf/astro-ph/9301001.pdf')
|
||||||
|
assert response is not None
|
||||||
|
assert 'content' in response
|
||||||
|
assert 'metadata' in response
|
||||||
|
assert 'We present spectrophotometric observations of the Broad Line Radio Galaxy' in response['content']
|
||||||
|
|
||||||
|
def test_successful_response_for_valid_scrape_with_pdf_file_without_explicit_extension():
|
||||||
|
app = FirecrawlApp(api_url=API_URL, api_key=TEST_API_KEY)
|
||||||
|
response = app.scrape_url('https://arxiv.org/pdf/astro-ph/9301001')
|
||||||
|
time.sleep(6) # wait for 6 seconds
|
||||||
|
assert response is not None
|
||||||
|
assert 'content' in response
|
||||||
|
assert 'metadata' in response
|
||||||
|
assert 'We present spectrophotometric observations of the Broad Line Radio Galaxy' in response['content']
|
||||||
|
|
||||||
|
def test_crawl_url_invalid_api_key():
|
||||||
|
invalid_app = FirecrawlApp(api_url=API_URL, api_key="invalid_api_key")
|
||||||
|
with pytest.raises(Exception) as excinfo:
|
||||||
|
invalid_app.crawl_url('https://firecrawl.dev')
|
||||||
|
assert "Unexpected error during start crawl job: Status code 401. Unauthorized: Invalid token" in str(excinfo.value)
|
||||||
|
|
||||||
|
def test_should_return_error_for_blocklisted_url():
|
||||||
|
app = FirecrawlApp(api_url=API_URL, api_key=TEST_API_KEY)
|
||||||
|
blocklisted_url = "https://twitter.com/fake-test"
|
||||||
|
with pytest.raises(Exception) as excinfo:
|
||||||
|
app.crawl_url(blocklisted_url)
|
||||||
|
assert "Unexpected error during start crawl job: Status code 403. Firecrawl currently does not support social media scraping due to policy restrictions. We're actively working on building support for it." in str(excinfo.value)
|
||||||
|
|
||||||
|
def test_crawl_url_wait_for_completion_e2e():
|
||||||
|
app = FirecrawlApp(api_url=API_URL, api_key=TEST_API_KEY)
|
||||||
|
response = app.crawl_url('https://roastmywebsite.ai', {'crawlerOptions': {'excludes': ['blog/*']}}, True)
|
||||||
|
assert response is not None
|
||||||
|
assert len(response) > 0
|
||||||
|
assert 'content' in response[0]
|
||||||
|
assert "_Roast_" in response[0]['content']
|
||||||
|
|
||||||
|
def test_crawl_url_with_idempotency_key_e2e():
|
||||||
|
app = FirecrawlApp(api_url=API_URL, api_key=TEST_API_KEY)
|
||||||
|
uniqueIdempotencyKey = str(uuid4())
|
||||||
|
response = app.crawl_url('https://roastmywebsite.ai', {'crawlerOptions': {'excludes': ['blog/*']}}, True, 2, uniqueIdempotencyKey)
|
||||||
|
assert response is not None
|
||||||
|
assert len(response) > 0
|
||||||
|
assert 'content' in response[0]
|
||||||
|
assert "_Roast_" in response[0]['content']
|
||||||
|
|
||||||
|
with pytest.raises(Exception) as excinfo:
|
||||||
|
app.crawl_url('https://firecrawl.dev', {'crawlerOptions': {'excludes': ['blog/*']}}, True, 2, uniqueIdempotencyKey)
|
||||||
|
assert "Conflict: Failed to start crawl job due to a conflict. Idempotency key already used" in str(excinfo.value)
|
||||||
|
|
||||||
|
def test_check_crawl_status_e2e():
|
||||||
|
app = FirecrawlApp(api_url=API_URL, api_key=TEST_API_KEY)
|
||||||
|
response = app.crawl_url('https://firecrawl.dev', {'crawlerOptions': {'excludes': ['blog/*']}}, False)
|
||||||
|
assert response is not None
|
||||||
|
assert 'jobId' in response
|
||||||
|
|
||||||
|
time.sleep(30) # wait for 30 seconds
|
||||||
|
status_response = app.check_crawl_status(response['jobId'])
|
||||||
|
assert status_response is not None
|
||||||
|
assert 'status' in status_response
|
||||||
|
assert status_response['status'] == 'completed'
|
||||||
|
assert 'data' in status_response
|
||||||
|
assert len(status_response['data']) > 0
|
||||||
|
|
||||||
|
def test_search_e2e():
|
||||||
|
app = FirecrawlApp(api_url=API_URL, api_key=TEST_API_KEY)
|
||||||
|
response = app.search("test query")
|
||||||
|
assert response is not None
|
||||||
|
assert 'content' in response[0]
|
||||||
|
assert len(response) > 2
|
||||||
|
|
||||||
|
def test_search_invalid_api_key():
|
||||||
|
invalid_app = FirecrawlApp(api_url=API_URL, api_key="invalid_api_key")
|
||||||
|
with pytest.raises(Exception) as excinfo:
|
||||||
|
invalid_app.search("test query")
|
||||||
|
assert "Unexpected error during search: Status code 401. Unauthorized: Invalid token" in str(excinfo.value)
|
||||||
|
|
||||||
|
def test_llm_extraction():
|
||||||
|
app = FirecrawlApp(api_url=API_URL, api_key=TEST_API_KEY)
|
||||||
|
response = app.scrape_url("https://mendable.ai", {
|
||||||
|
'extractorOptions': {
|
||||||
|
'mode': 'llm-extraction',
|
||||||
|
'extractionPrompt': "Based on the information on the page, find what the company's mission is and whether it supports SSO, and whether it is open source",
|
||||||
|
'extractionSchema': {
|
||||||
|
'type': 'object',
|
||||||
|
'properties': {
|
||||||
|
'company_mission': {'type': 'string'},
|
||||||
|
'supports_sso': {'type': 'boolean'},
|
||||||
|
'is_open_source': {'type': 'boolean'}
|
||||||
|
},
|
||||||
|
'required': ['company_mission', 'supports_sso', 'is_open_source']
|
||||||
|
}
|
||||||
|
}
|
||||||
|
})
|
||||||
|
assert response is not None
|
||||||
|
assert 'llm_extraction' in response
|
||||||
|
llm_extraction = response['llm_extraction']
|
||||||
|
assert 'company_mission' in llm_extraction
|
||||||
|
assert isinstance(llm_extraction['supports_sso'], bool)
|
||||||
|
assert isinstance(llm_extraction['is_open_source'], bool)
|
@ -1,22 +1,60 @@
|
|||||||
|
"""
|
||||||
|
FirecrawlApp Module
|
||||||
|
|
||||||
|
This module provides a class `FirecrawlApp` for interacting with the Firecrawl API.
|
||||||
|
It includes methods to scrape URLs, perform searches, initiate and monitor crawl jobs,
|
||||||
|
and check the status of these jobs. The module uses requests for HTTP communication
|
||||||
|
and handles retries for certain HTTP status codes.
|
||||||
|
|
||||||
|
Classes:
|
||||||
|
- FirecrawlApp: Main class for interacting with the Firecrawl API.
|
||||||
|
"""
|
||||||
|
import logging
|
||||||
import os
|
import os
|
||||||
from typing import Any, Dict, Optional
|
|
||||||
import requests
|
|
||||||
import time
|
import time
|
||||||
|
from typing import Any, Dict, Optional
|
||||||
|
|
||||||
|
import requests
|
||||||
|
|
||||||
|
logger : logging.Logger = logging.getLogger("firecrawl")
|
||||||
|
|
||||||
class FirecrawlApp:
|
class FirecrawlApp:
|
||||||
def __init__(self, api_key=None, api_url='https://api.firecrawl.dev'):
|
"""
|
||||||
|
Initialize the FirecrawlApp instance.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
api_key (Optional[str]): API key for authenticating with the Firecrawl API.
|
||||||
|
api_url (Optional[str]): Base URL for the Firecrawl API.
|
||||||
|
"""
|
||||||
|
def __init__(self, api_key: Optional[str] = None, api_url: Optional[str] = None) -> None:
|
||||||
self.api_key = api_key or os.getenv('FIRECRAWL_API_KEY')
|
self.api_key = api_key or os.getenv('FIRECRAWL_API_KEY')
|
||||||
if self.api_key is None:
|
if self.api_key is None:
|
||||||
|
logger.warning("No API key provided")
|
||||||
raise ValueError('No API key provided')
|
raise ValueError('No API key provided')
|
||||||
self.api_url = api_url or os.getenv('FIRECRAWL_API_URL')
|
else:
|
||||||
|
logger.debug("Initialized FirecrawlApp with API key: %s", self.api_key)
|
||||||
|
|
||||||
|
self.api_url = api_url or os.getenv('FIRECRAWL_API_URL', 'https://api.firecrawl.dev')
|
||||||
|
if self.api_url != 'https://api.firecrawl.dev':
|
||||||
|
logger.debug("Initialized FirecrawlApp with API URL: %s", self.api_url)
|
||||||
|
|
||||||
def scrape_url(self, url: str, params: Optional[Dict[str, Any]] = None) -> Any:
|
def scrape_url(self, url: str, params: Optional[Dict[str, Any]] = None) -> Any:
|
||||||
headers = {
|
"""
|
||||||
'Content-Type': 'application/json',
|
Scrape the specified URL using the Firecrawl API.
|
||||||
'Authorization': f'Bearer {self.api_key}'
|
|
||||||
}
|
Args:
|
||||||
|
url (str): The URL to scrape.
|
||||||
|
params (Optional[Dict[str, Any]]): Additional parameters for the scrape request.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Any: The scraped data if the request is successful.
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
Exception: If the scrape request fails.
|
||||||
|
"""
|
||||||
|
|
||||||
|
headers = self._prepare_headers()
|
||||||
|
|
||||||
# Prepare the base scrape parameters with the URL
|
# Prepare the base scrape parameters with the URL
|
||||||
scrape_params = {'url': url}
|
scrape_params = {'url': url}
|
||||||
|
|
||||||
@ -41,25 +79,32 @@ class FirecrawlApp:
|
|||||||
response = requests.post(
|
response = requests.post(
|
||||||
f'{self.api_url}/v0/scrape',
|
f'{self.api_url}/v0/scrape',
|
||||||
headers=headers,
|
headers=headers,
|
||||||
json=scrape_params
|
json=scrape_params,
|
||||||
)
|
)
|
||||||
if response.status_code == 200:
|
if response.status_code == 200:
|
||||||
response = response.json()
|
response = response.json()
|
||||||
if response['success']:
|
if response['success'] and 'data' in response:
|
||||||
return response['data']
|
return response['data']
|
||||||
else:
|
else:
|
||||||
raise Exception(f'Failed to scrape URL. Error: {response["error"]}')
|
raise Exception(f'Failed to scrape URL. Error: {response["error"]}')
|
||||||
elif response.status_code in [402, 408, 409, 500]:
|
|
||||||
error_message = response.json().get('error', 'Unknown error occurred')
|
|
||||||
raise Exception(f'Failed to scrape URL. Status code: {response.status_code}. Error: {error_message}')
|
|
||||||
else:
|
else:
|
||||||
raise Exception(f'Failed to scrape URL. Status code: {response.status_code}')
|
self._handle_error(response, 'scrape URL')
|
||||||
|
|
||||||
def search(self, query, params=None):
|
def search(self, query: str, params: Optional[Dict[str, Any]] = None) -> Any:
|
||||||
headers = {
|
"""
|
||||||
'Content-Type': 'application/json',
|
Perform a search using the Firecrawl API.
|
||||||
'Authorization': f'Bearer {self.api_key}'
|
|
||||||
}
|
Args:
|
||||||
|
query (str): The search query.
|
||||||
|
params (Optional[Dict[str, Any]]): Additional parameters for the search request.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Any: The search results if the request is successful.
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
Exception: If the search request fails.
|
||||||
|
"""
|
||||||
|
headers = self._prepare_headers()
|
||||||
json_data = {'query': query}
|
json_data = {'query': query}
|
||||||
if params:
|
if params:
|
||||||
json_data.update(params)
|
json_data.update(params)
|
||||||
@ -70,19 +115,37 @@ class FirecrawlApp:
|
|||||||
)
|
)
|
||||||
if response.status_code == 200:
|
if response.status_code == 200:
|
||||||
response = response.json()
|
response = response.json()
|
||||||
if response['success'] == True:
|
|
||||||
|
if response['success'] and 'data' in response:
|
||||||
return response['data']
|
return response['data']
|
||||||
else:
|
else:
|
||||||
raise Exception(f'Failed to search. Error: {response["error"]}')
|
raise Exception(f'Failed to search. Error: {response["error"]}')
|
||||||
|
|
||||||
elif response.status_code in [402, 409, 500]:
|
|
||||||
error_message = response.json().get('error', 'Unknown error occurred')
|
|
||||||
raise Exception(f'Failed to search. Status code: {response.status_code}. Error: {error_message}')
|
|
||||||
else:
|
else:
|
||||||
raise Exception(f'Failed to search. Status code: {response.status_code}')
|
self._handle_error(response, 'search')
|
||||||
|
|
||||||
def crawl_url(self, url, params=None, wait_until_done=True, timeout=2):
|
def crawl_url(self, url: str,
|
||||||
headers = self._prepare_headers()
|
params: Optional[Dict[str, Any]] = None,
|
||||||
|
wait_until_done: bool = True,
|
||||||
|
poll_interval: int = 2,
|
||||||
|
idempotency_key: Optional[str] = None) -> Any:
|
||||||
|
"""
|
||||||
|
Initiate a crawl job for the specified URL using the Firecrawl API.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
url (str): The URL to crawl.
|
||||||
|
params (Optional[Dict[str, Any]]): Additional parameters for the crawl request.
|
||||||
|
wait_until_done (bool): Whether to wait until the crawl job is completed.
|
||||||
|
poll_interval (int): Time in seconds between status checks when waiting for job completion.
|
||||||
|
idempotency_key (Optional[str]): A unique uuid key to ensure idempotency of requests.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Any: The crawl job ID or the crawl results if waiting until completion.
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
Exception: If the crawl job initiation or monitoring fails.
|
||||||
|
"""
|
||||||
|
headers = self._prepare_headers(idempotency_key)
|
||||||
json_data = {'url': url}
|
json_data = {'url': url}
|
||||||
if params:
|
if params:
|
||||||
json_data.update(params)
|
json_data.update(params)
|
||||||
@ -90,13 +153,25 @@ class FirecrawlApp:
|
|||||||
if response.status_code == 200:
|
if response.status_code == 200:
|
||||||
job_id = response.json().get('jobId')
|
job_id = response.json().get('jobId')
|
||||||
if wait_until_done:
|
if wait_until_done:
|
||||||
return self._monitor_job_status(job_id, headers, timeout)
|
return self._monitor_job_status(job_id, headers, poll_interval)
|
||||||
else:
|
else:
|
||||||
return {'jobId': job_id}
|
return {'jobId': job_id}
|
||||||
else:
|
else:
|
||||||
self._handle_error(response, 'start crawl job')
|
self._handle_error(response, 'start crawl job')
|
||||||
|
|
||||||
def check_crawl_status(self, job_id):
|
def check_crawl_status(self, job_id: str) -> Any:
|
||||||
|
"""
|
||||||
|
Check the status of a crawl job using the Firecrawl API.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
job_id (str): The ID of the crawl job.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Any: The status of the crawl job.
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
Exception: If the status check request fails.
|
||||||
|
"""
|
||||||
headers = self._prepare_headers()
|
headers = self._prepare_headers()
|
||||||
response = self._get_request(f'{self.api_url}/v0/crawl/status/{job_id}', headers)
|
response = self._get_request(f'{self.api_url}/v0/crawl/status/{job_id}', headers)
|
||||||
if response.status_code == 200:
|
if response.status_code == 200:
|
||||||
@ -104,13 +179,49 @@ class FirecrawlApp:
|
|||||||
else:
|
else:
|
||||||
self._handle_error(response, 'check crawl status')
|
self._handle_error(response, 'check crawl status')
|
||||||
|
|
||||||
def _prepare_headers(self):
|
def _prepare_headers(self, idempotency_key: Optional[str] = None) -> Dict[str, str]:
|
||||||
|
"""
|
||||||
|
Prepare the headers for API requests.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
idempotency_key (Optional[str]): A unique key to ensure idempotency of requests.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dict[str, str]: The headers including content type, authorization, and optionally idempotency key.
|
||||||
|
"""
|
||||||
|
if idempotency_key:
|
||||||
return {
|
return {
|
||||||
'Content-Type': 'application/json',
|
'Content-Type': 'application/json',
|
||||||
'Authorization': f'Bearer {self.api_key}'
|
'Authorization': f'Bearer {self.api_key}',
|
||||||
|
'x-idempotency-key': idempotency_key
|
||||||
}
|
}
|
||||||
|
|
||||||
def _post_request(self, url, data, headers, retries=3, backoff_factor=0.5):
|
return {
|
||||||
|
'Content-Type': 'application/json',
|
||||||
|
'Authorization': f'Bearer {self.api_key}',
|
||||||
|
}
|
||||||
|
|
||||||
|
def _post_request(self, url: str,
|
||||||
|
data: Dict[str, Any],
|
||||||
|
headers: Dict[str, str],
|
||||||
|
retries: int = 3,
|
||||||
|
backoff_factor: float = 0.5) -> requests.Response:
|
||||||
|
"""
|
||||||
|
Make a POST request with retries.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
url (str): The URL to send the POST request to.
|
||||||
|
data (Dict[str, Any]): The JSON data to include in the POST request.
|
||||||
|
headers (Dict[str, str]): The headers to include in the POST request.
|
||||||
|
retries (int): Number of retries for the request.
|
||||||
|
backoff_factor (float): Backoff factor for retries.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
requests.Response: The response from the POST request.
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
requests.RequestException: If the request fails after the specified retries.
|
||||||
|
"""
|
||||||
for attempt in range(retries):
|
for attempt in range(retries):
|
||||||
response = requests.post(url, headers=headers, json=data)
|
response = requests.post(url, headers=headers, json=data)
|
||||||
if response.status_code == 502:
|
if response.status_code == 502:
|
||||||
@ -119,7 +230,25 @@ class FirecrawlApp:
|
|||||||
return response
|
return response
|
||||||
return response
|
return response
|
||||||
|
|
||||||
def _get_request(self, url, headers, retries=3, backoff_factor=0.5):
|
def _get_request(self, url: str,
|
||||||
|
headers: Dict[str, str],
|
||||||
|
retries: int = 3,
|
||||||
|
backoff_factor: float = 0.5) -> requests.Response:
|
||||||
|
"""
|
||||||
|
Make a GET request with retries.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
url (str): The URL to send the GET request to.
|
||||||
|
headers (Dict[str, str]): The headers to include in the GET request.
|
||||||
|
retries (int): Number of retries for the request.
|
||||||
|
backoff_factor (float): Backoff factor for retries.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
requests.Response: The response from the GET request.
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
requests.RequestException: If the request fails after the specified retries.
|
||||||
|
"""
|
||||||
for attempt in range(retries):
|
for attempt in range(retries):
|
||||||
response = requests.get(url, headers=headers)
|
response = requests.get(url, headers=headers)
|
||||||
if response.status_code == 502:
|
if response.status_code == 502:
|
||||||
@ -128,8 +257,21 @@ class FirecrawlApp:
|
|||||||
return response
|
return response
|
||||||
return response
|
return response
|
||||||
|
|
||||||
def _monitor_job_status(self, job_id, headers, timeout):
|
def _monitor_job_status(self, job_id: str, headers: Dict[str, str], poll_interval: int) -> Any:
|
||||||
import time
|
"""
|
||||||
|
Monitor the status of a crawl job until completion.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
job_id (str): The ID of the crawl job.
|
||||||
|
headers (Dict[str, str]): The headers to include in the status check requests.
|
||||||
|
poll_interval (int): Secounds between status checks.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Any: The crawl results if the job is completed successfully.
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
Exception: If the job fails or an error occurs during status checks.
|
||||||
|
"""
|
||||||
while True:
|
while True:
|
||||||
status_response = self._get_request(f'{self.api_url}/v0/crawl/status/{job_id}', headers)
|
status_response = self._get_request(f'{self.api_url}/v0/crawl/status/{job_id}', headers)
|
||||||
if status_response.status_code == 200:
|
if status_response.status_code == 200:
|
||||||
@ -139,18 +281,38 @@ class FirecrawlApp:
|
|||||||
return status_data['data']
|
return status_data['data']
|
||||||
else:
|
else:
|
||||||
raise Exception('Crawl job completed but no data was returned')
|
raise Exception('Crawl job completed but no data was returned')
|
||||||
elif status_data['status'] in ['active', 'paused', 'pending', 'queued']:
|
elif status_data['status'] in ['active', 'paused', 'pending', 'queued', 'waiting']:
|
||||||
if timeout < 2:
|
poll_interval=max(poll_interval,2)
|
||||||
timeout = 2
|
time.sleep(poll_interval) # Wait for the specified interval before checking again
|
||||||
time.sleep(timeout) # Wait for the specified timeout before checking again
|
|
||||||
else:
|
else:
|
||||||
raise Exception(f'Crawl job failed or was stopped. Status: {status_data["status"]}')
|
raise Exception(f'Crawl job failed or was stopped. Status: {status_data["status"]}')
|
||||||
else:
|
else:
|
||||||
self._handle_error(status_response, 'check crawl status')
|
self._handle_error(status_response, 'check crawl status')
|
||||||
|
|
||||||
def _handle_error(self, response, action):
|
def _handle_error(self, response: requests.Response, action: str) -> None:
|
||||||
if response.status_code in [402, 408, 409, 500]:
|
"""
|
||||||
error_message = response.json().get('error', 'Unknown error occurred')
|
Handle errors from API responses.
|
||||||
raise Exception(f'Failed to {action}. Status code: {response.status_code}. Error: {error_message}')
|
|
||||||
|
Args:
|
||||||
|
response (requests.Response): The response object from the API request.
|
||||||
|
action (str): Description of the action that was being performed.
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
Exception: An exception with a message containing the status code and error details from the response.
|
||||||
|
"""
|
||||||
|
error_message = response.json().get('error', 'No additional error details provided.')
|
||||||
|
|
||||||
|
if response.status_code == 402:
|
||||||
|
message = f"Payment Required: Failed to {action}. {error_message}"
|
||||||
|
elif response.status_code == 408:
|
||||||
|
message = f"Request Timeout: Failed to {action} as the request timed out. {error_message}"
|
||||||
|
elif response.status_code == 409:
|
||||||
|
message = f"Conflict: Failed to {action} due to a conflict. {error_message}"
|
||||||
|
elif response.status_code == 500:
|
||||||
|
message = f"Internal Server Error: Failed to {action}. {error_message}"
|
||||||
else:
|
else:
|
||||||
raise Exception(f'Unexpected error occurred while trying to {action}. Status code: {response.status_code}')
|
message = f"Unexpected error during {action}: Status code {response.status_code}. {error_message}"
|
||||||
|
|
||||||
|
# Raise an HTTPError with the custom message and attach the response
|
||||||
|
raise requests.exceptions.HTTPError(message, response=response)
|
||||||
|
|
@ -1,7 +1,179 @@
|
|||||||
Metadata-Version: 2.1
|
Metadata-Version: 2.1
|
||||||
Name: firecrawl-py
|
Name: firecrawl-py
|
||||||
Version: 0.0.9
|
Version: 0.0.12
|
||||||
Summary: Python SDK for Firecrawl API
|
Summary: Python SDK for Firecrawl API
|
||||||
Home-page: https://github.com/mendableai/firecrawl
|
Home-page: https://github.com/mendableai/firecrawl
|
||||||
Author: Mendable.ai
|
Author: Mendable.ai
|
||||||
Author-email: nick@mendable.ai
|
Author-email: nick@mendable.ai
|
||||||
|
License: GNU General Public License v3 (GPLv3)
|
||||||
|
Project-URL: Documentation, https://docs.firecrawl.dev
|
||||||
|
Project-URL: Source, https://github.com/mendableai/firecrawl
|
||||||
|
Project-URL: Tracker, https://github.com/mendableai/firecrawl/issues
|
||||||
|
Keywords: SDK API firecrawl
|
||||||
|
Classifier: Development Status :: 5 - Production/Stable
|
||||||
|
Classifier: Environment :: Web Environment
|
||||||
|
Classifier: Intended Audience :: Developers
|
||||||
|
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
|
||||||
|
Classifier: Natural Language :: English
|
||||||
|
Classifier: Operating System :: OS Independent
|
||||||
|
Classifier: Programming Language :: Python
|
||||||
|
Classifier: Programming Language :: Python :: 3
|
||||||
|
Classifier: Programming Language :: Python :: 3.8
|
||||||
|
Classifier: Programming Language :: Python :: 3.9
|
||||||
|
Classifier: Programming Language :: Python :: 3.10
|
||||||
|
Classifier: Topic :: Internet
|
||||||
|
Classifier: Topic :: Internet :: WWW/HTTP
|
||||||
|
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
|
||||||
|
Classifier: Topic :: Software Development
|
||||||
|
Classifier: Topic :: Software Development :: Libraries
|
||||||
|
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
||||||
|
Classifier: Topic :: Text Processing
|
||||||
|
Classifier: Topic :: Text Processing :: Indexing
|
||||||
|
Requires-Python: >=3.8
|
||||||
|
Description-Content-Type: text/markdown
|
||||||
|
|
||||||
|
# Firecrawl Python SDK
|
||||||
|
|
||||||
|
The Firecrawl Python SDK is a library that allows you to easily scrape and crawl websites, and output the data in a format ready for use with language models (LLMs). It provides a simple and intuitive interface for interacting with the Firecrawl API.
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
To install the Firecrawl Python SDK, you can use pip:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install firecrawl-py
|
||||||
|
```
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
1. Get an API key from [firecrawl.dev](https://firecrawl.dev)
|
||||||
|
2. Set the API key as an environment variable named `FIRECRAWL_API_KEY` or pass it as a parameter to the `FirecrawlApp` class.
|
||||||
|
|
||||||
|
|
||||||
|
Here's an example of how to use the SDK:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from firecrawl import FirecrawlApp
|
||||||
|
|
||||||
|
# Initialize the FirecrawlApp with your API key
|
||||||
|
app = FirecrawlApp(api_key='your_api_key')
|
||||||
|
|
||||||
|
# Scrape a single URL
|
||||||
|
url = 'https://mendable.ai'
|
||||||
|
scraped_data = app.scrape_url(url)
|
||||||
|
|
||||||
|
# Crawl a website
|
||||||
|
crawl_url = 'https://mendable.ai'
|
||||||
|
params = {
|
||||||
|
'pageOptions': {
|
||||||
|
'onlyMainContent': True
|
||||||
|
}
|
||||||
|
}
|
||||||
|
crawl_result = app.crawl_url(crawl_url, params=params)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Scraping a URL
|
||||||
|
|
||||||
|
To scrape a single URL, use the `scrape_url` method. It takes the URL as a parameter and returns the scraped data as a dictionary.
|
||||||
|
|
||||||
|
```python
|
||||||
|
url = 'https://example.com'
|
||||||
|
scraped_data = app.scrape_url(url)
|
||||||
|
```
|
||||||
|
### Extracting structured data from a URL
|
||||||
|
|
||||||
|
With LLM extraction, you can easily extract structured data from any URL. We support pydantic schemas to make it easier for you too. Here is how you to use it:
|
||||||
|
|
||||||
|
```python
|
||||||
|
class ArticleSchema(BaseModel):
|
||||||
|
title: str
|
||||||
|
points: int
|
||||||
|
by: str
|
||||||
|
commentsURL: str
|
||||||
|
|
||||||
|
class TopArticlesSchema(BaseModel):
|
||||||
|
top: List[ArticleSchema] = Field(..., max_items=5, description="Top 5 stories")
|
||||||
|
|
||||||
|
data = app.scrape_url('https://news.ycombinator.com', {
|
||||||
|
'extractorOptions': {
|
||||||
|
'extractionSchema': TopArticlesSchema.model_json_schema(),
|
||||||
|
'mode': 'llm-extraction'
|
||||||
|
},
|
||||||
|
'pageOptions':{
|
||||||
|
'onlyMainContent': True
|
||||||
|
}
|
||||||
|
})
|
||||||
|
print(data["llm_extraction"])
|
||||||
|
```
|
||||||
|
|
||||||
|
### Search for a query
|
||||||
|
|
||||||
|
Used to search the web, get the most relevant results, scrap each page and return the markdown.
|
||||||
|
|
||||||
|
```python
|
||||||
|
query = 'what is mendable?'
|
||||||
|
search_result = app.search(query)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Crawling a Website
|
||||||
|
|
||||||
|
To crawl a website, use the `crawl_url` method. It takes the starting URL and optional parameters as arguments. The `params` argument allows you to specify additional options for the crawl job, such as the maximum number of pages to crawl, allowed domains, and the output format.
|
||||||
|
|
||||||
|
The `wait_until_done` parameter determines whether the method should wait for the crawl job to complete before returning the result. If set to `True`, the method will periodically check the status of the crawl job until it is completed or the specified `timeout` (in seconds) is reached. If set to `False`, the method will return immediately with the job ID, and you can manually check the status of the crawl job using the `check_crawl_status` method.
|
||||||
|
|
||||||
|
```python
|
||||||
|
crawl_url = 'https://example.com'
|
||||||
|
params = {
|
||||||
|
'crawlerOptions': {
|
||||||
|
'excludes': ['blog/*'],
|
||||||
|
'includes': [], # leave empty for all pages
|
||||||
|
'limit': 1000,
|
||||||
|
},
|
||||||
|
'pageOptions': {
|
||||||
|
'onlyMainContent': True
|
||||||
|
}
|
||||||
|
}
|
||||||
|
crawl_result = app.crawl_url(crawl_url, params=params, wait_until_done=True, timeout=5)
|
||||||
|
```
|
||||||
|
|
||||||
|
If `wait_until_done` is set to `True`, the `crawl_url` method will return the crawl result once the job is completed. If the job fails or is stopped, an exception will be raised.
|
||||||
|
|
||||||
|
### Checking Crawl Status
|
||||||
|
|
||||||
|
To check the status of a crawl job, use the `check_crawl_status` method. It takes the job ID as a parameter and returns the current status of the crawl job.
|
||||||
|
|
||||||
|
```python
|
||||||
|
job_id = crawl_result['jobId']
|
||||||
|
status = app.check_crawl_status(job_id)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Error Handling
|
||||||
|
|
||||||
|
The SDK handles errors returned by the Firecrawl API and raises appropriate exceptions. If an error occurs during a request, an exception will be raised with a descriptive error message.
|
||||||
|
|
||||||
|
## Running the Tests with Pytest
|
||||||
|
|
||||||
|
To ensure the functionality of the Firecrawl Python SDK, we have included end-to-end tests using `pytest`. These tests cover various aspects of the SDK, including URL scraping, web searching, and website crawling.
|
||||||
|
|
||||||
|
### Running the Tests
|
||||||
|
|
||||||
|
To run the tests, execute the following commands:
|
||||||
|
|
||||||
|
Install pytest:
|
||||||
|
```bash
|
||||||
|
pip install pytest
|
||||||
|
```
|
||||||
|
|
||||||
|
Run:
|
||||||
|
```bash
|
||||||
|
pytest firecrawl/__tests__/e2e_withAuth/test.py
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
## Contributing
|
||||||
|
|
||||||
|
Contributions to the Firecrawl Python SDK are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request on the GitHub repository.
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
The Firecrawl Python SDK is open-source and released under the [MIT License](https://opensource.org/licenses/MIT).
|
||||||
|
@ -1 +1,3 @@
|
|||||||
requests
|
requests
|
||||||
|
pytest
|
||||||
|
python-dotenv
|
||||||
|
48
apps/python-sdk/pyproject.toml
Normal file
48
apps/python-sdk/pyproject.toml
Normal file
@ -0,0 +1,48 @@
|
|||||||
|
[build-system]
|
||||||
|
requires = ["setuptools>=42", "wheel"]
|
||||||
|
build-backend = "setuptools.build_meta"
|
||||||
|
|
||||||
|
[project]
|
||||||
|
dynamic = ["version"]
|
||||||
|
name = "firecrawl-py"
|
||||||
|
description = "Python SDK for Firecrawl API"
|
||||||
|
readme = {file="README.md", content-type = "text/markdown"}
|
||||||
|
requires-python = ">=3.8"
|
||||||
|
dependencies = [
|
||||||
|
"requests",
|
||||||
|
]
|
||||||
|
authors = [{name = "Mendable.ai",email = "nick@mendable.ai"}]
|
||||||
|
maintainers = [{name = "Mendable.ai",email = "nick@mendable.ai"}]
|
||||||
|
license = {text = "GNU General Public License v3 (GPLv3)"}
|
||||||
|
|
||||||
|
classifiers = [
|
||||||
|
"Development Status :: 5 - Production/Stable",
|
||||||
|
"Environment :: Web Environment",
|
||||||
|
"Intended Audience :: Developers",
|
||||||
|
"License :: OSI Approved :: GNU General Public License v3 (GPLv3)",
|
||||||
|
"Natural Language :: English",
|
||||||
|
"Operating System :: OS Independent",
|
||||||
|
"Programming Language :: Python",
|
||||||
|
"Programming Language :: Python :: 3",
|
||||||
|
"Programming Language :: Python :: 3.8",
|
||||||
|
"Programming Language :: Python :: 3.9",
|
||||||
|
"Programming Language :: Python :: 3.10",
|
||||||
|
"Topic :: Internet",
|
||||||
|
"Topic :: Internet :: WWW/HTTP",
|
||||||
|
"Topic :: Internet :: WWW/HTTP :: Indexing/Search",
|
||||||
|
"Topic :: Software Development",
|
||||||
|
"Topic :: Software Development :: Libraries",
|
||||||
|
"Topic :: Software Development :: Libraries :: Python Modules",
|
||||||
|
"Topic :: Text Processing",
|
||||||
|
"Topic :: Text Processing :: Indexing",
|
||||||
|
]
|
||||||
|
|
||||||
|
keywords = ["SDK", "API", "firecrawl"]
|
||||||
|
|
||||||
|
[project.urls]
|
||||||
|
"Documentation" = "https://docs.firecrawl.dev"
|
||||||
|
"Source" = "https://github.com/mendableai/firecrawl"
|
||||||
|
"Tracker" = "https://github.com/mendableai/firecrawl/issues"
|
||||||
|
|
||||||
|
[tool.setuptools.packages.find]
|
||||||
|
where = ["."]
|
3
apps/python-sdk/requirements.txt
Normal file
3
apps/python-sdk/requirements.txt
Normal file
@ -0,0 +1,3 @@
|
|||||||
|
requests
|
||||||
|
pytest
|
||||||
|
python-dotenv
|
@ -1,14 +1,63 @@
|
|||||||
from setuptools import setup, find_packages
|
import re
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from setuptools import find_packages, setup
|
||||||
|
|
||||||
|
this_directory = Path(__file__).parent
|
||||||
|
long_description_content = (this_directory / "README.md").read_text()
|
||||||
|
|
||||||
|
|
||||||
|
def get_version():
|
||||||
|
"""Dynamically set version"""
|
||||||
|
version_file = (this_directory / "firecrawl" / "__init__.py").read_text()
|
||||||
|
version_match = re.search(r"^__version__ = ['\"]([^'\"]*)['\"]", version_file, re.M)
|
||||||
|
if version_match:
|
||||||
|
return version_match.group(1)
|
||||||
|
raise RuntimeError("Unable to find version string.")
|
||||||
|
|
||||||
|
|
||||||
setup(
|
setup(
|
||||||
name='firecrawl-py',
|
name="firecrawl-py",
|
||||||
version='0.0.9',
|
version=get_version(),
|
||||||
url='https://github.com/mendableai/firecrawl',
|
url="https://github.com/mendableai/firecrawl",
|
||||||
author='Mendable.ai',
|
author="Mendable.ai",
|
||||||
author_email='nick@mendable.ai',
|
author_email="nick@mendable.ai",
|
||||||
description='Python SDK for Firecrawl API',
|
description="Python SDK for Firecrawl API",
|
||||||
|
long_description=long_description_content,
|
||||||
|
long_description_content_type="text/markdown",
|
||||||
packages=find_packages(),
|
packages=find_packages(),
|
||||||
install_requires=[
|
install_requires=[
|
||||||
'requests',
|
'requests',
|
||||||
|
'pytest',
|
||||||
|
'python-dotenv',
|
||||||
],
|
],
|
||||||
|
python_requires=">=3.8",
|
||||||
|
classifiers=[
|
||||||
|
"Development Status :: 5 - Production/Stable",
|
||||||
|
"Environment :: Web Environment",
|
||||||
|
"Intended Audience :: Developers",
|
||||||
|
"License :: OSI Approved :: GNU General Public License v3 (GPLv3)",
|
||||||
|
"Natural Language :: English",
|
||||||
|
"Operating System :: OS Independent",
|
||||||
|
"Programming Language :: Python",
|
||||||
|
"Programming Language :: Python :: 3",
|
||||||
|
"Programming Language :: Python :: 3.8",
|
||||||
|
"Programming Language :: Python :: 3.9",
|
||||||
|
"Programming Language :: Python :: 3.10",
|
||||||
|
"Topic :: Internet",
|
||||||
|
"Topic :: Internet :: WWW/HTTP",
|
||||||
|
"Topic :: Internet :: WWW/HTTP :: Indexing/Search",
|
||||||
|
"Topic :: Software Development",
|
||||||
|
"Topic :: Software Development :: Libraries",
|
||||||
|
"Topic :: Software Development :: Libraries :: Python Modules",
|
||||||
|
"Topic :: Text Processing",
|
||||||
|
"Topic :: Text Processing :: Indexing",
|
||||||
|
],
|
||||||
|
keywords="SDK API firecrawl",
|
||||||
|
project_urls={
|
||||||
|
"Documentation": "https://docs.firecrawl.dev",
|
||||||
|
"Source": "https://github.com/mendableai/firecrawl",
|
||||||
|
"Tracker": "https://github.com/mendableai/firecrawl/issues",
|
||||||
|
},
|
||||||
|
license="GNU General Public License v3 (GPLv3)",
|
||||||
)
|
)
|
||||||
|
@ -5,6 +5,10 @@ services:
|
|||||||
build: apps/playwright-service
|
build: apps/playwright-service
|
||||||
environment:
|
environment:
|
||||||
- PORT=3000
|
- PORT=3000
|
||||||
|
- PROXY_SERVER=${PROXY_SERVER}
|
||||||
|
- PROXY_USERNAME=${PROXY_USERNAME}
|
||||||
|
- PROXY_PASSWORD=${PROXY_PASSWORD}
|
||||||
|
- BLOCK_MEDIA=${BLOCK_MEDIA}
|
||||||
networks:
|
networks:
|
||||||
- backend
|
- backend
|
||||||
|
|
||||||
|
41
examples/kubernetes-cluster-install/README.md
Normal file
41
examples/kubernetes-cluster-install/README.md
Normal file
@ -0,0 +1,41 @@
|
|||||||
|
# Install Firecrawl on a Kubernetes Cluster (Simple Version)
|
||||||
|
# Before installing
|
||||||
|
1. Set [secret.yaml](secret.yaml) and [configmap.yaml](configmap.yaml) and do not check in secrets
|
||||||
|
2. Build Docker images, and host it in your Docker Registry (replace the target registry with your own)
|
||||||
|
1. API (which is also used as a worker image)
|
||||||
|
1. ```bash
|
||||||
|
docker build -t ghcr.io/winkk-dev/firecrawl:latest ../../apps/api
|
||||||
|
docker push ghcr.io/winkk-dev/firecrawl:latest
|
||||||
|
```
|
||||||
|
2. Playwright
|
||||||
|
1. ```bash
|
||||||
|
docker build -t ghcr.io/winkk-dev/firecrawl-playwright:latest ../../apps/playwright-service
|
||||||
|
docker push ghcr.io/winkk-dev/firecrawl-playwright:latest
|
||||||
|
```
|
||||||
|
3. Replace the image in [worker.yaml](worker.yaml), [api.yaml](api.yaml) and [playwright-service.yaml](playwright-service.yaml)
|
||||||
|
|
||||||
|
## Install
|
||||||
|
```bash
|
||||||
|
kubectl apply -f configmap.yaml
|
||||||
|
kubectl apply -f secret.yaml
|
||||||
|
kubectl apply -f playwright-service.yaml
|
||||||
|
kubectl apply -f api.yaml
|
||||||
|
kubectl apply -f worker.yaml
|
||||||
|
kubectl apply -f redis.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
# Port Forwarding for Testing
|
||||||
|
```bash
|
||||||
|
kubectl port-forward svc/api 3002:3002 -n dev
|
||||||
|
```
|
||||||
|
|
||||||
|
# Delete Firecrawl
|
||||||
|
```bash
|
||||||
|
kubectl delete -f configmap.yaml
|
||||||
|
kubectl delete -f secret.yaml
|
||||||
|
kubectl delete -f playwright-service.yaml
|
||||||
|
kubectl delete -f api.yaml
|
||||||
|
kubectl delete -f worker.yaml
|
||||||
|
kubectl delete -f redis.yaml
|
||||||
|
```
|
39
examples/kubernetes-cluster-install/api.yaml
Normal file
39
examples/kubernetes-cluster-install/api.yaml
Normal file
@ -0,0 +1,39 @@
|
|||||||
|
apiVersion: apps/v1
|
||||||
|
kind: Deployment
|
||||||
|
metadata:
|
||||||
|
name: api
|
||||||
|
spec:
|
||||||
|
replicas: 1
|
||||||
|
selector:
|
||||||
|
matchLabels:
|
||||||
|
app: api
|
||||||
|
template:
|
||||||
|
metadata:
|
||||||
|
labels:
|
||||||
|
app: api
|
||||||
|
spec:
|
||||||
|
imagePullSecrets:
|
||||||
|
- name: docker-registry-secret
|
||||||
|
containers:
|
||||||
|
- name: api
|
||||||
|
image: ghcr.io/winkk-dev/firecrawl:latest
|
||||||
|
args: [ "pnpm", "run", "start:production" ]
|
||||||
|
ports:
|
||||||
|
- containerPort: 3002
|
||||||
|
envFrom:
|
||||||
|
- configMapRef:
|
||||||
|
name: firecrawl-config
|
||||||
|
- secretRef:
|
||||||
|
name: firecrawl-secret
|
||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
|
kind: Service
|
||||||
|
metadata:
|
||||||
|
name: api
|
||||||
|
spec:
|
||||||
|
selector:
|
||||||
|
app: api
|
||||||
|
ports:
|
||||||
|
- protocol: TCP
|
||||||
|
port: 3002
|
||||||
|
targetPort: 3002
|
14
examples/kubernetes-cluster-install/configmap.yaml
Normal file
14
examples/kubernetes-cluster-install/configmap.yaml
Normal file
@ -0,0 +1,14 @@
|
|||||||
|
apiVersion: v1
|
||||||
|
kind: ConfigMap
|
||||||
|
metadata:
|
||||||
|
name: firecrawl-config
|
||||||
|
data:
|
||||||
|
NUM_WORKERS_PER_QUEUE: "8"
|
||||||
|
PORT: "3002"
|
||||||
|
HOST: "0.0.0.0"
|
||||||
|
REDIS_URL: "redis://redis:6379"
|
||||||
|
PLAYWRIGHT_MICROSERVICE_URL: "http://playwright-service:3000"
|
||||||
|
USE_DB_AUTHENTICATION: "false"
|
||||||
|
SUPABASE_ANON_TOKEN: ""
|
||||||
|
SUPABASE_URL: ""
|
||||||
|
SUPABASE_SERVICE_TOKEN: ""
|
36
examples/kubernetes-cluster-install/playwright-service.yaml
Normal file
36
examples/kubernetes-cluster-install/playwright-service.yaml
Normal file
@ -0,0 +1,36 @@
|
|||||||
|
apiVersion: apps/v1
|
||||||
|
kind: Deployment
|
||||||
|
metadata:
|
||||||
|
name: playwright-service
|
||||||
|
spec:
|
||||||
|
replicas: 1
|
||||||
|
selector:
|
||||||
|
matchLabels:
|
||||||
|
app: playwright-service
|
||||||
|
template:
|
||||||
|
metadata:
|
||||||
|
labels:
|
||||||
|
app: playwright-service
|
||||||
|
spec:
|
||||||
|
imagePullSecrets:
|
||||||
|
- name: docker-registry-secret
|
||||||
|
containers:
|
||||||
|
- name: playwright-service
|
||||||
|
image: ghcr.io/winkk-dev/firecrawl-playwright:latest
|
||||||
|
ports:
|
||||||
|
- containerPort: 3000
|
||||||
|
envFrom:
|
||||||
|
- configMapRef:
|
||||||
|
name: firecrawl-config
|
||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
|
kind: Service
|
||||||
|
metadata:
|
||||||
|
name: playwright-service
|
||||||
|
spec:
|
||||||
|
selector:
|
||||||
|
app: playwright-service
|
||||||
|
ports:
|
||||||
|
- protocol: TCP
|
||||||
|
port: 3000
|
||||||
|
targetPort: 3000
|
30
examples/kubernetes-cluster-install/redis.yaml
Normal file
30
examples/kubernetes-cluster-install/redis.yaml
Normal file
@ -0,0 +1,30 @@
|
|||||||
|
apiVersion: apps/v1
|
||||||
|
kind: Deployment
|
||||||
|
metadata:
|
||||||
|
name: redis
|
||||||
|
spec:
|
||||||
|
replicas: 1
|
||||||
|
selector:
|
||||||
|
matchLabels:
|
||||||
|
app: redis
|
||||||
|
template:
|
||||||
|
metadata:
|
||||||
|
labels:
|
||||||
|
app: redis
|
||||||
|
spec:
|
||||||
|
containers:
|
||||||
|
- name: redis
|
||||||
|
image: redis:alpine
|
||||||
|
args: ["redis-server", "--bind", "0.0.0.0"]
|
||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
|
kind: Service
|
||||||
|
metadata:
|
||||||
|
name: redis
|
||||||
|
spec:
|
||||||
|
selector:
|
||||||
|
app: redis
|
||||||
|
ports:
|
||||||
|
- protocol: TCP
|
||||||
|
port: 6379
|
||||||
|
targetPort: 6379
|
20
examples/kubernetes-cluster-install/secret.yaml
Normal file
20
examples/kubernetes-cluster-install/secret.yaml
Normal file
@ -0,0 +1,20 @@
|
|||||||
|
apiVersion: v1
|
||||||
|
kind: Secret
|
||||||
|
metadata:
|
||||||
|
name: firecrawl-secret
|
||||||
|
type: Opaque
|
||||||
|
data:
|
||||||
|
OPENAI_API_KEY: ""
|
||||||
|
SLACK_WEBHOOK_URL: ""
|
||||||
|
SERPER_API_KEY: ""
|
||||||
|
LLAMAPARSE_API_KEY: ""
|
||||||
|
LOGTAIL_KEY: ""
|
||||||
|
BULL_AUTH_KEY: ""
|
||||||
|
TEST_API_KEY: ""
|
||||||
|
POSTHOG_API_KEY: ""
|
||||||
|
POSTHOG_HOST: ""
|
||||||
|
SCRAPING_BEE_API_KEY: ""
|
||||||
|
STRIPE_PRICE_ID_STANDARD: ""
|
||||||
|
STRIPE_PRICE_ID_SCALE: ""
|
||||||
|
HYPERDX_API_KEY: ""
|
||||||
|
FIRE_ENGINE_BETA_URL: ""
|
24
examples/kubernetes-cluster-install/worker.yaml
Normal file
24
examples/kubernetes-cluster-install/worker.yaml
Normal file
@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: apps/v1
|
||||||
|
kind: Deployment
|
||||||
|
metadata:
|
||||||
|
name: worker
|
||||||
|
spec:
|
||||||
|
replicas: 1
|
||||||
|
selector:
|
||||||
|
matchLabels:
|
||||||
|
app: worker
|
||||||
|
template:
|
||||||
|
metadata:
|
||||||
|
labels:
|
||||||
|
app: worker
|
||||||
|
spec:
|
||||||
|
imagePullSecrets:
|
||||||
|
- name: docker-registry-secret
|
||||||
|
containers:
|
||||||
|
- name: worker
|
||||||
|
image: ghcr.io/winkk-dev/firecrawl:latest
|
||||||
|
envFrom:
|
||||||
|
- configMapRef:
|
||||||
|
name: firecrawl-config
|
||||||
|
- secretRef:
|
||||||
|
name: firecrawl-secret
|
3
examples/roastmywebsite-example-app/.eslintrc.json
Normal file
3
examples/roastmywebsite-example-app/.eslintrc.json
Normal file
@ -0,0 +1,3 @@
|
|||||||
|
{
|
||||||
|
"extends": "next/core-web-vitals"
|
||||||
|
}
|
38
examples/roastmywebsite-example-app/.gitignore
vendored
Normal file
38
examples/roastmywebsite-example-app/.gitignore
vendored
Normal file
@ -0,0 +1,38 @@
|
|||||||
|
# See https://help.github.com/articles/ignoring-files/ for more about ignoring files.
|
||||||
|
|
||||||
|
# dependencies
|
||||||
|
/node_modules
|
||||||
|
/.pnp
|
||||||
|
.pnp.js
|
||||||
|
.yarn/install-state.gz
|
||||||
|
|
||||||
|
# testing
|
||||||
|
/coverage
|
||||||
|
|
||||||
|
# next.js
|
||||||
|
/.next/
|
||||||
|
/out/
|
||||||
|
|
||||||
|
# production
|
||||||
|
/build
|
||||||
|
|
||||||
|
# misc
|
||||||
|
.DS_Store
|
||||||
|
*.pem
|
||||||
|
|
||||||
|
# debug
|
||||||
|
npm-debug.log*
|
||||||
|
yarn-debug.log*
|
||||||
|
yarn-error.log*
|
||||||
|
|
||||||
|
# local env files
|
||||||
|
.env*.local
|
||||||
|
|
||||||
|
# vercel
|
||||||
|
.vercel
|
||||||
|
|
||||||
|
# typescript
|
||||||
|
*.tsbuildinfo
|
||||||
|
next-env.d.ts
|
||||||
|
.env
|
||||||
|
node_modules
|
5
examples/roastmywebsite-example-app/README.md
Normal file
5
examples/roastmywebsite-example-app/README.md
Normal file
@ -0,0 +1,5 @@
|
|||||||
|
# Roast My Website 🔥
|
||||||
|
|
||||||
|
Welcome to Roast My Website, the ultimate tool for putting your website through the wringer! This repository harnesses the power of Firecrawl to scrape and capture screenshots of websites, and then unleashes the latest LLM vision models to mercilessly roast them.
|
||||||
|
|
||||||
|
Check it out at roastmywebsite.ai 😈
|
17
examples/roastmywebsite-example-app/components.json
Normal file
17
examples/roastmywebsite-example-app/components.json
Normal file
@ -0,0 +1,17 @@
|
|||||||
|
{
|
||||||
|
"$schema": "https://ui.shadcn.com/schema.json",
|
||||||
|
"style": "default",
|
||||||
|
"rsc": true,
|
||||||
|
"tsx": true,
|
||||||
|
"tailwind": {
|
||||||
|
"config": "tailwind.config.ts",
|
||||||
|
"css": "src/app/globals.css",
|
||||||
|
"baseColor": "zinc",
|
||||||
|
"cssVariables": false,
|
||||||
|
"prefix": ""
|
||||||
|
},
|
||||||
|
"aliases": {
|
||||||
|
"components": "@/components",
|
||||||
|
"utils": "@/lib/utils"
|
||||||
|
}
|
||||||
|
}
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user