Nicolas
9e61d431f0
Nick: hyper dx integration init
2024-05-20 13:36:34 -07:00
Nicolas
6feb21cc35
Update website_params.ts
2024-05-17 11:21:26 -07:00
Nicolas
5be208f595
Nick: fixed
2024-05-17 10:40:44 -07:00
Nicolas
df6c3d1e7d
Merge branch 'main' into detect-pdfs
2024-05-17 09:55:51 -07:00
Nicolas
9d635cb2a3
Nick: docx support
2024-05-16 11:48:02 -07:00
Nicolas
098db17913
Update index.ts
2024-05-15 17:37:09 -07:00
Nicolas
6ca368327f
Merge branch 'main' into test/crawl-options
2024-05-15 17:18:25 -07:00
Nicolas
24be4866c5
Nick:
2024-05-15 17:16:20 -07:00
Nicolas
ade4e05cff
Nick: working
2024-05-15 17:13:04 -07:00
Nicolas
bfccaf670d
Nick: fixes most of it
2024-05-15 15:30:37 -07:00
rafaelsideguide
d91043376c
not working yet
2024-05-15 18:54:40 -03:00
rafaelsideguide
fa014defc7
Fixing child links only bug
2024-05-15 18:35:09 -03:00
Nicolas
2ba743fb1a
Merge pull request #27 from eltociear/patch-1
...
refactor: fix typo in WebScraper/index.ts
2024-05-15 13:28:38 -07:00
Nicolas
1b0d6341d3
Update index.ts
2024-05-15 11:48:12 -07:00
Nicolas
d10f81e7fe
Nick: fixes
2024-05-15 11:28:20 -07:00
Nicolas
87570bdfa1
Update index.ts
2024-05-15 11:06:03 -07:00
Ikko Eltociear Ashimine
e91c122c69
Merge branch 'main' into patch-1
2024-05-15 12:14:52 +09:00
Nicolas
a0fdc6f7c6
Nick:
2024-05-14 12:12:40 -07:00
Nicolas
7f31959be7
Nick:
2024-05-14 12:04:36 -07:00
Nicolas
8a72cf556b
Nick:
2024-05-13 21:10:58 -07:00
Nicolas
26a092f780
Update index.ts
2024-05-13 21:04:49 -07:00
Nicolas
8101cbee37
Update index.ts
2024-05-13 21:02:47 -07:00
Nicolas
86b8439844
Nick:
2024-05-13 20:51:42 -07:00
Nicolas
a96fc5b96d
Nick: 4x speed
2024-05-13 20:45:11 -07:00
rafaelsideguide
8eb2e95f19
Cleaned up
2024-05-13 16:13:10 -03:00
Nicolas
2ce045912f
Nick: disable vision right now
2024-05-13 10:56:08 -07:00
rafaelsideguide
f4348024c6
Added check during scraping to deal with pdfs
...
Checks if the URL is a PDF during the scraping process (single_url.ts).
TODO: Run integration tests - Does this strat affect the running time?
ps. Some comments need to be removed if we decide to proceed with this strategy.
2024-05-13 09:13:42 -03:00
Rafael Miller
5a2712fa5a
Merge branch 'main' into detect-pdfs
2024-05-10 15:53:13 -03:00
rafaelsideguide
bc6b929b43
[Bug] Fixing /crawl limit
2024-05-10 12:15:54 -03:00
Nicolas
66bd1e4020
Update website_params.ts
2024-05-09 18:41:15 -07:00
Nicolas
d21091bb06
Update single_url.ts
2024-05-09 17:52:46 -07:00
Nicolas
be85008622
Nick: better
2024-05-09 17:48:11 -07:00
Nicolas
be5661a768
Nick: a lot better
2024-05-09 17:45:16 -07:00
Nicolas
dcedb8d798
Merge branch 'main' into feat/max-depth
2024-05-07 10:20:49 -07:00
Nicolas
6505bf6bf2
Merge branch 'main' into feat/max-depth
2024-05-07 10:20:44 -07:00
Nicolas
bdbee963f7
Merge branch 'main' into nsc/cancel-job
2024-05-07 10:13:43 -07:00
rafaelsideguide
61d615c04b
Added tests
2024-05-07 14:03:00 -03:00
rafaelsideguide
e1f52c538f
nested includeHtml inside pageOptions
2024-05-07 13:40:24 -03:00
Nicolas
f46bf19fa5
Nick:
2024-05-07 09:26:52 -07:00
rafaelsideguide
83f3408634
Added max depth option
2024-05-07 11:06:26 -03:00
Nicolas
6d5da358cc
Nick: cancel job
2024-05-06 17:16:43 -07:00
rafaelsideguide
509250c4ef
changed to includeHtml
2024-05-06 19:45:56 -03:00
rafaelsideguide
538355f1af
Added toMarkdown option
2024-05-06 11:36:44 -03:00
Nicolas
15b774e974
Update index.ts
2024-05-04 12:44:30 -07:00
Nicolas
2aa09a3000
Nick: partial docs working, cleaner
2024-05-04 12:30:12 -07:00
Nicolas
00373228fa
Update index.ts
2024-05-04 11:53:16 -07:00
Nicolas
768166b066
Update single_url.ts
2024-04-30 16:57:44 -07:00
Nicolas
cbd9e88b77
Merge branch 'main' into llm-extraction
2024-04-30 14:49:20 -07:00
Nicolas
4f526cff92
Nick: cleanup
2024-04-30 12:19:43 -07:00
Caleb Peffer
3ca9e5153f
Caleb: trying to get loggin workng
2024-04-30 09:20:15 -07:00
rafaelsideguide
a095e1b63d
Resolve merge conflicts with main
2024-04-30 10:54:18 -03:00
rafaelsideguide
d3c36adaa7
Update index.ts
2024-04-29 17:58:47 -03:00
rafaelsideguide
f8b207793f
changed the request to do a HEAD to check for a PDF instead
2024-04-29 15:15:32 -03:00
Nicolas
b69feab916
Merge branch 'main' into llm-extraction
2024-04-29 08:40:44 -07:00
Caleb Peffer
2ad7a58eb7
Caleb: first test passing
2024-04-28 17:38:20 -07:00
Caleb Peffer
06497729e2
Caleb: got it to a testable state I believe
2024-04-28 15:52:09 -07:00
Caleb Peffer
6ee1f2d3bc
Caleb: initially pulled inspiration code from https://github.com/mishushakov/llm-scraper
2024-04-28 13:59:35 -07:00
Nicolas
68838c9e0d
Update single_url.ts
2024-04-28 12:44:00 -07:00
Nicolas
d8ee4e90d6
Update website_params.ts
2024-04-28 11:47:25 -07:00
Nicolas
8e44696c4d
Nick:
2024-04-28 11:34:25 -07:00
Nicolas
1dc6458c6a
Update crawler.ts
2024-04-27 11:17:10 -07:00
Nicolas
0f694e0608
Update crawler.ts
2024-04-27 11:14:52 -07:00
tractorjuice
a5d38039f2
Add additional file extensions to crawler.ts
...
Add additional file extensions.
2024-04-27 11:03:27 +01:00
rafaelsideguide
75597f72a1
[Feat] Added allowed urls
...
FireCrawl should be able to scrape LinkedIn Articles (/pulse/*)
2024-04-25 08:39:45 -03:00
Rafael Miller
f189589da4
Merge pull request #34 from mendableai/nsc/returnOnlyUrls
...
Implements the ability for the crawler to output all the links it found, without scraping
2024-04-24 10:34:42 -03:00
rafaelsideguide
942ac3b41c
Resolved merge conflicts between feat/added-anthropic-vision-api and main
2024-04-24 09:57:45 -03:00
Nicolas
8939ca570b
Merge branch 'main' into nsc/returnOnlyUrls
2024-04-23 18:05:48 -07:00
Nicolas
fdb2789eaa
Nick: added url as return param
2024-04-23 17:14:34 -07:00
Nicolas
734c76fc56
Merge branch 'main' into nsc/mvp-search
2024-04-23 17:04:31 -07:00
Nicolas
f0695c7123
Update single_url.ts
2024-04-23 17:04:10 -07:00
Nicolas
0146157876
Nick: mvp
2024-04-23 15:28:32 -07:00
rafaelsideguide
849c0b6ebf
[Feat] Added blocklist for social media urls
2024-04-23 18:50:35 -03:00
Nicolas
306cfe4ce1
Nick:
2024-04-23 11:15:11 -07:00
Nicolas
ddf9ff9c9a
Nick:
2024-04-20 11:46:06 -07:00
Nicolas
f1dd97af0f
Update index.ts
2024-04-19 15:37:27 -07:00
Nicolas
84cebf618b
Nick:
2024-04-19 15:36:00 -07:00
Nicolas
5b93799149
Nick: a bit faster
2024-04-19 15:13:17 -07:00
Nicolas
c5cb268b61
Update pdfProcessor.ts
2024-04-19 13:13:42 -07:00
Nicolas
43cfcec326
Nick: disabling in crawl and sitemap for now
2024-04-19 13:12:08 -07:00
Nicolas
140529c609
Nick: fixes pdfs not found
2024-04-19 13:05:21 -07:00
Ikko Eltociear Ashimine
9e9d66f7a3
refactor: fix typo in WebScraper/index.ts
...
breakign -> breaking
2024-04-20 02:27:53 +09:00
rafaelsideguide
72e1dadccd
adding option to replace all relative paths with absolute paths
2024-04-19 11:47:20 -03:00
rafaelsideguide
c4cc4b9262
fixing document response
2024-04-18 14:12:39 -03:00
Rafael Miller
704a059448
Update index.ts
2024-04-18 13:53:11 -03:00
rafaelsideguide
57e5b36014
[Feat] Adding pdf parser
2024-04-18 11:43:57 -03:00
Nicolas
ca2bf9cc12
Update single_url.ts
2024-04-17 18:27:08 -07:00
Nicolas
36abe0f7f9
Nick:
2024-04-17 18:24:46 -07:00
Nicolas
460763ba5f
Merge pull request #11 from mendableai/feat/parse-to-markdown-tables
...
[Feat] Added html to markdown table parser
2024-04-17 15:52:43 -04:00
Nicolas
52fb28bc1a
Update index.ts
2024-04-17 12:52:15 -07:00
Nicolas
de439f6529
Update index.ts
2024-04-17 12:51:29 -07:00
Nicolas
871d5d91b0
Update index.ts
2024-04-17 12:51:12 -07:00
Nicolas
08ed68ff55
Nick: fixes
2024-04-17 12:44:23 -07:00
rafaelsideguide
ee8a097252
adding unit tests and fixing the parse function
2024-04-17 15:56:01 -03:00
Nicolas
2eb81545fa
Update index.test.ts
2024-04-17 11:04:03 -07:00
rafaelsideguide
b375ce3e39
adding unit tests and bugfixing
2024-04-17 14:54:54 -03:00
Nicolas
db15724b0c
Update imageDescription.ts
2024-04-17 10:39:29 -07:00
Nicolas
27674a624d
Update index.ts
2024-04-17 10:39:00 -07:00
rafaelsideguide
ff622739b7
Added a html to markdown table parser
2024-04-17 11:01:19 -03:00
rafaelsideguide
ed5dc808c7
Update imageDescription.ts
2024-04-16 18:05:07 -03:00
rafaelsideguide
00941d94a4
Added anthropic vision to getImageDescription function
2024-04-16 18:03:48 -03:00