Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added en/.gitbook/assets/oxylabs_document_loader.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions en/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,7 @@
* [Microsoft Powerpoint](integrations/langchain/document-loaders/microsoft-powerpoint.md)
* [Microsoft Word](integrations/langchain/document-loaders/microsoft-word.md)
* [Notion](integrations/langchain/document-loaders/notion.md)
* [Oxylabs](integrations/langchain/document-loaders/oxylabs.md)
* [PDF Files](integrations/langchain/document-loaders/pdf-file.md)
* [Plain Text](integrations/langchain/document-loaders/plain-text.md)
* [Playwright Web Scraper](integrations/langchain/document-loaders/playwright-web-scraper.md)
Expand Down
42 changes: 42 additions & 0 deletions en/integrations/langchain/document-loaders/oxylabs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
---
description: Get data from any website with Oxylabs.
---

# Oxylabs Document Loaders

Oxylabs is a web scraping service that retrieves public web data at scale, with tools designed to navigate regional restrictions.

<figure><img src="../../../.gitbook/assets/oxylabs_document_loader.png" alt="" width="260"><figcaption><p>Oxylabs Docuemnt Loader Node</p></figcaption></figure>


### Features
- Retrieve data from Google, Amazon and any other website
- Set geolocation
- Utilize the browser rendering
- Parse the data
- Specify User Agent types
- Process content with text splitters

### Required Parameters
- **Connect Credential**: Oxylabs API credentials
- **Query**: Search query or URL
- **Source**: One of the available sources:
- Universal - scrape any website
- Google Search - scrape Google Search results
- Amazon Product - scrape Amazon Product information
- Amazon Search - scrape Amazon Search results

### Optional Parameters
- **Geolocation**: Sets the proxy's geo location to retrieve data. See [documentation](https://files.gitbook.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FiwDdoZGfMbUe5cRL2417%2Fuploads%2FxoQb19qSyodB2D4no0DZ%2FList%20of%20supported%20geo_location%20values_sapi.json?alt=media&token=d2e2df7b-10ba-4399-a547-0c4a99e62293) for more details.
- **Render**: Enables JavaScript rendering when set to true.
- **Parse**: Returns parsed data when set to true, as long as a dedicated parser exists for the submitted URL's page type.
- **User Agent Type**: Device type and browser.

### Outputs
- **Document**: Array of document objects containing metadata and pageContent
- **Text**: Concatenated string from pageContent of documents


## Document Structure
Each document contains:
- **pageContent**: Extracted page content