We're using AI agents for the orchestration of our fully automated web scrapers.
But instead of trying to have one large general purpose agent that is hard to control and test, we use many smaller agents that basically just pick the right strategy for a specific sub-task in our workflows. In our case, an agent is a medium-sized LLM prompt that has a) context and b) a set of functions available to call.
For example we use it for:
- Website Loading: Automate proxy and browser selection to load sites effectively. Start with the cheapest and simplest way of extracting data, which is fetching the site without any JS or actual browser. If that doesn't work, the agent tries to load the site with a browser and a simple proxy, and so on.
- Navigation: Detect navigation elements and handle actions like pagination or infinite scroll automatically.
- Network Analysis: Identify desired data within network calls.
- Validation: Hallucination checks and verification that the data is actually on the website and in the right format. (this is mostly traditional code though)
- Data transformation: Clean and map the data into the desired format. Finetuned small and performant LLMs are great at this task with a high reliability.
The main challenge:
We quickly realized that doing this for a few data sources with low complexity is one thing, doing it for thousands of websites in a reliable, scalable, and cost-efficient way is a whole different beast.
The integration of tightly constrained agents with traditional engineering methods effectively solved this issue for us.
We're actively using this approach at scale, although still improving :) You can try out a simplified version of this in our playground: https://www.kadoa.com/add
Gave this a go. Just so happened that I had the page of an eBay seller open. Wondered if it could manage to do something as simple as extracting all 240 listed products on that page. Instead of determining that the most important data on this page would be the products, it identified these properties: categoryName, subCategories, link.
yeah i tried with a type of website that i commonly write scrapers for and i'm not sure if i can do anything with these results.
ai + web scraping is hard, i've tried and gave up, but that doesn't mean it's impossible, it just means i'm not a good engineer, so i will stay tuned to kadoa project.
Absolutely not knocking this project. Was just a somewhat unexpected result from such a simple site. Asked GPT-4 to write a scraper just to compare and it produced a quite usable boilerplate.
On a related note, I recently learned about the got-scraping module which doesn’t use chromium or any browser but its good at mimicking a browser and executes javascript. I also wrote a module that parallelizes browserless.io / playwright and makes it really cheap to use a cloud scraping solution.
For example we use it for:
- Website Loading: Automate proxy and browser selection to load sites effectively. Start with the cheapest and simplest way of extracting data, which is fetching the site without any JS or actual browser. If that doesn't work, the agent tries to load the site with a browser and a simple proxy, and so on.
- Navigation: Detect navigation elements and handle actions like pagination or infinite scroll automatically.
- Network Analysis: Identify desired data within network calls.
- Validation: Hallucination checks and verification that the data is actually on the website and in the right format. (this is mostly traditional code though)
- Data transformation: Clean and map the data into the desired format. Finetuned small and performant LLMs are great at this task with a high reliability.
The main challenge:
We quickly realized that doing this for a few data sources with low complexity is one thing, doing it for thousands of websites in a reliable, scalable, and cost-efficient way is a whole different beast.
The integration of tightly constrained agents with traditional engineering methods effectively solved this issue for us.
Edit: You can try out a simplified version of this in our playground: https://www.kadoa.com/add