Show HN: Crul – Query Any Webpage or API

jensneuse · on Feb 28, 2023

Hey, just watched the video. This looks super useful! I'm the founder of WunderGraph (https://wundergraph.com) and we allow our users to easily integrate multiple data sources into a virtual graph, which they can then access using GraphQL. You can add various data sources, like GraphQL, Federation, OpenAPI, Databases, etc... I was just thinking, wouldn't it be cool if we could find an easy way to add a "Crul" datasource? If you're interested, please DM me in our discord (https://wundergraph.com/discord). I'd love to have a conversation!

portInit · on Feb 28, 2023

Will absolutely reach out! Our experience has been that just getting data is often really challenging, so we've really focused on that piece, and being able to easily share with destinations that are purpose built for analytics, viz, etc.

Thanks for checking it out!

chatmasta · on Feb 28, 2023

I'll make this same offer for Splitgraph :) If you feel like writing a Postgres FDW then we can add it to the engine on the backend, so that anyone with a Postgres client could connect to postgres://data.splitgraph.com:5432 and SELECT from a table backed by crul (either "mounted" for live querying, and/or ingested once/periodically for subsequent querying). The user just needs to provide parameters for the table; it's up to the FDW how to interpret those parameters.

It would take some thinking and planning, and it's possibly not even a good idea ;) But generally any "data source" is packageable as an FDW as long as you can model it in such a way that you can reasonably implement certain functions for operations like table scans. For most FDWs, this is easy and the tradeoff of a large query is usually limited to excess bandwidth and latency while the query executor reads the result from the FDW. But with a live source pointing to a crawler instance, a table scan could in the worst case mean waiting for the crawler to parse the responses to hundreds of rate-limited network requests. So it's probably better to ingest the data once (and/or periodically) for a particular crul "table" (whatever you decide that means) rather than to query it live.

Fortunately, you can still write an FDW as the adapter layer, because Splitgraph ingests data on a schedule by querying the FDW of the live data source (while tolerating a long-running query). Alternatively (or additionally) you could write an Airbyte adapter which we also support, but only for ingestion - if you want live queryable tables then an FDW is necessary.

We've been interested in adding something like this (think Apify + Postgres) for a while. If done well it could be really cool. Let me know if you want to talk about it: miles@splitgraph.com

dan_rock_wilson · on March 1, 2023

How did you integrate with all those services (the third party service APIs, not asking about the DBs)? I've seen a bunch of sites do this and I'm curious if there's some open source library I should be using, so that I don't need to write from scratch each time I'd like to integrate with another service.

Edit: just saw "airbyte" in a bunch of places, which I assume answers my question. So updated question: airbyte works well for ya?

portInit · on March 1, 2023

We hand wrote a number of integrations, sometimes it was a simple as reusing a schema with slightly different values, we are also using the awesome https://www.benthos.dev/!

dan_rock_wilson · on March 1, 2023

Thanks for the info, Benthos has not been on my radar, will check it out.

portInit · on Feb 28, 2023

We'll have to see how close our current postgres integration is! Would like to understand more and will reach out.

chatmasta · on Feb 28, 2023

Awesome, looking forward to it.

I just got to the section on destinations ("Stores"). Very cool. If you're building an Enterprise plan where you manage the infrastructure for your customers, we can deploy a dedicated white-labeled deployment of Splitgraph Cloud to any of the three major clouds. Perhaps you could find a use for that in your backend infrastructure.

ETA: Also, if you can write to Postgres, you can write to the Splitgraph DDN [0] with most DML and DDL statements, including INSERT, and CREATE TABLE. So even without any FDW, you might be able to add Splitgraph as a "destination" for your users (who would for the most part just need to provide an API keypair).

[0] https://www.splitgraph.com/docs/add-data/from-ddn

dishwishy · on Feb 28, 2023

I just started playing with CRUL recently to try to map out different media download links on various websites, in an effort to avoid clicking around looking for hidden content. The language is very robust, but just the `filter` command alone is crazy powerful for just exploring things quickly and intuitively.

portInit · on Feb 28, 2023

Thanks! The find (https://www.crul.com/docs/commands/find) command works really well if you are trying to construct a filter expression and just want to quickly look for results containing a particular string so you can see the defining attributes/column+row values.

There's a short writeup of this pattern here: https://www.crul.com/docs/examples/how-to-find-filters

jawns · on Feb 28, 2023

How does crul handle the dynamic nature of the web?

Yes, content changes, but so does structure. If I'm interested in content that shows up in a news feed div, and that div is renamed or moved as part of a site redesign, what happens?

I've worked on a bunch of tools in the past that do similar things, and structural changes were the kryptonite for all of them.

A secondary problem is when you use particular content as a reference point, and that content is later updated. Now your reference point is gone!

portInit · on Feb 28, 2023

At this point we're considering it a foundational concept to build around - web content changes, so our best option currently is to make the query as easy as possible to change, and alert when things break.

We have done some preliminary work in some AI or other intelligence for pattern recognition to be able to handle structural changes better, but still have lots of work.

But the expanding and querying concepts also make a lot of sense with APIs, which tend to be a little more stable.

robbs · on Feb 28, 2023

IMO, this is the hardest part of maintaining a web scraper. We had ~100 scripts to scrape ~1000 clients' sites and it was, at minimum, 50 hours a week to keep up with changes.

The second hardest part was 30% of our clients all used the same hosting provider, which would start to fail at 10-20 req/s. We had to throttle the sites by IP, cluster-wide.

portInit · on Feb 28, 2023

This makes sense and I am curious about this. Was there consistency between those 1k client sites or were they all rather different? Mind if I reach out?

ElijahLynn · on March 1, 2023

I just signed up and the email confirmation ended up at http://links.outseta.com/ls/click?upn=xKo-2FU5fxLX67yddEnUva... which I get a page that says "Malware and Phishing This site is blocked because it is a known security threat. Please contact your network administrator to gain access."

I can't tell if the message is from Brave or from the ISP though.

Brajeshwar · on March 1, 2023

I think this is with Outseta, the membership management platform that CRUL is using. If I remember correctly, I had issues earlier and had manually set it to allow. I know few other founder friends using Outseta, so I'm OK with it.

ElijahLynn · on March 1, 2023

Just tried on mobile hotspot, same message. And on Firefox, same message.

portInit · on March 1, 2023

That's unfortunate. A quick look suggests ISP but maybe not with the hotspot. Will do a little more digging.

ElijahLynn · on March 1, 2023

Thanks, once I got back to my home network it worked (same browser). So definitely looks ISP-related.

1vuio0pswjnm7 · on Feb 28, 2023

Please share some examples of webpages with data to be wrangled that support this statement:

"The reality is that shit is hard, doesn't scale (classic blocking for-loop or async saturation), and comes with thorny maintenance/security issues."

Every web user's needs are different. One person might have a task that they struggle to accomplish while another might have one that presents no major challenges. As a web user, I transform web pages to CSV or SQL. I log HTTP and network requests. I do this for free using open source software. No web browser needed. No docker image needed. Works on both Linux and BSD.

For me, the web is a dataset from which I retrieve data/information. "Tech" companies want to the web to be more like a video game, with visuals and constant interactivity.

portInit · on Feb 28, 2023

Thanks for this, we're still trying to figure out these details ourselves.

Related to the quote, we've seen interest for API data wrangling, where prepackaged data feeds can be cumbersome to edit, or other implementation details become challenging like credential management, domain throttling, scheduling, checkpointing, export, etc.

It's also interesting for webpage data when you need to use a particular page as an index of links to filter and crawl. We've tried to build an abstraction layer around that.

Initially, we were mainly focused on webpages, and wanted to bypass the visuals of the browser and use a headless browser to fulfill network requests, render js, etc. then convert the page to flat table of enriched elements. With the APIs as another data set there is some work for us to do around language.

We're now trying to figure out what workflows are most relevant for crul to optimize around, as well, honestly - we just built what we thought was cool. Some features/workflows will certainly be more straightforward with existing tools and software - especially for a technically savvy user.

nine_k · on Feb 28, 2023

Do you have an easy way to transform e.g. a typical Amazon product page into nicely structured data? Not that it's trying to be very video-gamey.

1vuio0pswjnm7 · on March 1, 2023

What should the structure look like. If it is CSV what are the columns, i.e., what specific data does it need include.

Taking a quick look at the Amazon site these product pages appear to be enormous in size. Interestingly, the website requires a "viewport-width" header. Otherwise one gets directed to a CAPTCHA.

The product page I checked already has some structered data in the form of JSON, including keys such as

   "title":"xxxxxxxxxx"
   "displayPrice":"$000.00"
   "priceAmount":000.00
   "currencySymbol":"$"
   "integerValue":"000"
   "decimalSeparator":"."
   "fractionalValue":"00"
   "symbolPosition":"left"
   "asin": "xxxxxxxxx"
   "asin":"xxxxxxxxxx"
   "acAsin":"xxxxxxxxxx"
   "buyingOptionTypes":["NEW"]
   "productAsin":"xxxxxxxxxx"
   "mediaAsin":"xxxxxxxxxxx"
   "parentAsin":"xxxxxxxxx"
   "asinList":"xxxxxxxxxx"

Thus, CSV with product name, price and ASIN would appear to be easy. No need to mess with the HTML.

Other data such as, e.g., delivery time, seller, where the item ships from and number left in stck can be extracted from the HTML.

Delivery time is in a <span> that contains "data-csa-c-delivery-time".

Seller, shipping info and number left in stock are under a <span> with class="a-size-base _p13n-desktop-sims-fbt_fbt-desktop_shipping-info-show-box__17yWM"

One needs to decide what data one wants from the page.

The way to present an example on which to evaluate a "new" solution such as the one in this thread is to present a problem, e.g.,

Get data items x, y and z from website xyzexample.com.

In the majorty of cases I see submitted to HN, it is impossible to benchmark these "new" solutions against existing ones because no example websites are ever provided.

nine_k · on March 1, 2023

The output may be a collection of CSV files, or a JSON file with nicely structured data, because the page certainly has a pretty visible structure, with various data blocks.

djhn · on March 1, 2023

I played around with this trying it on a few tricky cases. At least for the initial step of getting the data in a tidy format, it wasn't immediately obvious to me how crul could speed up my workflow.

1. Many JSON endpoints (especially not meant for public consumption) return a somewhat deeply nested list, and somewhere in that list are the individual items. Can your tool speed up (or automagically solve) getting just the itemns? I found that it just gives me thousands of columns (and truncates the results), where I would have wanted, say, two thousand rows of 23 columns. I couldn't wrangle the JSON within crul.

2. Many smaller sites still use wordpress, and may manually lay out items in a visual hierarchy that isn't immediately obvious in the structure of the HTML. Then you have to go in and parse every "row" and every "column" using xpath or css selectors. Crul wasn't particularly helpful for this either.

3. A scraping workflow requiring authentication, post request for searching, get request for picking the correct result, post request for downloading pdf and parsing said pdf... well, I couldn't get this to work at all.

Hope you'll end up solving such problems automagically and I end up paying you for it.

portInit · on March 1, 2023

Appreciate you taking the time to play around with crul and share your thoughts! They're incredibly valuable.

1. Although it has limitations, were you able to try the normalize command? https://www.crul.com/docs/queryconcepts/api-normalization

2. Will need to think about this some more.

3. Although we don't yet handle pdfs, the rest of the flow is one we aiming to accomplish with crul. Other than pdf, the pieces should be there and would love to understand this further

djhn · on March 1, 2023

1: It does expand a level of hierarchy if you already know what you're looking for (from manually getting the data). Is there a way to omit columns that have more levels or keep them as list columns?

2: Probably too niche to be worth it for you :)

3: Parsing pdfs automagically isn't easy, but handling downloads and images and storing them in a bucket would go quite far (I guess there's a way to do that, but I didn't immediately see a "simple" example).

It sounds like it's potentially a great tool, the only open question is really, is it worth studying the docs and implementing a process in crul as opposed to whatever language I'm familiar with already?

RobotToaster · on March 1, 2023

Doesn't WordPress have a built in API?

djhn · on March 1, 2023

If you have a link to some high quality Wordpress API hacks/dorks for scraping, I'm all ears. I think my problem pages are usually made in some sort of page builder, like Elementor, and the content is a static soup of HTML like it came straight out of FrontPage2003.

Uptrenda · on Feb 28, 2023

Really elegant work guys. You've identified the pain points well. I'd be curious how you've structured your infrastructure to scale though. But I imagine that's secret sauce.

What are some of your experiences with async networking, btw? Did you find that it didn't really handle concurrency well in practice? Or were there other issues? I've written a lot of async networking code but always found it horrible to profile.

portInit · on Feb 28, 2023

Thank you and your comment really warms our hearts. It's been fun building in the "cave" but comes with self doubt.

We've built using a microservice architecture to allow us to scale out the parts that need to scale, mainly the workers, which interact with a queue, although we'll need to move from the parts that are currently nodejs for some perf gains and a smaller footprint. All those microservices are consolidated for the desktop variants.

Network concurrency is mostly throttled by our domain policy manager (named "gonogo" - lol) at 1 req/per outbound domain a sec. It's a little slow for a default, but also configurable and provides a nice guardrail for api request limits, etc. Overall async networking has been quite tricky, esp with retries, etc., and we're still iterating on it. Agreed on the profiling difficulties.

jwx48 · on Feb 28, 2023

Ah, I see it's pronounced more like "Krull" than "cruel".

portInit · on Feb 28, 2023

lol - yeah. https://www.imdb.com/title/tt0085811/?ref_=tt_urv

We've been thinking of ways to make the pronunciation a bit clearer, maybe a mascot or something. Open to ideas!

orangutanemoji · on March 1, 2023

how about a millipede or something else that crawls?

docyes · on March 1, 2023

crab?

MetaMonk · on March 1, 2023

Could use a cruller as a logo.

yevpats · on March 1, 2023

Congrats team! Getting data from and to places is a fundamental Computer Science challenge (or in marketing terms ELT) that involves mostly data interoperability but also things like rate limits, transformations, horizontal and vertical scaling, schema definitions, incremental syncs, at least one delivery guarantee. Those are rarely things that can be automatically defined and needs user configuration - so Im really curious how you solve it under the hood or is it only working for subset of problems?

Asking all those as we are working on CloudQuery (https://github.com/cloudquery/cloudquery) so been dealing with a lot of those underlying challenges as well.

hermitcrab · on Feb 28, 2023

Looks interesting. Customers of our data wrangling tool (Easy Data Transform) are asking to be able to pull data out of REST APIs, but we don't currently support this. So it could be an interesting way to bridge the gap.

portInit · on Feb 28, 2023

Would love to chat and try a few use cases together! At first glance I see you can drag in csv files, which could be generated by crul and either manually downloaded or scheduled and written to the filesystem.

Crul is a really easy way of populating tools that need data, whether it's just a one time thing for a demo/static data set or a scheduled data feed, so this kind of usage makes sense to us.

hermitcrab · on Feb 28, 2023

That might be interesting. I will try to find some time to have a play with Crul. Do you support pulling data from an API on a schedule?

portInit · on Feb 28, 2023

Yes that is possible, although we currently have the scheduler set as an enterprise feature. We should look into a free trial, I enjoyed the flow of the EasyDataTransform installation with the free trial option.

hermitcrab · on Feb 28, 2023

>I enjoyed the flow of the EasyDataTransform installation with the free trial option

We're taking a different apporach to our enterprisey competitiors. ;0)

>we currently have the scheduler set as an enterprise feature.

Understandable.

Brajeshwar · on March 1, 2023

I was able to sign-up successfully. But the download file is not set to public access (or something like that) on your S3.

Screenshot: https://www.dropbox.com/s/i254xxbdenmqde1/screenshot%202023-...

portInit · on March 1, 2023

Thanks for the heads up. Could you share what browser you are using? And are you downloading from https://www.crul.com/account?

Brajeshwar · on March 1, 2023

Yes, trying to download from inside the Crul account after I successfully signed in. I'm using Safari 16.3 on macOS 13.2.1.

portInit · on March 1, 2023

I'm having trouble replicating. Although I did get a prompt from Safari to allow downloads from crul.com. The screenshot looks like it is redirecting from the button, is that what you are seeing - rather than the button triggering a download?

imiric · on March 1, 2023

This looks very cool. Congrats on the launch!

Have you thought about open sourcing it? I realize you want to sustain a business, but an open core model with commercial features would allow you to do that. A free tool like this would be a welcome addition to the ecosystem, most of which is FOSS already, and would help build a community around your product.

portInit · on March 1, 2023

We certainly have, it is admittedly quite new to both of us, so we have been exploring the best way to introduce something like this, as well as tying up technical bits. The open core model with commercial features is certainly appealing.

Open to any perspectives on this.

imiric · on March 1, 2023

I've been involved with successful commercial projects using the open core model. Feel free to reach out over email if you have any questions. My contact info is in the profile.

throwaway_e9463 · on Feb 28, 2023

Do you have an example of how to turn an html table element into a CSV? I saw the open and the scrape commands, but wasn't sure where to go from there.

portInit · on Feb 28, 2023

Ah! We didn't quite get an html table command in this release but it will be in the next one.

Here's a query that shows an option, but the table command will be far more straightforward.

open https://www.w3schools.com/html/html_tables.asp --dimension || filter "(nodeName == 'TD')" || groupBy boundingClientRect.top || table _group.0.innerText _group.1.innerText _group.2.innerText

the_giver · on Feb 28, 2023

That's a pretty slick tool you guys built!

portInit · on Feb 28, 2023

Thank you! Really means a lot to us

codetrotter · on March 1, 2023

This is very cool! Does the Docker image contain everything, or does part of it rely on server side things hosted by crul?

portInit · on March 1, 2023

Thank you! Docker image has everything for functionality, there's just an update check that pings our end to check for updates.

asdadsdad · on Feb 28, 2023

How do you plan on tackling anti-bot blocking?

portInit · on Feb 28, 2023

It's a tricky question. Part of it is looking at APIs as the main source of data for scheduled queries/data feeds.

Crul sort of operates as a text only browser when interacting with a single page at a time, but when you expand and open up multiple tabs it becomes a little more challenging. We have the concept of domain policies which allow you to control how quickly/slowly you access something. There are also some puppeteer level options that could be relevant, even a headful toggle.

We have not invested too much time into this yet as we focused on getting the core functionality working. We think there are use cases (particularly with APIs) that don't run into this problem, but if it comes up more often we'll come up with some options.

wefarrell · on Feb 28, 2023

In my experience bot detection is moving more towards looking at network activity and IP reputations. Using a proxy will go along way, it's easy to implement, and the cost can easily be passed onto the customer.

pocket_cheese · on Feb 28, 2023

In my experience, the thing that makes me actually have to lug out my ol' headless browser is that a lot of websites are starting to implement obfuscated cryptographic puzzles in their JS, making it really difficult to emulate without just running it in a browser.

wefarrell · on Feb 28, 2023

Starting to? That's been going on for at least the last 3-4 years. Akamai tends to rely on that more than Cloudflare and in my recent experience Akamai is winning that game and browser emulation alone, headless or not, isn't going to bypass Akamai. Recently I have seen browser emulation not be effective at all for bypassing bot detection.

The only kind of emulation where I have seen success is mobile and in that case you need to run a device emulator.

asdadsdad · on Feb 28, 2023

In my experience it's starting to moving towards "use a headless browser with patched attributes" or nothing else will work.

edit: I have quite a bit of experience with Akamai and other vendors =)

is_true · on Feb 28, 2023

You shouldn't

KomoD · on Feb 28, 2023

> status: complete (28.284 seconds / 14 results)

Why was it so slow? Is there a default delay between requests or something?

portInit · on Feb 28, 2023

Yeah the default domain throttle policy is 1 req per second per domain. Configurable through domain policies https://www.crul.com/docs/features/domain-policies - although currently an enterprise feature.

We found that it becomes too easy to break API request limits or spam a website otherwise.

However if you rerun that query it should load pretty instantly due to the caching layers, so the actual querying/filtering of the data part is smoother/faster.

mdaniel · on Feb 28, 2023

> due to the caching layers

Every time I see that, the "2 hardest things" springs to mind. Is there a clear-caches option, or I guess the opposite question: does that process honor the HTTP caching semantics? Scrapy actually has a bunch of configurable knobs for that (use RFC2616 Policy ( https://docs.scrapy.org/en/2.8/topics/downloader-middleware.... ), write your own policy, or a ton of other stuff: https://docs.scrapy.org/en/2.8/topics/downloader-middleware.... )

portInit · on Feb 28, 2023

Agreed, caching does come with its own set of quirks and mind-numbing bugs, crul does have a caching override flag at the command/stage level which alleviates some of this: https://www.crul.com/docs/queryconcepts/common-flags#--cache

Your provided links are interesting and something for us think about some more. Honestly, I would be quite interested in hearing more about your experiences.

KomoD · on Feb 28, 2023

> although currently an enterprise feature.

Wait, so we literally can't go faster than 1 req/s unless we pay?

I have to say I'm pretty disappointed :/

curiousgeorgio · on March 1, 2023

If you attach to the running docker container, these defaults appear to be defined in /crul/dist/crul-docker/packages/startup/.env

Don't spam APIs. That said, if you're determined to do so, there's not much this or any other tool can do to stop you from trying.

portInit · on March 1, 2023

Yeah exactly. Living up to your username :) nice find! Note this is a global default, unlike domain policies which are associated with a domain.

portInit · on Feb 28, 2023

Sorry to hear that - we do need to think about this. It's our first pass at product tiers and features and we may need to adjust.

Scheduling and Domain Policies were the main features we chose to gate initially as they don't affect core functionality other than performance and deployment.

zerop · on March 1, 2023

Can it get text of a web page given just URL, like how pocket app saves the text or the link.

portInit · on March 1, 2023

You would need to do a bit filtering to get the exact text you need. For example, to get the text for this post you could run:

open https://news.ycombinator.com/item?id=34970917 || filter "(attributes.class == 'toptext')" || table innerText

zerop · on March 1, 2023

This requires me to know the HTML structure in advance. I want to do it on any page. Mozilla has readability JS which does it, wanted to know if this tool has the same feature. BTW it's a great tool.

portInit · on March 1, 2023

Thank you! The query below would get you the full page text, although it likely wont be too legible. I'll read up on readability JS some, it looks quite magic and could possibly be added in.

open https://news.ycombinator.com/item?id=34970917 || filter "(nodeName == 'BODY')" || table innerText

Often you can use crul to discover the html structure in the results table. With a `find "text string"` to filter rows and then a filter on the column values that identify the desired elements.

netsharc · on March 1, 2023

Man, this looks cool, the name's closeness to "curl" irritates me..

demarq · on March 1, 2023

Minor nitpick, the download says M1, but it prompts me to install Rosetta.

portInit · on March 1, 2023

Thank you! Will make this more clear.

markandrewj · on March 2, 2023

The video made me laugh. All this... and more?! :)

2h · on March 1, 2023

looks like it is closed source:

https://github.com/crul-oss