RasaGPT: First headless LLM chatbot built on top of Rasa, Langchain and FastAPI

data_maan · on May 8, 2023

I'm not sure what the advantage the use of a somewhat comprehensive framework like Langchain gives you for this use case?

It starts to feel as AI tech is slowly turning into web tech with a million tools and frameworks, so I'm just wondering whether all of these are needed and if it isn't easier to code your own than learning a foreign framework...

riter · on May 8, 2023

Not off-topic at all. After struggling with LangChain's hyper-opinionated implementation of classes I agree.

In fact, this is better off leveraging Llamaindex. This is a proof-of-concept and ultimately leveraging a library / framework helps afford the following:

- easy implementation of chunking strategies when you're unsure - OpenAI helper functions - embeddings and vector store management

Again, even with the above I struggled and had to implement PGVector myself. Going into production once I have my document retrieval strategy and prompt-tuning optimized, I would never use Langchain in production simply bc of the bloat and inflexible implementation of things like the PGVector class. Also the footprint is massive and the LLM part can be done in 5% of the footprint in Golang and 5% of the cloud costs.

So I actually agree with you :)

data_maan · on May 8, 2023

Thanks for the insights.

I wonder if one needs even LlamaIndex?

From their site:

>Storing context in an easy-to-access format for prompt insertion.

>Dealing with prompt limitations (e.g. 4096 tokens for Davinci) when context is too big.

>Dealing with text splitting.

Not sure if it isn't easier to roll one's own for that...?

I know a thing or two about the math behind LLMs and all this software build around a few core ideas just seems to be a lot of overkill...

When mentioning about PGVevtor, did you refer to this repo or is there a class within LangChain that has the same name? https://github.com/pgvector/pgvector

akiselev · on May 8, 2023

You’re almost certainly going to have to write your own splitting code for anything nontrivial. LlamaIndex breaks down hard when there’s a lot of markup in the document, for example. You’ll also want control over the vector search strategy (just using the query or chunk embedding may not be enough)

riter · on May 8, 2023

in terms of search store and engine, would you agree that pgvector is sufficient for most text-specific cases?

riter · on May 8, 2023

I agree. I mentioned in a thread below that these frameworks are useful for discovering appropriate index-retrieval strategy that works best for you product.

On PGVector, I tried to use LangChains class (https://python.langchain.com/en/latest/modules/indexes/vecto...) but it was highly opinionated and it didn't make sense to subclass nor implement interfaces so in this particular project I did it myself.

As part of implementing with SQLModel I absolutely leaned on https://github.com/pgvector/pgvector :)

Thanks for the observation.

lmeyerov · on May 8, 2023

FWIW, individual classes are generally tiny, so we found using langchain is fine and then for places we need to beef up (chunking, not calling 'eval', ...), we do our own class/subclass. That way we can align with community for broader pieces and patterns, and decrease technical risks from smaller fly-by-night repos.

At the same time, the underlying APIs are super simple, so just rolling your own entirely, with no framework, can make sense. We need to deal with businesses wanting to plug in their own APIs & models, so that happens to be less attractive to us.

That said, purpose built frameworks can be great. Our data agent has a headless tier and we are building it fine with langchain, and benefiting from the ecosystem there, but I can imagine someone with more specific needs enjoying rasa..

kordlessagain · on May 8, 2023

Splitting things is easy! Store the dense vectors of 512 characters or so and use an overlayed index of terms to set context of the current conversation.

Use Weaviate Cloud for the vector engine…

_cs2017_ · on May 11, 2023

Ignoring footprint and bloat, the big problem you identify is inflexible class design. I wonder why it happened? Is it hard for langchain to expose all the desired features of a tool like PGVector via its own class?

yawnxyz · on May 8, 2023

Someone needs to create a “Langchain, but less complicated” framework

bluecoconut · on May 8, 2023

I sorta did this, feel free to check it out and let me know your thoughts!

On the main langchain post (In January) that got the traction on hackernews, i left this comment: https://news.ycombinator.com/item?id=34422917 . It still remains true, a "simpler langchain"

> To offer this code-style interface on top of LLMs, I made something similar to LangChain, but scoped what i made to only focus on the bare functional interface and the concept of a "prompt function", and leave the power of the "execution flow" up to the language interpreter itself (in this case python) so the user can make anything with it.

I made a really lightweight wrapper over requests and call it lambdaprompt https://github.com/approximatelabs/lambdaprompt It has served all of my personal use-cases since making it, including powering `sketch` (copilot for pandas) https://github.com/approximatelabs/sketch

Core things it does: Uses jinja templates, does sync and async, and most importantly treats LLM completion endpoints as "function calls", which you can compose and build structures around just with simple python. I also combined it with fastapi so you can just serve up any templates you want directly as rest endpoints. It also offers callback hooks so you can log & trace execution graphs.

All together its only ~600 lines of python.

I haven't had a chance to really push all the different examples out there, so I think it hasn't seen much adoption outside of those that give it a try.

I hope to get back to it sometime in the next week to introduce local-mode (eg. all the open source smaller models are now available, I want to make those first-class)

krishadi · on May 8, 2023

The use-cases and tooling around language models is very premature. So, any framework you build now will either look like bloatware or will remain close to just calling an API.

The dust around language models needs to settle a bit, for a useful framework to emerge from it.

For our own use-cases, I built a framework from scratch, and it was the best decision we made.

data_maan · on May 8, 2023

> For our own use-cases, I built a framework from scratch, and it was the best decision we made.

My thinking precisely. So you just used the "raw" OpenAI (I presume?) API, and no other tech on top?

krishadi · on May 11, 2023

Exactly. The most important part was working with Jinja templating. So, openai + jinja2.

riter · on May 8, 2023

very much agreed re: dust settling.

it makes no sense deploying any of these libraries to prod. as-is. best to understand a configuration / workflow / tuning / etc. that fits your data best and write it from scratch in golang/rust/whatever.

Dowwie · on May 8, 2023

Are these computationally expensive operations? If not, Elixir could fit.

krishadi · on May 11, 2023

They are not all computationally expensive. The rate limiting step here is the LLM call itself over the API. So, async is definitely needed. The other aspects would be loading the template from filesystem. I would assume this could be something that's needs to be optimized in the application.

opyate · on May 8, 2023

This recently via DataMachina substack:

https://blog.scottlogic.com/2023/05/04/langchain-mini.html

riter · on May 8, 2023

thanks for the share, will check out

riter · on May 8, 2023

lololol. i think this opportunity gets bigger post $10m seed round. they'll likely double down and expand footprint vs the inverse.

check out llama-index. its purpose-built for document indexing and retrieval and less agents and "everything else"

chaxor · on May 8, 2023

That's pretty wild, I've been setting things up like this for about 5 years with just BERT or my own fine tuned encoder only systems. It should be done for free, not millions... Can I get millions for running `ls` too?

data_maan · on May 8, 2023

What do you by post 10m seed round?

Do you mean if LlamaIndex starts collecting VC? I'm not sure, are they for-profit?...

riter · on May 8, 2023

I was referring to Langchain who raised $10mm from Benchmark

https://blog.langchain.dev/announcing-our-10m-seed-round-led...

I'm fairly Jerry Liu (LlamaIndex founder) already has angels or will see enough traction to warrant a seed.

jvanderbot · on May 8, 2023

But this is a turn-key llm, that is built on langchain? A user doesn't need to dig into langchain themselves, right?

riter · on May 8, 2023

To be clear (apologies if I haven't made it so) this is not an LLM. This is an implementation of Rasa leveraging Langchain under the hood.

A user technically does not need to dig into Langchain themselves, but they would want to if they find their query results sub-optimal.

There are a many indexing strategies and superficial parameters you could modify to tune output response. They are mentioned in the README.md.

lachlan_gray · on May 8, 2023

LMQL (language model query language) is a different take on prompting, and I find it less restrictive and more intuitive. Langchain is to LMQL what Keras is to Tensorflow

https://lmql.ai/

eternalban · on May 8, 2023

Thanks for the link. I skimmed the docs and couldn't find a motivation section. Can you expand on how you find it less restrictive and more intuitive.

My first impression is that this is a paradigm mismatch and an 'API' masquerading as a "language". LMQApi? Looks fine, and we have all the necessary ports for (query, model, []constraints, ...).

So what's the language bit? It's the 'scripted prompt'. That's the only bit that is reasonably a 'language', but as a language it is all over the place. Semantics are rather wild, don't you agree?

    sample(temperature=0.8)
       "A list of things not to forget when going to the sea (not travelling): \n"
       backpack = []
       for i in range(5):
          "-[THING]"
          backpack.append(THING.strip())
       print(backpack)
    from
       'openai/text-ada-001'
    where
       STOPS_AT(THING, "\n")

This part reminds of shell scripting (and what I hate about it). For example, what are the semantics of > "something quoted" < in this language? How about "THING" and THING? Is that a token, a variable, or both?

So, we really have an 'imperative' language part (the scripted prompt) and then a pretense at "declarative language" with the elaborated api call spelled out as a sqlish query.

p.s. I appreciate and laud the effort of the team which produced this. This is just feedback.

lachlan_gray · on May 8, 2023

Yeah the semantics are very weird, but I guess “prompt engineering” is weird too, so it makes sense :) .

Everything between “sample” and “from” is basically a script that generates a prompt, which is incrementally fed to the LM.

Each line contained in double quotes will get appended to the prompt, using an f-string syntax, like normal LM templates. So if you have a local python variable “foo”, you can say “how do I make {foo}?” and it will substitute its value into the prompt (not interesting).

But things in square brackets are called “hole variables”, and do the opposite. If you follow up the previous with the line “you make it by [instructions]” , the prompt up to that point is passed to the LM, and the hole in the prompt is filled, and the result is stored in a local variable “instructions” which you can reference later on in the prompt, or in python script.

Any lines in between that don’t have double quotes are interpreted as python. So you can make program logic and LM calls conditional on the result of previous LM calls, or other results of some other process. So for example you could build a critique loop like the critique chain in the LC docs out of an actual while loop, where the while loop breaks when the LM determines the output is acceptable.

The exact same thing is possible with LangChain already, but it would involve creating templates, instantiating chains, etc, which isn’t bad, but adds complexity. In LMQL syntax, you can glance at the program and plainly see what it does using your programming brain… “yeah this while loop breaks when the screenplay is good enough, and the refined version gets returned” whereas I think LC’s abstractions make something simple like this look complex.

The “where” clause is where you specify constraints, which allow you to limit what the value of a hole can be. In this case you could apply a “where” constraint to a hole variable [rating] that forces it to be either “good enough” or “needs improvement”, and nothing else can possibly be sampled from the token distribution. This makes pipelines a lot more efficient by eliminating the need for “correction chains” in a lot of places. Also, once the tokens “ne” or “go” have been generated, LMQL doesn’t have to request any more tokens because the result is already uniquely determined, and it can substitute the rest and move on.

The other thing that I love about LMQL is that everything is async. Last time I tried, maybe two months ago, making a LC chain asynchronous didn’t feel natural. In my use cases, chains were async more often than not and it was kind of annoying.

In fact under the hood, the LMQL query is compiled to a decorated async function. So at the end of the day, you can use any of your queries as simple async functions. If you want to make react Agents, or any other LM abstraction you like, you pretty much just have to stick a few @lmql.query decorated functions inside a class definition and you’re good to go. That’s what I meant by the Tensorflow/keras analogy.

LMQL still isn’t mature and there’s a lot on the roadmap. Prompting is a wild west, and altogether we haven’t even discovered a lot of the problems we will need to solve. I like to think the situation is like how I imagine operating systems and a lot of software in general looked before Bell labs. For now at least, I think of all the options, LMQL is closest to the golden path.

Let me know if you have any more questions, feel free to send an email!

eternalban · on May 8, 2023

Very informative, thank you. You make a strong case for it. Interesting how in QL the query plan is kinda spelled out in the select. I also appreciate the motivation of 'seeing the algorithm'; makes sense. Why SQL-ish approach?

> Prompting is a wild west

I am racking my brain trying to remember a continuation based language that made it to hn frontpage recently. Wondering if something like that isn't a better approach for prompting.

But quite interesting. Thanks for the writeup.

lachlan_gray · on May 9, 2023

Was it crystal by chance? I meant to have a look at it but never bothered, if it is maybe I will.

As for the SQLish approach, I’m not sure, it just seems to fit. I think it came from the way that there’s a thing you are requesting, and you have constraints you want applied to it. I think it’s one of those things where the analogy to SQL gives us developers just enough of a toe hold on what we’re doing to produce something but ultimately I think it will start to look less like SQL.

data_maan · on May 8, 2023

Great answer.

Though you say at some point

> That’s what I meant by the Tensorflow/keras analogy.

It seems you didn't mention keras before? Curious about what that analogy is about.

_neil · on May 8, 2023

Earlier in the thread op said, “Langchain is to LMQL what Keras is to Tensorflow.”

riter · on May 8, 2023

very interesting abstraction. very DBT-esque. i will dig into the docs, thanks for sharing!

leetharris · on May 8, 2023

I keep hearing stuff like,"why use X framework or Y library and why not write it yourself?"

As AI moves from academia into mainstream dev these things help bridge gaps for those who don't understand the full pipeline.

Many people asking these questions have the burden of knowledge and can't remember what it is like for average devs to dive into this stuff.

Similar questions were asked about why we ever needed Entity Framework or Express for Node.

riter · on May 8, 2023

well to be fair, when you're scaling it does matter. i would want my techlead or seniors to care and know when/where to make specific trade-offs bc cloud costs are not forgiving.

i think that's where folks that make those comments are coming from.

pratik_kanthi · on May 8, 2023

Agree about Langchain. It's tedious to work with. I don't want so many abstractions.

riter · on May 8, 2023

Amen. Constructive feedback to Langchain dev(s):

- Reduce bloat, make packages optional e.g. pip install langchain[all] - Reduce opinionated implementation of vector stores, I want my own schema - Don't unnaturally force the chain abstraction - Invest more in document retrieval

riter · on May 8, 2023

Unfortunately there were not a whole lot of end-to-end examples of integrating Rasa with OpenAI nor functional boilerplates on github so I put a working prototype together in a few days and thus RasaGPT was bron.

RasaGPT is a python-based boilerplate and reference implementation of Rasa and Telegram utilizing an LLM library like Langchain for indexing, retrieval and context injection. FastAPI end-points are made available for you to build your application on top of. Features include:

- Automated hand-off to human if queries are out of bounds - "Training" pipeline done via API - Multi-tenant support - Generate category labels from questions - Works right out of the box with docker-compose - Ngrok reverse tunnel and dummy data included - Multiple use cases and a great starting point

Hope you like it, more @ rasagpt.dev

contravariant · on May 8, 2023

I haven't worked with Rasa so I was wondering if I understood things correctly.

Are you using a language model to look up the correct reply to a particular response inside Rasa? Where Rasa presumably connects to some kind of backend to retrieve information or 'do stuff'?

riter · on May 8, 2023

thanks for asking.

this implementation leverages Rasa and stands up a FastAPI server where it receives the user response webhook first and gets processed by (or bypasses) Rasa.

The LLM queries a set of documents indexed by Langchain. Dummy data has been included (Pepe Corp.)

Rasa has support for a "fallback" mechanism whereby if a user's response scores low on your pre-configured Rasa intents (like Greet) you can have it route directly to the LLM as well. But for now RasaGPT capture and routes the Telegram response to the FastAPI webhook endpoint.

the LLM itself and prompts I configured provides a boolean on whether the response should be escalated to a human or not, based on LLM+Langchain not knowing the answer to the user's query from the indexed documents.

I hope that answers your question, if not happy to follow-up!

janmo · on May 8, 2023

A bit off topic but you better change the name and remove the GPT. OpenAI is claiming AI products that are using GPT in their name are causing confusion and is sending legal threats now.

One of many examples: https://twitter.com/pbteja1998/status/1654095756200931328

mirekrusin · on May 8, 2023

Can you really use this acronym as a trademark?

They say "mark" and not "trademark" in this takedown notice probably for a reason - because they can't call it a trademark and it's not enforceable?

If somebody creates ClosedAI company with product called ChatLLM, does it mean they can start sending takedown notices to everybody left and right who is using "LLM" in their name?

halflings · on May 8, 2023

> If somebody creates ClosedAI company with product called ChatLLM, does it mean they can start sending takedown notices to everybody left and right who is using "LLM" in their name?

Those things are nothing alike.

GPT is a very specific family of models, all created by OpenAI. The copy-cats came after OpenAI released those models, specifically to point out similarities with the OpenAI-created models.

LLM is a generic term and cannot be trademarked.

You can trademark "Coca Cola", but you can't trademark "Cola" or "Soda".

mirekrusin · on May 8, 2023

No they are not all created by ClosedAI, ie. GPT-J and GPT-NeoX was created by EleutherAI.

If LLM is pizza then GPT is ie. thin crust pizza.

ps. it's funny how quickly they went from open ai research to prosecuting project for using gpt acronym while pumping money and doing exclusive deals with mega corp - the very thing they were suppose to protect us from.

petesergeant · on May 8, 2023

According to Wikipedia, "GPT" originated with OpenAI: "The concept and first such model were introduced in 2018 by the American artificial intelligence organization OpenAI"

JustBreath · on May 8, 2023

OpenAI really ought to change their name to something more appropriate... It's getting to the point where it almost comes off as doublethink.

riter · on May 8, 2023

I appreciate the feedback. I didn't realize they were acting on it. Would Rasa-LLM sound as compelling?

KaoruAoiShiho · on May 8, 2023

Yes it sounds better and less confusing.

mirekrusin · on May 8, 2023

Other good alternatives may include:

* KnowsItAllKaren

* GeniusJack

* GuruGary

* BotBecky

* ChatterBoxChantelle

* SmartypantsSam

samstave · on May 8, 2023

I call dibbs on 'SmartyPantsSam'

riter · on May 8, 2023

lol @ knowsitallkaren. i smell a winner.

danjc · on May 8, 2023

Is it a GPT?

riter · on May 8, 2023

It itself is not a GPT. It is a a framework of a framework project built on top of Rasa (https://github.com/RasaHQ/rasa) and Langchain which by default uses gpt3.5-turbo (change it in the .env file) or any foundation model you wish.

BaculumMeumEst · on May 8, 2023

Sorry for the off topic question, but does anyone know how to buy consumer hardware optimal for running emerging open source chat models with the largest parameter chat models possible?

Would it be more cost effective to try to buy an absurd amount of ram and run on the cpu?

Or buy an Nvidia card with the biggest capacity available?

Or maybe buy a Mac with the most memory you can get?

execveat · on May 8, 2023

ARM-based Macs are the easiest way to get an acceptable performance without the headaches right now, if you can afford the price.

Install https://github.com/oobabooga/text-generation-webui, update pytorch and llamacpp-python, and you should be able to run pretty much all models out there, in all formats, both on GPU and CPU. CPU on a MAC gives you the fastest speed, but you should pass the correct --threads argument (investigate how many performance cores you've got). GPU is slower, but more energy efficient. https://github.com/mlc-ai/mlc-llm gives me way better GPU performance compared to oobabooga, but they only support a couple of models right now, - it's worth following their progress though.

If you're after the raw performance, I suggest using GGML models (meant for llama.cpp, but it's bundled in textgen, so you can use it there with the convenience of a web ui). q4_0 is the fastest quantization, while the q5_1 is the best quality right now.

If the GGML is not available, you can generate it quite easily from the safetensors yourself (not the you need enough RAM to load the model in pytorch though).

With 16GB RAM you can run any 13G model, as long as it's quantized to 4/5 bits. 32GB RAM allows you running 30/33G models and 64GB RAM - 65G models. 30G and 60G models are way more useful for real world tasks, but they are more expensive to train, so there aren't as many to choose from compared to 7/13. 7B and anything less is a toy in my opinion while 13B is good enough for experimentation and prototyping.

BaculumMeumEst · on May 8, 2023

Thanks, that makes sense and helps a lot. I have a 16gb m1 that I got llama 13b running on. It works really well but I really want to run bigger models, so your examples of ram -> model size are super helpful.

I’ll probably just end up getting a higher capacity Mac in the next few years. Right now 96gb configurations seem to be around 4k, if that comes down a bit in the future i’ll probably pick something up.

I’m not really looking to train myself so training cost isn’t an issue for me personally, I just want to be able to run the best of what the open source community comes up with (or contribute to a pool to train models, if that becomes a thing).

thewataccount · on May 8, 2023

Just a heads up - GPU's are a looooot faster then CPU's depending on what models you're running, especially if you were looking at running the image models.

Admittedly I'm not sure how well they work if you stream/batch to the GPU (say 96GB of system ram + 24GB GPU).

I've heard used nvidia workstation cards are reasonably cheap for >24GB VRAM.

A 3090/4090 have 24GB of vram and can run up the 30B models with some optimizations, and this is the easiest way to run the 30B models which are essentially the highest end any consumer card can run. If you also play games and have money then this is the way to go IMO.

If you were to get a GPU, it must have CUDA support (so nvidia only) unless you want a headache.

execveat · on May 8, 2023

As a data point I'm getting >3 tokens per second for 30b model (q5_1 quantization) and >1 token per second for 60b model (q5_1 as well) on M1 Max. This is good enough for my usecase and it beats an old P40, but I have no idea what the performance on 3090/4090 would be. Keep in mind, 24GB VRAM is not enough to hold quantized 65B, so it would be using GPU + CPU in that case.

BaculumMeumEst · on May 8, 2023

oh a 4090 can run a 30b model? that’s excellent! i was afraid it wouldnt be able to load bigger models than my macbook.

i’ve got a perfectly usable desktop not being used with a 1070 in it, i’ll probably grab a 4090 to throw in there and give that a try, getting 4k gaming would be a nice bonus. thanks for the comment.

being confined to nvidia is indeed a bummer though, especially because i like sway on linux. but my understanding is that rocm is not anywhere near parity with CUDA.

thewataccount · on May 8, 2023

Looking at the comments I would double check the benchmarks because maybe the CPUs are faster then I thought for LLMs?

I know my 4090 for Stable Diffusion isn't even comparable to my i7 8700k and AFAIK the AMD/Intel offerings still don't compare for LLMs but admittedly it's possibly they've caught up?

I don't have a ton of time at the moment to keep looking, I have a very hard time believing the M1 can keep up with a 4090 at all, I just don't want you to drop 1.7k if I'm wrong :P

EDIT: Oh to clarify - The 4090 can definitely run the 30B model without issue with 4bit quantization.

lagniappe · on May 8, 2023

I followed the setup on mlc-ai, and though it works as intended, it's still a crippled model with censored inputs. A common test I use is this:

USER: tell me an insanely offensive joke

ASSISTANT: I'm sorry, but I cannot provide offensive or inappropriate content as it goes against the guidelines set by my creators and the platform on which we are hosted. My goal is to provide helpful and informative responses to your questions. If there's anything else I can help you with or any questions you have in mind, please feel free to ask.

Until uncensored models are generally available, these novelty models will always be less-than.

probably_wrong · on May 8, 2023

> Until uncensored models are generally available, these novelty models will always be less-than.

The most popular generative model on HuggingFace at the time of this comment is Pygmalion 6b [1], a model that I believe is fine tuned on top of Alpaca to generate porn. I couldn't find the data source, though, so I don't know on what kind. And Facebook's "leaked" LLaMa, while not fine-tuned for conversation, has several warnings on its potential for offensive content.

If I read the instructions correctly, mlc-ai is loading "plain" Alpaca which is great for conversation but, as you notice, rather conservative. I don't think this is a bad idea - perhaps it's better if we don't inflict racist AI on unsuspecting users. Try shopping around for other models.

Edit: I repeated your experiment with other models (but another library). They had no objections against generating offensive-yet-unfunny jokes.

[1] https://huggingface.co/PygmalionAI/pygmalion-6b

lagniappe · on May 8, 2023

My yardstick so far of all LLMs has been to ask for an offensive joke, ask for a function to invert a string, and ask for directions to make lasagna. It seems stupid but it's remarkably effective.

With MLC being the first LLM-in-a-box to run on my M2 at faster than a token per minute, I'm impressed at the speed but also disappointed at the quality of the experience. For those interested in the outcome, it failed all 3 tests, which is not unexpected for a small model like this.

Using/producing models with censorship included voluntarily demonstrates a willingness to hobble the technology for peripheral reasons that do not directly correlate with the advancement of the field. For that reason, this is a disqualifying characteristic in the capacity of my own use on the basis that social sensibilities and decency varies across cultural and regional lines, anything so trivial as a crass joke being limited is such a low bar that other things of much more grave concern will undoubtedly be tampered with or limited, and not always in ways the authors intended.

Self-hindering behavior will not be the positive we think it will be, as with most measures to correct injustices with data.

sroussey · on May 8, 2023

You can use MLC with different (bigger) models, right?

execveat · on May 8, 2023

You can't right now. Devs are working on instructions for porting other models, but they're not ready yet. The point of MLC is that it supports pretty much all GPU backends out there (including Intel and Mac). The bundled model is just a proof of concept.

moffkalast · on May 8, 2023

Well for running the average model as-is without spending a few days figuring out why you're getting strange errors and can't get it working you more or less need CUDA support.

As much VRAM as you can get is probably also a good idea.

For reference I can seemingly run Vicuna-7B (I think the 4 bit version) on my 6G 1660 Ti at roughly 1.5 tokens per second. Way too slow for anything useful, so you can imagine what CPU inference would look like.

londons_explore · on May 8, 2023

CPU inference is only a little slower. GPU's aren't good for a batch size of 1 and everything quantised.

execveat · on May 8, 2023

I get 3 tokens per second on M1 Max running 30B models compared to 1 token per second on a GPU (P40), both quantized to 4bit. So, in my opinion CPUs are better for inference (at least fast CPUs with DDR 5 versus cheapest GPUs).

The reason why GPUs seem to be the standard de facto is that they scale better, are more power efficient and are better supported by pytorch & co. Also, academia cares more about getting the best quality for their benchmarks, than about the performance and accessibility.

londons_explore · on May 8, 2023

GPU's win for training... And those who write papers and publish code tend to do lots of training and only a little inference.

MacsHeadroom · on May 8, 2023

I don't know how anyone hasn't mentioned this yet, the $180 Nvidia Tesla P40 24GB is about as capable as a 4090 for running LLMs (~70% of the token throughput for 8x cheaper). You can even run two or more in SLI to run 65B or larger models.

Just search eBay for Nvidia P40. Be sure to add an aftermarket cooling fan ($15 on eBay), as the P40 does not come with its own.

The P40 is a LOT faster than an ARM Mac, and a lot cheaper.

(Note: Do not go older than a P40. Pascal or newer is required to run 4bit quantizatized models. For example. the $100 M40 24GB is effectively only 6GB as it must run models in 16bit.)

oersted · on May 8, 2023

Can you provide sources for this claim? If true, how? What is it that the 4090 has that the P40 doesn't to justify the price?

I understand that 4090 is aimed at gaming and has a lot of extra bells and whistles like the RTX cores. But it is also consumer electronics and much cheaper than the enterprise GPU lines for the same power.

According to this 4090 already has double the raw flop performance of the V100 and is competitive with the most powerful GPUs in the market from last year.

https://www.aime.info/blog/en/deep-learning-gpu-benchmarks-2...

And according to this the V100 is ~60% faster than the P40

https://ai-benchmark.com/ranking_deeplearning_detailed.html

Not that these sources look particularly reliable, but still, consistent with intuition.

KaoruAoiShiho · on May 8, 2023

The claim of 70% of 4090 is very strange, my 4090 runs a 30b at roughly 25 tokens/second compared to the 1token/second claimed by the p40 user here: https://news.ycombinator.com/item?id=35861360

MacsHeadroom · on May 9, 2023

>compared to the 1token/second claimed by the p40 user

That user is doing something wrong. They may not be cooling it and are getting thermal throttled. That would be my guess.

The P40 is capable of upwards of 10 tokens/second with 30b.

BaculumMeumEst · on May 8, 2023

i was looking into an nvidia k80 before (so thanks for including your comment about needing pascal or greater) but i had a couple of concerns about the power connectors and pcie lanes/speed.

i read that data center gpus need specialized power adapters, and i didn't find good resources to see if it would be able to hook up to a consumer grade power supply or what adapters i would need

i think my tomahawk b450's pcie 3.0 x16 would suffice, but i'm not 100% sure if there would be bandwidth issues when running an nvme ssd alongside it

driver-wise i think i would be fine, i'm not sure if datacenter drivers are typically included in what's provided by linux distros but i'm sure i could make it work if not

and yeah i would definitely grab a cooler and probably undervolt and/or run it at a slightly lower clock speed to be safe

gtirloni · on May 8, 2023

Check this article out to build a perspective a out how consumer hardware performs against more business solutions in the machine learning area: https://bizon-tech.com/gpu-benchmarks/NVIDIA-RTX-3090-vs-NVI...

riter · on May 8, 2023

there are a few decent YT videos on this topic (dated 2022 though)

vast.ai is decent if you want to rent.

aantti · on May 8, 2023

Also, with Haystack and a smaller Transformer model to address the long-tail of answers https://github.com/deepset-ai/rasa-haystack (and https://www.deepset.ai/blog/build-smart-conversational-agent...)

deet · on May 8, 2023

Tangentially, it's interesting seeing an open source project like this actually spin up a domain name, contact email, and some branding (the image in the Readme), for a project the author said was created in just a few days.

I wonder what the objective is for that extra polish. If it's optimizing star count growth, how much do these touches help?

riter · on May 8, 2023

OP here.

that's a somewhat cynical interpretation. what if i just care about aesthetics and want to raise the bar.

my primary motivation was to get users of Rasa out of a directional hole bc that's where i was.

of course i like stars. it's a video game and i like winning. it was actually created in a few days all by me. no ulterior motive, literally indexing a solution to my problem from ~a week ago.

my bg is eng + product so i do these things as reflex and have a love for good UX.

nothing more. nothing less.

rodoxcasta · on May 8, 2023

We are all crafters, and I admire the attention to these presentation points of OP. This is clear in the documentation and explanations at the repository too.

That said, I don't think the questioning of GP was malicious, just a natural curiosity. Yes, a little suspicious, but, well, we are in the internet after all. In the least, it's good to point when someone does the extra work to make a great presentation.

Anyway, great work riter!

riter · on May 8, 2023

<3 ty, i totally misinterpreted the comment for cynicism. u guys are kind ty!

deet · on May 8, 2023

Sorry, I didn't mean to imply any nefarious ulterior motives here.

I'm more just intellectually curious about the dynamics of Github and marketing on it these days, whether it's for attracting contributors to non-commercial OSS projects or more commercial objectives where rapid growth leads to userbase, funding, etc.

The project looks quite interesting and I agree we need a way to bridge the gap between traditional bot creation frameworks and the more LLM-centric approaches of late.

riter · on May 8, 2023

ah! in that case glad you asked.

my objective falls into neither bucket. i want rasa users to find it so i optimize for search (GH tags, clear description), ease of use (video, addt'l MD files) and perception (logos) but i'll be honest, for my intention it has a diminishing rate of return.

at minimum i find canonical README sections like quick start, installation, how it works is necessary if you want to be helpful. helpfulness is difficult to measure outside of inbound emails thanking you / forks w/ actual commits.

hope that gives some kind of insight. just make everything awesome :)

klibertp · on May 8, 2023

The sad truth is that it's the average README that's hilariously bad, not that you did something strange or wrong. As you say - whenever something is trying to be useful to anyone other than the author, the author should in turn try to be helpful. Unfortunately, programmers always had a love-hate relationship with documentation (love it when you need it and others wrote it for you, hate the guts of it if it's you who need to write it), but I think it got progressively worse in the last decade.

At work, I have to often seriously fight for there to even be a README. Lack of docs and docstrings, lack of meaningful comments in the code, utter lack of visualization was the norm in the 90s, then it got better for a while, and now we've done a full circle and are back with undocumented spaghetti everywhere.

It's really strange, and I don't understand why it's like this. I tell people who nominally are way past being juniors to read their code before making a PR - to see how easy to understand it is - and they look like they just got enlightened. Like, isn't this (reading your own code) the most basic of all ways of working with code? Same for READMEs, I tell them to put all the information needed for a new person to set the project up, and am met with blank stares - why would they, programmers, bother with writing down plain English and managing the information surrounding what they do? Have these guys never thought about what the "I" in "IT" means?

Sorry, that's a possibly unwarranted rant, but when I see posts like GP's that seem to assume that writing a helpful README is somehow strange and a waste of effort unless it translates into clicks, it just blows my mind, it a pretty negative way.

riter · on May 8, 2023

LOL.

you're not alone my dude. i have a similar challenge w/ my engineers. my best lead is an artisan and its proud of what he authors, and like me sees their code as part of the product UX / funnel (1%). the others (99%) i have to get a bit draconian or simply create company templates they must adhere to or PRs get rejected and they hear from me on their 1:1s.

you either love it or you don't. and if you don't, follow the rules like a big boy or get called out.

ultimately you have to set a culture for it even if it is pulling teeth because you net net it impacts the PnL.

dinerodiva · on May 9, 2023

I really appreciate your work on this--we have been building on rasa and looking into alt ways to use new LLM models. The question is, do we even need Rasa anymore, does it make sense given that RasaX is behind a very expensive paywall.

bravura · on May 8, 2023

Curious if people want to suggest alternatives to Rasa for writing stateful chatbots. Or share feedback about using Rasa.

lachlan_gray · on May 8, 2023

I’m making a similar “chatbot backend” kind of thing right now, meant to be configurable and versatile, while avoiding dependencies. It’ll be compatible with stuff like llama index and pinecone but doesn’t impose it. I’ll have a minimal foundation ready this week, I can send it your way when it’s up if you’d like

aantti · on May 8, 2023

This was an interesting read :) https://www.pinecone.io/learn/javascript-chatbot/

riter · on May 8, 2023

the next best platform I could find for my friend I was helping was google's dialog flow. again, it was managed, closed-source opinionated and not as flexible. and most importantly design considerations were for a pre-LLM world.

i personally think there is an acute opportunity for creating a bare bones rasa built with LLMs in mind. the core concepts behind rasa are useful (domains, intents, actions, etc.) but the underlying NLU technology and assumptions around the platform are obsolete so 70% of the footprint is unnecessary.

just my humble Ξ0.02

dinerodiva · on May 9, 2023

What would you use if you were building a consumer facing chat bot, if not Rasa?

sergioprada · on May 8, 2023

I've been working on Motorhead! https://github.com/getmetal/motorhead

0x008 · on May 8, 2023

Can somebody ELI5 Rasa for me? I read through the README and I still don't get what it does.

riter · on May 8, 2023

Totally. Rasa (https://github.com/RasaHQ/rasa) is an open source chatbot platform.

It allows you to setup "Input Channels" e.g. slack telegram, and has an intents and response pipeline.

It leverages pre-LLM NLU models (NLTK, BERT, etc.) to score intents and based on that intent it will automate a pre-configured response.

My implementation allows you directly route (or fallback to) to GPT-3 or GPT-4 via Langchain document retrieval. So essentially this is an example of a knowledgebase customer support bot.

I hope that makes sense, let me know if not!

zwaps · on May 8, 2023

I guess it connects your langchain bot to some API like e.g. slack?

depr · on May 8, 2023

Can you actually build a reliable customer-facing chatbot on top of LLM's? With the "jailbreaking" and not knowing if it's actually using the data you're supplying it or other data it was trained on and so on.

riter · on May 8, 2023

yes. there are a few approaches which i intend to take and some helpful resources:

You could implement a Dual LLM Pattern Model https://simonwillison.net/2023/Apr/25/dual-llm-pattern/

You could also leverage a concept like Kor which is a kind of pydantic for LLMs: https://github.com/eyurtsev/kor

in short and as mentioned in the README.md this is absolutely vulnerable to prompt injection. I think this is not a fully solved issue but some interesting community research has been done to help address these things in production

depr · on May 8, 2023

Thanks, I hadn't seen those. I did find https://github.com/NVIDIA/NeMo-Guardrails earlier but haven't looked into it yet.

I'm not sure it solves the problem of restricting the information it uses though. For example, as a proof of concept for a customer, I tried providing information from a vector database as context, but GPT would still answer questions that were not provided in that context. It would base its answers on information that was already crawled from the customer website and in the model. That is concerning because the website might get updated but you can't update the model yourself (among other reasons).

darepublic · on May 8, 2023

Everybody racing into the AI space to plant their flag and say "First!". But first isn't going to be correlated with the winner much, I'd wager

riter · on May 8, 2023

OP here.

i agree. perhaps you're confused on the intent. the only flag being planted is for folks using rasa looking for a reference implementation just like i was a week ago. not sure if you're being intentionally cynical but trying is good thing. why? bc most ppl don't try. you make 0 of the shots you never take. and of course, if you're not intentionally being cynical -- gucci. if you are i encourage you to make your next comment substantial or encouraging :)

xchip · on May 8, 2023

what problem is this solving?

MichaelMoser123 · on May 8, 2023

only google bard could tell me (chatgtp3.5 has a too old knowledge cutoff date). They are trying to get the capabilities of ChatGTP plugins (apparently)

>what is Langchain?

LangChain is a framework for developing applications powered by large language models (LLMs). It provides a standard interface for connecting LLMs to other sources of data, as well as for persisting state between calls of a chain/agent. LangChain also provides a collection of memory implementations, and examples of chains/agents that use memory.

Here are some of the features of LangChain:

    Data-aware: LangChain allows you to connect your LLM to other sources of data, such as databases, APIs, and file systems. This makes it possible to build applications that are more powerful and differentiated than those that simply call out to an LLM via an API.
    Agentic: LangChain allows you to build agents that use LLMs to make decisions and take actions. This makes it possible to build applications that can interact with their environment in a more sophisticated way.
    Chains: LangChain allows you to chain together multiple calls to an LLM, or to other utilities. This makes it possible to build more complex applications that can perform multiple tasks.

LangChain is a powerful tool for building applications powered by LLMs. It is easy to use and provides a wide range of features. If you are interested in building an application that uses an LLM, I encourage you to check out LangChain.

>is langchain trying to extend the LLM with plugins, similar to what is possible with ChatGtp?

Yes, LangChain is trying to extend the LLM with plugins, similar to what is possible with ChatGtp. This will allow users to add new functionality to their LLMs without having to modify the underlying code.

For example, a user could create a plugin that adds the ability to translate text, or to generate different creative text formats. This would allow the user to use their LLM for a wider range of tasks.

LangChain is still under development, but the team is working on making it easy for users to create and deploy plugins. They are also working on adding new features to the plugin ecosystem, such as the ability to share plugins and to rate plugins.

I believe that LangChain has the potential to be a powerful tool for developers who want to build applications powered by LLMs. The ability to extend the LLM with plugins will make it even more powerful and versatile.

MichaelMoser123 · on May 8, 2023

it seems that only an LLM is able to explain things about this field clearly, without recourse to jargon and without recursive acronyms.

riter · on May 8, 2023

riter · on May 8, 2023

it solves how to integrate LLMs (Langchain) an application API pipeline with Rasa... of which I could not find an out-of-the-box public example on github. and so here we are :)

TL;DR: i solved a friend's headache (at the time)

cehrlich · on May 8, 2023

I'd suggest to not put a bunch of 4chan memes in your product demos.

Der_Einzige · on May 8, 2023

It worked just fine for the stable diffusion community, where automatic1111 puts a ton of credit to 4chan for the development of stable diffusion tooling

Beaver117 · on May 8, 2023

every meme started as a 4chan meme

becquerel · on May 8, 2023

this is a great disrespect to the memes that were born on SA

riter · on May 8, 2023

why is that exactly? is it offensive, if so I'm unaware and appreciate the feedback.

hereonout2 · on May 8, 2023

Lots of references to that weird cartoon frog in the JSON output.

Using something that's been quite controversial in the past does seem at least a little naive ... https://en.m.wikipedia.org/wiki/Pepe_the_Frog