I'm not sure what the advantage the use of a somewhat comprehensive framework like Langchain gives you for this use case?
It starts to feel as AI tech is slowly turning into web tech with a million tools and frameworks, so I'm just wondering whether all of these are needed and if it isn't easier to code your own than learning a foreign framework...
Not off-topic at all. After struggling with LangChain's hyper-opinionated implementation of classes I agree.
In fact, this is better off leveraging Llamaindex. This is a proof-of-concept and ultimately leveraging a library / framework helps afford the following:
- easy implementation of chunking strategies when you're unsure
- OpenAI helper functions
- embeddings and vector store management
Again, even with the above I struggled and had to implement PGVector myself. Going into production once I have my document retrieval strategy and prompt-tuning optimized, I would never use Langchain in production simply bc of the bloat and inflexible implementation of things like the PGVector class. Also the footprint is massive and the LLM part can be done in 5% of the footprint in Golang and 5% of the cloud costs.
>Storing context in an easy-to-access format for prompt insertion.
>Dealing with prompt limitations (e.g. 4096 tokens for Davinci) when context is too big.
>Dealing with text splitting.
Not sure if it isn't easier to roll one's own for that...?
I know a thing or two about the math behind LLMs and all this software build around a few core ideas just seems to be a lot of overkill...
When mentioning about PGVevtor, did you refer to this repo or is there a class within LangChain that has the same name?
https://github.com/pgvector/pgvector
You’re almost certainly going to have to write your own splitting code for anything nontrivial. LlamaIndex breaks down hard when there’s a lot of markup in the document, for example. You’ll also want control over the vector search strategy (just using the query or chunk embedding may not be enough)
I agree. I mentioned in a thread below that these frameworks are useful for discovering appropriate index-retrieval strategy that works best for you product.
FWIW, individual classes are generally tiny, so we found using langchain is fine and then for places we need to beef up (chunking, not calling 'eval', ...), we do our own class/subclass. That way we can align with community for broader pieces and patterns, and decrease technical risks from smaller fly-by-night repos.
At the same time, the underlying APIs are super simple, so just rolling your own entirely, with no framework, can make sense. We need to deal with businesses wanting to plug in their own APIs & models, so that happens to be less attractive to us.
That said, purpose built frameworks can be great. Our data agent has a headless tier and we are building it fine with langchain, and benefiting from the ecosystem there, but I can imagine someone with more specific needs enjoying rasa..
Splitting things is easy! Store the dense vectors of 512 characters or so and use an overlayed index of terms to set context of the current conversation.
Ignoring footprint and bloat, the big problem you identify is inflexible class design. I wonder why it happened? Is it hard for langchain to expose all the desired features of a tool like PGVector via its own class?
I sorta did this, feel free to check it out and let me know your thoughts!
On the main langchain post (In January) that got the traction on hackernews, i left this comment: https://news.ycombinator.com/item?id=34422917 . It still remains true, a "simpler langchain"
> To offer this code-style interface on top of LLMs, I made something similar to LangChain, but scoped what i made to only focus on the bare functional interface and the concept of a "prompt function", and leave the power of the "execution flow" up to the language interpreter itself (in this case python) so the user can make anything with it.
Core things it does: Uses jinja templates, does sync and async, and most importantly treats LLM completion endpoints as "function calls", which you can compose and build structures around just with simple python. I also combined it with fastapi so you can just serve up any templates you want directly as rest endpoints. It also offers callback hooks so you can log & trace execution graphs.
All together its only ~600 lines of python.
I haven't had a chance to really push all the different examples out there, so I think it hasn't seen much adoption outside of those that give it a try.
I hope to get back to it sometime in the next week to introduce local-mode (eg. all the open source smaller models are now available, I want to make those first-class)
The use-cases and tooling around language models is very premature. So, any framework you build now will either look like bloatware or will remain close to just calling an API.
The dust around language models needs to settle a bit, for a useful framework to emerge from it.
For our own use-cases, I built a framework from scratch, and it was the best decision we made.
it makes no sense deploying any of these libraries to prod. as-is. best to understand a configuration / workflow / tuning / etc. that fits your data best and write it from scratch in golang/rust/whatever.
They are not all computationally expensive. The rate limiting step here is the LLM call itself over the API. So, async is definitely needed. The other aspects would be loading the template from filesystem. I would assume this could be something that's needs to be optimized in the application.
That's pretty wild, I've been setting things up like this for about 5 years with just BERT or my own fine tuned encoder only systems. It should be done for free, not millions... Can I get millions for running `ls` too?
LMQL (language model query language) is a different take on prompting, and I find it less restrictive and more intuitive. Langchain is to LMQL what Keras is to Tensorflow
Thanks for the link. I skimmed the docs and couldn't find a motivation section. Can you expand on how you find it less restrictive and more intuitive.
My first impression is that this is a paradigm mismatch and an 'API' masquerading as a "language". LMQApi? Looks fine, and we have all the necessary ports for (query, model, []constraints, ...).
So what's the language bit? It's the 'scripted prompt'. That's the only bit that is reasonably a 'language', but as a language it is all over the place. Semantics are rather wild, don't you agree?
sample(temperature=0.8)
"A list of things not to forget when going to the sea (not travelling): \n"
backpack = []
for i in range(5):
"-[THING]"
backpack.append(THING.strip())
print(backpack)
from
'openai/text-ada-001'
where
STOPS_AT(THING, "\n")
This part reminds of shell scripting (and what I hate about it). For example, what are the semantics of > "something quoted" < in this language? How about "THING" and THING? Is that a token, a variable, or both?
So, we really have an 'imperative' language part (the scripted prompt) and then a pretense at "declarative language" with the elaborated api call spelled out as a sqlish query.
p.s. I appreciate and laud the effort of the team which produced this. This is just feedback.
Yeah the semantics are very weird, but I guess “prompt engineering” is weird too, so it makes sense :) .
Everything between “sample” and “from” is basically a script that generates a prompt, which is incrementally fed to the LM.
Each line contained in double quotes will get appended to the prompt, using an f-string syntax, like normal LM templates. So if you have a local python variable “foo”, you can say “how do I make {foo}?” and it will substitute its value into the prompt (not interesting).
But things in square brackets are called “hole variables”, and do the opposite. If you follow up the previous with the line “you make it by [instructions]” , the prompt up to that point is passed to the LM, and the hole in the prompt is filled, and the result is stored in a local variable “instructions” which you can reference later on in the prompt, or in python script.
Any lines in between that don’t have double quotes are interpreted as python. So you can make program logic and LM calls conditional on the result of previous LM calls, or other results of some other process. So for example you could build a critique loop like the critique chain in the LC docs out of an actual while loop, where the while loop breaks when the LM determines the output is acceptable.
The exact same thing is possible with LangChain already, but it would involve creating templates, instantiating chains, etc, which isn’t bad, but adds complexity. In LMQL syntax, you can glance at the program and plainly see what it does using your programming brain… “yeah this while loop breaks when the screenplay is good enough, and the refined version gets returned” whereas I think LC’s abstractions make something simple like this look complex.
The “where” clause is where you specify constraints, which allow you to limit what the value of a hole can be. In this case you could apply a “where” constraint to a hole variable [rating] that forces it to be either “good enough” or “needs improvement”, and nothing else can possibly be sampled from the token distribution. This makes pipelines a lot more efficient by eliminating the need for “correction chains” in a lot of places. Also, once the tokens “ne” or “go” have been generated, LMQL doesn’t have to request any more tokens because the result is already uniquely determined, and it can substitute the rest and move on.
The other thing that I love about LMQL is that everything is async. Last time I tried, maybe two months ago, making a LC chain asynchronous didn’t feel natural. In my use cases, chains were async more often than not and it was kind of annoying.
In fact under the hood, the LMQL query is compiled to a decorated async function. So at the end of the day, you can use any of your queries as simple async functions. If you want to make react Agents, or any other LM abstraction you like, you pretty much just have to stick a few @lmql.query decorated functions inside a class definition and you’re good to go. That’s what I meant by the Tensorflow/keras analogy.
LMQL still isn’t mature and there’s a lot on the roadmap. Prompting is a wild west, and altogether we haven’t even discovered a lot of the problems we will need to solve. I like to think the situation is like how I imagine operating systems and a lot of software in general looked before Bell labs. For now at least, I think of all the options, LMQL is closest to the golden path.
Let me know if you have any more questions, feel free to send an email!
Very informative, thank you. You make a strong case for it. Interesting how in QL the query plan is kinda spelled out in the select. I also appreciate the motivation of 'seeing the algorithm'; makes sense. Why SQL-ish approach?
> Prompting is a wild west
I am racking my brain trying to remember a continuation based language that made it to hn frontpage recently. Wondering if something like that isn't a better approach for prompting.
Was it crystal by chance? I meant to have a look at it but never bothered, if it is maybe I will.
As for the SQLish approach, I’m not sure, it just seems to fit. I think it came from the way that there’s a thing you are requesting, and you have constraints you want applied to it. I think it’s one of those things where the analogy to SQL gives us developers just enough of a toe hold on what we’re doing to produce something but ultimately I think it will start to look less like SQL.
well to be fair, when you're scaling it does matter. i would want my techlead or seniors to care and know when/where to make specific trade-offs bc cloud costs are not forgiving.
i think that's where folks that make those comments are coming from.
- Reduce bloat, make packages optional e.g. pip install langchain[all]
- Reduce opinionated implementation of vector stores, I want my own schema
- Don't unnaturally force the chain abstraction
- Invest more in document retrieval
Unfortunately there were not a whole lot of end-to-end examples of integrating Rasa with OpenAI nor functional boilerplates on github so I put a working prototype together in a few days and thus RasaGPT was bron.
RasaGPT is a python-based boilerplate and reference implementation of Rasa and Telegram utilizing an LLM library like Langchain for indexing, retrieval and context injection. FastAPI end-points are made available for you to build your application on top of. Features include:
- Automated hand-off to human if queries are out of bounds
- "Training" pipeline done via API
- Multi-tenant support
- Generate category labels from questions
- Works right out of the box with docker-compose
- Ngrok reverse tunnel and dummy data included
- Multiple use cases and a great starting point
I haven't worked with Rasa so I was wondering if I understood things correctly.
Are you using a language model to look up the correct reply to a particular response inside Rasa? Where Rasa presumably connects to some kind of backend to retrieve information or 'do stuff'?
this implementation leverages Rasa and stands up a FastAPI server where it receives the user response webhook first and gets processed by (or bypasses) Rasa.
The LLM queries a set of documents indexed by Langchain. Dummy data has been included (Pepe Corp.)
Rasa has support for a "fallback" mechanism whereby if a user's response scores low on your pre-configured Rasa intents (like Greet) you can have it route directly to the LLM as well. But for now RasaGPT capture and routes the Telegram response to the FastAPI webhook endpoint.
the LLM itself and prompts I configured provides a boolean on whether the response should be escalated to a human or not, based on LLM+Langchain not knowing the answer to the user's query from the indexed documents.
I hope that answers your question, if not happy to follow-up!
A bit off topic but you better change the name and remove the GPT.
OpenAI is claiming AI products that are using GPT in their name are causing confusion and is sending legal threats now.
They say "mark" and not "trademark" in this takedown notice probably for a reason - because they can't call it a trademark and it's not enforceable?
If somebody creates ClosedAI company with product called ChatLLM, does it mean they can start sending takedown notices to everybody left and right who is using "LLM" in their name?
> If somebody creates ClosedAI company with product called ChatLLM, does it mean they can start sending takedown notices to everybody left and right who is using "LLM" in their name?
Those things are nothing alike.
GPT is a very specific family of models, all created by OpenAI. The copy-cats came after OpenAI released those models, specifically to point out similarities with the OpenAI-created models.
LLM is a generic term and cannot be trademarked.
You can trademark "Coca Cola", but you can't trademark "Cola" or "Soda".
No they are not all created by ClosedAI, ie. GPT-J and GPT-NeoX was created by EleutherAI.
If LLM is pizza then GPT is ie. thin crust pizza.
ps. it's funny how quickly they went from open ai research to prosecuting project for using gpt acronym while pumping money and doing exclusive deals with mega corp - the very thing they were suppose to protect us from.
According to Wikipedia, "GPT" originated with OpenAI: "The concept and first such model were introduced in 2018 by the American artificial intelligence organization OpenAI"
It itself is not a GPT. It is a a framework of a framework project built on top of Rasa (https://github.com/RasaHQ/rasa) and Langchain which by default uses gpt3.5-turbo (change it in the .env file) or any foundation model you wish.
Sorry for the off topic question, but does anyone know how to buy consumer hardware optimal for running emerging open source chat models with the largest parameter chat models possible?
Would it be more cost effective to try to buy an absurd amount of ram and run on the cpu?
Or buy an Nvidia card with the biggest capacity available?
Or maybe buy a Mac with the most memory you can get?
ARM-based Macs are the easiest way to get an acceptable performance without the headaches right now, if you can afford the price.
Install https://github.com/oobabooga/text-generation-webui, update pytorch and llamacpp-python, and you should be able to run pretty much all models out there, in all formats, both on GPU and CPU. CPU on a MAC gives you the fastest speed, but you should pass the correct --threads argument (investigate how many performance cores you've got). GPU is slower, but more energy efficient. https://github.com/mlc-ai/mlc-llm gives me way better GPU performance compared to oobabooga, but they only support a couple of models right now, - it's worth following their progress though.
If you're after the raw performance, I suggest using GGML models (meant for llama.cpp, but it's bundled in textgen, so you can use it there with the convenience of a web ui). q4_0 is the fastest quantization, while the q5_1 is the best quality right now.
If the GGML is not available, you can generate it quite easily from the safetensors yourself (not the you need enough RAM to load the model in pytorch though).
With 16GB RAM you can run any 13G model, as long as it's quantized to 4/5 bits. 32GB RAM allows you running 30/33G models and 64GB RAM - 65G models. 30G and 60G models are way more useful for real world tasks, but they are more expensive to train, so there aren't as many to choose from compared to 7/13. 7B and anything less is a toy in my opinion while 13B is good enough for experimentation and prototyping.
Thanks, that makes sense and helps a lot. I have a 16gb m1 that I got llama 13b running on. It works really well but I really want to run bigger models, so your examples of ram -> model size are super helpful.
I’ll probably just end up getting a higher capacity Mac in the next few years. Right now 96gb configurations seem to be around 4k, if that comes down a bit in the future i’ll probably pick something up.
I’m not really looking to train myself so training cost isn’t an issue for me personally, I just want to be able to run the best of what the open source community comes up with (or contribute to a pool to train models, if that becomes a thing).
Just a heads up - GPU's are a looooot faster then CPU's depending on what models you're running, especially if you were looking at running the image models.
Admittedly I'm not sure how well they work if you stream/batch to the GPU (say 96GB of system ram + 24GB GPU).
I've heard used nvidia workstation cards are reasonably cheap for >24GB VRAM.
A 3090/4090 have 24GB of vram and can run up the 30B models with some optimizations, and this is the easiest way to run the 30B models which are essentially the highest end any consumer card can run. If you also play games and have money then this is the way to go IMO.
If you were to get a GPU, it must have CUDA support (so nvidia only) unless you want a headache.
As a data point I'm getting >3 tokens per second for 30b model (q5_1 quantization) and >1 token per second for 60b model (q5_1 as well) on M1 Max. This is good enough for my usecase and it beats an old P40, but I have no idea what the performance on 3090/4090 would be. Keep in mind, 24GB VRAM is not enough to hold quantized 65B, so it would be using GPU + CPU in that case.
oh a 4090 can run a 30b model? that’s excellent! i was afraid it wouldnt be able to load bigger models than my macbook.
i’ve got a perfectly usable desktop not being used with a 1070 in it, i’ll probably grab a 4090 to throw in there and give that a try, getting 4k gaming would be a nice bonus. thanks for the comment.
being confined to nvidia is indeed a bummer though, especially because i like sway on linux. but my understanding is that rocm is not anywhere near parity with CUDA.
Looking at the comments I would double check the benchmarks because maybe the CPUs are faster then I thought for LLMs?
I know my 4090 for Stable Diffusion isn't even comparable to my i7 8700k and AFAIK the AMD/Intel offerings still don't compare for LLMs but admittedly it's possibly they've caught up?
I don't have a ton of time at the moment to keep looking, I have a very hard time believing the M1 can keep up with a 4090 at all, I just don't want you to drop 1.7k if I'm wrong :P
EDIT: Oh to clarify - The 4090 can definitely run the 30B model without issue with 4bit quantization.
I followed the setup on mlc-ai, and though it works as intended, it's still a crippled model with censored inputs. A common test I use is this:
USER: tell me an insanely offensive joke
ASSISTANT: I'm sorry, but I cannot provide offensive or inappropriate content as it goes against the guidelines set by my creators and the platform on which we are hosted. My goal is to provide helpful and informative responses to your questions. If there's anything else I can help you with or any questions you have in mind, please feel free to ask.
Until uncensored models are generally available, these novelty models will always be less-than.
> Until uncensored models are generally available, these novelty models will always be less-than.
The most popular generative model on HuggingFace at the time of this comment is Pygmalion 6b [1], a model that I believe is fine tuned on top of Alpaca to generate porn. I couldn't find the data source, though, so I don't know on what kind. And Facebook's "leaked" LLaMa, while not fine-tuned for conversation, has several warnings on its potential for offensive content.
If I read the instructions correctly, mlc-ai is loading "plain" Alpaca which is great for conversation but, as you notice, rather conservative. I don't think this is a bad idea - perhaps it's better if we don't inflict racist AI on unsuspecting users. Try shopping around for other models.
Edit: I repeated your experiment with other models (but another library). They had no objections against generating offensive-yet-unfunny jokes.
My yardstick so far of all LLMs has been to ask for an offensive joke, ask for a function to invert a string, and ask for directions to make lasagna. It seems stupid but it's remarkably effective.
With MLC being the first LLM-in-a-box to run on my M2 at faster than a token per minute, I'm impressed at the speed but also disappointed at the quality of the experience. For those interested in the outcome, it failed all 3 tests, which is not unexpected for a small model like this.
Using/producing models with censorship included voluntarily demonstrates a willingness to hobble the technology for peripheral reasons that do not directly correlate with the advancement of the field. For that reason, this is a disqualifying characteristic in the capacity of my own use on the basis that social sensibilities and decency varies across cultural and regional lines, anything so trivial as a crass joke being limited is such a low bar that other things of much more grave concern will undoubtedly be tampered with or limited, and not always in ways the authors intended.
Self-hindering behavior will not be the positive we think it will be, as with most measures to correct injustices with data.
You can't right now. Devs are working on instructions for porting other models, but they're not ready yet. The point of MLC is that it supports pretty much all GPU backends out there (including Intel and Mac). The bundled model is just a proof of concept.
Well for running the average model as-is without spending a few days figuring out why you're getting strange errors and can't get it working you more or less need CUDA support.
As much VRAM as you can get is probably also a good idea.
For reference I can seemingly run Vicuna-7B (I think the 4 bit version) on my 6G 1660 Ti at roughly 1.5 tokens per second. Way too slow for anything useful, so you can imagine what CPU inference would look like.
I get 3 tokens per second on M1 Max running 30B models compared to 1 token per second on a GPU (P40), both quantized to 4bit. So, in my opinion CPUs are better for inference (at least fast CPUs with DDR 5 versus cheapest GPUs).
The reason why GPUs seem to be the standard de facto is that they scale better, are more power efficient and are better supported by pytorch & co. Also, academia cares more about getting the best quality for their benchmarks, than about the performance and accessibility.
I don't know how anyone hasn't mentioned this yet, the $180 Nvidia Tesla P40 24GB is about as capable as a 4090 for running LLMs (~70% of the token throughput for 8x cheaper). You can even run two or more in SLI to run 65B or larger models.
Just search eBay for Nvidia P40. Be sure to add an aftermarket cooling fan ($15 on eBay), as the P40 does not come with its own.
The P40 is a LOT faster than an ARM Mac, and a lot cheaper.
(Note: Do not go older than a P40. Pascal or newer is required to run 4bit quantizatized models. For example. the $100 M40 24GB is effectively only 6GB as it must run models in 16bit.)
Can you provide sources for this claim? If true, how? What is it that the 4090 has that the P40 doesn't to justify the price?
I understand that 4090 is aimed at gaming and has a lot of extra bells and whistles like the RTX cores. But it is also consumer electronics and much cheaper than the enterprise GPU lines for the same power.
According to this 4090 already has double the raw flop performance of the V100 and is competitive with the most powerful GPUs in the market from last year.
The claim of 70% of 4090 is very strange, my 4090 runs a 30b at roughly 25 tokens/second compared to the 1token/second claimed by the p40 user here: https://news.ycombinator.com/item?id=35861360
i was looking into an nvidia k80 before (so thanks for including your comment about needing pascal or greater) but i had a couple of concerns about the power connectors and pcie lanes/speed.
i read that data center gpus need specialized power adapters, and i didn't find good resources to see if it would be able to hook up to a consumer grade power supply or what adapters i would need
i think my tomahawk b450's pcie 3.0 x16 would suffice, but i'm not 100% sure if there would be bandwidth issues when running an nvme ssd alongside it
driver-wise i think i would be fine, i'm not sure if datacenter drivers are typically included in what's provided by linux distros but i'm sure i could make it work if not
and yeah i would definitely grab a cooler and probably undervolt and/or run it at a slightly lower clock speed to be safe
Tangentially, it's interesting seeing an open source project like this actually spin up a domain name, contact email, and some branding (the image in the Readme), for a project the author said was created in just a few days.
I wonder what the objective is for that extra polish. If it's optimizing star count growth, how much do these touches help?
that's a somewhat cynical interpretation. what if i just care about aesthetics and want to raise the bar.
my primary motivation was to get users of Rasa out of a directional hole bc that's where i was.
of course i like stars. it's a video game and i like winning. it was actually created in a few days all by me. no ulterior motive, literally indexing a solution to my problem from ~a week ago.
my bg is eng + product so i do these things as reflex and have a love for good UX.
We are all crafters, and I admire the attention to these presentation points of OP. This is clear in the documentation and explanations at the repository too.
That said, I don't think the questioning of GP was malicious, just a natural curiosity. Yes, a little suspicious, but, well, we are in the internet after all. In the least, it's good to point when someone does the extra work to make a great presentation.
Sorry, I didn't mean to imply any nefarious ulterior motives here.
I'm more just intellectually curious about the dynamics of Github and marketing on it these days, whether it's for attracting contributors to non-commercial OSS projects or more commercial objectives where rapid growth leads to userbase, funding, etc.
The project looks quite interesting and I agree we need a way to bridge the gap between traditional bot creation frameworks and the more LLM-centric approaches of late.
my objective falls into neither bucket. i want rasa users to find it so i optimize for search (GH tags, clear description), ease of use (video, addt'l MD files) and perception (logos) but i'll be honest, for my intention it has a diminishing rate of return.
at minimum i find canonical README sections like quick start, installation, how it works is necessary if you want to be helpful. helpfulness is difficult to measure outside of inbound emails thanking you / forks w/ actual commits.
hope that gives some kind of insight. just make everything awesome :)
The sad truth is that it's the average README that's hilariously bad, not that you did something strange or wrong. As you say - whenever something is trying to be useful to anyone other than the author, the author should in turn try to be helpful. Unfortunately, programmers always had a love-hate relationship with documentation (love it when you need it and others wrote it for you, hate the guts of it if it's you who need to write it), but I think it got progressively worse in the last decade.
At work, I have to often seriously fight for there to even be a README. Lack of docs and docstrings, lack of meaningful comments in the code, utter lack of visualization was the norm in the 90s, then it got better for a while, and now we've done a full circle and are back with undocumented spaghetti everywhere.
It's really strange, and I don't understand why it's like this. I tell people who nominally are way past being juniors to read their code before making a PR - to see how easy to understand it is - and they look like they just got enlightened. Like, isn't this (reading your own code) the most basic of all ways of working with code? Same for READMEs, I tell them to put all the information needed for a new person to set the project up, and am met with blank stares - why would they, programmers, bother with writing down plain English and managing the information surrounding what they do? Have these guys never thought about what the "I" in "IT" means?
Sorry, that's a possibly unwarranted rant, but when I see posts like GP's that seem to assume that writing a helpful README is somehow strange and a waste of effort unless it translates into clicks, it just blows my mind, it a pretty negative way.
you're not alone my dude. i have a similar challenge w/ my engineers. my best lead is an artisan and its proud of what he authors, and like me sees their code as part of the product UX / funnel (1%). the others (99%) i have to get a bit draconian or simply create company templates they must adhere to or PRs get rejected and they hear from me on their 1:1s.
you either love it or you don't. and if you don't, follow the rules like a big boy or get called out.
ultimately you have to set a culture for it even if it is pulling teeth because you net net it impacts the PnL.
I really appreciate your work on this--we have been building on rasa and looking into alt ways to use new LLM models. The question is, do we even need Rasa anymore, does it make sense given that RasaX is behind a very expensive paywall.
I’m making a similar “chatbot backend” kind of thing right now, meant to be configurable and versatile, while avoiding dependencies. It’ll be compatible with stuff like llama index and pinecone but doesn’t impose it. I’ll have a minimal foundation ready this week, I can send it your way when it’s up if you’d like
the next best platform I could find for my friend I was helping was google's dialog flow. again, it was managed, closed-source opinionated and not as flexible. and most importantly design considerations were for a pre-LLM world.
i personally think there is an acute opportunity for creating a bare bones rasa built with LLMs in mind. the core concepts behind rasa are useful (domains, intents, actions, etc.) but the underlying NLU technology and assumptions around the platform are obsolete so 70% of the footprint is unnecessary.
It allows you to setup "Input Channels" e.g. slack telegram, and has an intents and response pipeline.
It leverages pre-LLM NLU models (NLTK, BERT, etc.) to score intents and based on that intent it will automate a pre-configured response.
My implementation allows you directly route (or fallback to) to GPT-3 or GPT-4 via Langchain document retrieval. So essentially this is an example of a knowledgebase customer support bot.
Can you actually build a reliable customer-facing chatbot on top of LLM's? With the "jailbreaking" and not knowing if it's actually using the data you're supplying it or other data it was trained on and so on.
in short and as mentioned in the README.md this is absolutely vulnerable to prompt injection. I think this is not a fully solved issue but some interesting community research has been done to help address these things in production
I'm not sure it solves the problem of restricting the information it uses though. For example, as a proof of concept for a customer, I tried providing information from a vector database as context, but GPT would still answer questions that were not provided in that context. It would base its answers on information that was already crawled from the customer website and in the model. That is concerning because the website might get updated but you can't update the model yourself (among other reasons).
i agree. perhaps you're confused on the intent. the only flag being planted is for folks using rasa looking for a reference implementation just like i was a week ago. not sure if you're being intentionally cynical but trying is good thing. why? bc most ppl don't try. you make 0 of the shots you never take. and of course, if you're not intentionally being cynical -- gucci. if you are i encourage you to make your next comment substantial or encouraging :)
only google bard could tell me (chatgtp3.5 has a too old knowledge cutoff date). They are trying to get the capabilities of ChatGTP plugins (apparently)
>what is Langchain?
LangChain is a framework for developing applications powered by large language models (LLMs). It provides a standard interface for connecting LLMs to other sources of data, as well as for persisting state between calls of a chain/agent. LangChain also provides a collection of memory implementations, and examples of chains/agents that use memory.
Here are some of the features of LangChain:
Data-aware: LangChain allows you to connect your LLM to other sources of data, such as databases, APIs, and file systems. This makes it possible to build applications that are more powerful and differentiated than those that simply call out to an LLM via an API.
Agentic: LangChain allows you to build agents that use LLMs to make decisions and take actions. This makes it possible to build applications that can interact with their environment in a more sophisticated way.
Chains: LangChain allows you to chain together multiple calls to an LLM, or to other utilities. This makes it possible to build more complex applications that can perform multiple tasks.
LangChain is a powerful tool for building applications powered by LLMs. It is easy to use and provides a wide range of features. If you are interested in building an application that uses an LLM, I encourage you to check out LangChain.
>is langchain trying to extend the LLM with plugins, similar to what is possible with ChatGtp?
Yes, LangChain is trying to extend the LLM with plugins, similar to what is possible with ChatGtp. This will allow users to add new functionality to their LLMs without having to modify the underlying code.
For example, a user could create a plugin that adds the ability to translate text, or to generate different creative text formats. This would allow the user to use their LLM for a wider range of tasks.
LangChain is still under development, but the team is working on making it easy for users to create and deploy plugins. They are also working on adding new features to the plugin ecosystem, such as the ability to share plugins and to rate plugins.
I believe that LangChain has the potential to be a powerful tool for developers who want to build applications powered by LLMs. The ability to extend the LLM with plugins will make it even more powerful and versatile.
it solves how to integrate LLMs (Langchain) an application API pipeline with Rasa... of which I could not find an out-of-the-box public example on github. and so here we are :)
It worked just fine for the stable diffusion community, where automatic1111 puts a ton of credit to 4chan for the development of stable diffusion tooling
It starts to feel as AI tech is slowly turning into web tech with a million tools and frameworks, so I'm just wondering whether all of these are needed and if it isn't easier to code your own than learning a foreign framework...