I get the sense that any accurate portrayal of the model is getting downvoted. Sure, it's going to be a while before open-source models catch up with OpenAI. I'm glad that Stability is working on this, and I look forward to trying out their larger models that they're training right now, but we should be able to discuss accurately the capabilities of an LLM without just automatically jumping on the hype train.
> Is this a joke? I tried the HuggingFace demo, and it just spews out ridiculous replies for everything.
Is a shallow and uninformed criticism. The parameter count alone should inform people that this isn't going to match GPT-3.5 or GPT-4. Not everyone knows that of course, but starting a comment with "Is this a joke?" is precisely the wrong way to start such a discussion.
So, it can write very coherently but not sure it excels at coding tasks.
For example:
User: Write Python code for evaluating the integral of f(x)=x^2.
Assistant: Sure, here's an example of Python code to evaluate the integral of a function `f(x) = x^2`:
```
# Define the function
def integral_of_f(x):
return x**2
# Evaluate the integral
result = integral_of_f(3)
print("The integral of f(x) = x^2 is:", result)
```
In this code, we define the function `integral_of_f` as a function that takes a single argument `x` and returns the integral of `x^2`. We then evaluate the integral by calling the `integral_of_f` function and passing in the value of `x` as an argument.
It might be a little late, but I hope datasets start incorporating patent texts as well:
1. It's a large corpus of technical knowledge;
2. The language is written by experts in a field and reviewed many times, and
3. They have technical drawings with labels and references in the text
The only downside I suppose is that sometimes patents are written with "just enough knowledge" to get it granted but not too much to give away the secret sauce. That's not really that different from many scholarly papers though.
To give a size of scale, the granted patent texts of 2020 (without images) is about 160 GB of data, and we have digitized grants going back to at least 1970.
Now, I don't know if I would rely on it, but I've certainly thought about asking a LLM to write my patent text for me, provided with a technical description.
Part of its contents come from the "USPTO Backgrounds" dataset. From The Pile's paper:
> USPTO Backgrounds is a dataset of background sections from patents granted by the United States Patent and Trademark Office, derived from its published bulk archives. A typical patent background lays out the general context of the invention, gives an overview of the technical field, and sets up the
framing of the problem space. We included USPTO Backgrounds because it contains a large volume of technical writing on applied subjects, aimed at a
non-technical audience.
I don't know how complete the digitization of old texts is, but if you go to worldwide.espacenet.com, search for "airship" and reverse sort by date you get documents from the 1880s.
In fact I'm downloading a whole batch of patent texts right now because I wanted to experiment with semantic search on patent texts.
Anyone here have any pointers on what the state of the art method for semantic search through a large corpus would be? I've just started researching and BERT and friends seems like it was popular about 2 years ago but things move so fast I wouldn't know what I should do now.
What about a medium sized corpus of text, say 100.000 pages of text?
afaik sentence embeddings via sbert are still considered a pretty viable path. This may be what you were already looking at, but there's more info here: https://www.sbert.net/index.html