Hacker Newsnew | past | comments | ask | show | jobs | submit | franzypants's commentslogin

I get the sense that any accurate portrayal of the model is getting downvoted. Sure, it's going to be a while before open-source models catch up with OpenAI. I'm glad that Stability is working on this, and I look forward to trying out their larger models that they're training right now, but we should be able to discuss accurately the capabilities of an LLM without just automatically jumping on the hype train.


That's fair, but this:

> Is this a joke? I tried the HuggingFace demo, and it just spews out ridiculous replies for everything.

Is a shallow and uninformed criticism. The parameter count alone should inform people that this isn't going to match GPT-3.5 or GPT-4. Not everyone knows that of course, but starting a comment with "Is this a joke?" is precisely the wrong way to start such a discussion.


Yes, go try asking GPT-3 Babbage these same questions and let me know if it is even remotely coherent! That is the fair comparison.


I dunno if OP tried raw 3B, but finetuned llama 7B is much better than that.


So, it can write very coherently but not sure it excels at coding tasks.

For example:

User: Write Python code for evaluating the integral of f(x)=x^2.

Assistant: Sure, here's an example of Python code to evaluate the integral of a function `f(x) = x^2`: ```

# Define the function

def integral_of_f(x):

    return x**2
# Evaluate the integral

result = integral_of_f(3)

print("The integral of f(x) = x^2 is:", result) ```

In this code, we define the function `integral_of_f` as a function that takes a single argument `x` and returns the integral of `x^2`. We then evaluate the integral by calling the `integral_of_f` function and passing in the value of `x` as an argument.


It might be a little late, but I hope datasets start incorporating patent texts as well:

1. It's a large corpus of technical knowledge; 2. The language is written by experts in a field and reviewed many times, and 3. They have technical drawings with labels and references in the text

The only downside I suppose is that sometimes patents are written with "just enough knowledge" to get it granted but not too much to give away the secret sauce. That's not really that different from many scholarly papers though.

To give a size of scale, the granted patent texts of 2020 (without images) is about 160 GB of data, and we have digitized grants going back to at least 1970.


You wouldn't want chatbots to answer you with the kind of language used in patent texts.


LLMs are actually pretty good at translating info in one form into another form.


Now, I don't know if I would rely on it, but I've certainly thought about asking a LLM to write my patent text for me, provided with a technical description.


The Pile already does!

Part of its contents come from the "USPTO Backgrounds" dataset. From The Pile's paper:

> USPTO Backgrounds is a dataset of background sections from patents granted by the United States Patent and Trademark Office, derived from its published bulk archives. A typical patent background lays out the general context of the invention, gives an overview of the technical field, and sets up the framing of the problem space. We included USPTO Backgrounds because it contains a large volume of technical writing on applied subjects, aimed at a non-technical audience.

More details in the paper: https://arxiv.org/pdf/2101.00027.pdf

The Pile: https://pile.eleuther.ai/


I don't know how complete the digitization of old texts is, but if you go to worldwide.espacenet.com, search for "airship" and reverse sort by date you get documents from the 1880s.

In fact I'm downloading a whole batch of patent texts right now because I wanted to experiment with semantic search on patent texts.

Anyone here have any pointers on what the state of the art method for semantic search through a large corpus would be? I've just started researching and BERT and friends seems like it was popular about 2 years ago but things move so fast I wouldn't know what I should do now.

What about a medium sized corpus of text, say 100.000 pages of text?


afaik sentence embeddings via sbert are still considered a pretty viable path. This may be what you were already looking at, but there's more info here: https://www.sbert.net/index.html


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: