More

dimitry12 · on Dec 20, 2024

I believe this is a valid point: HF's replication indeed uses larger off-the-shelf model as a verifier.

In contrast, in the original paper, verifier is a fine-tune of the exact same base model which is used to sample step-by-step solutions (="solver").

boroboro4 · on Dec 20, 2024

Using different 1B model as verifier makes sense, yes. Using Llama 8B finetune as verifier to compare 1B inference time scaled in comparison with 8B makes little sense to me.

Using 3B model with 8B verifier against 70B model would make sense too. This being said their performance barely crossed 70B line with 256 examples. This is 256*(8+3)/70 ~ 40 times more computationally expensive than running 70B model as is.

dimitry12 · on Dec 20, 2024

"1B solver + 8B verifier + search" beating 0-shot 70B is nice, agree.

"1B solver + 8B verifier + search" beating 1B-0-shot or 1B-majority as baselines isn't illustrative imo. In other words, by using larger verifier, HF's replication fails to establish a "fair" baseline. Still an awesome blog and release/repository from HF's group - I love it!

zackangelo · on Dec 20, 2024

Where did you see that? I thought they used an 8b model for their reward model?

> To guide our search strategies, we used RLHFlow/Llama3.1-8B-PRM-Deepseek-Data, an 8B reward model that has been trained using process supervision

dimitry12 · on Dec 20, 2024

"Solver" is `meta-llama/Llama-3.2-1B-Instruct` (1B model, and they use 3B for another experiment), and verifier is `RLHFlow/Llama3.1-8B-PRM-Deepseek-Data`.

See https://github.com/huggingface/search-and-learn/blob/b3375f8... and https://github.com/huggingface/search-and-learn/blob/b3375f8...

In the original paper, they use PaLM 2-S* as "solver" and its fine-tune as "verifier".

dimitry12 · on Dec 20, 2024

In this paper and HF's replication the model used to produce solutions to MATH problems is off-the-shelf. It is induced to produce step-by-step CoT-style solutions by few-shot ICL prompts or by instructions.

Yes, the search process (beam-search of best-of-N) does produce verbose traces because there is branching involved when sampling "thoughts" from base model. These branched traces (including incomplete "abandoned" branches) can be shown to the user or hidden, if the approach is deployed as-is.

dimitry12 · on Dec 20, 2024

Verifier is trained with soft values of reward-to-go for each solution-prefix, obtained from monte-carlo rollouts of step-by-step solutions sampled from the "base" model.

In other words: 1) sample step-by-step solutions from "base" model; 2) do it at non-zero temperature so that you can get multiple continuation from each solution-prefix; 3) use MATH-labels to decide if full solution (leaf/terminal node in MC rolloout) has reward `1` or `0`; 4) roll up these rewards to calculate reward-to-go for each intermediate step.

Yes, verifier trained in this manner can be used to score solution-prefixes (as a process verifier) or a full-solution (as an outcome verifier).

In the original paper (https://arxiv.org/abs/2408.03314) they fine-tune a fresh verifier. HF's replication uses an off-the-shelf verifier based on another paper: https://arxiv.org/abs/2312.08935

dimitry12 · on Dec 10, 2024

Curious about that too. There are plenty of forks left, for example: https://github.com/plastic-labs/llama3_interpretability_sae (no affiliation)

dimitry12 · on Nov 25, 2024

Looking at https://github.com/modelcontextprotocol/python-sdk?tab=readm... it's clear that there must be a decision connecting, for example, `tools` returned by the MCP server and `call_tool` executed by the host.

In case of Claude Desktop App, I assume the decision which MCP-server's tool to use based on the end-user's query is done by Claude LLM using something like ReAct loop. Are the prompts and LLM-generated tokens involved inside "Protocol Handshake"-phase available for review?

dimitry12 · on May 21, 2024

Looks great as a self-host alternative if/when you make self-hosting feasible.

lindesvard · on May 21, 2024

When*

Before beta is over it’ll be easily self-hosted.

But since it free now there is no need for self-hosting.

Will make it easy to export/import your events between instances

michaelmior · on May 21, 2024

> since it free now there is no need for self-hosting.

It's awesome that you're offering this for free for now, but I don't think that means there's no need to self-host. There are many reasons people might not want to hand over data to a third party.

Aeolun · on May 22, 2024

Jup, my main use case for these things is bypassing all the approvals necessary to send stuff to a third party.

lindesvard · on May 22, 2024

True, I guess I need to prioritize this a bit

dimitry12 · on May 21, 2024

Awesome!

dimitry12 · on Dec 19, 2023

Can you please expand on the topic of "learn the marketing side if only by doing it semi-professionally for a client"?

I mean, one side of this spectrum is doing affiliate marketing or direct-sales/MLM. Other point on this spectrum might be for an engineer to go get hired as a social media "manager" (lots of "jobs" like this on Upwork).

What possibilities do you have in mind?

dimitry12 · on Jan 21, 2023

Can anyone who has the visible "Upgrade plan" option share a link to it? I wonder if it's only disabled in UI and we can still upgrade.

dimitry12 · on July 29, 2022

Lambda Labs has (slow, low IOPS) cloud filesystem to persist data between instances. Attached storage does not persist but is high bandwidth and high IOPS, which is a necessity if training small-medium sized models.

dimitry12 · on July 13, 2022

Seconding this. PML is high quality, active, and well documented.

generall · on July 13, 2022

PML is a great collection of implementations, but not the best framework. Also you can use PML with Quaterion: https://github.com/qdrant/quaterion/blob/master/examples/tra...