Hacker Newsnew | past | comments | ask | show | jobs | submit | more dimitry12's commentslogin

I believe this is a valid point: HF's replication indeed uses larger off-the-shelf model as a verifier.

In contrast, in the original paper, verifier is a fine-tune of the exact same base model which is used to sample step-by-step solutions (="solver").


Using different 1B model as verifier makes sense, yes. Using Llama 8B finetune as verifier to compare 1B inference time scaled in comparison with 8B makes little sense to me.

Using 3B model with 8B verifier against 70B model would make sense too. This being said their performance barely crossed 70B line with 256 examples. This is 256*(8+3)/70 ~ 40 times more computationally expensive than running 70B model as is.


"1B solver + 8B verifier + search" beating 0-shot 70B is nice, agree.

"1B solver + 8B verifier + search" beating 1B-0-shot or 1B-majority as baselines isn't illustrative imo. In other words, by using larger verifier, HF's replication fails to establish a "fair" baseline. Still an awesome blog and release/repository from HF's group - I love it!


Where did you see that? I thought they used an 8b model for their reward model?

> To guide our search strategies, we used RLHFlow/Llama3.1-8B-PRM-Deepseek-Data, an 8B reward model that has been trained using process supervision


"Solver" is `meta-llama/Llama-3.2-1B-Instruct` (1B model, and they use 3B for another experiment), and verifier is `RLHFlow/Llama3.1-8B-PRM-Deepseek-Data`.

See https://github.com/huggingface/search-and-learn/blob/b3375f8... and https://github.com/huggingface/search-and-learn/blob/b3375f8...

In the original paper, they use PaLM 2-S* as "solver" and its fine-tune as "verifier".


In this paper and HF's replication the model used to produce solutions to MATH problems is off-the-shelf. It is induced to produce step-by-step CoT-style solutions by few-shot ICL prompts or by instructions.

Yes, the search process (beam-search of best-of-N) does produce verbose traces because there is branching involved when sampling "thoughts" from base model. These branched traces (including incomplete "abandoned" branches) can be shown to the user or hidden, if the approach is deployed as-is.


Verifier is trained with soft values of reward-to-go for each solution-prefix, obtained from monte-carlo rollouts of step-by-step solutions sampled from the "base" model.

In other words: 1) sample step-by-step solutions from "base" model; 2) do it at non-zero temperature so that you can get multiple continuation from each solution-prefix; 3) use MATH-labels to decide if full solution (leaf/terminal node in MC rolloout) has reward `1` or `0`; 4) roll up these rewards to calculate reward-to-go for each intermediate step.

Yes, verifier trained in this manner can be used to score solution-prefixes (as a process verifier) or a full-solution (as an outcome verifier).

In the original paper (https://arxiv.org/abs/2408.03314) they fine-tune a fresh verifier. HF's replication uses an off-the-shelf verifier based on another paper: https://arxiv.org/abs/2312.08935


Curious about that too. There are plenty of forks left, for example: https://github.com/plastic-labs/llama3_interpretability_sae (no affiliation)


Looking at https://github.com/modelcontextprotocol/python-sdk?tab=readm... it's clear that there must be a decision connecting, for example, `tools` returned by the MCP server and `call_tool` executed by the host.

In case of Claude Desktop App, I assume the decision which MCP-server's tool to use based on the end-user's query is done by Claude LLM using something like ReAct loop. Are the prompts and LLM-generated tokens involved inside "Protocol Handshake"-phase available for review?


Looks great as a self-host alternative if/when you make self-hosting feasible.


When*

Before beta is over it’ll be easily self-hosted.

But since it free now there is no need for self-hosting.

Will make it easy to export/import your events between instances


> since it free now there is no need for self-hosting.

It's awesome that you're offering this for free for now, but I don't think that means there's no need to self-host. There are many reasons people might not want to hand over data to a third party.


Jup, my main use case for these things is bypassing all the approvals necessary to send stuff to a third party.


True, I guess I need to prioritize this a bit


Awesome!


Can you please expand on the topic of "learn the marketing side if only by doing it semi-professionally for a client"?

I mean, one side of this spectrum is doing affiliate marketing or direct-sales/MLM. Other point on this spectrum might be for an engineer to go get hired as a social media "manager" (lots of "jobs" like this on Upwork).

What possibilities do you have in mind?


Can anyone who has the visible "Upgrade plan" option share a link to it? I wonder if it's only disabled in UI and we can still upgrade.


Lambda Labs has (slow, low IOPS) cloud filesystem to persist data between instances. Attached storage does not persist but is high bandwidth and high IOPS, which is a necessity if training small-medium sized models.


Seconding this. PML is high quality, active, and well documented.


PML is a great collection of implementations, but not the best framework. Also you can use PML with Quaterion: https://github.com/qdrant/quaterion/blob/master/examples/tra...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: