You're correct. My position was that in Linux, threads weigh just as much as processes; the forking speed is identical, and extremely fast with COW. Compared to other OSes where processes and threads are completely different beasts.
Even with NPTL, both processes and threads have task_structs.
vfork() and clone() can almost give you what you want, but Linux threading is using something different (NPTL) nowadays.