On a machine with a large number of registers, storing the outgoing set of registers (program state) and loading the new set of registers (program state), since this usually requires (slow) memory access. On a machine with a small set of registers, the largest cost may be changing the processor's memory management structures (e.g. page table).
The processor could store multiple sets of registers in hardware, so that saving and restoring registers is not needed. The processor can also handle saving and restoring registers in hardware, allowing it to happen both more quickly and overlapped with other computation. Finally, the processor can also make it fast to change memory management information (page tables). Only hardware to store multiple sets of registers requires support from the operating system, which can assigning the most frequently running processes to the hardware sets of registers.
A new PCB, address space (including page table entries for code, data, a new stack and heap), and copies of all OS bookkeeping entries (like file handles, etc.).
A thread needs just a new program counter, register set, and stack. No new memory structures or operating system resources are needed.
Kernel threads allow scheduling on a thread level, and prevent all threads in a process from blocking when a single thread (and therefore the process) blocks. The kernel can also schedule threads on different processors of a multiprocessor.