AMD has announced a new set of extensions to the x86-64 ISA that it may incorporate into future processors. Called Light-Weight Profiling (LWP), the new technology is the first in a series of initiatives that AMD is calling "Hardware Extensions for Software Parallelism." The general point of the newly announced LWP technology, and of the parallelism-related announcements that will follow it as AMD unveils more of its plans, is to make it easier for programmers to extract performance from multicore processors. LWP contributes to this goal by giving running processes a set of low-overhead profiling tools that enable them to get a better look at themselves and each other in real time, so that they can see what they're doing and adjust their behavior accordingly.
In theory, the feedback that LWP gives to processes and to the OS will be used to improve software parallelism and memory allocation on the fly, thereby increasing overall performance. Notably, LWP will apparently consume very little overhead while performing such optimizations, and its benefits aren't strictly limited to multicore scenarios. Single-core products could also apparently benefit from the technology, though obviously such products will make up an increasingly small amount of AMD's sales as time progresses. AMD references both Java and Microsoft .NET as two operating environments that could conceivably benefit from such technology.
According to the hardware specification (PDF) that AMD has posted, LWP proposes additional registers, memory structures, and instructions that operate in both legacy and long modes. A new model-specific register (MSR), populated by the OS, controls what types of events, if any, a process is allowed to monitor using LWP. When LWP is enabled for a process, the processor's profiling hardware checks a special LWP control block (LWPCB) that's stored in the process's memory space (and possibly cached in a special set of registers for quick access) in order to see what types of events it should be monitoring. It then monitors those events—cache misses, instructions and branches retired, instructions executed, etc.—using a set of counters and event records that are kept in memory and can be accessed by the process.