When joining two tables that are both partitioned on the same column and partition function, SQL Server does not take advantage of the fact that each partition could be processed independently and the results concatenated or aggregated at the end.
In such cases, a query plan that allocates one thread to each partition and performs a MERGE JOIN within that partition--with that thread not touching data from other partitions at all--would likely be a highly efficient way of parallelizing the join such that performance scales linearly with the number of CPU cores used (so long as there are enough partitions involved).
The attached test file builds a data set and includes test queries and statistics that highlight the current inefficiency of SQL Server in this particular case. The test file can be run in any environment, but it is likely that the most meaningful results will be seen on machines with 8+ cores.
For reference, here is an Oracle blog entry describing functionality similar to this request in Oracle:
In addition, here is a relevant StackExchange question on the topic: