I am a graduate CS student (Data mining and machine learning) and have a good exposure to core Java (>4 years). I have read up a bunch of stuff on Hadoop and Map/Reduce
I would now like to do a project on this stuff (over my free time of corse) to get a better understanding.
Any good project ideas would be really appreciated. I just wanna do this to learn, so I dont really mind re-inventing the wheel. Also, anything related to data mining/machine learning would be an added bonus (fits with my research) but absolutely not necessary.
You haven't written anything about your interest. I know algorithms in graph mining has been implemented over hadoop framework. This software http://www.cs.cmu.edu/~pegasus/ and paper : "PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations" may give you starting point.
Further, this link discusses something similar to your question: http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/ but it is in python. And, there is a very good paper by Andrew Ng "Map-Reduce for Machine Learning on Multicore".
There was a NIPS 2009 workshop on similar topic "Large-Scale Machine Learning: Parallelism and Massive Datasets". You can browse some of the paper and get an idea.
Edit : Also there is Apache Mahout http://mahout.apache.org/ -->" Our core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm"
See http://www.quora.com/Machine-Learning/What-are-some-good-class-projects-for-machine-learning-using-MapReduce
and some good toy projects to start with: http://www.quora.com/Programming-Challenges-1/What-are-some-good-toy-problems-in-data-science
Why don't you contribute to Apache Hadoop/Mahout by helping them implement additional algorithms?
https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
Has a number of algorithms marked as "open". To my understanding, they could use help with implementing these? And there are hundreds of algorithms even missing from this list.
By any means, since you want to do something with Hadoop, why don't you ask them what they need instead of asking on some random internet site?
Trying to think of an efficient way to implement Hierarchical Agglomerative Clustering on Hadoop is a nice project to work on. It not only involves algorithmic aspects but also had hadoop core framework related optimizations.