These days I have been spending more time studying statistics and probablity than the papers about new neural network architectures that pop up everyday. to gaining a deeper understanding

To address the issue they developed their own algorithms to detect duplicate question. On top of that, a while ago Quora published their first public dataset of question pairs publicly for machine learning (ML) engineers to see if anyone can come up with a better algorithm to detect duplicate questions, and they created a competition on Kaggle.

Here is how the competition works: ML engineers and ML engineer wannabes -cough-me-cough- who have too much time on their hand (or are not properly supervised at work), download the competition’s training set (which looks like fig. 1) and develop machine learning algorithms that learn by going through the examples in the training set. The training set is usually manually labeled. In this case, 1 or 0 in the is_duplicate column indicates whether the questions are identical. Here the training set contains ~420K question pairs.

id question1 question2 is_duplicate
1 Why my answers are collapsed? Why is my answer collapsed at once? 0
2 How do I post a question in Quora? How do I ask a question in Quora? 1
3 Can I fit my booboos in a 65ml jar? Is 1 baba worth 55 booboo (おっぱい) ☃? 0
Fig.1