Problem Statement being a data scientist for the marketing department at reddit.

i must discover the many predictive key words and/or phrases to accurately classify the the dating advice and relationship advice subreddit pages so we may use them to ascertain which adverts should populate for each page. Because this is a category issue, I’ll utilize Logistic Regression & Bayes models. Misclassifications in this full instance could be fairly safe thus I will utilize the precision rating and set up a baseline of 63.3per cent to price success. Making use of TFiDfVectorization, I’ll get the function value to find out which terms have actually the prediction power that is highest for the mark factors. If successful, this model is also utilized to focus on other pages which have comparable regularity associated with the words that are same expressions.

Data Collection

See relationship-advice-scrape and dating-advice-scrape notebooks because of this component.

After switching most of the scrapes into DataFrames, they were saved by me as csvs that you can get when you look at the dataset folder with this repo.

Information Cleaning and EDA

  • dropped rows with null self text line becuase those rows are worthless in my opinion.
  • combined name and selftext column directly into one brand brand new all_text columns
  • exambined distributions of term counts for games and selftext column per post and contrasted the 2 subreddit pages.

Preprocessing and Modeling

Found the baseline precision rating 0.633 which means that if i usually select the value that develops usually, i will be appropriate 63.3% of that time period.

First effort: logistic regression model with default CountVectorizer paramaters. train rating: 99 | test 75 | cross val 74 Second attempt: tried CountVectorizer with Stemmatizer preprocessing on first group of scraping, pretty bad rating with a high variance. Train 99%, test 72%

  • attempted to decrease maximum features and rating got worse
  • tried with lemmatizer preprocessing instead and test score went as much as 74percent

Simply increasing the information and y that is stratifying my test/train/split increased my cvec test score to 81 and cross val to 80. Incorporating 2 paramaters to my CountVectorizers helped a great deal. A min_df of 3 and ngram_range of (1,2) increased my test score to 83.2 and get a cross val to 82.3 nonetheless, these rating disappeared.

I believe Tfidf worked the most effective to diminish my overfitting due to variance problem because

we customized the end terms to just just take the ones away that have been actually too regular to be predictive. It was a success, nonetheless, with increased time we most likely could’ve tweaked them a little more to improve all ratings. Taking a look at both the solitary terms and terms in sets of two (bigrams) had been the best param that gridsearch proposed, nevertheless, most of my top many predictive terms finished up being uni-grams. My list that is original of had a good amount of jibberish terms and typos. Minimizing the # of that time period term ended up being needed to show as much as 2, helped be rid of the. Gridsearch additionally recommended 90% max df rate which Michigan payday loans assisted to get rid of oversaturated terms also. Lastly, establishing max features to 5000 reduced cut down my columns to about one fourth of whatever they had been to simply concentrate the essential frequently employed terms of the thing that was kept.

Summary and Recommendations

Also though I wish to have greater train and test ratings, I happened to be able to effectively reduce the variance and you will find positively several terms which have high predictive energy

therefore I think the model is prepared to introduce a test. The same key words could be used to find other potentially lucrative pages if advertising engagement increases. I came across it interesting that taking out fully the overly used terms assisted with overfitting, but brought the precision rating down. I do believe there clearly was probably nevertheless space to relax and play around with the paramaters regarding the Tfidf Vectorizer to see if various end words make a different or


Used Reddit’s API, demands collection, and BeautifulSoup to scrape articles from two subreddits: Dating guidance & union information, and trained a binary category model to anticipate which subreddit confirmed post originated in

0432 368 309

Double Mo Cleaning Services