methodology
This project can be broken down into three parts: gathering data, processing data, and analyzing data.
gathering data
I started by using SEM Rush’s Open.Trends service to find the top websites for each country across all industries. While this can be done manually, i automated the process using the Python libraries BeautifulSoup and Selenium-Python (you can also use the Requests library in this case, but I already had Selenium imported lol). Here’s some pseudo-code to give you an idea of how it was done:
# run a function to get the list of countries Open.Trends has listed on their site countries = getCountries() # initialize a dictionary to store the information d = { 'country':[], 'website':[], 'visits':[] } # iterate through that list for country in countries: # follow semrush's URL formatting and plug in the country using a formatted string url = f'https://www.semrush.com/trending-websites/{country}/all' # navigate to the URL using Selenium Webdriver driver.get(url) # feed the page information into BeautifulSoup soup = BeautifulSoup(driver.page_source, 'html.parser') # extract the table data using BeautifulSoup results = getTableData(soup) # feed the results into the dictionary d['country'] = results['country'] d['website'] = results['website'] d['visits'] = results['visits'] # save this into some sort of file df = pandas.DataFrame(d) df.save_csv('popular_websites.csv', index=False)
NOTE: the quality of this data is subject to the accuracy of SEM rush’s methods. i didn’t really look too deeply into that because their listings were comparable to similar services.
You should now have a dictionary of the most popular websites in each country. A lot of those websites will be porn or malware or both. Let’s try to filter some of those out using the Cyren URL Lookup API. This is a service that uses “machine learning, heuristics, and human analysis” to categorize websites.
Here’s more pseudocode:
# iterate through all the websites we found for i in range(len(df['website'])): # select the website url = df.loc[i,'website'] # call the API on the website category = getCategory(url) # save the results df.loc[i,'category'] = category # filter out all the undesireable categories undesireable = [...] df = df.loc[df['category'] in undesireable] # save this dataframe to avoid needing to do this all over again df.save_csv('popular_websites_filtered.csv', index=False)
NOTE: Cyren URL Lookup API has 1,000 free queries per month per user.
... continue reading