{"id":515,"date":"2015-04-03T17:00:04","date_gmt":"2015-04-03T23:00:04","guid":{"rendered":"http:\/\/mattdturner.com\/wordpress\/?p=515"},"modified":"2018-08-03T12:17:17","modified_gmt":"2018-08-03T18:17:17","slug":"scrape-keywords-from-indeed-com-job-postings-2","status":"publish","type":"post","link":"http:\/\/mattdturner.com\/wordpress\/2015\/04\/scrape-keywords-from-indeed-com-job-postings-2\/","title":{"rendered":"Scrape Keywords from Indeed.com Job Postings"},"content":{"rendered":"<div id=\"notebook\" class=\"border-box-sizing\" tabindex=\"-1\">\n<div id=\"notebook-container\" class=\"container\">\n<div class=\"cell border-box-sizing text_cell rendered\"><\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>This is code that will pull each job posting for a specific job title in a specific location (or Nationally) and return \/ plot the percentage of the postings that have certain keywords. The code is set up to search for all words except stopwords, and other user-defined words (there is probably a much more efficient way of doing this, but I had no need to change this once I had the code running). This allows the user to see common technical skills, as well as common soft skills that should be included on a resume.<\/p>\n<p>NOTE: I got this idea from <a href=\"https:\/\/jessesw.com\/Data-Science-Skills\/\">https:\/\/jessesw.com\/Data-Science-Skills\/<\/a>. Obviously, just using his code would be of no real benefit to me, as I wanted to use the idea to help better my skills with scraping data from HTML files. So, I used his idea and developed my own code from scratch. I also modified the overall process a bit to better fit my needs.<\/p>\n<p>NOTE2: This code will not be able to identify multiple-word skills. So, for example, &#8216;machine learning&#8217; will show up as either &#8216;machine&#8217; or &#8216;learning&#8217;. However, &#8216;machine&#8217; could show up for other phrases than &#8216;machine learning&#8217;.<\/p>\n<p>To run the code, change the city, state, and job title to whichever you wish. After generating the plot, you might need to add &#8216;keywords&#8217; to the attitional_stop_words list if you do not want them to be included.<br \/>\n<!--more--><\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[114]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"kn\">from<\/span> <span class=\"nn\">bs4<\/span> <span class=\"kn\">import<\/span> <span class=\"n\">BeautifulSoup<\/span>\r\n<span class=\"kn\">import<\/span> <span class=\"nn\">urllib<\/span>\r\n<span class=\"kn\">import<\/span> <span class=\"nn\">re<\/span>\r\n<span class=\"kn\">from<\/span> <span class=\"nn\">time<\/span> <span class=\"kn\">import<\/span> <span class=\"n\">sleep<\/span>\r\n<span class=\"kn\">from<\/span> <span class=\"nn\">collections<\/span> <span class=\"kn\">import<\/span> <span class=\"n\">Counter<\/span>\r\n<span class=\"kn\">from<\/span> <span class=\"nn\">nltk.corpus<\/span> <span class=\"kn\">import<\/span> <span class=\"n\">stopwords<\/span>\r\n<span class=\"kn\">import<\/span> <span class=\"nn\">pandas<\/span> <span class=\"kn\">as<\/span> <span class=\"nn\">pd<\/span>\r\n<span class=\"o\">%<\/span><span class=\"k\">matplotlib<\/span> inline\r\n<span class=\"kn\">import<\/span> <span class=\"nn\">matplotlib.pylab<\/span> <span class=\"kn\">as<\/span> <span class=\"nn\">plt<\/span>\r\n<span class=\"kn\">from<\/span> <span class=\"nn\">matplotlib.backends.backend_pdf<\/span> <span class=\"kn\">import<\/span> <span class=\"n\">PdfPages<\/span>\r\n<span class=\"n\">plt<\/span><span class=\"o\">.<\/span><span class=\"n\">rcParams<\/span><span class=\"p\">[<\/span><span class=\"s\">'figure.figsize'<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"p\">(<\/span><span class=\"mf\">10.0<\/span><span class=\"p\">,<\/span> <span class=\"mf\">8.0<\/span><span class=\"p\">)<\/span>\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>Define the city, state, and job title.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[115]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"n\">city<\/span> <span class=\"o\">=<\/span> <span class=\"s\">'Seattle'<\/span>\r\n<span class=\"n\">state<\/span> <span class=\"o\">=<\/span> <span class=\"s\">'WA'<\/span>\r\n<span class=\"n\">job_title<\/span> <span class=\"o\">=<\/span> <span class=\"s\">'Data Scientist'<\/span>\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>Define a function that will take the url and pull out the text of the main body as a list of strings. Remove common words such as &#8216;the&#8217;, &#8216;or&#8217;, &#8216;and&#8217;, etc.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[168]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"k\">def<\/span> <span class=\"nf\">clean_the_html<\/span><span class=\"p\">(<\/span><span class=\"n\">url<\/span><span class=\"p\">):<\/span>\r\n    <span class=\"c\"># First try to download the html file<\/span>\r\n    <span class=\"k\">try<\/span><span class=\"p\">:<\/span>\r\n        <span class=\"n\">html<\/span> <span class=\"o\">=<\/span> <span class=\"n\">urllib<\/span><span class=\"o\">.<\/span><span class=\"n\">urlopen<\/span><span class=\"p\">(<\/span><span class=\"n\">url<\/span><span class=\"p\">)<\/span>\r\n    <span class=\"k\">except<\/span><span class=\"p\">:<\/span>\r\n        <span class=\"k\">return<\/span>\r\n    \r\n    <span class=\"c\">#print url<\/span>\r\n    \r\n    <span class=\"c\"># Open html in BeautifulSoup<\/span>\r\n    <span class=\"n\">soup<\/span> <span class=\"o\">=<\/span> <span class=\"n\">BeautifulSoup<\/span><span class=\"p\">(<\/span><span class=\"n\">html<\/span><span class=\"p\">)<\/span>\r\n        \r\n    <span class=\"c\"># Extract everything within the &lt;p&gt; tags<\/span>\r\n    <span class=\"n\">text<\/span> <span class=\"o\">=<\/span> <span class=\"n\">soup<\/span><span class=\"o\">.<\/span><span class=\"n\">findAll<\/span><span class=\"p\">(<\/span><span class=\"s\">'body'<\/span><span class=\"p\">)<\/span>\r\n    <span class=\"n\">word_list<\/span> <span class=\"o\">=<\/span> <span class=\"s\">''<\/span>\r\n    <span class=\"k\">for<\/span> <span class=\"n\">line<\/span> <span class=\"ow\">in<\/span> <span class=\"n\">text<\/span><span class=\"p\">:<\/span>\r\n        <span class=\"n\">word_list<\/span> <span class=\"o\">=<\/span> <span class=\"s\">' '<\/span><span class=\"o\">.<\/span><span class=\"n\">join<\/span><span class=\"p\">([<\/span><span class=\"n\">word_list<\/span><span class=\"p\">,<\/span><span class=\"n\">line<\/span><span class=\"o\">.<\/span><span class=\"n\">get_text<\/span><span class=\"p\">(<\/span><span class=\"s\">' '<\/span><span class=\"p\">,<\/span><span class=\"n\">strip<\/span><span class=\"o\">=<\/span><span class=\"bp\">True<\/span><span class=\"p\">)<\/span><span class=\"o\">.<\/span><span class=\"n\">lower<\/span><span class=\"p\">()])<\/span>\r\n    \r\n    <span class=\"c\"># Remove non text characters from list<\/span>\r\n    <span class=\"n\">word_list<\/span> <span class=\"o\">=<\/span> <span class=\"n\">re<\/span><span class=\"o\">.<\/span><span class=\"n\">sub<\/span><span class=\"p\">(<\/span><span class=\"s\">'[^a-zA-Z+3]'<\/span><span class=\"p\">,<\/span><span class=\"s\">' '<\/span><span class=\"p\">,<\/span> <span class=\"n\">word_list<\/span><span class=\"p\">)<\/span>\r\n\r\n    <span class=\"n\">list_of_words<\/span> <span class=\"o\">=<\/span> <span class=\"n\">word_list<\/span><span class=\"o\">.<\/span><span class=\"n\">encode<\/span><span class=\"p\">(<\/span><span class=\"s\">'ascii'<\/span><span class=\"p\">,<\/span><span class=\"s\">'ignore'<\/span><span class=\"p\">)<\/span><span class=\"o\">.<\/span><span class=\"n\">split<\/span><span class=\"p\">()<\/span>\r\n                   \r\n    <span class=\"n\">stop_words<\/span> <span class=\"o\">=<\/span> <span class=\"nb\">set<\/span><span class=\"p\">(<\/span><span class=\"n\">stopwords<\/span><span class=\"o\">.<\/span><span class=\"n\">words<\/span><span class=\"p\">(<\/span><span class=\"s\">\"english\"<\/span><span class=\"p\">))<\/span>\r\n    <span class=\"n\">additional_stop_words<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[<\/span><span class=\"s\">'webfont'<\/span><span class=\"p\">,<\/span><span class=\"s\">'limited'<\/span><span class=\"p\">,<\/span><span class=\"s\">'saved'<\/span><span class=\"p\">,<\/span><span class=\"s\">'disability'<\/span><span class=\"p\">,<\/span>\\\r\n                             <span class=\"s\">'desirable'<\/span><span class=\"p\">,<\/span><span class=\"s\">'nreum'<\/span><span class=\"p\">,<\/span><span class=\"s\">'skills'<\/span><span class=\"p\">,<\/span><span class=\"s\">'net'<\/span><span class=\"p\">,<\/span><span class=\"s\">'+'<\/span><span class=\"p\">,<\/span><span class=\"s\">'k'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'above'<\/span><span class=\"p\">,<\/span><span class=\"s\">'it'<\/span><span class=\"p\">,<\/span><span class=\"s\">'end'<\/span><span class=\"p\">,<\/span><span class=\"s\">'excellent'<\/span><span class=\"p\">,<\/span><span class=\"s\">'join'<\/span><span class=\"p\">,<\/span><span class=\"s\">'want'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'how'<\/span><span class=\"p\">,<\/span><span class=\"s\">'well'<\/span><span class=\"p\">,<\/span><span class=\"s\">'sets'<\/span><span class=\"p\">,<\/span><span class=\"s\">'like'<\/span><span class=\"p\">,<\/span><span class=\"s\">'page'<\/span><span class=\"p\">,<\/span><span class=\"s\">'home'<\/span><span class=\"p\">,<\/span><span class=\"s\">'demonstrated'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'other'<\/span><span class=\"p\">,<\/span><span class=\"s\">'re'<\/span><span class=\"p\">,<\/span><span class=\"s\">'size'<\/span><span class=\"p\">,<\/span><span class=\"s\">'etc'<\/span><span class=\"p\">,<\/span><span class=\"s\">'gettime'<\/span><span class=\"p\">,<\/span><span class=\"s\">'work'<\/span><span class=\"p\">,<\/span><span class=\"s\">'ms'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'zqdxyrmad'<\/span><span class=\"p\">,<\/span><span class=\"s\">'description'<\/span><span class=\"p\">,<\/span><span class=\"s\">'value'<\/span><span class=\"p\">,<\/span><span class=\"s\">'re'<\/span><span class=\"p\">,<\/span><span class=\"s\">'transactionname'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'education'<\/span><span class=\"p\">,<\/span><span class=\"s\">'daylight'<\/span><span class=\"p\">,<\/span><span class=\"s\">'highly'<\/span><span class=\"p\">,<\/span><span class=\"s\">'bodyrendered'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'amazon'<\/span><span class=\"p\">,<\/span><span class=\"s\">'new'<\/span><span class=\"p\">,<\/span><span class=\"s\">'bam'<\/span><span class=\"p\">,<\/span><span class=\"s\">'techniques'<\/span><span class=\"p\">,<\/span><span class=\"s\">'com'<\/span><span class=\"p\">,<\/span><span class=\"n\">city<\/span><span class=\"o\">.<\/span><span class=\"n\">lower<\/span><span class=\"p\">(),<\/span>\\\r\n                            <span class=\"n\">state<\/span><span class=\"o\">.<\/span><span class=\"n\">lower<\/span><span class=\"p\">(),<\/span><span class=\"s\">'min'<\/span><span class=\"p\">,<\/span><span class=\"s\">'need'<\/span><span class=\"p\">,<\/span><span class=\"s\">'email'<\/span><span class=\"p\">,<\/span><span class=\"s\">'job'<\/span><span class=\"p\">,<\/span><span class=\"s\">'content'<\/span><span class=\"p\">,<\/span><span class=\"s\">'features'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'service'<\/span><span class=\"p\">,<\/span><span class=\"s\">'wa'<\/span><span class=\"p\">,<\/span><span class=\"s\">'id'<\/span><span class=\"p\">,<\/span><span class=\"s\">'modern'<\/span><span class=\"p\">,<\/span><span class=\"s\">'looking'<\/span><span class=\"p\">,<\/span><span class=\"s\">'eastern'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'qualifications'<\/span><span class=\"p\">,<\/span><span class=\"s\">'teams'<\/span><span class=\"p\">,<\/span><span class=\"s\">'based'<\/span><span class=\"p\">,<\/span><span class=\"s\">'false'<\/span><span class=\"p\">,<\/span><span class=\"s\">'times'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'software'<\/span><span class=\"p\">,<\/span><span class=\"s\">'career'<\/span><span class=\"p\">,<\/span><span class=\"s\">'ability'<\/span><span class=\"p\">,<\/span><span class=\"s\">'platform'<\/span><span class=\"p\">,<\/span><span class=\"s\">'years'<\/span><span class=\"p\">,<\/span><span class=\"s\">'data'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'date'<\/span><span class=\"p\">,<\/span><span class=\"s\">'product'<\/span><span class=\"p\">,<\/span><span class=\"s\">'team'<\/span><span class=\"p\">,<\/span><span class=\"s\">'time'<\/span><span class=\"p\">,<\/span><span class=\"s\">'agent'<\/span><span class=\"p\">,<\/span><span class=\"s\">'information'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'methods'<\/span><span class=\"p\">,<\/span><span class=\"s\">'candidate'<\/span><span class=\"p\">,<\/span><span class=\"s\">'customers'<\/span><span class=\"p\">,<\/span><span class=\"s\">'back'<\/span><span class=\"p\">,<\/span><span class=\"s\">'info'<\/span><span class=\"p\">,<\/span><span class=\"s\">'scientist'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'experience'<\/span><span class=\"p\">,<\/span><span class=\"s\">'apply'<\/span><span class=\"p\">,<\/span><span class=\"s\">'us'<\/span><span class=\"p\">,<\/span><span class=\"s\">'engineering'<\/span><span class=\"p\">,<\/span><span class=\"s\">'learning'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'strong'<\/span><span class=\"p\">,<\/span><span class=\"s\">'business'<\/span><span class=\"p\">,<\/span><span class=\"s\">'design'<\/span><span class=\"p\">,<\/span><span class=\"s\">'title'<\/span><span class=\"p\">,<\/span><span class=\"s\">'large'<\/span><span class=\"p\">,<\/span><span class=\"s\">'e'<\/span><span class=\"p\">,<\/span><span class=\"s\">'document'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'science'<\/span><span class=\"p\">,<\/span><span class=\"s\">'company'<\/span><span class=\"p\">,<\/span><span class=\"s\">'location'<\/span><span class=\"p\">,<\/span><span class=\"s\">'field'<\/span><span class=\"p\">,<\/span><span class=\"s\">'communication'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'customer'<\/span><span class=\"p\">,<\/span><span class=\"s\">'tools'<\/span><span class=\"p\">,<\/span><span class=\"s\">'used'<\/span><span class=\"p\">,<\/span><span class=\"s\">'research'<\/span><span class=\"p\">,<\/span><span class=\"s\">'model'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'opportunity'<\/span><span class=\"p\">,<\/span><span class=\"s\">'online'<\/span><span class=\"p\">,<\/span><span class=\"s\">'including'<\/span><span class=\"p\">,<\/span><span class=\"s\">'degree'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'preferred'<\/span><span class=\"p\">,<\/span><span class=\"s\">'across'<\/span><span class=\"p\">,<\/span><span class=\"s\">'beacon'<\/span><span class=\"p\">,<\/span><span class=\"s\">'using'<\/span><span class=\"p\">,<\/span><span class=\"s\">'friend'<\/span><span class=\"p\">,<\/span><span class=\"s\">'function'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'position'<\/span><span class=\"p\">,<\/span><span class=\"s\">'window'<\/span><span class=\"p\">,<\/span><span class=\"s\">'role'<\/span><span class=\"p\">,<\/span><span class=\"s\">'3'<\/span><span class=\"p\">,<\/span><span class=\"s\">'written'<\/span><span class=\"p\">,<\/span><span class=\"s\">'build'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'presentation'<\/span><span class=\"p\">,<\/span><span class=\"s\">'getelementbyid'<\/span><span class=\"p\">,<\/span><span class=\"s\">'technical'<\/span><span class=\"p\">,<\/span><span class=\"s\">'posted'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'newrelic'<\/span><span class=\"p\">,<\/span><span class=\"s\">'decision'<\/span><span class=\"p\">,<\/span><span class=\"s\">'log'<\/span><span class=\"p\">,<\/span><span class=\"s\">'errorbeacon'<\/span><span class=\"p\">,<\/span><span class=\"s\">'solutions'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'applicationtime'<\/span><span class=\"p\">,<\/span><span class=\"s\">'enable'<\/span><span class=\"p\">,<\/span><span class=\"s\">'responsibilities'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'models'<\/span><span class=\"p\">,<\/span><span class=\"s\">'applicationid'<\/span><span class=\"p\">,<\/span><span class=\"s\">'complex'<\/span><span class=\"p\">,<\/span><span class=\"s\">'licensekey'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'high'<\/span><span class=\"p\">,<\/span><span class=\"s\">'browser'<\/span><span class=\"p\">,<\/span><span class=\"s\">'d'<\/span><span class=\"p\">,<\/span><span class=\"s\">'nr'<\/span><span class=\"p\">,<\/span><span class=\"s\">'develop'<\/span><span class=\"p\">,<\/span><span class=\"s\">'please'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'selection'<\/span><span class=\"p\">,<\/span><span class=\"s\">'queuetime'<\/span><span class=\"p\">,<\/span><span class=\"s\">'cookies'<\/span><span class=\"p\">,<\/span><span class=\"s\">'icimsaddonload'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'computer'<\/span><span class=\"p\">,<\/span><span class=\"s\">'icims'<\/span><span class=\"p\">,<\/span><span class=\"s\">'scientists'<\/span><span class=\"p\">,<\/span><span class=\"s\">'great'<\/span><span class=\"p\">,<\/span><span class=\"s\">'returning'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'systems'<\/span><span class=\"p\">,<\/span><span class=\"s\">'writing'<\/span><span class=\"p\">,<\/span><span class=\"s\">'united'<\/span><span class=\"p\">,<\/span><span class=\"s\">'working'<\/span><span class=\"p\">,<\/span><span class=\"s\">'iframe'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'analyses'<\/span><span class=\"p\">,<\/span><span class=\"s\">'applications'<\/span><span class=\"p\">,<\/span><span class=\"s\">'try'<\/span><span class=\"p\">,<\/span><span class=\"s\">'related'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'states'<\/span><span class=\"p\">,<\/span><span class=\"s\">'languages'<\/span><span class=\"p\">,<\/span><span class=\"s\">'yghvbe'<\/span><span class=\"p\">,<\/span><span class=\"s\">'language'<\/span><span class=\"p\">,<\/span><span class=\"s\">'one'<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'site'<\/span><span class=\"p\">,<\/span><span class=\"s\">'llc,'<\/span><span class=\"p\">,<\/span><span class=\"s\">'category'<\/span><span class=\"p\">,<\/span><span class=\"s\">'personalized'<\/span><span class=\"p\">,<\/span><span class=\"s\">'knowledge'<\/span><span class=\"p\">]<\/span>\r\n    \r\n    <span class=\"c\"># Remove words from list<\/span>\r\n    <span class=\"n\">truncated_list<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[<\/span><span class=\"n\">w<\/span> <span class=\"k\">for<\/span> <span class=\"n\">w<\/span> <span class=\"ow\">in<\/span> <span class=\"n\">list_of_words<\/span> <span class=\"k\">if<\/span> <span class=\"ow\">not<\/span> <span class=\"p\">(<\/span><span class=\"n\">w<\/span> <span class=\"ow\">in<\/span> <span class=\"n\">stop_words<\/span> <span class=\"ow\">or<\/span> \\\r\n                      <span class=\"n\">w<\/span> <span class=\"ow\">in<\/span> <span class=\"n\">additional_stop_words<\/span><span class=\"p\">)]<\/span>\r\n    \r\n    <span class=\"n\">truncated_set<\/span> <span class=\"o\">=<\/span> <span class=\"nb\">set<\/span><span class=\"p\">(<\/span><span class=\"n\">truncated_list<\/span><span class=\"p\">)<\/span>\r\n    <span class=\"n\">truncated_list<\/span> <span class=\"o\">=<\/span> <span class=\"nb\">list<\/span><span class=\"p\">(<\/span><span class=\"n\">truncated_set<\/span><span class=\"p\">)<\/span>\r\n        \r\n    <span class=\"k\">return<\/span> <span class=\"n\">truncated_list<\/span>\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>Define a function to generate a list of urls for a given search (i.e., &#8216;Data Scientist&#8217;). Each search result page has 10 non-sponsored links. Search the first url for &#8216;Jobs # to # of ###&#8217; in order to determine how many iterations to perform.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[167]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"k\">def<\/span> <span class=\"nf\">gen_url_list<\/span><span class=\"p\">(<\/span><span class=\"n\">city<\/span><span class=\"p\">,<\/span><span class=\"n\">state<\/span><span class=\"p\">,<\/span><span class=\"n\">job_name<\/span><span class=\"p\">):<\/span>\r\n    <span class=\"n\">base_url<\/span> <span class=\"o\">=<\/span> <span class=\"s\">'http:\/\/www.indeed.com\/'<\/span>\r\n    \r\n    <span class=\"n\">job_term<\/span> <span class=\"o\">=<\/span> <span class=\"n\">re<\/span><span class=\"o\">.<\/span><span class=\"n\">sub<\/span><span class=\"p\">(<\/span><span class=\"s\">' '<\/span><span class=\"p\">,<\/span><span class=\"s\">'+'<\/span><span class=\"p\">,<\/span><span class=\"n\">job_name<\/span><span class=\"o\">.<\/span><span class=\"n\">lower<\/span><span class=\"p\">())<\/span>\r\n    \r\n    <span class=\"n\">search_url<\/span> <span class=\"o\">=<\/span> <span class=\"s\">''<\/span><span class=\"o\">.<\/span><span class=\"n\">join<\/span><span class=\"p\">([<\/span><span class=\"n\">base_url<\/span><span class=\"p\">,<\/span><span class=\"s\">'jobs?q='<\/span><span class=\"p\">,<\/span><span class=\"n\">job_term<\/span><span class=\"p\">,<\/span><span class=\"s\">'&amp;l='<\/span><span class=\"p\">,<\/span><span class=\"n\">city<\/span><span class=\"p\">,<\/span><span class=\"s\">'%2C+'<\/span><span class=\"p\">,<\/span><span class=\"n\">state<\/span><span class=\"p\">])<\/span>\r\n    \r\n    <span class=\"k\">try<\/span><span class=\"p\">:<\/span>\r\n        <span class=\"n\">html<\/span> <span class=\"o\">=<\/span> <span class=\"n\">urllib<\/span><span class=\"o\">.<\/span><span class=\"n\">urlopen<\/span><span class=\"p\">(<\/span><span class=\"n\">search_url<\/span><span class=\"p\">)<\/span>\r\n    <span class=\"k\">except<\/span><span class=\"p\">:<\/span>\r\n        <span class=\"k\">return<\/span>\r\n    \r\n    <span class=\"n\">soup<\/span> <span class=\"o\">=<\/span> <span class=\"n\">BeautifulSoup<\/span><span class=\"p\">(<\/span><span class=\"n\">html<\/span><span class=\"p\">)<\/span>\r\n    \r\n    <span class=\"n\">total_jobs<\/span> <span class=\"o\">=<\/span> <span class=\"n\">soup<\/span><span class=\"o\">.<\/span><span class=\"n\">find<\/span><span class=\"p\">(<\/span><span class=\"nb\">id<\/span> <span class=\"o\">=<\/span> <span class=\"s\">'searchCount'<\/span><span class=\"p\">)<\/span><span class=\"o\">.<\/span><span class=\"n\">string<\/span><span class=\"o\">.<\/span><span class=\"n\">encode<\/span><span class=\"p\">(<\/span><span class=\"s\">'utf-8'<\/span><span class=\"p\">)<\/span>\r\n    <span class=\"n\">job_nums<\/span> <span class=\"o\">=<\/span> <span class=\"nb\">int<\/span><span class=\"p\">([<\/span><span class=\"nb\">int<\/span><span class=\"p\">(<\/span><span class=\"n\">s<\/span><span class=\"p\">)<\/span> <span class=\"k\">for<\/span> <span class=\"n\">s<\/span> <span class=\"ow\">in<\/span> <span class=\"n\">total_jobs<\/span><span class=\"o\">.<\/span><span class=\"n\">split<\/span><span class=\"p\">()<\/span> <span class=\"k\">if<\/span> <span class=\"n\">s<\/span><span class=\"o\">.<\/span><span class=\"n\">isdigit<\/span><span class=\"p\">()][<\/span><span class=\"o\">-<\/span><span class=\"mi\">1<\/span><span class=\"p\">]<\/span><span class=\"o\">\/<\/span><span class=\"mi\">10<\/span><span class=\"p\">)<\/span>\r\n    <span class=\"k\">print<\/span> <span class=\"n\">total_jobs<\/span>\r\n\r\n    <span class=\"n\">job_URLS<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[]<\/span>\r\n    <span class=\"k\">for<\/span> <span class=\"n\">i<\/span> <span class=\"ow\">in<\/span> <span class=\"nb\">range<\/span><span class=\"p\">(<\/span><span class=\"n\">job_nums<\/span><span class=\"o\">+<\/span><span class=\"mi\">1<\/span><span class=\"p\">):<\/span>\r\n        <span class=\"k\">if<\/span> <span class=\"n\">i<\/span> <span class=\"o\">%<\/span> <span class=\"mi\">10<\/span> <span class=\"o\">==<\/span> <span class=\"mi\">0<\/span><span class=\"p\">:<\/span>\r\n            <span class=\"k\">print<\/span> <span class=\"n\">i<\/span>\r\n        <span class=\"n\">page_url<\/span> <span class=\"o\">=<\/span> <span class=\"s\">''<\/span><span class=\"o\">.<\/span><span class=\"n\">join<\/span><span class=\"p\">([<\/span><span class=\"n\">base_url<\/span><span class=\"p\">,<\/span><span class=\"n\">job_term<\/span><span class=\"p\">,<\/span><span class=\"s\">'&amp;1='<\/span><span class=\"p\">,<\/span><span class=\"n\">city<\/span><span class=\"p\">,<\/span><span class=\"s\">'%2C+'<\/span><span class=\"p\">,<\/span><span class=\"n\">state<\/span><span class=\"p\">,<\/span>\\\r\n                            <span class=\"s\">'&amp;start='<\/span><span class=\"p\">,<\/span><span class=\"nb\">str<\/span><span class=\"p\">((<\/span><span class=\"n\">i<\/span><span class=\"o\">+<\/span><span class=\"mi\">1<\/span><span class=\"p\">)<\/span><span class=\"o\">*<\/span><span class=\"mi\">10<\/span><span class=\"p\">)])<\/span>\r\n        <span class=\"n\">html<\/span> <span class=\"o\">=<\/span> <span class=\"n\">urllib<\/span><span class=\"o\">.<\/span><span class=\"n\">urlopen<\/span><span class=\"p\">(<\/span><span class=\"n\">search_url<\/span><span class=\"p\">)<\/span>\r\n        \r\n        <span class=\"n\">soup<\/span> <span class=\"o\">=<\/span> <span class=\"n\">BeautifulSoup<\/span><span class=\"p\">(<\/span><span class=\"n\">html<\/span><span class=\"p\">)<\/span>\r\n        \r\n        <span class=\"n\">job_link_area<\/span> <span class=\"o\">=<\/span> <span class=\"n\">soup<\/span><span class=\"o\">.<\/span><span class=\"n\">findAll<\/span><span class=\"p\">(<\/span><span class=\"s\">'h2'<\/span><span class=\"p\">,{<\/span><span class=\"s\">'class'<\/span><span class=\"p\">:<\/span><span class=\"s\">'jobtitle'<\/span><span class=\"p\">})<\/span>\r\n\r\n        <span class=\"k\">for<\/span> <span class=\"n\">link<\/span> <span class=\"ow\">in<\/span> <span class=\"n\">job_link_area<\/span><span class=\"p\">:<\/span>\r\n            <span class=\"n\">match_href<\/span> <span class=\"o\">=<\/span> <span class=\"n\">re<\/span><span class=\"o\">.<\/span><span class=\"n\">search<\/span><span class=\"p\">(<\/span><span class=\"s\">'&lt;a\\shref=\"(.+?)\"'<\/span><span class=\"p\">,<\/span><span class=\"nb\">str<\/span><span class=\"p\">(<\/span><span class=\"n\">link<\/span><span class=\"p\">))<\/span>\r\n            <span class=\"k\">if<\/span> <span class=\"n\">match_href<\/span><span class=\"p\">:<\/span>\r\n                <span class=\"n\">job_URLS<\/span><span class=\"o\">.<\/span><span class=\"n\">append<\/span><span class=\"p\">([<\/span><span class=\"n\">base_url<\/span> <span class=\"o\">+<\/span> <span class=\"n\">match_href<\/span><span class=\"o\">.<\/span><span class=\"n\">group<\/span><span class=\"p\">(<\/span><span class=\"mi\">1<\/span><span class=\"p\">)])<\/span>\r\n\r\n    <span class=\"k\">return<\/span> <span class=\"n\">job_URLS<\/span>\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>Now that we have a list of all of the URLs of job postings, pull the information from each site, clean the data, and populate the keyword list.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[164]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"k\">def<\/span> <span class=\"nf\">job_posting_analysis<\/span><span class=\"p\">(<\/span><span class=\"n\">url_list<\/span><span class=\"p\">):<\/span>\r\n    <span class=\"n\">job_skills<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[]<\/span>\r\n    <span class=\"n\">count<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">0<\/span>\r\n    <span class=\"k\">for<\/span> <span class=\"n\">url<\/span> <span class=\"ow\">in<\/span> <span class=\"n\">url_list<\/span><span class=\"p\">:<\/span>\r\n        <span class=\"n\">count<\/span> <span class=\"o\">+=<\/span> <span class=\"mi\">1<\/span>\r\n        <span class=\"k\">if<\/span> <span class=\"n\">count<\/span> <span class=\"o\">%<\/span> <span class=\"mi\">10<\/span> <span class=\"o\">==<\/span> <span class=\"mi\">1<\/span><span class=\"p\">:<\/span>\r\n            <span class=\"k\">print<\/span> <span class=\"n\">count<\/span>\r\n        \r\n        <span class=\"n\">posting_keywords<\/span> <span class=\"o\">=<\/span> <span class=\"n\">clean_the_html<\/span><span class=\"p\">(<\/span><span class=\"n\">url<\/span><span class=\"p\">[<\/span><span class=\"mi\">0<\/span><span class=\"p\">])<\/span>\r\n        <span class=\"k\">if<\/span> <span class=\"n\">posting_keywords<\/span><span class=\"p\">:<\/span>\r\n            <span class=\"n\">job_skills<\/span><span class=\"o\">.<\/span><span class=\"n\">append<\/span><span class=\"p\">(<\/span><span class=\"n\">posting_keywords<\/span><span class=\"p\">)<\/span>\r\n        <span class=\"n\">sleep<\/span><span class=\"p\">(<\/span><span class=\"mf\">0.5<\/span><span class=\"p\">)<\/span>\r\n        \r\n    <span class=\"k\">return<\/span> <span class=\"n\">job_skills<\/span>\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>Now that the various functions are defined, run the code.<\/p>\n<p>First: run gen_url_list for the specified city, state, and jobtitle in order to generate<br \/>\nthe list of job posting links<\/p>\n<p>Second: run job_posting_analysis to pull out the job_skills listed for each job posting.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[169]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"k\">print<\/span> <span class=\"s\">'Crawl indeed.com for '<\/span> <span class=\"o\">+<\/span> <span class=\"n\">city<\/span> <span class=\"o\">+<\/span> <span class=\"s\">', '<\/span> <span class=\"o\">+<\/span> <span class=\"n\">state<\/span> <span class=\"o\">+<\/span> <span class=\"s\">' '<\/span> <span class=\"o\">+<\/span> <span class=\"n\">job_title<\/span> <span class=\"o\">+<\/span> \\\r\n<span class=\"s\">' postings and generate a list of all of the job posting links'<\/span>\r\n\r\n<span class=\"n\">url_list<\/span> <span class=\"o\">=<\/span> <span class=\"n\">gen_url_list<\/span><span class=\"p\">(<\/span><span class=\"n\">city<\/span><span class=\"p\">,<\/span><span class=\"n\">state<\/span><span class=\"p\">,<\/span><span class=\"n\">job_title<\/span><span class=\"p\">)<\/span>\r\n\r\n<span class=\"k\">print<\/span> <span class=\"s\">\"Given the job posting links, pull out the keywords for each posting\"<\/span>\r\n\r\n<span class=\"n\">job_skills<\/span> <span class=\"o\">=<\/span> <span class=\"n\">job_posting_analysis<\/span><span class=\"p\">(<\/span><span class=\"n\">url_list<\/span><span class=\"p\">)<\/span>\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"output_wrapper\">\n<div class=\"output\">\n<div class=\"output_area\">\n<div class=\"prompt\"><\/div>\n<div class=\"output_subarea output_stream output_stdout output_text\">\n<pre>Crawl indeed.com for Seattle, WA Data Scientist postings and generate a list of all of the job posting links\r\nJobs 1 to 10 of 725\r\n0\r\n10\r\n20\r\n30\r\n40\r\n50\r\n60\r\n70\r\nGiven the job posting links, pull out the keywords for each posting that is found in the provided keywords_input variable\r\n1\r\n11\r\n<strong>truncated for readability<\/strong>\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>Now that we have the list of keywords in the job postings, calculate the number of postings in which each keyword appears. Then plot the data on a bar graph<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[171]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"n\">skill_frequency<\/span> <span class=\"o\">=<\/span> <span class=\"n\">Counter<\/span><span class=\"p\">()<\/span> <span class=\"c\"># This will create a full counter of our terms. <\/span>\r\n<span class=\"p\">[<\/span><span class=\"n\">skill_frequency<\/span><span class=\"o\">.<\/span><span class=\"n\">update<\/span><span class=\"p\">(<\/span><span class=\"n\">item<\/span><span class=\"p\">)<\/span> <span class=\"k\">for<\/span> <span class=\"n\">item<\/span> <span class=\"ow\">in<\/span> <span class=\"n\">job_skills<\/span><span class=\"p\">]<\/span> <span class=\"c\"># List comp<\/span>\r\n<span class=\"k\">print<\/span> <span class=\"n\">skill_frequency<\/span><span class=\"o\">.<\/span><span class=\"n\">items<\/span><span class=\"p\">()<\/span>\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"output_wrapper\">\n<div class=\"output\">\n<div class=\"output_area\">\n<div class=\"prompt\"><\/div>\n<div class=\"output_subarea output_stream output_stdout output_text\">\n<pre>[('addedtojobcart', 73), ('applicationstatusdetail', 73), ('auc', 73), ('matlab', 365), ('worth', 73), ('merchant', 73), ('collaborate', 219), ('every', 146), ('tagging', 73), ('skillz', 73), ('companies', 219), ('vector', 73), ('clicktracks', 73), ('enhance', 146), ('enjoy', 73), ('leaders', 146), ('direct', 73), ('rigorous', 73), ('machines', 73), ('even', 73), ('hide', 73), ('selected', 73), ('children', 73), ('designing', 73), ('supplies', 73), ('centric', 73), ('behavior', 73), ('men', 73), ('createde', 73), ('hundreds', 73), ('employees', 146), ('economics', 146), ('reports', 73), <strong>truncated for readability <\/strong>]\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[172]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"n\">data_to_plot<\/span> <span class=\"o\">=<\/span> <span class=\"n\">pd<\/span><span class=\"o\">.<\/span><span class=\"n\">DataFrame<\/span><span class=\"p\">(<\/span><span class=\"n\">skill_frequency<\/span><span class=\"o\">.<\/span><span class=\"n\">items<\/span><span class=\"p\">(),<\/span><span class=\"n\">columns<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[<\/span><span class=\"s\">'Skill'<\/span><span class=\"p\">,<\/span><span class=\"s\">'Occurances'<\/span><span class=\"p\">])<\/span>\r\n\r\n<span class=\"n\">data_to_plot<\/span><span class=\"o\">.<\/span><span class=\"n\">Occurances<\/span> <span class=\"o\">=<\/span> <span class=\"p\">(<\/span><span class=\"n\">data_to_plot<\/span><span class=\"o\">.<\/span><span class=\"n\">Occurances<\/span><span class=\"p\">)<\/span><span class=\"o\">*<\/span><span class=\"mi\">100<\/span><span class=\"o\">\/<\/span><span class=\"nb\">len<\/span><span class=\"p\">(<\/span><span class=\"n\">job_skills<\/span><span class=\"p\">)<\/span>\r\n\r\n<span class=\"n\">data_to_plot<\/span><span class=\"o\">.<\/span><span class=\"n\">sort<\/span><span class=\"p\">(<\/span><span class=\"n\">columns<\/span> <span class=\"o\">=<\/span> <span class=\"s\">'Occurances'<\/span><span class=\"p\">,<\/span><span class=\"n\">ascending<\/span> <span class=\"o\">=<\/span> <span class=\"bp\">False<\/span><span class=\"p\">,<\/span><span class=\"n\">inplace<\/span> <span class=\"o\">=<\/span> <span class=\"bp\">True<\/span><span class=\"p\">)<\/span>\r\n\r\n<span class=\"n\">test_data<\/span> <span class=\"o\">=<\/span> <span class=\"n\">data_to_plot<\/span><span class=\"o\">.<\/span><span class=\"n\">head<\/span><span class=\"p\">(<\/span><span class=\"mi\">15<\/span><span class=\"p\">)<\/span>  <span class=\"c\"># plot only top 15 skills<\/span>\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[173]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"k\">print<\/span> <span class=\"n\">data_to_plot<\/span><span class=\"o\">.<\/span><span class=\"n\">head<\/span><span class=\"p\">(<\/span><span class=\"mi\">20<\/span><span class=\"p\">)<\/span>\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"output_wrapper\">\n<div class=\"output\">\n<div class=\"output_area\">\n<div class=\"prompt\"><\/div>\n<div class=\"output_subarea output_stream output_stdout output_text\">\n<pre>            Skill  Occurances\r\n1148       python   80.109739\r\n66        machine   80.109739\r\n500             r   80.109739\r\n329           sql   80.109739\r\n468    statistics   70.096022\r\n431   statistical   70.096022\r\n544            js   60.082305\r\n476   programming   60.082305\r\n1382          pig   60.082305\r\n1066         hive   60.082305\r\n934        hadoop   60.082305\r\n77     algorithms   50.068587\r\n187     analytics   50.068587\r\n55      scripting   50.068587\r\n1242   predictive   50.068587\r\n161            d3   50.068587\r\n3          matlab   50.068587\r\n943      analysis   50.068587\r\n610         world   40.054870\r\n341           llc   40.054870\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[174]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"n\">frame<\/span> <span class=\"o\">=<\/span> <span class=\"n\">test_data<\/span><span class=\"o\">.<\/span><span class=\"n\">plot<\/span><span class=\"p\">(<\/span><span class=\"n\">x<\/span><span class=\"o\">=<\/span><span class=\"s\">'Skill'<\/span><span class=\"p\">,<\/span><span class=\"n\">kind<\/span><span class=\"o\">=<\/span><span class=\"s\">'bar'<\/span><span class=\"p\">,<\/span><span class=\"n\">legend<\/span><span class=\"o\">=<\/span><span class=\"bp\">None<\/span><span class=\"p\">,<\/span>\\\r\n                  <span class=\"n\">title<\/span><span class=\"o\">=<\/span><span class=\"s\">'Percentage of Data Scientist Job Postings with each Skill, ('<\/span>\\\r\n                  <span class=\"o\">+<\/span> <span class=\"n\">city<\/span> <span class=\"o\">+<\/span> <span class=\"s\">', '<\/span> <span class=\"o\">+<\/span> <span class=\"n\">state<\/span> <span class=\"o\">+<\/span> <span class=\"s\">')'<\/span><span class=\"p\">)<\/span>\r\n\r\n<span class=\"c\">#plt.ylim([40,90])<\/span>\r\n\r\n<span class=\"n\">fig<\/span> <span class=\"o\">=<\/span> <span class=\"n\">frame<\/span><span class=\"o\">.<\/span><span class=\"n\">get_figure<\/span><span class=\"p\">()<\/span>\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"output_wrapper\">\n<div class=\"output\">\n<div class=\"output_area\">\n<div class=\"prompt\"><\/div>\n<div class=\"output_png output_subarea \"><img decoding=\"async\" src=\"http:\/\/mattdturner.com\/wordpress\/wp-content\/uploads\/2015\/04\/Seattle_WA_skills.jpg\" \/><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[175]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre><span class=\"n\">filename<\/span> <span class=\"o\">=<\/span> <span class=\"s\">'_'<\/span><span class=\"o\">.<\/span><span class=\"n\">join<\/span><span class=\"p\">([<\/span><span class=\"n\">city<\/span><span class=\"p\">,<\/span><span class=\"n\">state<\/span><span class=\"p\">,<\/span><span class=\"s\">'skills'<\/span><span class=\"p\">])<\/span>\r\n<span class=\"n\">filename<\/span> <span class=\"o\">=<\/span> <span class=\"s\">''<\/span><span class=\"o\">.<\/span><span class=\"n\">join<\/span><span class=\"p\">([<\/span><span class=\"n\">filename<\/span><span class=\"p\">,<\/span><span class=\"s\">'.pdf'<\/span><span class=\"p\">])<\/span>\r\n<span class=\"n\">pp<\/span> <span class=\"o\">=<\/span> <span class=\"n\">PdfPages<\/span><span class=\"p\">(<\/span><span class=\"n\">filename<\/span><span class=\"p\">)<\/span>\r\n<span class=\"n\">pp<\/span><span class=\"o\">.<\/span><span class=\"n\">savefig<\/span><span class=\"p\">(<\/span><span class=\"n\">fig<\/span><span class=\"p\">)<\/span>\r\n<span class=\"n\">pp<\/span><span class=\"o\">.<\/span><span class=\"n\">close<\/span><span class=\"p\">()<\/span>\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In\u00a0[\u00a0]:<\/div>\n<div class=\"inner_cell\">\n<div class=\"input_area\">\n<div class=\" highlight hl-ipython2\">\n<pre> \r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This is code that will pull each job posting for a specific job title in a specific location (or Nationally) and return \/ plot the percentage of the postings that have certain keywords. The code is set up to search for all words except stopwords, and other user-defined words (there is probably a much more [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":519,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[3,235,234,64,11,233,20],"tags":[247,30,241,240,236,246,239,238,80,237,245,244],"class_list":["post-515","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-command-line","category-data-science","category-html-scraping","category-linux-2","category-mac","category-python","category-scripting","tag-beautifulsoup","tag-command-line-2","tag-data","tag-data-science","tag-html","tag-indeed-com","tag-matplotlib","tag-pandas","tag-python","tag-scraping","tag-stopwords","tag-urllib"],"_links":{"self":[{"href":"http:\/\/mattdturner.com\/wordpress\/wp-json\/wp\/v2\/posts\/515"}],"collection":[{"href":"http:\/\/mattdturner.com\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/mattdturner.com\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/mattdturner.com\/wordpress\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/mattdturner.com\/wordpress\/wp-json\/wp\/v2\/comments?post=515"}],"version-history":[{"count":4,"href":"http:\/\/mattdturner.com\/wordpress\/wp-json\/wp\/v2\/posts\/515\/revisions"}],"predecessor-version":[{"id":6145,"href":"http:\/\/mattdturner.com\/wordpress\/wp-json\/wp\/v2\/posts\/515\/revisions\/6145"}],"wp:featuredmedia":[{"embeddable":true,"href":"http:\/\/mattdturner.com\/wordpress\/wp-json\/wp\/v2\/media\/519"}],"wp:attachment":[{"href":"http:\/\/mattdturner.com\/wordpress\/wp-json\/wp\/v2\/media?parent=515"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/mattdturner.com\/wordpress\/wp-json\/wp\/v2\/categories?post=515"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/mattdturner.com\/wordpress\/wp-json\/wp\/v2\/tags?post=515"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}