Category Archives: Linux

Scrape Keywords from Indeed.com Job Postings











Job Posting Crawler

This is code that will pull each job posting for a specific job title in a specific location (or Nationally) and return / plot the percentage of the postings that have certain keywords. The code is set up to search for all words except stopwords, and other user-defined words (there is probably a much more efficient way of doing this, but I had no need to change this once I had the code running). This allows the user to see common technical skills, as well as common soft skills that should be included on a resume.

NOTE: I got this idea from https://jessesw.com/Data-Science-Skills/. Obviously, just using his code would be of no real benefit to me, as I wanted to use the idea to help better my skills with scraping data from HTML files. So, I used his idea and developed my own code from scratch. I also modified the overall process a bit to better fit my needs.

NOTE2: This code will not be able to identify multiple-word skills. So, for example, ‘machine learning’ will show up as either ‘machine’ or ‘learning’. However, ‘machine’ could show up for other phrases than ‘machine learning’.

To run the code, change the city, state, and job title to whichever you wish. After generating the plot, you might need to add ‘keywords’ to the attitional_stop_words list if you do not want them to be included.
Continue reading Scrape Keywords from Indeed.com Job Postings

Backup with rsync using SSH Tunneling

For those of you that read my blog often, you know that I admin the cluster that our research group uses here at CU Boulder.  Because of this, I get a lot of questions from users who don’t want to take the time to solve their own problems.  Fairly recently, our RAID-6 crashed (we had a 4th drive die and had to rebuild the array).  Normally this wouldn’t be very much of a problem as most of the files saved on our storage drive are just input files that we can re-download from a separate server, or so I thought.  Personally, all my source code is in my home folder, backed up on our data server, and backed up onto my personal laptop.  For researchers in our group who are developing code, not having a backup of source code can lead to many many months of lost work.  Well, as it turns out, many of the people in our group had their source code on our data server (the one that crashed), without a backup anywhere.  So months of work had been lost.  Well, after the rebuild I have gotten many questions on how to set up an ssh tunnel so that they can backup from our cluster, through the front end, to their home computer.

Continue reading Backup with rsync using SSH Tunneling

Add Infiniband interface to ifconfig

We recently had an issue where we had to rebuild our RAID-6 array.  After rebuilding the array, our cluster did not automatically locate and mount our high-capacity storage array.  In order to fix this problem, we had to add a new interface configuration file to ifconfig by following the below steps:

1. As root on the server that will be connected to the high-capacity storage server

vi /etc/sysconfig/network-scripts/ifcfg-ib0

Add the following to the file:

DEVICE=ib0
TYPE=Infiniband
 BOOTPROTO=static
 BROADCAST=172.30.255.255
 IPADDR=172.30.1.11
 NETMASK=255.255.0.0
 ONBOOT=yes

Of course your BROADCAST, IPADDR, and NETMASK will be different from those set here.
Some Notes:

  • The filename is ifcfg-ib0 for the configuration file for device ib0 (note these are zeros, not the letter o).
  • BROADCAST  is the broadcast IP address.
  • NETMASK is the netmask IP value
  • BOOTPROTO is the boot protocol, where the value is one of the following: (a) none – No boot-time protocol should be used, (b) dhcp – The dhcp protocol should be used, (c) static – static hard set the IP.
  • IPADDR is the IP address
  • ONBOOT specifies if the interface needs to be active on boot (values: yes or no)
  • TYPE is the interface type

	

Frustrations of a System Admin

There are times when I absolutely love being the admin for our group’s high performance computer.  But there are also times when I would rather clean toilets all day.  This post will hopefully explain a few of the things I hate about being an Admin.

  1. Debug Support:  I’m not your personal debugger.  I know nothing of the code that you are writing, and therefore I shouldn’t be expected to help you debug your model.  With that being said, I will usually help where I can.  But don’t just send me an email saying “My model won’t compile.  I need help”.  If you really want help, send me detailed information of the problem you are having.  Better yet, send me a copy of your source code.  If you give me little to no information, then expect little to no support. Continue reading Frustrations of a System Admin

How to: Passwordless SSH

As some of you know, I prefer to set up passwordless logins to all of my accounts on remote machines. I recently made a post describing how to enable passwordless SSH to compute nodes, however what if you are attempting to enable passwordless logins to remote machines?

If you are on a Linux machine, or have a copy of the “ssh-copy-id” script on your system then the process is fairly simple.  You must first create the private/public key pairing.  For passwordless SSH, just accept the defaults for each option.

ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/cmaqadj/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/cmaqadj/.ssh/id_rsa.
Your public key has been saved in /home/cmaqadj/.ssh/id_rsa.pub.

Continue reading How to: Passwordless SSH