Entry-Level Job Classifier using NLP

View on GitHub

Project Overview

This project involves scraping job postings from LinkedIn using web scraping techniques. The project filters job postings based on specific criteria, particularly focusing on identifying entry-level roles. It then stores the valid job postings in a CSV file for further use. The project utilizes Python libraries such as BeautifulSoup, CSV, and a custom machine learning module gem.py to classify the jobs.

Key Features

Code Snippets

1. Removing HTML Tags from Job Descriptions

def remove_common_chars(string):
    # Replace unwanted HTML tags and characters
    string = string.replace("<", "").replace(">", "").replace("</strong>", "")
    string = string.replace("</u>", "").replace("</li>", "")
    return string
            
2. Fetching Job Postings from LinkedIn

def get_links(url, num, big_num):
    delay = 1
    if num > 45000:
        return
    print(f"Fetching {num} job links")
    
    if num != 0:
        url += f"&start={num}"

    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    hrefs = [tag.get('href') for tag in soup.find_all('a') if tag.get('href')]

    with open('jobPost.csv', mode='a', newline='') as file:
        writer = csv.writer(file)
        if num == 0:
            writer.writerow(['Link', 'Description'])
        
        for link in hrefs:
            if link.startswith("https://www.linkedin.com/jobs/"):
                process_job_link(link)
            time.sleep(delay)
    get_links(url, num + 25, big_num)
            
3. Machine Learning Model to Classify Entry-Level Jobs

The goal of this section is to train a machine learning model that can automatically classify job descriptions as "entry-level" or "not entry-level." To achieve this, we use a Natural Language Processing (NLP) approach. Below is a detailed breakdown of the process:

Step 1: Vectorization with TfidfVectorizer

We use TfidfVectorizer to convert job descriptions into numerical features. TF-IDF (Term Frequency-Inverse Document Frequency) measures the importance of each word in a document relative to the entire corpus. This transformation is crucial because machine learning models cannot work directly with text data—they need numerical input.

Step 2: Classification with ComplementNB

We use ComplementNB, a variant of the Naive Bayes classifier, which is particularly effective for imbalanced datasets (such as job descriptions where true entry-level jobs may be fewer in number). The classifier predicts the likelihood of each description being entry-level based on the vectorized features.

Step 3: Training and Testing

The dataset is split into training and testing sets using train_test_split. The model is trained on 80% of the data, and its performance is evaluated on the remaining 20%. This ensures that the model generalizes well to unseen job descriptions.


        def job(s):
            # Load data into a DataFrame
            df = pd.DataFrame(data)
        
            # Split the data into training and testing sets
            X_train, X_test, y_train, y_test = train_test_split(df['job_description'], df['label'], test_size=0.2, random_state=42)
        
            # Convert text data to numerical features using TfidfVectorizer
            vectorizer = TfidfVectorizer(stop_words='english')
        
            # Create a machine learning pipeline with TfidfVectorizer and ComplementNB
            model = make_pipeline(vectorizer, ComplementNB())
        
            # Train the model on the training data
            model.fit(X_train, y_train)
        
            # Make predictions on the input job description
            y_pred = model.predict([s])
        
            # Return True if classified as entry-level, otherwise False
            return y_pred[0] == 1
                

Step 4: Making Predictions

After training, the model can be used to predict whether a new job description is entry-level. The input description is vectorized and passed to the trained model, which outputs a prediction. A value of True means the job is classified as entry-level, while False indicates it is not.

Conclusion

This project is a robust job scraper that uses machine learning to filter out irrelevant job postings, focusing on those suitable for recent graduates or entry-level candidates. By leveraging NLP and a Naive Bayes classifier, we efficiently identify job descriptions that match the entry-level criteria, saving time for users who are looking for relevant positions.