This project involves scraping job postings from LinkedIn using web scraping techniques. The project filters
job postings based on specific criteria, particularly focusing on identifying entry-level roles. It then
stores the valid job postings in a CSV file for further use. The project utilizes Python libraries such as
BeautifulSoup, CSV, and a custom machine learning module gem.py
to classify the jobs.
def remove_common_chars(string):
# Replace unwanted HTML tags and characters
string = string.replace("<", "").replace(">", "").replace("</strong>", "")
string = string.replace("</u>", "").replace("</li>", "")
return string
def get_links(url, num, big_num):
delay = 1
if num > 45000:
return
print(f"Fetching {num} job links")
if num != 0:
url += f"&start={num}"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
hrefs = [tag.get('href') for tag in soup.find_all('a') if tag.get('href')]
with open('jobPost.csv', mode='a', newline='') as file:
writer = csv.writer(file)
if num == 0:
writer.writerow(['Link', 'Description'])
for link in hrefs:
if link.startswith("https://www.linkedin.com/jobs/"):
process_job_link(link)
time.sleep(delay)
get_links(url, num + 25, big_num)
The goal of this section is to train a machine learning model that can automatically classify job descriptions as "entry-level" or "not entry-level." To achieve this, we use a Natural Language Processing (NLP) approach. Below is a detailed breakdown of the process:
TfidfVectorizer
We use TfidfVectorizer
to convert job descriptions into numerical features. TF-IDF (Term Frequency-Inverse Document Frequency) measures the importance of each word in a document relative to the entire corpus. This transformation is crucial because machine learning models cannot work directly with text data—they need numerical input.
ComplementNB
We use ComplementNB
, a variant of the Naive Bayes classifier, which is particularly effective for imbalanced datasets (such as job descriptions where true entry-level jobs may be fewer in number). The classifier predicts the likelihood of each description being entry-level based on the vectorized features.
The dataset is split into training and testing sets using train_test_split
. The model is trained on 80% of the data, and its performance is evaluated on the remaining 20%. This ensures that the model generalizes well to unseen job descriptions.
def job(s):
# Load data into a DataFrame
df = pd.DataFrame(data)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['job_description'], df['label'], test_size=0.2, random_state=42)
# Convert text data to numerical features using TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english')
# Create a machine learning pipeline with TfidfVectorizer and ComplementNB
model = make_pipeline(vectorizer, ComplementNB())
# Train the model on the training data
model.fit(X_train, y_train)
# Make predictions on the input job description
y_pred = model.predict([s])
# Return True if classified as entry-level, otherwise False
return y_pred[0] == 1
After training, the model can be used to predict whether a new job description is entry-level. The input description is vectorized and passed to the trained model, which outputs a prediction. A value of True
means the job is classified as entry-level, while False
indicates it is not.
This project is a robust job scraper that uses machine learning to filter out irrelevant job postings, focusing on those suitable for recent graduates or entry-level candidates. By leveraging NLP and a Naive Bayes classifier, we efficiently identify job descriptions that match the entry-level criteria, saving time for users who are looking for relevant positions.