enter image description hereI have created a code to help me retrieving the data from csv file
import re
keywords = {"metal", "energy", "team", "sheet", "solar" "financial", "transportation", "electrical", "scientists",
"electronic", "workers"} # all your keywords
keyre=re.compile("energy",re.IGNORECASE)
with open("2006-data-8-8-2016.csv") as infile:
with open("new_data.csv", "w") as outfile:
outfile.write(infile.readline()) # Save the header
for line in infile:
if len(keyre.findall(line))>0:
outfile.write(line)
I need it to look for each keyword in two main columns which are "position" and "Job description" , and then take the whole row that includes these words and write them in the new file. Any ideas on how this can be done in the simplest way?
Try this, looping in a dataframe and write back a new dataframe to a csv file.
import pandas as pd
keywords = {"metal", "energy", "team", "sheet", "solar", "financial",
"transportation", "electrical", "scientists",
"electronic", "workers"} # all your keywords
df = pd.read_csv("2006-data-8-8-2016.csv", sep=",")
listMatchPosition = []
listMatchDescription = []
for i in range(len(df.index)):
if any(x in df['position'][i] or x in df['Job description'][i] for x in keywords):
listMatchPosition.append(df['position'][i])
listMatchDescription.append(df['Job description'][i])
output = pd.DataFrame({'position':listMatchPosition, 'Job description':listMatchDescription})
output.to_csv("new_data.csv", index=False)
EDIT:
If you have many columns to add, the modified following code will do the job.
df = pd.read_csv("2006-data-8-8-2016.csv", sep=",")
output = pd.DataFrame(columns=df.columns)
for i in range(len(df.index)):
if any(x in df['position'][i] or x in df['Job description'][i] for x in keywords):
output.loc[len(output)] = [df[j][i] for j in df.columns]
output.to_csv("new_data.csv", index=False)
You can do this using pandas as follows, if you are looking for rows that contain exactly one word from the list of keywords:
keywords = ["metal", "energy", "team", "sheet", "solar" "financial", "transportation", "electrical", "scientists",
"electronic", "workers"]
# read the csv data into a dataframe
# change "," to the data separator in your csv file
df = pd.read_csv("2006-data-8-8-2016.csv", sep=",")
# filter the data: keep only the rows that contain one of the keywords
# in the position or the Job description columns
df = df[df["position"].isin(keywords) | df["Job description"].isin(keywords)]
# write the data back to a csv file
df.to_csv("new_data.csv",sep=",", index=False)
If you are looking for substrings in the rows (e.g looking financial
in financial engineering
) then you can do the following:
keywords = ["metal", "energy", "team", "sheet", "solar" "financial", "transportation", "electrical", "scientists",
"electronic", "workers"]
searched_keywords = '|'.join(keywords)
# read the csv data into a dataframe
# change "," to the data separator in your csv file
df = pd.read_csv("2006-data-8-8-2016.csv", sep=",")
# filter the data: keep only the rows that contain one of the keywords
# in the position or the Job description columns
df = df[df["position"].str.contains(searched_keywords) | df["Job description"].str.contains(searched_keywords)]
# write the data back to a csv file
df.to_csv("new_data.csv",sep=",", index=False)