The Implementation studies (IS) for Learning paths (LP) and Towards Data Stewardship (DM) are proposing a a web scrapping solution to search for job ads and look for descriptions of knowledge skills and abilities (KSA’s). The current solution searches and saves job ads from indeed.com as text (csv, html) and then, it collects sentences for each job ad, after collecting the sentences, the report presented here looks for keywords identified around KSA’s.
This is an R Markdown Notebook. The code is displayed within the notebook, the results appear beneath the code. I recommend to click on code “Hide all Code”, if you prefer to see only results.
The R packages used for the search are tidyverse
and here
:
library(tidyverse)
library(here)
Using the script webScrapping.R
from the scripts folder we have collated some job ads as text files in results > data > pages
and then for each job ad, a search for sentences was performed and the results were saved in text form in results > data > sentences
. Once having a dataser of sentences, this report shows the search for KSA’s.
This report is meant to be an example, it was last rendered on 2018-12-05.
From the keywords below the search is non case sensitive, any keywords with ^
means starts with
, otherwise the keyword will be searched in any part of the sentence.
mysentences <-
allsentences %>%
distinct(sentence, .keep_all = FALSE) %>%
filter(stringi::stri_count(sentence, regex = "\\w+") > 2) %>% # remove sentences if less than 2 words
mutate_at(vars(sentence), ~str_replace_all(., "\\. \\.", "\\.")) %>%
mutate_at(vars(sentence), ~str_squish(.)) %>% # reduced repeated white spaces in a string
arrange(sentence)
cat("Number of unique sentences analised", nrow(mysentences))
Number of unique sentences analised 244
Analysis of text - KSA
Knowledge
mysentences %>%
filter(grepl(sentence, pattern = "^knowledge", ignore.case = TRUE) |
grepl(sentence, pattern = "experience|degree|expertise", ignore.case = TRUE) ) %>%
arrange(sentence)
Skills
mysentences %>%
filter(grepl(sentence, pattern = "skills|proficiency|familiarity", ignore.case = TRUE) |
grepl(sentence, pattern = "^highly", ignore.case = TRUE)) %>%
arrange(sentence)
Abilities
mysentences %>%
filter(grepl(sentence, pattern = "^ability", ignore.case = TRUE) |
grepl(sentence, pattern = "^able to", ignore.case = TRUE) ) %>%
arrange(sentence)
This report is meant to be an example, it was last rendered on 2018-12-05.
LS0tCnRpdGxlOiAiUiBOb3RlYm9vayIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKVGhlIEltcGxlbWVudGF0aW9uIHN0dWRpZXMgKElTKSBmb3IgTGVhcm5pbmcgcGF0aHMgKExQKSBhbmQgVG93YXJkcyBEYXRhIApTdGV3YXJkc2hpcCAoRE0pIGFyZSBwcm9wb3NpbmcgYSBhIHdlYiBzY3JhcHBpbmcgc29sdXRpb24gdG8gc2VhcmNoIGZvciBqb2IgYWRzIAphbmQgbG9vayBmb3IgZGVzY3JpcHRpb25zIG9mIGtub3dsZWRnZSBza2lsbHMgYW5kIGFiaWxpdGllcyAoS1NBJ3MpLiAKVGhlIGN1cnJlbnQgc29sdXRpb24gc2VhcmNoZXMgYW5kIHNhdmVzIGpvYiBhZHMgZnJvbSBpbmRlZWQuY29tIGFzIHRleHQgCihjc3YsIGh0bWwpIGFuZCB0aGVuLCBpdCBjb2xsZWN0cyBzZW50ZW5jZXMgZm9yIGVhY2ggam9iIGFkLCBhZnRlciBjb2xsZWN0aW5nIAp0aGUgc2VudGVuY2VzLCB0aGUgcmVwb3J0IHByZXNlbnRlZCBoZXJlIGxvb2tzIGZvciBrZXl3b3JkcyBpZGVudGlmaWVkIGFyb3VuZCAKS1NBJ3MuCgoKVGhpcyBpcyBhbiBbUiBNYXJrZG93bl0oaHR0cDovL3JtYXJrZG93bi5yc3R1ZGlvLmNvbSkgTm90ZWJvb2suIFRoZSBjb2RlIGlzIApkaXNwbGF5ZWQgd2l0aGluIHRoZSBub3RlYm9vaywgdGhlIHJlc3VsdHMgYXBwZWFyIGJlbmVhdGggdGhlIGNvZGUuIEkgcmVjb21tZW5kIAp0byBjbGljayBvbiBjb2RlICJIaWRlIGFsbCBDb2RlIiwgaWYgeW91IHByZWZlciB0byBzZWUgb25seSByZXN1bHRzLgoKVGhlIFIgcGFja2FnZXMgdXNlZCBmb3IgdGhlIHNlYXJjaCBhcmUgYHRpZHl2ZXJzZWAgYW5kIGBoZXJlYDoKYGBge3IgbWVzc2FnZT1GQUxTRSwgd2FybmluZz1GQUxTRX0KbGlicmFyeSh0aWR5dmVyc2UpCmxpYnJhcnkoaGVyZSkKYGBgCgpVc2luZyB0aGUgc2NyaXB0IGB3ZWJTY3JhcHBpbmcuUmAgZnJvbSB0aGUgc2NyaXB0cyBmb2xkZXIgd2UgaGF2ZSBjb2xsYXRlZCBzb21lIApqb2IgYWRzIGFzIHRleHQgZmlsZXMgaW4gYHJlc3VsdHMgPiBkYXRhID4gcGFnZXNgIGFuZCB0aGVuCmZvciBlYWNoIGpvYiBhZCwgYSBzZWFyY2ggZm9yIHNlbnRlbmNlcyB3YXMgcGVyZm9ybWVkIGFuZCAKdGhlIHJlc3VsdHMgd2VyZSBzYXZlZCBpbiB0ZXh0IGZvcm0gaW4gYHJlc3VsdHMgPiBkYXRhID4gc2VudGVuY2VzYC4gT25jZSBoYXZpbmcKYSBkYXRhc2VyIG9mIHNlbnRlbmNlcywgdGhpcyByZXBvcnQgc2hvd3MgdGhlIHNlYXJjaCBmb3IgS1NBJ3MuCgpUaGlzIHJlcG9ydCBpcyBtZWFudCB0byBiZSBhbiBleGFtcGxlLCBpdCB3YXMgbGFzdCByZW5kZXJlZCBvbiBgciBTeXMuRGF0ZSgpYC4KCkZyb20gdGhlIGtleXdvcmRzIGJlbG93IHRoZSBzZWFyY2ggaXMgbm9uIGNhc2Ugc2Vuc2l0aXZlLCBhbnkga2V5d29yZHMgd2l0aCBgXmAKbWVhbnMgYHN0YXJ0cyB3aXRoYCwgb3RoZXJ3aXNlIHRoZSBrZXl3b3JkIHdpbGwgYmUgc2VhcmNoZWQgaW4gYW55IHBhcnQgb2YgdGhlIApzZW50ZW5jZS4gCgpgYGB7ciBpbmNsdWRlPUZBTFNFfQojIExvYWRzIGN1c3RvbWUgZnVuY3Rpb24KUmVhZE15RGF0YUZpbGVzIDwtIGZ1bmN0aW9uKHBhdHRlcm4gPSBwYXR0ZXJuLCBwYXRoID0gIi4vIikgewogICAgbWFwKGxpc3QuZmlsZXMocGF0aCA9IHBhdGgsIHBhdHRlcm4gPSBwYXR0ZXJuLCBmdWxsLm5hbWVzID0gVFJVRSksIAogICAgICAgIHJlYWRfY3N2KQp9CmBgYAoKYGBge3IgaW5jbHVkZT1GQUxTRX0Kc2FmZWx5X1JlYWRNeURhdGFGaWxlcyA8LSBzYWZlbHkoUmVhZE15RGF0YUZpbGVzKQoKY2hlY2sxIDwtIHNhZmVseV9SZWFkTXlEYXRhRmlsZXMocGF0dGVybiA9ICIyMDE4IiwgcGF0aCA9IGhlcmUoInJlc3VsdHMiLCAiZGF0YSIsICJzZW50ZW5jZXMiKSkKYGBgCgpgYGB7ciBpbmNsdWRlPUZBTFNFfQphbGxzZW50ZW5jZXMgPC0gCiAgICBjaGVjazEkcmVzdWx0ICU+JQogICAgbWFwX2RmKH4oLikpCgpkaW0oYWxsc2VudGVuY2VzKQpgYGAKCmBgYHtyfQpteXNlbnRlbmNlcyA8LQphbGxzZW50ZW5jZXMgJT4lIAogICAgZGlzdGluY3Qoc2VudGVuY2UsIC5rZWVwX2FsbCA9IEZBTFNFKSAlPiUgCiAgICBmaWx0ZXIoc3RyaW5naTo6c3RyaV9jb3VudChzZW50ZW5jZSwgcmVnZXggPSAiXFx3KyIpID4gMikgJT4lICMgcmVtb3ZlIHNlbnRlbmNlcyBpZiBsZXNzIHRoYW4gMiB3b3JkcwogICAgbXV0YXRlX2F0KHZhcnMoc2VudGVuY2UpLCB+c3RyX3JlcGxhY2VfYWxsKC4sICJcXC4gXFwuIiwgIlxcLiIpKSAlPiUgCiAgICBtdXRhdGVfYXQodmFycyhzZW50ZW5jZSksIH5zdHJfc3F1aXNoKC4pKSAlPiUgICMgcmVkdWNlZCByZXBlYXRlZCB3aGl0ZSBzcGFjZXMgaW4gYSBzdHJpbmcgCiAgICBhcnJhbmdlKHNlbnRlbmNlKSAKCmNhdCgiTnVtYmVyIG9mIHVuaXF1ZSBzZW50ZW5jZXMgYW5hbGlzZWQiLCBucm93KG15c2VudGVuY2VzKSkKYGBgCgpgYGB7ciBpbmNsdWRlPUZBTFNFfQojIGNsZWFuIHVwCnJtKGNoZWNrMSkKCiMgc2F2ZSBjbGVhbiBzZW50ZW5jZXMKd3JpdGVfY3N2KG15c2VudGVuY2VzLCAgaGVyZSgicmVzdWx0cyIsICJkYXRhIiwgInNlbnRlbmNlcyIsICJhbGxzZW50ZW5jZXMiLAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIHBhc3RlMChTeXMuRGF0ZSgpLCAiX2FsbHNlbnRlbmNlcy5jc3YiKSkpCnNhdmUobXlzZW50ZW5jZXMsIGZpbGUgPSBoZXJlKCJyZXN1bHRzIiwgIlJEYXRhIiwKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBwYXN0ZTAoU3lzLkRhdGUoKSwgIl9hbGxzZW50ZW5jZXMuUkRhdGEiKSkpCmBgYAoKIyBBbmFseXNpcyBvZiB0ZXh0IC0gS1NBCgojIyBLbm93bGVkZ2UKCmBgYHtyfQpteXNlbnRlbmNlcyAlPiUgCiAgICBmaWx0ZXIoZ3JlcGwoc2VudGVuY2UsIHBhdHRlcm4gPSAiXmtub3dsZWRnZSIsIGlnbm9yZS5jYXNlID0gVFJVRSkgfAogICAgICAgICAgIGdyZXBsKHNlbnRlbmNlLCBwYXR0ZXJuID0gImV4cGVyaWVuY2V8ZGVncmVlfGV4cGVydGlzZSIsIGlnbm9yZS5jYXNlID0gVFJVRSkgKSAlPiUgCiAgICBhcnJhbmdlKHNlbnRlbmNlKQpgYGAKIyMgU2tpbGxzCgpgYGB7cn0KbXlzZW50ZW5jZXMgJT4lIAogICAgZmlsdGVyKGdyZXBsKHNlbnRlbmNlLCBwYXR0ZXJuID0gInNraWxsc3xwcm9maWNpZW5jeXxmYW1pbGlhcml0eSIsIGlnbm9yZS5jYXNlID0gVFJVRSkgfAogICAgICAgICAgIGdyZXBsKHNlbnRlbmNlLCBwYXR0ZXJuID0gIl5oaWdobHkiLCBpZ25vcmUuY2FzZSA9IFRSVUUpKSAlPiUgCiAgICBhcnJhbmdlKHNlbnRlbmNlKQpgYGAKCiMjIEFiaWxpdGllcwoKYGBge3J9Cm15c2VudGVuY2VzICU+JSAKICAgIGZpbHRlcihncmVwbChzZW50ZW5jZSwgcGF0dGVybiA9ICJeYWJpbGl0eSIsIGlnbm9yZS5jYXNlID0gVFJVRSkgfAogICAgICAgICAgIGdyZXBsKHNlbnRlbmNlLCBwYXR0ZXJuID0gIl5hYmxlIHRvIiwgaWdub3JlLmNhc2UgPSBUUlVFKSApICU+JSAKICAgIGFycmFuZ2Uoc2VudGVuY2UpCmBgYAoKVGhpcyByZXBvcnQgaXMgbWVhbnQgdG8gYmUgYW4gZXhhbXBsZSwgaXQgd2FzIGxhc3QgcmVuZGVyZWQgb24gYHIgU3lzLkRhdGUoKWAuCgo=