Apr 15 2018

Developer-friendly job search. The tooling

It is quite interesting to me to see how software job market works in the Netherlands from an active candidate’s point of view. Unfortunately one great journey has almost ended and there is a firm incentive to find something at least as good. Having my current job through StackOverflow Jobs, I intended to keep using what proved to work, like marking the profile active, entering python -django -sysadmin into the search field, adding relevant filters and subscribing to the search. That, as such, of course works, but it looks it is not the most popular software talent recruitment service that Dutch companies use. And many companies are still quite happy to pay 20% of the gross year salary per new hire to the volume-oriented keyword-matching salesmen, colloquially referred as recruiters.

On the other side of price spectrum there are free (as a beer… on a degustation) services like Indeed and Glassdoor [1] where employers can place their vacancies. The two seem to me most relevant and complete, but because the latter also provides valuable insight into company state, salary range and interview process I will focus on it.

StackOverflow Jobs is in between. It’s definitely not free for employers. For our small company it costed around €2k for half-year access to active candidate database and one job posting. Most probably it’ll require a company to have a full-time HR employee to manage the interactions with candidates. But hiring 4-5 developers a year already justifies the investment, so Stack Exchange has growth potential here.

Glassdoor has poor search features, term-inclusion-only search and location, which makes it a chore to examine the proposals manually.

Problem

StackOverflow Jobs search features include [2]:

  1. tag inclusion and exclusion
  2. full-text search on body, title and company
  3. enumeration filters by contract type, industry and seniority
  4. salary range
  5. flag filters by remote option, visa and relocation sponsorship
  6. posted time filtering

My job search criteria were the following:

  • Python (backend software development) job preferably in Amsterdam
  • The company develops own product, i.e. it’s not a consultancy
  • The job is available directly, i.e. it’s not an agency or any other kind of intermediary

At first glance it looks simple. What subset of the above list of feature is sufficient to cover these?

  1. Location data should be good enough for equality comparison, but generally the country is small, very connected and geospatial search is not needed.
  2. A Python job is a match against [python] minus some other tags (though only if carefully tagged), but Glassdoor doesn’t have tags and implementing NLP to recognise them was not feasible. Doing just Python will not only match all sorts of jobs with it in nice-to-have, but also match jobs that extensively use Python but are not about software development as such, e.g. devops, (data) science, test automation, screen scraping and probably others. Also don’t forget full-stack and junior/intern positions (no dedicated seniority field on Glassdoor either).
  3. Solving consultancies and intermediaries should be possible with a blacklist, but needs full-text index on company name to mitigate dealing with optionally present suffixes like “B.V.” and “BV”, geographical suffixes and the like. Using NLP to break down jobs into direct and indirect would be an interesting research, but again it was not feasible.

Items 1 and 3 look clear. 2 needs more attention. It should also be possible to solve with smarter blacklisting. But let’s look at a sample of job titles, which match Python query:

  • Lead Data Scientist
  • DevOps Engineer (Linux, AWS)
  • Software Engineer Embedded Smart Systems (C/C++; C#; Python)
  • Agile Tester
  • Backend Software Engineer
  • Experienced Python/Django Developer in Zaandam (NL/ENG)
  • Java Developer
  • Junior Full Stack Engineer
  • Medior Front-end Developer
  • Bioinformatic Postdoctoral Fellow
  • C++ Software engineer Artificial intelligence DELFT 40k-70k
  • Portfolio Manager Quant Allocation
  • PhD Student “Modelling causal interactions between marine microbes”

Yes, all sorts of jobs as promised. The good thing is that indexing title and body separately can help a lot with blacklisting. Say, if the title includes Django it’s definitely a no-go, but if Django is mentioned in the body it depends and may be tolerable. The same way other terms, like Java or Embedded can be handled.

Storage

Thus on the level of storage/search it’s necessary to have:

  1. primary key, original, if exposed, or hash of the whole record
  2. title, full-text indexed
  3. employer, full-text indexed
  4. body, full-text indexed
  5. JSON field or just blob for other properties
  6. status, just integer for own convenience

What embedded database has full-text search support with document storage features and easy Python interface? The latter is about JSON-field support that PostgeSQL and MySQL have acquired. SQLite!

The version that I have from Ubuntu Xenial’s repository, at the time of writing libsqlite3-0 3.11.0-1ubuntu1, supports both:

$ echo "PRAGMA compile_options;" | sqlite3 | grep -E "FTS|JSON"
ENABLE_FTS3
ENABLE_FTS3_PARENTHESIS
ENABLE_FTS4
ENABLE_JSON1

And Python standard library sqlite3 [3] uses it! Now briefly about the extensions.

FTS3 and FTS4

The documentation says [4]:

FTS3 and FTS4 are SQLite virtual table modules that allow users to perform full-text searches on a set of documents… FTS3 and FTS4 are nearly identical… FTS4 is an enhancement to FTS3. FTS3 has been available since SQLite version 3.5.0 (2007-09-04). The enhancements for FTS4 were added with SQLite version 3.7.4 (2010-12-07).

A virtual table [5] in SQLite is:

…an object that is registered with an open SQLite database connection. From the perspective of an SQL statement, the virtual table object looks like any other table or view. But behind the scenes, queries and updates on a virtual table invoke callback methods of the virtual table object instead of reading and writing on the database file.

In case of FTS4 it looks like a view that behind the scenes operates 5 shadow tables [6] which are stored within the database file. And a note about ENABLE_FTS3_PARENTHESIS and query syntax:

New applications should also define the SQLITE_ENABLE_FTS3_PARENTHESIS macro to enable the enhanced query syntax [7].

It means that python -django won’t work. python AND NOT django (or python NOT django because AND is the implicit operator) should be used instead.

To better understand how full-text search is going to contribute to solving the problem, here is simplified code that represents the data structure [6], inverted index, that is used by the SQLite extension.

from collections import defaultdict


docs = [
  (100, 'alpha beta beta gamma gamma gamma'),
  (101, 'gamma delta'),
]
invertedIndex = defaultdict(list)
for docId, text in docs:
  docIndex = defaultdict(list)
  for offset, term in enumerate(text.lower().split()):
    docIndex[term].append(offset)
  for term, offsetList in docIndex.items():
    invertedIndex[term].append((docId, offsetList))

Result invertedIndex looks like:

{
  'alpha' : [(100, [0])],
  'beta'  : [(100, [1, 2])],
  'delta' : [(101, [1])],
  'gamma' : [(100, [3, 4, 5]), (101, [0])]
}

Using this structure FTS3 and FTS4 serve the queries. str.lower and str.split in the snippet represent tokenisation of the text. Here no stemming is needed because most of technical terms are present intact, so I’ll use unicode61 for the virtual table.

The “unicode61” tokeniser is available beginning with SQLite version 3.7.13 (2012-06-11). Unicode61 works very much like “simple” except that it does simple unicode case folding according to rules in Unicode Version 6.1 and it recognises unicode space and punctuation characters and uses those to separate tokens. The simple tokeniser only does case folding of ASCII characters and only recognises ASCII space and punctuation characters as token separators… By default, “unicode61” also removes all diacritics from Latin script characters.

One thing to point out is that “C++” and “C#” will be indistinguishable from “C” because non-alphanumeric characters will be treated as separators. SQLite doesn’t limit minimal term length or have built-in stop-word list. So to solve this tokenchars is enough:

…”tokenchars=” option may be used to specify one or more extra characters that should be treated as part of tokens instead of as separator characters.

I also want to exclude “.NET” and adding “.” to the tokenchars will cause issue with words on the end of a sentence, but it’s negligible for the use case. Alternatively LIKE can be safely used as a post-filter condition, or generally it’s possible to implement tokenisers in Python with sqlitefts [12].

JSON1

The documentation says [8]:

The json1 extension is a loadable extension that implements fifteen application-defined SQL functions and two table-valued functions that are useful for managing JSON content stored in an SQLite database… The json1 extension (currently) stores JSON as ordinary text… The json1 extension uses the sqlite3_value_subtype() and sqlite3_result_subtype() interfaces that were introduced with SQLite version 3.9.0 (2015-10-14).

Out of the 15 function I’ll use only json_extract [9]. Even though on storage level of JSON1 JSON values are just string blobs, on logical level given set of function extends relation model [10] of SQLite with document storage model [11]. This provides great flexibility in data modelling, where one can take the best out of the two models, and still use declarative query language, SQL.

Implementation

import json
import sqlite3


class JobStorage:

  _conn = None


  def __init__(self):
    self._conn = sqlite3.connect('db.sqlite', isolation_level = None)
    self._setup()

  def _setup(self):
    sql = '''
      CREATE VIRTUAL TABLE IF NOT EXISTS job USING fts4(
        properties, status, employer, title, body,
        notindexed=properties, notindexed=status,
        tokenize=unicode61 "tokenchars=.+#")
    '''
    self._conn.execute(sql)

  def save(self, job):
    sql = '''
      INSERT OR IGNORE INTO job(docid, properties, status, employer, title, body)
      VALUES(:docid, :properties, :status, :employer, :title, :body)
    '''

    title = job.pop('title')
    body = job.pop('body')
    self._conn.execute(sql, {
      'docid'      : job['dataset']['id'],
      'employer'   : job['employer'],
      'properties' : json.dumps(job, default = str),
      'status'     : 1,
      'title'      : title,
      'body'       : body
    })

isolation_level = None re-enables auto-commit (for simplicity, and generally to let transactions be an explicit measure, because explicit is better than implicit). job['dataset'] will be described in the next section.

Retrieval

Data retrieval in context of the web or how it’s called in less neutral terms — screen scraping — is usually implemented with HTTP client using a HTML parser that is capable of either XPath queries or CSS selectors. This time I wanted to try something new and the time was right. These days we see a new wave of headless browsers [13]. I’m not talking about PhantomJS or the like, but rather ubiquitous desktop browsers that have gained official automation APIs. For instance, Chromium since version 59 supports DevTools Protocol (a.k.a. Remote Debugging Protocol) [14] in Linux.

DevTools Protocol is a WebSocket-based protocol and Google maintains an official client for it, puppeteer [15], written in JavaScript. What’s better that there’s a port to Python 3.6+. It’s called pyppeteer [16].

Pyppeteer

Most things that you can do manually in the browser can be done using Puppeteer!

As it should be with Pyppeteer. Basically the protocol provides WebSocket-based RPC to BOM, DOM and user objects, and the implementation provides more-or-less transparent proxy object API. In case of Pyppeteer it’s asyncio API.

Here it a demonstration of the proxy behaviour. It’ll use pyppeteer.element_handle.ElementHandle [18] that represents DOM element. Here’s a JavaScript that one can run in Chromium’s Developer Tools.

var a = document.querySelector('div div.flexbox div a');
var hrefProperty = a.href;
var href = hrefProperty.valueOf();
var titleProperty = a.textContent;
var title = titleProperty.valueOf();
console.log(href, title);

And the same thing with Pyppeteer:

a = await page.querySelector('div div.flexbox div a')
hrefProperty = await a.getProperty('href')
href = await hrefProperty.jsonValue()
titleProperty = await a.getProperty('textContent')
title = await titleProperty.jsonValue()
print(href, title)

There’re two must-know things about Pyppeteer. The first is that every attribute and value access needs to be await‘ed. The second is that pyppeteer.page.Page.click and pyppeteer.page.Page.waitForNavigation have a race condition and should be awaited simultaneously [19], like:

await asyncio.gather(page.click('#HeroSearchButton'), page.waitForNavigation())

Towards the implementation, the idea for job retrieval API is to have an asynchronous iterator object implemented as asynchronous generator (one that mixes yield and await) as defined in PEP 525 [17]. It will encapsulate browser preparation, pagination, addtional page interactions, throttling sleeps and at the same time in the calling code there will be just an async for loop. I’ll be using:

  • Python 3.6.5
  • Pyppeteer 0.0.16
  • Chromium 64.0.3282.167

Iteration

import asyncio
import logging


class JobIter:

  _page = None

  _query    = None
  _location = None
  _days     = None

  _prepared = False


  def __init__(self, page, query, location, days):
    self._page = page

    self._query    = query
    self._location = location
    self._days     = days

  async def _prepare(self):
    logging.debug('Opening main page')
    await self._page.goto('https://www.glassdoor.nl/index.htm')

    logging.debug('Entering query')
    await self._page.type('#KeywordSearch', self._query)
    await self._page.evaluate(
      'function(id, loc) {document.getElementById(id).value = loc}',
      'LocationSearch', self._location)

    logging.debug('Opening result page')
    await asyncio.gather(
      self._page.click('#HeroSearchButton'),
      self._page.waitForNavigation()
    )

    logging.debug('Selecting posted date range')
    datePosted = '#DKFilters > div > div > div:nth-child(2)'
    await self._page.click(datePosted)

    daysAgo = '#DKFilters .filter.expanded ul li[value="{}"]'.format(self._days)
    await self._page.waitForSelector(daysAgo)
    await self._page.click(daysAgo)

    logging.debug('Waiting for list to update')
    await self._page.waitForSelector('.jlGrid.updating')
    await self._page.waitForSelector('.jlGrid.updating', hidden = True)

  async def _dismiss(self):
    button = await self._page.querySelector('button.mfp-close')
    if button:
        await button.click()

  async def __aiter__(self):
    if not self._prepared:
      await self._prepare()
      self._prepared = True

    while True:
      await self._dismiss()

      button = await self._page.querySelector('#FooterPageNav .page.current')
      p = await button.getProperty('textContent')
      current = (await p.jsonValue()).strip()
      logging.debug('Processing page %s', current)

      jobList = await self._page.querySelectorAll('ul.jlGrid > li')
      for j in jobList:
        parser = JobParser(j, self._page)
        job = await parser.parse()
        logging.debug('Processed job %s', job['title'])
        yield job
        await asyncio.sleep(0.5)

      button = await self._page.querySelector('#FooterPageNav .next')
      disabled = await button.querySelector('.disabled')
      if disabled:
        logging.debug('Last page has been reached')
        return

      await asyncio.gather(button.click(), self._page.waitForNavigation())
      await asyncio.sleep(2)

Parsing

Root element of a job listing is HTML list item tag which looks like:

<li class="jl selected"
  data-id="2618878823" data-emp-id="262109" data-is-organic-job="false"
  data-sgoc-id="-1" data-purchase-ad-order-id="0" data-is-easy-apply="false"
  data-njslv="false"/>

There’re several data attributes, notably id:

  1. id is internal Glassdoor identifier of a job which will help deduplicate jobs on subsequent runs,
  2. empId is an employer identifier, it’s is somewhat useful,
  3. isOrganicJob, initially I thought it the flag to filter out intermediaries, but unfortunately I didn’t grasp how exactly it is set.

The attributes can be accessed as a dictionary under dataset property, see _getDataset below.

import datetime
from decimal import Decimal


class JobParser:

  _elementHandle = None
  _page = None


  def __init__(self, eh, page):
    self._elementHandle = eh
    self._page = page

  async def _getDataset(self):
    p = await self._elementHandle.getProperty('dataset')
    d = await p.getProperties()
    dataset = {}
    for k, jsh in d.items():
        dataset[k] = json.loads(await jsh.jsonValue())
    return dataset

  async def _getRating(self):
    span = await self._elementHandle.querySelector('.compactStars')
    if span:
      p = await span.getProperty('textContent')
      return Decimal((await p.jsonValue()).replace(',', '.'))

  async def _getTitleLink(self):
    a = await self._elementHandle.querySelector('div div.flexbox div a')
    p = await a.getProperty('href')
    href = await p.jsonValue()
    p = await a.getProperty('textContent')
    title = (await p.jsonValue()).strip()

    return title, href

  async def _getEmployer(self):
    div = await self._elementHandle.querySelector('.empLoc div:nth-child(1)')
    firstNode = await div.getProperty('firstChild')
    p = await firstNode.getProperty('nodeValue')
    return (await p.jsonValue()).strip().strip('–').strip()

  async def _getCity(self):
    span = await self._elementHandle.querySelector('.subtle.loc')
    p = await span.getProperty('textContent')
    return (await p.jsonValue()).strip()

  async def _getDate(self):
    div = await self._elementHandle.querySelector('.hotListing')
    ts = datetime.datetime.now()
    if div:
      return ts

    span = await self._elementHandle.querySelector('.showHH span')
    p = await span.getProperty('textContent')
    value, unit = (await p.jsonValue()).strip().split(maxsplit = 1)
    value = int(value)
    if unit == 'd':
      ts -= datetime.timedelta(days = value)
    elif unit == 'u':
      ts -= datetime.timedelta(hours = value)
    elif unit == 'min':
      ts -= datetime.timedelta(minutes = value)
    else:
      raise ValueError('Unsupported unit {}'.format(unit))

    return ts

  async def _getBody(self):
    p = await self._elementHandle.getProperty('classList')
    await self._elementHandle.click()
    spinner = await self._page.querySelector('.jobDetails .ajaxSpinner')
    if spinner:
      spinner = await self._page.waitForSelector('.jobDetails .ajaxSpinner', hidden = True)

    div = await self._page.querySelector('.jobDetails .jobDescriptionContent')
    p = await div.getProperty('textContent')
    return (await p.jsonValue()).strip()

  async def parse(self):
    title, url = await self._getTitleLink()
    dataset, rating, employer, city, date = await asyncio.gather(
      self._getDataset(), self._getRating(), self._getEmployer(), self._getCity(), self._getDate())
    body = await self._getBody()

    return {
      'dataset'  : dataset,
      'rating'   : rating,
      'url'      : url,
      'employer' : employer,
      'city'     : city,
      'date'     : date,
      'title'    : title,
      'body'     : body
    }

Glue

By default on first run Pyppeteer downloads Chromium build that is guaranteed to work. But because I already had compatible version pre-installed I explicitly provided executablePath which disables the behaviour. headless = False is useful for debugging.

from pyppeteer import launch


async def main(query, location, days):
  browser = await launch(executablePath = '/usr/bin/chromium-browser', headless = True)
  page = await browser.newPage()
  storage = JobStorage()
  async for job in JobIter(page, query, location, days):
    storage.save(job)
  await browser.close()


if __name__ == '__main__':
  logging.basicConfig(
    level = logging.DEBUG, format = '%(asctime)s %(levelname)s %(name)s %(message)s')
  for l in ('websockets', 'pyppeteer'):
    logging.getLogger(l).setLevel(logging.WARNING)

  asyncio.get_event_loop().run_until_complete(main('Python', 'Nederland', 7))

Listing relevant jobs

And finally after executing the above code, in several minutes there’s db.sqlite. I used sqliteman [20] (available from Ubuntu Xenial’s repository) to query and explore it, because it correctly works with JSON attributes and clipboard (surprisingly SQLiteStudio, which I usually open SQLite databases with, has issues with both).

A sample record looks like:

{
  "dataset": {
    "id"                : 2622677334,
    "empId"             : 823453,
    "isOrganicJob"      : true,
    "sgocId"            : -1,
    "purchaseAdOrderId" : 0,
    "isEasyApply"       : false,
    "njslv"             : false
  },
  "rating"   : "2.9",
  "url"      : "https://www.glassdoor.nl/partner/jobListing.htm?...",
  "employer" : "Takeaway.com",
  "city"     : "Amsterdam",
  "date"     : "2018-03-23 16:10:59.413456"
}

Here follows the query I used. Yours most probably will be different.

SELECT
  title,
  status,
  round(julianday('now') - julianday(json_extract(properties, '$.date'))) ago,
  json_extract(properties, '$.employer') employer,
  json_extract(properties, '$.dataset.empId') eid,
  json_extract(properties, '$.rating') rating,
  json_extract(properties, '$.city') city,
  json_extract(properties, '$.url') url
FROM job
WHERE job MATCH '
  python
  NOT (django OR php OR java OR ruby OR perl OR C# OR .NET)
  NOT (title:devops OR title:analyst OR title:analist OR title:support OR title:graduation OR
    title:director OR title:phd OR title:researcher OR title:biologist OR
    title:test OR title:tester OR title: "quality assurance" OR title:hacker OR
    title:intern OR title:internship OR title:junior OR title:young OR
    title: "full stack" OR title:fullstack OR title: "front end" OR
    title:embedded OR title:firmware OR title:fpga OR title:C++ OR
    title:science OR title:scientist OR title:consultant OR title:coach OR
    title:marketing OR title:network OR title:operations OR title:student OR title:linux OR
    title:architect OR title:android OR title:automation OR title: "ops engineer" OR
    title:verification OR title:validation OR title:manager OR title:expert OR
    title:azure OR title:vmware OR title: "net developer" OR title:trader)
  NOT (employer:CareerValue OR employer: "Bright Cubes" OR employer:Deloitte OR
    employer:Cegeka OR employer: "Star Apple" OR employer:HUMAN-CAPITAL OR employer:Bonque OR
    employer:Sogeti OR employer:Yacht OR employer:iSense OR employer: "Talent Relations" OR
    employer:Place-IT OR employer:Professionals OR employer:ISAAC OR
    employer:Trinamics OR employer:CINQ OR employer:Teqoia OR employer:JouwICTvacature OR
    employer:Gazelle OR employer:Consulting OR employer:Topicus OR employer: "Blue Lynx" OR
    employer:Nobru OR employer:Qualogy OR employer:YER OR employer: "IT Human Resources" OR
    employer:Montash OR employer:CodeGuild OR employer:recruitment OR employer: "hot item" OR
    employer:MobGen OR employer:Mobiquity OR employer: "Orange Quarter" OR
    employer: "You Get" OR employer:ITHR)
'
ORDER BY employer, date(json_extract(properties, '$.date')) DESC

Yes, it’s pretty restrictive and thus gives relevant results, although I’m not very happy with constant manual blacklisting (there’s really whole lot of intermediaries). Maybe next time I’ll be looking for a job I employ some NLP or other machine learning. But now now this is it.

Wrap-up

  1. SQLite a Swiss army knife of an ad-hoc storage and search,
  2. Ubiquitous browsers have become automatable out of the box,
  3. Even though some vacancy search platforms provide decent search features and other try to automate job collection, like Indeed with its XML job feed format [21], it looks like the effort is not enough and penetration of developer-friendly automated job search tools can be significantly improved,
  4. Python for the win!

P.S. The new journey has already been scheduled :-)


[1]https://www.glassdoor.nl/
[2]https://stackoverflow.com/help/advanced-search-parameters-jobs
[3]https://docs.python.org/3/library/sqlite3.html
[4]https://sqlite.org/fts3.html
[5]https://sqlite.org/vtab.html
[6](1, 2) https://sqlite.org/fts3.html#data_structures
[7]https://sqlite.org/fts3.html#_set_operations_using_the_enhanced_query_syntax
[8]https://sqlite.org/json1.html
[9]https://sqlite.org/json1.html#jex
[10]https://en.wikipedia.org/wiki/Relational_database
[11]https://en.wikipedia.org/wiki/Document-oriented_database
[12]https://pypi.org/project/sqlitefts/
[13]https://en.wikipedia.org/wiki/Headless_browser
[14]https://github.com/ChromeDevTools/devtools-protocol
[15]https://github.com/GoogleChrome/puppeteer
[16]https://pypi.org/project/pyppeteer/
[17]https://www.python.org/dev/peps/pep-0525/
[18]https://miyakogi.github.io/pyppeteer/reference.html#elementhandle-class
[19]https://miyakogi.github.io/pyppeteer/reference.html#pyppeteer.page.Page.click
[20]https://github.com/pvanek/sqliteman
[21]https://www.indeed.com/intl/en/xmlinfo.html