3. Methodology
3.1. Sample and primary data sources
We built the sample in three steps. First, we obtained the list of all Belgian start-ups listed in Crunchbase, and retrieved their founders. Second, we collected information on the work and education history of the founders from the business directory LinkedIn. This step was done by hand by a team of collectors hired specifically for the task. The resulting (raw) data from Crunchbase and LinkedIn is in table R01_be_founders_source.
3.2. Construction of the dataset
We used the data in table R01_be_founders_source as the starting point to build the database, and proceeded as follows:
Extracting columns with the names, work experience and education history of the entrepreneurs from table R01_be_founders_source.
Combining all columns that referred to the same field into a single one. E.g., there are about 30 columns named exp<xx>_org (where <xx> is a number) with the names of each entrepreneur’s past employers, but columns like featured_job_organization_name contains employers’ names as well.
Creating raw tables R02_edu_source and R03_exp_source with raw data limited to education and work history, respectively.
Harmonizing fields: canonicalizing, cleaning and parsing strings (see Cleaning raw strings for more details).
Creating index variables for the parsed fields.
Disambiguating the parsed (string) fields (i.e., reduce all the strings that refer to the same entity to a single, uniform, label) using NLP techniques to match them to relevant external databases (see String disambiguation for more details).
We explain these steps in detail in the next sections. Table 3.1 below shows the correspondences between the raw fields of F01_founders_info, R02_edu_source and R03_exp_source and parsed fields in the main tables.
Raw field name |
Parsed name |
Parsed id |
Linked to table(s) |
---|---|---|---|
name_src |
name |
ind_id |
F01_founders_info, R03_exp_source, R03_exp_source, E01_edu_main_parsed, W01_exp_main_parsed, U01_exp_flat, U02_edu_flat |
exp_org |
exp_org_parsed |
org_id |
W01_exp_main_parsed, U01_exp_flat |
exp_jt |
jt_parsed |
jt_parsed_id |
W02_job_titles_raw_parsed, W02_job_titles_raw_parsed |
edu_org |
edu_org_parsed |
org_id |
E01_edu_main_parsed, U02_edu_flat |
edu_prg |
edu_prg_parsed |
jt_parsed_id |
E01_edu_main_parsed, W02_job_titles_raw_parsed, U02_edu_flat |
Warning
Parsing of job titles: There is unique id for each raw job title in exp_jt, stored as variable jt_raw_id. Because one raw job title often includes several roles (e.g., “CEO, founder”), we parsed job titles in a way that assigns each of these roles to a separate parsed job title. Hence, each jt_raw_id may correspond to multiple jt_parsed_id. Table W02_job_titles_raw_parsed contains the correspondence between the id’s of raw and parsed job titles.
3.2.1. Cleaning raw strings
We pre-processed and harmonized all raw string fields by canonicalizing them:
Removing punctuation marks and non-alphanumeric characters;
Replacing accented, special and non-latin characters with their closest character in a US keyboard (e.g., é -> e, á -> a, ž -> z; using
unidecode
);Replacing roman numerals by latin numerals;
Turning strings to uppercase (in the experience fields) or lowercase (in the education fields)
Removing multiple, leading and trailing white spaces
Standardizing firm type tokens (limited -> ltd, company -> co, international -> intl, etc.)
3.2.2. String disambiguation
We matched the harmonized strings containing organization names (from firms and universities/educational instutitutions) and job titles against dictionaries and lists of terms compiled using relevant external databases. This facilitated the disambiguation, but also allowed to directly link our data with the external databases where those dictionaries and lists come from. Moreover, we matched some fields to several different databases. E.g., we matched universities names to Orbis company data, but also to the ETER and Carnegie databases. Similarly, we matched firm names to Orbis, Compustat and CSRP.
3.2.2.1. Matching organization names
3.2.2.1.1. Matching against business register datasets
First, we pooled all harmonized organization names (exp_org, featured_job_organization_name, edu_org) and matched them to Bvd’s Bel-first and Orbis databases, using their batch upload tools. These tools take a company name (i.e., our harmonized firm names) and look for their closest match in a business directory of millions of statutory organization names (i.e., the name under which organizations are recorded in national business registers). Both databases record alternative and prior names of the organizations.
Bel-first has data about organizations registered in Belgium, whereas Orbis has world-wide coverage. Hence, Bel-first data is a subset of Orbis. However, because the organizations in our database are predominantly Belgian, we matched the harmonized names using Bel-first to begin with. We retained succesful matches, and took the names that remained unmatched to the Orbis tool. Finally, we replaced harmonized names in our list with their respective successful from Bel-first and Orbis.
Note
Both Bel-first and Orbis provide an indication of the quality of their match: a ranking from excellent to poor using letters A-E. We kept only A and B matches–i.e., excellent and good.
Second, we took the list that resulted from the previous step and matched the names against all firm/organization names in each of the following datasets:
Compustat (downloaded on 24th September, 2021),
CSRP (downloaded on 25th September 2021),
the Crunchbase 2013 data dump and a partial 2015 export (see Links),
AnaCredit’s list of international organization (see Links).
Prior to matching, we pre-processed the company names with the steps listed in
Cleaning raw strings. In this step we used process.extractOne
from the
fuzzywuzzy
. As inputs, the extractOne
module takes a list of focal strings
(our organization names) and a list of candidates names (e.g., all Compustat
firm names). As output, it finds the string in the candidate group which is closest to
each focal string, using the Levenshtein Distance (LD).
Third, we further refined the matching results from step two and retained only {focal, candidate} duples fulfilling one of the following conditions:
Adjusted token sort ratio >= 95
OR
Adjusted token sort ratio >70 AND token set ratio >= 95
Note
Adjusted token sort ratio and token set ratio are two LD-based metrics in
the fuzzywuzzy.fuzz
module. The former computes the LD between a
pair of strings strings after tokenizing each string and sorting its tokens
alphabetically; the resulting score is adjusted using the inverse string length.
The latter computes the LD latter after tokenizing and performing a set
operation to remove repeated tokens.
We run the steps explained in this sub-section twice:
With the complete harmonized strings
Removing the firm type tokens (e.g., co, ltd, inc, etc.)
We examined the entities that could not be matched and found out that, oftentimes, the public and statutory name of a company differ and the self-reported company name in LinkedIn will not match any company from any database. E.g., in our raw data, people report having worked for Humin, whose statutory name is Humanovation. Therefore, as a fourth step, we analyzed the organization names that were still unmatched and manually looked for their matching firms in the Orbis database. To do this we looked up firms using information other than the name, such as the physical address, website, email, phone number and/or founders’ name.
Finally, the quality of the matching process was evaluated by two external raters.
Matching results
There were 5,602 distinct organization names in the raw data (including firms as well as educational institutions). The harmonization and disambiguation steps as described above reduced the number of distinct organizations to 4244. Of this total, 3,766 (88.7%) were matched to an external database and 3748 (88.3%) were matched to BvD (Bel-first and/or Orbis).
3.2.2.1.2. Matching against datasets of university names
We matched the list of organization names to three databases containing names of educational institutions:
ETER (dataset downloaded on 26th July, 2021)
Carnegie Classification (dataset downloaded on 1st September, 2021)
Webometrics Ranking (we scraped the complete website on 19th July, 2021)
To do so, we pre-processed the names of the educational institutions in these datasets according to the steps listed in Cleaning raw strings, and matched our organization names against them, following the procedure described in Matching against business register datasets–i.e., in this step the candidates where ETER, Carnegie and Webometrics Ranking pre-processed names.
Concordance table
Many organizations in our database were successfully matched against multiple external sources (e.g., KU Leuven is matched to Orbis, ETER and Webometrics). To facilitate the future use of the matched data we built table T01_org_concordance, which links org_id’s with the id’s of multiple external databases.
3.2.2.2. Matching job titles
The pre-processing of the (raw) job title strings posed an additional challenge: a large share of job title strings contain more than one role (e.g., “Co-founder and CEO”, “Technical Sales and Software Engineer”, “COO, Board member”). We dealt with this issue in several steps. First, we split the raw string using the following characters/terms: {‘,’; ‘ & ‘; ‘ - ‘; ‘ and ‘}. Second, we applied the steps in section Cleaning raw strings to the split strings. After parsing, one job title in its raw form (in field exp_jt, identified also by jt_raw_id) may correspond to more than one job title in its parsed form (in field jt_parsed, identified also by jt_parsed_id).
Note
The correspondence between raw and parsed job titles is in table W02_job_titles_raw_parsed
Third, we disambiguated the parsed job titles (i.e., split and pre-processed) following the procedure in Matching against business register datasets, matching them against a dictionary composed by all the job titles in the O*NET Database. More specifically, we used the alternate job titles in O*NET (identified by the field Alternate Title in the Alternate titles table). Note that the alternate titles were also pre-processed according to Cleaning raw strings.
Fourth, and finally, we performed extensive manual revision and correction of the parsed job title strings that could not be matched to O*NET.
Warning
PhD, courses and other education
Many individuals in our database list their phd’s and other education (such as courses and training, presumably with job-market relevance) inside work experience. Our meticulous parsing and classification steps allowed us to spot such entries, and we reclassified them as education.
From raw job title to O*NET SOC code
Merge W02_job_titles_raw_parsed with W03_job_titles_parsed_onet on jt_parsed_id to obtain the O*NET SOC code(s) matched to each raw job title.
3.2.3. Creating categorical variables from job titles and study programs
3.2.3.1. Job titles
We followed the methodology in Chen and Thompson [CT16] to classify the parsed job titles into:
Job ranks: top management, management, sub-management and non-management
Functional areas: Business and Management, Production, R&D and Engineering, Personnel, Sales and Marketing, Accounting and Finance
Other indicators (e.g., engineering role, medical role, etc.)
The cited methodology uses sets of keywords linked to each of the fields above to assign a job title string to one class if two conditions are fulfilled:
the string contains one or more keywords linked to a class,
the string does not contain a keyword in a set of exclusions, particular to that class.
For instance, we classify job title strings containing terms such as sales, key account or marketer into Sales and Marketing. However, we abstain from doing so if the string also contains a term such as bond (e.g., in bond sales).
Finally, note that we grouped the job rank, functional areas and other classifications at the level of the raw job title, and not the parsed job titles. Hence, tables W05_job_titles_job_rank, W06_job_titles_functional_areas and W07_job_titles_other_indicators link the jt_raw_id to the classifications assigned following the procedure in this section.
Note
We used the dictionary provided by Chen and Thompson [CT16] and expanded it with new terms after careful examination of the job title string in our database. A copy of the dictionary we used in this step is available in ‘./03_params/parse_jt_job_ranks_funct_areas_curr.xlsx’.
3.2.3.2. Study programs
We parsed the edu_prg raw string in order to identify
the level of education of each study program (primary school, secondary school, bachelor, etc.), and
the field of each study program (Arts, humanities, ICT, etc.).
We proceeded as follows.
Cleaning. We applied the methodology in section Cleaning raw strings to the edu_prg field. Additionally, we cleaned/harmonized the parts of the string that refer to the level of the study program (e.g., ‘master’ -> ‘msc’, ‘Ph. d.’ -> ‘phd’). The harmonized strings are stored as edu_prg_parsed in table E03_edu_programs.
Levels of education. We adapted the methodology in Chen and Thompson [CT16] and created lists of words that relate to ISCED levels of education. ISCED 2011 has nine levels of educations, from level 0 to 8. We collapsed them into 6 different categories according to Table 3.2.
Study level in our database |
Corresponds to ISCED level(s) |
---|---|
Primary school |
0, 1 |
Secondary school |
2, 3 |
Post secondary (tertiary) |
4, 5 |
Bachelor’s degree |
6 |
Master’s degree |
7 |
Doctoral degree |
8 |
We used the word lists to classify edu_prg_parsed into study levels. Specifically, we assigned a study program string to a study level if two conditions are fulfilled:
the string contains one or more keywords linked to a level,
the string does not contain a keyword in a set of exclusions, particular to that level.
Some study programs could not be assigned a study level as above and we coded them as ‘other’.
The following example illustrates the approach. Our algorithm parsed the raw string “bachelor in anthropology” “bsc in anthropology” and classified it as ‘bachelor degree’ because it contains the term bsc.
Warning
We reclassified all post-doctoral experience as work experience. Hence, post-docs are in the experience-related tables, and not in the education-related ones.
Note
A copy of the dictionary we used in this step is available in ‘./03_params/parse_edu_isced_levels_curr.xlsx’.
Fields of education. We classified the study programs into the fields of study defined in the ISCED-F 2013 taxonomy. (see table E06_isced_fields_defs). As before, we adapted the methodology in Chen and Thompson [CT16] and created lists of words and tokens that relate to the study fields. The starting point of our word list was the Appendix II: Numerical Code List from the ISCED-F2013 report, which provides about 1,200 study programs that belong to the fields of study in the taxonomy. We expanded these lists with additional keywords as we explored the data.
We used the word lists referred above to classify edu_prg_parsed into study fields. Specifically, we assigned a study program string to a study field if two conditions are fulfilled:
the string contains one or more keywords linked to a field,
the string does not contain a keyword in a set of exclusions, particular to that field.
Warning
Some entrepreneurs who attended an accelerator program list them as work experience, while others as education. To harmonize, we reclassified all accelerator-related entries as work experience.
Note
A copy of the dictionary we used in this step is available in ‘./03_params/parse_edu_isced_fields_curr.xlsx’.
3.2.4. Other variables
3.2.4.1. Age
We use disambiguated schooling data to estimate entrepreneurs’ birth years and derive the age at the time of work experience, education and cofounding. In this step we combine information from the study level and the starting or ending year of each entry in the education table. First, for each education entry with ISCED level and at least start or end year of education program, we used the mapping rule in table Table 3.3 to estimate the birth year of the entrepreneur. For example, if Jane started secondary school in 2005, we estimate she was 12 at the time, and impute a birth year of 1993. Second, if an entrepreneur has many education entries in the education table, this step may produce different estimates of birth year. In such case, we take the minimum of all estimated birth years of each entrepreneur. Third, we combined the (minimum) estimated birth year and the cofounding year to estimate the age of the entrepreneur at the moment of cofounding a firm.
Study level |
Start age |
End age |
---|---|---|
Primary school |
6 |
11 |
Secondary school |
12 |
18 |
Post secondary (tertiary) |
19 |
21 |
Bachelor’s degree |
19 |
21 |
Master’s degree |
22 |
23 |
Doctoral degree |
25 |
N/A |