Skip to content

Commit db29245

Browse files
authored
Merge pull request #3 from Senzing/butcher.initial-content
initial commit
2 parents d4a7278 + d2908df commit db29245

4 files changed

Lines changed: 1055 additions & 1 deletion

File tree

README.md

Lines changed: 109 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,109 @@
1-
# mapper-dnb-ubo
1+
# mapper-dnb
2+
3+
## Overview
4+
5+
The [dnb_mapper.py](dnb_mapper.py) python script converts Dun & Bradstreet (DNB) files to json files ready to load into Senzing. This includes the following formats ...
6+
- Companies and their principles **(CMPCVF)** json format
7+
- Global contacts **(GCA)** tab delimited csv format
8+
- Ultimate beneficial owners **(UBO)** tab delinited csv format
9+
10+
Normally these are provided by DNB on request and placed on an FTP server for you to download.
11+
12+
*Warning: the [dnb_formats.json](dnb_formats.json) file contains the exact structure of these files. You may need to send these formats to DNB so they know exactly how to create them!*
13+
14+
Loading DNB data into Senzing requires additional features and configurations. These are contained in the
15+
[dnb_config_updates.json](dnb_config_updates.json) file.
16+
17+
Usage:
18+
```console
19+
python3 dnb_mapper.py --help
20+
usage: dnb_mapper.py [-h] [-f DNB_FORMAT] [-i INPUT_SPEC] [-o OUTPUT_PATH]
21+
[-l LOG_FILE]
22+
23+
optional arguments:
24+
-h, --help show this help message and exit
25+
-f DNB_FORMAT, --dnb_format DNB_FORMAT
26+
choose CMPCVF, UBO, or GCA
27+
-i INPUT_SPEC, --input_spec INPUT_SPEC
28+
the name of one or more DNB files to map (place in
29+
quotes if you use wild cards)
30+
-o OUTPUT_PATH, --output_path OUTPUT_PATH
31+
output directory or file name for mapped json records
32+
-l LOG_FILE, --log_file LOG_FILE
33+
optional statistics filename (json format).
34+
```
35+
36+
## Contents
37+
38+
1. [Prerequisites](#Prerequisites)
39+
2. [Installation](#Installation)
40+
3. [Configuring Senzing](#Configuring-Senzing)
41+
4. [Running the mapper](#Running-the-mapper)
42+
5. [Loading into Senzing](#Loading-into-Senzing)
43+
44+
### Prerequisites
45+
- python 3.6 or higher
46+
- Senzing API version 1.7 or higher
47+
- https://github.com/Senzing/mapper-base
48+
49+
### Installation
50+
51+
Place the the following files on a directory of your choice ...
52+
- [dnb_mapper.py](dnb_mapper.py)
53+
- [dnb_config_updates.json](dnb_config_updates.json)
54+
- [dnb_formats.json](dnb_formats.json)
55+
56+
*Note: Since the mapper-base project referenced above is required by this mapper, it is necessary to place them in a common directory structure like so ...*
57+
```Console
58+
/senzing/mappers/mapper-base
59+
/senzing/mappers/mapper-dnb <--
60+
```
61+
You will also need to set the PYTHONPATH to where the base mapper is as follows ... (assumuing the directory structure above)
62+
```Console
63+
export PYTHONPATH=$PYTHONPATH:/senzing/mappers/mapper-base
64+
```
65+
66+
### Configuring Senzing
67+
68+
*Note:* This only needs to be performed one time! In fact you may want to add these configuration updates to a master configuration file for all your data sources.
69+
70+
From the /opt/senzing/g2/python directory ...
71+
```console
72+
python3 G2ConfigTool.py <path-to-file>/dnb_config_updates.json
73+
```
74+
This will step you through the process of adding the data sources, entity types, features, attributes and other settings needed to load this watch list data into Senzing. After each command you will see a status message saying "success" or "already exists". For instance, if you run the script twice, the second time through they will all say "already exists" which is OK.
75+
76+
Configuration updates include:
77+
- addDataSource **DNB-COMPANY** used when when mapping companies from CMPCVF json files
78+
- addDataSource **DNB-PRINCIPLE** used when when mapping principles from CMPCVF json files
79+
- addDataSource **DNB-OWNER** used when when mapping owners from UBO csv files
80+
- addDataSource **DNB-CONTACT** used when when mapping contacts from GCA csv files
81+
- addEntityType **PERSON**
82+
- addEntityType **ORGANIZATION**
83+
- add features and attributes for ...
84+
- **DNB_OWNER_ID** This is used to help prevent owners from resolving to each other and so that you can search on it.
85+
86+
### Running the mapper
87+
88+
First, download the DNB files you want to load from the DNB FTP site. Since the data files are so large, these are normally split into multiple files.
89+
90+
Second, run the mapper. Example usage:
91+
```console
92+
python3 dnb_mapper.py -f CMPCVF -i "./input/CMPCVF*.txt" -o ./output -l cmpcvf_stats.json
93+
94+
python3 dnb_mapper.py -f GCA -i "./input/GCA*.txt" -o ./output -l gca_stats.json
95+
96+
python3 dnb_mapper.py -f UBO -i "./input/UBO*.txt" -o ./output -l ubo_stats.json
97+
```
98+
The output file defaults to the same name and location as the input file and a .json extension is added.
99+
100+
*It is critical that the -f file format match the input files exactly!*
101+
102+
### Loading into Senzing
103+
104+
If you use the G2Loader program to load your data, its best to list the mapped json files you want to load in a project file. There is an example of one in your senzing instalation here: /opt/senzing/g2/python/demo/sample/project.csv. Then from from the /opt/senzing/g2/python directory ...
105+
```console
106+
python3 G2Loader.py -p <name of project file>
107+
108+
If you use the API directly, then you just need to perform an process() or addRecord() for each line of each mapped file.
109+
```

dnb_config_updates.json

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
addDataSource DNB-OWNER
2+
addDataSource DNB-CONTACT
3+
addDataSource DNB-COMPANY
4+
addDataSource DNB-PRINCIPLE
5+
addEntityType PERSON
6+
addEntityType ORGANIZATION
7+
8+
addFeature {"feature": "DNB_OWNER_ID", "class": "ISSUED_ID", "behavior": "F1E", "anonymize": "No", "candidates": "Yes", "standardize": "PARSE_ID", "expression": "EXPRESS_ID", "comparison": "EXACT_COMP", "elementList": [{"element": "ID_NUM", "expressed": "No", "compared": "Yes"}, {"element": "ID_NUM_STD", "expressed": "Yes", "compared": "No"}, {"element": "ID_LAST4", "expressed": "No", "compared": "No"}]}
9+
addAttribute {"attribute": "DNB_OWNER_ID", "class": "IDENTIFIER", "feature": "DNB_OWNER_ID", "element": "ID_NUM", "required": "Yes", "default": "", "advanced": "No", "internal": "No"}
10+
11+
save
12+
updateDatabase

dnb_formats.json

Lines changed: 158 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,158 @@
1+
[
2+
{
3+
"formatCode": "CMPCVF",
4+
"fileType": "json",
5+
"encoding": "latin1",
6+
"jsonStructure": "https://directplus.documentation.dnb.com/html/resources/JSONSample_CMP.html#cmpcvf"
7+
},
8+
{
9+
"formatCode": "UBO",
10+
"fileType": "tab",
11+
"encoding": "latin1",
12+
"columns": [
13+
"SUBJ_DUNS",
14+
"EXT_SUBJ_REF_ID",
15+
"SUBJ_NME",
16+
"SUBJ_ADR_LN1",
17+
"SUBJ_ADR_LN2",
18+
"SUBJ_ADR_LN3",
19+
"SUBJ_PRIM_TOWN",
20+
"SUBJ_CNTY",
21+
"SUBJ_POST_CD",
22+
"SUBJ_PROV_OR_ST",
23+
"SUBJ_CTRY_CD",
24+
"SUBJ_CTRY_NME",
25+
"SUBJ_LGL_FORM_CD",
26+
"SUBJ_LGL_FORM_DESC",
27+
"SIC_CD",
28+
"SIC_CD_DESC",
29+
"SUBJ_OOB",
30+
"PRNT_DUNS",
31+
"PRNT_NME",
32+
"DOM_ULT_DUNS",
33+
"DOM_ULT_NME",
34+
"GLBL_ULT_DUNS",
35+
"GLBL_ULT_NME",
36+
"STAT_CD",
37+
"STAT_MSG",
38+
"BENF_NME",
39+
"BENF_DUNS",
40+
"PERS_ID",
41+
"BENF_TYP_CD",
42+
"BENF_TYP_DESC",
43+
"BENF_LGL_FORM_CD",
44+
"BENF_LGL_FORM_DESC",
45+
"BENF_ADR_LN1",
46+
"BENF_ADR_LN2",
47+
"BENF_ADR_LN3",
48+
"BENF_PRIM_TOWN",
49+
"BENF_CNTY",
50+
"BENF_POST_CD",
51+
"BENF_PROV_OR_ST",
52+
"BENF_CTRY_CD",
53+
"BENF_CTRY_NME",
54+
"NATY",
55+
"DT_OF_BRTH",
56+
"DIRC_OWRP_PCTG",
57+
"IDIR_OWRP_PCTG",
58+
"BENF_OWRP_PCTG",
59+
"BENF_INDC",
60+
"OWRP_UNAV_REAS",
61+
"MINY_SHRH",
62+
"BENF_OOB",
63+
"DEPTH",
64+
"BENF_UDSC",
65+
"BENF_OWRP_CMTRY",
66+
"BENF_ID",
67+
"SUBJ_CTRL_TYP_CD",
68+
"SUBJ_CTRL_TYP_DESC",
69+
"SUBJ_CTRL_TYP_CFDC_CD",
70+
"SUBJ_CTRL_TYP_CFDC_DESC",
71+
"BENF_CTRL_TYP_CD",
72+
"BENF_CTRL_TYP_DESC",
73+
"BENF_CTRL_TYP_CFDC_CD",
74+
"BENF_CTRL_TYP_CFDC_DESC",
75+
"SUBJ_OWRP_UNAV_REAS"
76+
]
77+
},
78+
{
79+
"formatCode": "GCA",
80+
"fileType": "tab",
81+
"encoding": "latin1",
82+
"columns": [
83+
"ROWNUM",
84+
"CONTACT_ID",
85+
"EMAIL",
86+
"EMAILDOMAIN ",
87+
"FIRSTNAME",
88+
"MIDDLENAME",
89+
"LASTNAME",
90+
"NAMEPREFIX",
91+
"NAMESUFFIX",
92+
"PRIMARYPHONE",
93+
"PRIMARYPHONEREGIONFORMAT",
94+
"PRIMARYPHONEEXTENSION",
95+
"SECONDARYPHONE",
96+
"SECONDARYPHONEREGIONFORMAT",
97+
"SECONDARYPHONEEXTENSION",
98+
"PRIMARYPHONETYPE",
99+
"SECONDARYPHONETYPE",
100+
"JOBTITLE",
101+
"GCA_VANITYTITLE",
102+
"GCA_JOBTITLEFUNCTION_IDS",
103+
"GCA_PRIMARYJOBFUNCTION_ID",
104+
"GCA_JOBTITLELEVEL_IDS",
105+
"GCA_JOBTITLEFUNCTIONNAMES",
106+
"GCA_PRIMARYJOBFUNCTIONNAME",
107+
"GCA_JOBTITLELEVELNAMES",
108+
"GCA_FULLPOSTALADDRESS",
109+
"GCA_STREETADDRESS1",
110+
"GCA_STREETADDRESS2",
111+
"GCA_CITYNAME",
112+
"GCA_STATEPROVINCECODE",
113+
"GCA_STATEPROVINCENAME",
114+
"GCA_USZIP",
115+
"GCA_USZIP4",
116+
"GCA_POSTALCODE",
117+
"GCA_COUNTY",
118+
"GCA_COUNTRYCODE",
119+
"GCA_LATITUDE",
120+
"GCA_LONGITUDE",
121+
"GCA_OWNER_ID",
122+
"GCA_ORIGIN_ID",
123+
"GCA_VALIDITY",
124+
"GCA_CONFIDENCE",
125+
"GCA_CONFIDENCEDATE",
126+
"GCA_PHONEACCURACYDATE",
127+
"GCA_PHONEACCURACYSCORE",
128+
"GCA_EMAILACCURACYDATE",
129+
"GCA_EMAILDELIVERABILITY",
130+
"GCA_BUSINESSNAME",
131+
"GCA_PREMIUMPRODUCTCODE",
132+
"GCA_GENDER",
133+
"GCA_NICKNAME",
134+
"GCA_TITLEACCURACYSCORE",
135+
"GCA_DATAFRESHNESSSCORE",
136+
"GCA_SOCIALVERIFIEDFLAG",
137+
"GCA_TITLEMATCHFLAG",
138+
"TWITTERPROFILEURL",
139+
"FACEBOOKPROFILEURL",
140+
"LINKEDINPROFILEURL",
141+
"ROLE_IDS",
142+
"INDIVIDUAL_ID",
143+
"DUNS_ID",
144+
"DUNS",
145+
"TRUSTkrEDDUNS",
146+
"SMSACODE",
147+
"PRIMARYMRCCODE",
148+
"MRCCODES",
149+
"CEOINDICATOR",
150+
"ENTITYRESOLUTIONCONFIDENCECODE",
151+
"ENTITYRESOLUTIONMATCHGRADE",
152+
"ENTITYRESOLUTIONMATCHDATAPROFILE",
153+
"CHANGERECORDALERTTIME",
154+
"GCA_SMARTSORTSCORE",
155+
"GCA_DAYSSINCEPUBLISH"
156+
]
157+
}
158+
]

0 commit comments

Comments
 (0)