- Reworked the interpolation phase into (i) the interpolation phase and (ii) the model phase (see
hnyc-spatial-linkage/interpolation_notebooks/fall20_notebooks/PROCESS_INTERPOLATION.ipynb
andPROCESS_MODEL_training_ward10.ipynb
). - Extended the interpolation phase to all wards (except ward 12 and 19 which do not have enough known dwellings). The model phase is still only on ward 10. Cross validation is implemented for the model phase.
- Initiated a new sequence similarity feature. There is still an issue with this. See the end of
HNYC_Project/Projects/spatial_linkage/Spatial Linkage & Interpolation: Summer 2020.ppt
for more details. - All work is done on 1850 Manhattan. (see notebookes in
interpolation_notebooks/fall20_notebooks/
.) - Improvised elastic search match process to include more records for both 1880 and 1850 (see
ES_matching/src
folder - Elastic_Search_1850_mn v02.ipynb) - Disambiguation process on new matches (see
disambiguation_1850
folder) - Resolved dwelling conflicts on the new 1850 disambiguated output (that includes new matches) (
interpolation_notebooks/Concepts_and_Development/Dwelling_Addresses_Fill_In_and_Conflict_Resolution_Development_v2.ipynb
) - Updated 'dwelling adress fill in' process to identify unique dwelling ids and then choose dwelling address through bipartite matching both for summer and fall 1850 disambiguated run (
interpolation_notebooks/Concepts_and_Development/Unique_Dwelling_Addresses_Fill_In_and_Conflict_Resolution_Development.ipynb
)
potential for improvement
- Explore rules for elastic search match
- Identify other weight combinations for confidence score tuning
- Explore methods for creating unique dwelling ids for interpolation process
- Make comparisons in the output of Summer and Fall 2020 (with extra matches) disambiguated and address fill in processes
- Developed confidence score tuning process (see
disambiguation_1880
folder) - Developed interpolation process (see
interpolation
folder for code andinterpolation_notebooks
folder for jupyter notebooks that document process) - For overview of process and next steps see google drive
HNYC_Project/Projects/spatial_linkage/Spatial Linkage & Interpolation: Summer 2020.ppt
andHNYC_Project/Projects/spatial_linkage/Spatial Linkage and Interpolation Workflow.doc
Accomplished:
- Updated confidence score to include census conflicts (see
disambiguation_2.ipynb
) - Merged lat lng data to matched data -> produced
matched.csv
- Experimented with methods to add spatial weights, including graph-based and cluster-based approaches on a subset of the data (see
spatial-disambiguation.ipynb
) - Outlined overall workflow for disambiguation using bipartite graph matching algorithm (
linkage-disambiguation.ipynb
). - Ran algorithms on full dataset & obtained initial performance metrics
- Functionalized process using disambiguation module
- Tuned algorithms and compared metrics + recommendations
- Updated ES matching process to reflect metaphone matching See Google Drive Documentation > disambiguation > Spatial Linkage: Spring 2020 for slides
Two sources:
- City Directory: name (first, last, initial), address, ID, occupation, ED, ward, block number (constructed)
- Census: address (hidden during match process), ID (different from CD), occupation, ED, ward, age, gender, dwelling, household characteristics
Steps:
0. Preprocessing: changing names to their phonetic index
- Initial entity link (to generate possible matches): criteria = same ED + JW dist < 2 on indexed name
- Disambiguation (to choose between non-unique matches):
a. Generate a confidence score based on occupation (having occupation), age > 12 (not implemented), JW dist of name, relative probability (number of non-unique matches)
Similar, but no ED and address data, only ward data in the census.
-
/ES_matching
contains all work related to elastic searchreadme.md
gives guidelines on how to implement the processfall_2019_analysis.md
describes the work done up to fall 2019/doc
: relevant documentation of the process/src
: source code for running elastic search on our dataexplore_output.ipynb
quick validation of the ES output (targeted at understanding whether metaphone matching was implemented in ES process)
-
/disambiguation_1850
contains all work related to disambiguation of 1880 datadisambiguation_1850_v1.ipynb
runs disambiguation process on 1850 ES output
-
/disambiguation_1880
contains all work related to disambiguation of 1880 dataConfidence_Score_Tuning.ipynb
: Documents confidence score tuning processConfidence_Score_Tuning_v02.ipynb
: Confidence score tuning results used for 1850 v02 disambiguation run (10/2020), uses old version of 1880 data because of data issues with new 1880Confidence_Score_Tuning_new_1880_data_draft.ipynb
: Attempt to tune confidence score with new 1880 data, revealed that there was an issue with the data/_archived
: archived scripts/confidence_score
preprocess.ipynb
: preprocessing of data including generation of metaphonesget_confidence_score.ipynb
: outlines process for generating confidence score, including calculation of jaro-wrinklerdisambiguation_analysis.ipynb
: EDA on confidence scores -- generatesmatch_results_confidence_score.csv
andfall_2019_disambiguation_report.md
/linkage_eda
spatial-disambiguation.ipynb
: documentation of different spatial weight algorithmslinkage-disambiguation.ipynb
: outline record linkage approach (conceptually valid but code is outdated)linkage_full_run_v1.ipynb
: applies basic algorithm to the whole dataset + initial performance analysislinkage_eda.ipynb
: applies various iterations of algorithm to the whole dataset + conclusionslinkage_eda_v2.ipynb
: applies updated geocodes on best 2 algorithms + improved benchmarking
run_link_records.ipynb
: implemented record matching using pysparkconfidence_score_latlng.ipynb
: adding of census conflicts to confidence score, merging of lat lng data (contains most updated confidence score formula)linkage_full_run_SPRING_LATEST.ipynb
: informed by linkage EDA (see archive), generates latest disambiguated output from ES matching (with metaphone issue fixed)
-
/interpolation_notebooks
Process_Documentation
Disambiguation_Analysis_v01.ipynb
: Resolves dwelling conflicts, calculates statistics,explores distance based sequences, and interpolation between known dwellingsInterpolation_v01.ipynb
: Runs through current version of predicting unknown recordsDisambiguation_Analysis_v02.ipynb
: Same information as v01, for new data (10/2020)Interpolation_v02.ipynb
: Same information as v01, for new data (10/2020)
Concepts_and_Development
:Block and Centroid Prediction with Analysis.ipynb
: Walks through approaches to predicting block numbers directly, and then clusters (tests different clustering algorithms)Block Centroids and What They Represent, 1850.ipynb
: Creates block centroids and illustrates them with visualizationsDwelling Addresses Fill In and Conflict Resolution Development.ipynb
: Development of conflict resolution within dwelling processDeveloping Distance Based Sequences.ipynb
: Process of developing distance based sequencesModel Comparison.ipynb
: Tests a few different model options (no in depth tuning)Sequences Exploration.ipynb
: Tests different iterations of sequence identificationModel Exploration.ipynb
: Brief experimentation with using neural networks, incomplete because of preprocessing necessary
Archived
1880_1850_for_Interpolation.ipynb
: Explores 1880 and 1850 census datasetsFeature_Exploration.ipynb
: Explores some of the columns in 1880 and 1850 datasets in order to determine what they represents and if they can be used for modellingInterpolation Pilots.ipynb
: Working notebook for starting explorations of options for interpolation (often moved into a separate notebook when they seem worth looking at in more depth)Linear_Model.ipynb
: Creates and tests linear models for house number interpolationModeling Comparison.ipynb
: Tests different modeling approaches for house numbers (currently linear model and gradient boosting) -- includes haversine sequences and block numbers as featuresBlock_Numbers Early Exploration.ipynb
: Explore block numbers distributions/data analysis and try using them as feature to predict house numberStreet_Dictionaries.ipynb
: Tried out looking at street dictionaries for dwellings in betweenBlock Number Prediction.ipynb
: Initial experiment with predicting block numbersInterpolation between known address development.ipynb
: Process of looking at values between known dwellings
fall20_notebooks/
EDA_INTERPOLATION_[].ipynb
: the eda files for the interpolation phaseEDA_MODEL_[].ipynb
: the eda files for the model phase.PROCESS_INTERPOLATION.ipynb
: the process file for the interpolation phase. It reads in a file frominterpolation_notebooks/Process_Documentation/Disambiguation_Analysis_v02.ipynb
. This is an extension ofinterpolation_notebooks/Process_Documentation/Interpolation_v02.ipynb
.PROCESS_MODEL_training_ward10.ipynb
: the process file for the model phase. It reads in the file fromPROCESS_INTERPOLATION.ipynb
. It is currently applied only on ward 10.OUTDATED_model inspection.ipynb
: old code. It inspects different setting of a mdoel.
-
/interpolation
See read me within this folder for details -
/disambiguation
is a python module containing wrapper functions needed in the disambiguation processinit.py
contains a Disambiguator object, when instantiated can be used to run entire disambiguation process, calling functions from below (seelinkage_eda.ipynb
for example on usage)preprocess.py
contains functions needed before applying disambiguation algorithms, including confidence score generationdisambiguation.py
contains functions needed for disambiguationanalysis.py
wrapper functions to produce performance metricsconfidence_score_tuning.py
contains functions needed for the confidence tuning processbenchmarking.py
contains Benchmark objects, to run benchmarking process in confidence tuning for 1880
-
/matching_viz
visualization web app to understand disambiguation output, see readme in folder for guidance on how to run
Data is available in the HNYC Spatial Linkage Google Drive HNYC_Project/Projects/spatial_linkage/Data
- 1850 disambiguated output:
1850_disambiguated.csv
- 1850 disambiguated output 10/2020 (current):
1850_mn_match_v02.csv
- 1850 ES matches:
es-1850-22-9-2020.csv
- 1850 Interpolated output 12/2020 (current):
census_interpolated_1850_mn_20201219.csv
- Matches with confidence score (raw input for 1880 disambiguation processes):
matches.csv
- this is based off Fall 2019's Spark matching output
- Latest ES matches:
es-1880-21-5-2020.csv
- Latest Disambiguated Output:
disambiguated-21-5-2020.csv