WBL Success Rates Fuzzy Data Matching Sample Code

Note: This page is from an older version of the Data Dictionary and may therefore contain information that is now out of date. It is included here for reference only.

1 The following SPSS code is to illustrate the WBL Success Rates Methodology - Fuzzy Data Matching.

Match 1 – Using National Insurance Numbers

2 This code illustrates one between year fuzzy match. This process is repeated for every between year gap.

*{START}***Match 01/02 and 02/03 using L01 L26 A15**************************************.

MATCH FILES /FILE='QSR_WBL_0203_prematch - resort.sav'
/FILE='QSR_WBL_0102_prematch - resort.sav'
/BY l01 L26 A15.

SELECT IF (sfl_0102 = 1 AND sfl_0203 = 1).
SELECT IF (l03a <> l03b).
SELECT IF (l26 <> '').
SORT CASES BY l01 l03a.

SAVE OUTFILE='0102 - 0203 Match 1.sav'
/KEEP l01 l03a l03b a15 a26a a_aola a26b a_aolb /COMPRESSED.

*{END}***Match 01/02 and 02/03 using L01 L26 A15**************************************.

Match 2 – Using Date of Birth

3 The process here is similar to Match 1, the only differences being the variables used to do the fuzzy matching and the additional criteria applied about start dates. The pre-match files have also been resorted to the different order required for this fuzzy match.

*{START}***Match 01/02 and 02/03 using L01 L11 A15 Matchv - with additional start quarter criteria**************************.

MATCH FILES /FILE='QSR _WBL_0203_prematch - resort.sav'
/FILE='QSR_WBL_0102_prematch - resort.sav'
/BY L01 L11 A15 Matchv.

SELECT IF (sfl_0102 = 1 AND sfl_0203 = 1).
SELECT IF (l03a <> l03b).
SELECT IF (XDATE.QUARTER(a27a) = XDATE.QUARTER(a27b)).
SELECT IF (XDATE.YEAR(a27a) = XDATE.YEAR(a27b)).
SELECT IF (a27a < DATE.DMY(01,08,2002) AND a27b < DATE.DMY(01,08,2002)).
SORT CASES BY l01 l03a.

SAVE OUTFILE='0102 - 0203 Match 2.sav'
/KEEP l01 l03a l03b a15 a26a a_aola a26b a_aolb /COMPRESSED.

*{END}***Match 01/02 and 02/03 using L01 L11 A15 Matchv - with additional start quarter criteria***********************.

Merge Matches 1 and 2

4 The match files created above are then merged together so that there is one fuzzy match between each year. The data from match 1 is preferred to that from match 2.

*****************************************************************.
*Combine files from matches 1 and 2, using match 1 where both have given matches.
*****************************************************************.
*{START}***Combine matches 1 and 2 for 01/02 and 02/03***********.

GET FILE='0102 - 0203 Match 1.sav'.
COMPUTE Match = 1.
ADD FILES /FILE=*
/FILE='0102 - 0203 Match 2.sav'.
IF (Match <> 1) Match = 2.
SORT CASES BY l01 l03a Match.
COMPUTE dupe = 0.
IF (l01 = lag(l01,1) AND l03a = lag(l03a,1)) dupe = 1.
EXE.
SELECT IF dupe = 0.
SAVE OUTFILE='0102 - 0203 Match.sav'
/COMPRESSED.

*{END}***Combine matches 1 and 2 for 01/02 and 02/03************************************.

Remove Clean Matches

5 Learners that match cleanly between years are then removed as we know that they have not changed their L03. This leaves us with a lookup of all non-clean fuzzy matches we have made.

*{START}***Create l01 l03 lookup for non clean matches for 01/02 to 02/03***************.

MATCH FILES /FILE='QSR_WBL_0203_prematch - resort.sav'
/FILE='QSR_WBL_0102_prematch - resort.sav'
/BY L01 L03 A15 Matchv.
SELECT IF (sfl_0102 = 1 AND sfl_0203 = 1).
COMPUTE Clean_Match = 1.
SAVE OUTFILE='0102 - 0203 Clean Match.sav'
/KEEP l01 l03a Clean_Match /COMPRESSED.
MATCH FILES /FILE='0102 - 0203 Clean Match.sav'
/FILE='0102 - 0203 Match.sav'
/BY L01 L03a.
RECODE Clean_Match (SYSMIS = 0) (else = copy).
SELECT IF NOT (Clean_Match = 1).
EXE.
SELECT IF(NOT(l01 = lag(l01,1) AND l03a = lag(l03a,1))).
SORT CASES BY l01 l03b Match.
SELECT IF(NOT(l01 = lag(l01,1) AND l03b = lag(l03b,1))).
SORT CASES BY l01 l03a.
SAVE OUTFILE ='0102 - 0203 Match final.sav'
/DROP Clean_Match Match /COMPRESSED.

*{END}***Create l01 l03 lookup for non clean matches for 01/02 to 02/03*****************.

Final Fuzzy Match Lookup Production

6 This is the process of combining the final match files into a single lookup that can then be used to update the L03 values in the original data.

*******************************************************.
*Combine the seperate lookup tables into a single table for all years that can be applied to the prematch files.
*******************************************************.
*{START}***Combine the seperate lookup tables into a single file - where an L03 changes more than once this is shown on a single record*************.

GET FILE = '0102 - 0203 Match final.sav'.
SORT CASES BY l01 l03b.
MATCH FILES /FILE=*
/FILE='0203 - 0304 Match final.sav'
/BY L01 L03b.
SORT CASES BY l01 l03c.
MATCH FILES /FILE=*
/FILE='0304 - 0405 Match final.sav'
/BY L01 L03c.
SORT CASES BY l01 l03d.
MATCH FILES /FILE=*
/FILE='0405 - 0506 Match final.sav'
/BY L01 L03d.

*{END}***Combine the seperate lookup tables into a single file - where an L03 changes more than once this is shown on a single record*********************************.

*{START}***Define l03_match as the latest l03 value*********************************.

STRING l03_match (A12).
DO IF (l03e <> '').
COMPUTE l03_match = l03e.
ELSE IF (l03d <> '').
COMPUTE l03_match = l03d.
ELSE IF (l03c <> '').
COMPUTE l03_match = l03c.
ELSE IF (l03b <> '').
COMPUTE l03_match = l03b.
END IF.

*{END}***Define l03_match as the latest l03 value*********************************.

*{START}***Define l03 as the earliest l03 where it is matched across years*********************************.

STRING l03 (A12).
IF (l03d <> '' AND l03e <> '') l03 = l03e.
IF (l03c <> '' AND l03d <> '') l03 = l03c.
IF (l03b <> '' AND l03c <> '') l03 = l03b.
IF (l03a <> '' AND l03b <> '') l03 = l03a.
SAVE OUTFILE ='0102 - 0506 Match final.sav'
/KEEP l01 l03 l03_match l03a l03b l03c l03d l03e /COMPRESSED.

*{END}***Define l03 as the earliest l03 where it is matched across years*********************************.

*{START}***Where l03 changes between 01/02-02/03 and 02/03-03/04 create a new record for the 02/03-03/04 match*********************************.

SELECT IF (l03b <> '' AND l03c <> '' and l03 <> l03b).
COMPUTE l03 = l03b.
ADD FILES /FILE = *
/FILE = '0102 - 0506 Match final.sav'.
SAVE OUTFILE ='0102 - 0506 Match final.sav'.

*{END}***Where l03 changes between 01/02-02/03 and 02/03-03/04 create a new record for the 02/03-03/04 match*********************************.

*{START}***Where l03 changes between 02/03-03/04 and 03/04-04/05 create a new record for the 03/04-04/05 match*********************************.

SELECT IF (l03c <> '' AND l03d <> '' and l03 <> l03c).
COMPUTE l03 = l03c.
ADD FILES /FILE = *
/FILE = '0102 - 0506 Match final.sav'.
SAVE OUTFILE ='0102 - 0506 Match final.sav'.

*{END}***Where l03 changes between 01/02-02/03 and 02/03-03/04 create a new record for the 02/03-03/04 match*********************************.

*{START}***Where l03 changes between 03/04-04/05 and 04/05-05/06 create a new record for the 04/05-05/06 match*********************************.

SELECT IF (l03d <> '' AND l03e <> '' and l03 <> l03d).
COMPUTE l03 = l03d.
ADD FILES /FILE = *
/FILE = '0102 - 0506 Match final.sav'.
SAVE OUTFILE ='0102 - 0506 Match final.sav'
/KEEP l01 l03 l03_match l03a l03b l03c l03d l03e.

*{END}***Where l03 changes between 01/02-02/03 and 02/03-03/04 create a new record for the 02/03-03/04 match*********************************.

*{START}***Remove duplicates*********************************.

GET FILE = '0102 - 0506 Match final.sav'.
SORT CASES BY l01 l03.
COMPUTE dupe = 0.
IF (l01 = lag(l01,1) and l03 = lag(l03,1)) dupe = 1.
EXE.
SELECT IF dupe = 0.
SAVE OUTFILE ='0102 - 0506 Match final.sav'
/KEEP l01 l03 l03_match l03a l03b l03c l03d l03e.

*{END}***Remove duplicates*********************************.

Update Pre-Match Files Using Lookup Table

7 For each year’s pre-match data, we update the L03 values to the most recent value we found through the fuzzy matching process. This ensures the L03 values will be consistent between years.

**********************************************************.
*Update pre-match files using lookup table.
**********************************************************.

MATCH FILES /FILE='QSR_WBL_0102_prematch.sav'
/TABLE='0102 - 0506 Match final.sav'
/BY L01 L03.
IF (l03a <> l03_match AND l03_match <> '') l03 = l03_match.
SORT CASES BY l01 l03 a15 matchv.
SAVE OUTFILE ='QSR_WBL_0102_fuzzyl03.sav'
/DROP l03_match l03b l03c l03d l03e /COMPRESSED.

Creator

Analysis and MI Team

Date issued

31 March 2006

Date created

24 February 2006

Document ref.

\\records.lsc.local\NAT\23 LrngSkillsPolicyInfrastr\23-07 DataCollectAlysis\23-07-03 LrnrDataAlysisDiss\nat-wblsuccessratesfuzzydatamatchingsamplecode-report-31mar2006.doc

LSC office

Learning and Skills Council
Cheylesmore House Quinton Road Coventry CV1 2WT
T 0845 019 4170 F 024 7682 3675 www.lsc.gov.uk/

Last Modified: 31 Mar 06