Introduction
I’ve lost count of the number of times I’ve had the need to generate a representative dataset of users. Of course I have access to many production datasets but for many reasons they can’t be used. Finding previous datasets I’ve randomly generated always seems to take longer than it should, so with my most recent iteration of having to generate a fictitious list of users with Australian addresses, I’ve documented how I went about it, along with the source data I used and the script to create it.
Source Data
For my data sources to base my dataset off, I wanted representative data for Australia for both people names and locations. After a few quick searches I found;
- that Data South Australia has lists of baby names for both male and female babies in SA. I downloaded the 2017 lists as CSV’s.
- for Surname, also from Data South Australia I borrowed the 19th Century Arrivals list and manipulated the Fullname column to separate it on “,” then used the Excel Function to remove duplicates. I deleted all other columns so that I was left with just over 13,000 surnames in a CSV file.
- Matthew Proctor’s list of Australian Postcodes as a CSV. This provides Postcode, Suburb and State.
- Brisbane City Council (Australia’s largest Council) has a dataset with all bus locations that includes Street names as a CSV. Like I did for Surname I used the Excel Function to remove duplicates, removed the blanks and the other columns and then had just over 1600 street names.
The Script
The script is pretty simple. It imports each of the CSV’s listed above and generates a random number based on the number of records in each file.
The GitHub Repo contains the PowerShell script along with the source files. Change line 3 for the location where you store the CSV files and change line 66 for the number of users to generate. I’ve left the end of the script empty. I either insert the API call to create the users, or the PowerShell cmdlet with the data to do the creation depending on where I’m creating the users.
Generate-Random-Users/Generate Random Users.ps1 at master · darrenjrobinson/Generate-Random-Users
Using real data, randomise it to create realistic users with Australian addresses – Generate-Random-Users/Generate Random Users.ps1 at master · darrenjrobinson/Generate-Random-Users
The Output
Here is a sample output in JSON format.
{ "Street": "370 Miskin St", "Surname": "Burne", "Suburb": "WOODBROOK", "Postcode": "3451", "State": "VIC", "GivenName": "Miro" } { "Street": "293 Preston Rd", "Surname": "Partingale", "Suburb": "MARRARA", "Postcode": "812", "State": "NT", "GivenName": "Daniella" } { "Street": "409 Orchard St", "Surname": "Liaseyer", "Suburb": "THURGOONA", "Postcode": "2640", "State": "NSW", "GivenName": "Ariana" } { "Street": "775 Station Rd", "Surname": "Nevin", "Suburb": "AVON DOWNS", "Postcode": "862", "State": "NT", "GivenName": "Naria" }
Summary
Using data publicly available and PowerShell it is possible to quickly generate a dataset of representative users and addresses. Generating other attributes is as easy as extrapolating from the existing data or supplementing it with additional source data files.