Splitting a vote or other piece of data into many pieces using =SPLIT
Previously I've explained how you can use Excel's text to columns feature to convert a long address or name into its consistuent parts (forename, surname, etc.)
This feature doesn't exist in Google Spreadsheets, but there is a formula which does the same thing just as easily. It's the =SPLIT formula - here's how it works:
=SPLIT(what you want to split, "where you want to split it")
Here's the concrete example:
I'm working here with some voting data from Toronto's open data portal...
(As it happens the data is not particularly open: the data is actually contained within a database which only gives results when you select from the drop-down menus, and each result has the same URL, so you cannot link directly to the data or point a computer script at it.
(In addition, you cannot get results for all politicians together - you have to download each individual's voting records separately (there's also a voting record drop-down menu hidden away in the top-right corner). Finally, the result is presented as 'Download CSV' but it's not a CSV file - it's an Excel file. It seems that someone went to File > Save As... in Excel and just added .csv on the end of the file name. This results in a file that ends .csv.xls (tip: if you're saving as CSV, select the option from the file format dropdown menu when saving))
Anyway, gripe over with. The fact that this data is so problematic actually gives us a chance to explore some data cleaning skills.
If you can open it up in Excel and save as a CSV file, then do that (You will notice that all the data is contained in one column). Alternatively, you can just delete the part of the file name that ends .xls so that it does end .csv.
Now if you open it in Google Refine or Google Docs you should have columns for each heading, rather than it all be bunched together in one column.
But one column actually contains three pieces of data: this is the one headed Result. In it are:
- The result of the vote (carried or lost)
- The votes for, and
- The votes against
If we want to treat these bits of data separately, we can use the =SPLIT formula to do so, as follows:
=SPLIT(G2, ", ")
The above is typed into cell H2. It looks at what's in G2 (the first part of the formula), and splits it wherever a comma appears (the second part of the formula, called a delimiter).
Now, H2 will contain 'carried' (or 'lost'), and I2 will contain '45-0', or whatever the vote was.
But we also want to split the vote into those for and against. See if you can work out how to adapt the formula to do that...
OK, in J2 we can type the following
=split(I2, "-")
This time it looks in I2 and splits the contents wherever a hyphen appears. J2 will now contain '45', and K2 contains '0'.
Copy the formula to the end of the spreadsheet. Here's one I prepared earlier: watch out for problems - in the very last row you can see that the vote 8-1 has been convertedinto a date, and so has any other row where the result resembles a date. This results in a #VALUE error in the next column. If you have hundreds of rows you might miss this, so use search to look for 'VALUE' and correct it manually.