Skip to content

Prepare the data

Data source

The data for the genus and species tables comes from the spreadsheet in the sidebar. Download and open the spreadsheet and you will see that it contains many columns which are not required in either table. You will therefore need to remove them, and you will also have to produce two different versions of the data, one for each table.

When making changes to data, you should always try to keep the original version unchanged in case you need to go back to it at any time. For this exercise, you can use a separate copy of the spreadsheet for each table, or you can copy the data from the first tab in the spreadsheet onto another.

Genus

The data required for genus is found in only two columns headed Family and Genus. You therefore need a copy of the data which contains only these two columns. To delete columns, click their headers, then right-click and choose Delete from the pop-up menu as shown below.

Deleting columns

The next task is to remove the duplicate values from the two remaining columns. With 13583 rows, this is not a task that you want to do by hand. Instead you can get Excel to remove the duplicates for you. Start this operation by selecting the two columns using their headers, and then choosing the Advanced filter option from the Data tab. You may be prompted to confirm that the first row contains labels as shown below, and you should answer OK.

Advanced filter

A dialog is then displayed which allows you to specify the parameters for the filter operation. You need to copy the results to a different location in the spreadsheet and keep only unique records. The graphic below shows how the dialog should look - Sheet1!\(D\)1 is cell D1 on the tab called Sheet1.

Filter options

Once the filtered results appear, you no longer need the original columns and you can delete them. The data now only takes up 1385 rows which is a tenth of the original, but it still needs some cleaning. In particular, there are rows with null values in the Genus column which need to be removed. This is most easily done if the rows all occur together in a block, so select the entire sheet by clicking the top left-hand corner and click the Sort icon on the Data tab. In the sort dialog, make sure the My data has headers checkbox is checked as shown in the graphic below, and then sort by Genus.

Sorting rows

Once the rows are sorted, those with no values in the Genus column will appear at the bottom, and can be deleted in a single operation. The data is now ready to be transformed into a set of insert statements.

Country

Before moving on, there is one other piece of preparation that needs to be done. Country data comes from the second spreadsheet in the sidebar; however, if you open it you will see that there are many blank cells. These have been left for readability, but you will need to copy the values for a country from the first row in a block to all of the other rows representing regions in that country. This can be done very quickly using standard Excel functions.

The first thing to do is to select the range of cells that you want to change. In this case, than means cells A2 to B3949. Do not select the columns A and B using the column headers - if you do, you will end up with data in those columns for the maximum number of rows the spreadsheet can hold.

In Excel, you can use a simple formula to make one cell display the same value as another. In the next couple of steps, you will fill all of the blank cells with a formula which makes them display the same value as the cell immediately above.

Select all the blank cells in the range by typing CTRL+G for Go to... and clicking Special... (1). You can select the blanks option in the dialog box as shown below (2).

Blanks

When you click OK you will see the selection change to include only the blank cells with the first one (A3) being the active cell. When you start typing the input will be placed into the active cell, so you should type =A2

This simple formula makes cell A3 display the same value as cell A2. To copy the same formula to all the currently selected cells, press CTRL+ENTER. When copying functions, Excel updates any cell references, so each of the previously blank cells now contains a unique formula that refers to the cell immediately above it. You can check this by clicking on one.

The final step is to replace the formulae with real values. This will avoid any problems that might arise when cells are re-ordered.

Select the columns containing the formulae by clicking on their column headers, and then press CTRL+C to copy them. Click on cell A1 (ie the first one in the selected range) to activate it, then right-click in it and choose Paste Values from the pop-up menu as shown below.

Special paste

The appearance of the page will not change, but now if you click in one of the cells which was previously blank you will see that the formula has been replaced by an ordinary value.

Save this version of the spreadsheet for use in the next section.