Skip to content

Data

If research is about collecting data and analysing it in order to draw conclusions, it is important to have a clear understanding of what data means. Equipped with this understanding, you can then address some questions that are more specific to your project such as

present_to_allQuestions about data

What data do I need to analyse in order to address my aim and/or answer my research questions?

Where does that data come from? i.e. What data sources are available to me?

What is an appropriate (and feasible) method that I can use to collect the data?

What would be an appropriate way to process the data?

What form should my results take?

Quantitative data

This is perhaps the easiest type of data to understand. It is numerical and can be collected by counting or measuring things. Numbers can be analysed using mathematical or statistical methods and while some of those may be difficult to understand, they are often easy to apply using appropriate software tools and libraries. The link at the bottom of the page provides a very accessible introduction to a range of statistical tests that can help you to decide which tests you need in a particular situation. Excel is often a good choice for analysing quantitative data since it provides a wide range of formulae and has integrated charting functions.

Python is another good option for processing quantitative data. Although it requires more effort that Excel, it affords greater control, provides a wider range of mathematical methods through the use of libraries such as pandas, scipy and numpy, and integrates well with many libraries for creating interactive charts such as matplotlib and plotly. A further advantage of Python is that the data collection and analysis functions can be built into your application code.

Qualitative data

Sometimes, the most appropriate source of data in a project is people. They can be system users, organisational staff, security professionals or any other group of interest. They can provide you with opinions, judgements, feedback, etc. but this data is more difficult to process effectively because it comes in the form of responses to questions. It is difficult to extract usable data from raw material in written form, and there are established techniques for ensuring that your data is as reliable as possible.

Qualitative data is typically collected by survey, interview or focus group. Like so many other aspects of academic research, these terms have much more precise meanings that they do in everyday language. For one thing, they all refer to a class of methods, and to use them you would need to be specific about which variation you are using. Simply asking a series of questions does not necessarily constitute a semi-structured interview, for example. There are many steps that need to be carefully designed and managed in order to have confidence in the data that is being collected.

Sampling

One of the things you need to consider is how representative your final results will be. Usually, it is not possible to test absolutely every variation of a case to reach a conclusive result. In that situation, your results will still be valid, but only for a restricted set of conditions - those which match the conditions of your project. For example, you might train a machine learning model on a certain set of data and report on its prediction accuracy. This does not enable you to definitively say that the algorithm will work with equal accuracy on all datasets.

When you carry out your evaluation on a restricted set of conditions, you are working with a sample of cases. The composition of your sample should be discussed, and ideally you should take steps to ensure that your sample is as representative as possible. This will enable you to make the broadest claims in your conclusions.

The concept of sampling is most clearly explained with reference to survey methods. Take, for example, a project which aims to establish project managers' opinions on the relative importance of contextual factors in relation to project success. Ideally, you would like to present results from the total population of project managers. However, it is not possible to ask them all, and so you need to select a sample. You might restrict your sample to project managers from public-sector organisations, in which case you cannot extrapolate any of your findings to the private sector. On the other hand, you might ensure that your sample includes project managers from a wide range of different types of project. This would allow you to claim that your sample is representative of more than just one type of project.

Sampling is also relevant in quantitative studies that take an experimental approach. To cope with statistical variation in the results of a single experiment, it is common practice to run the experiment many times and to take an average of the results. The average is deemed more representative of all cases than a single result would be.

What data is required?

The data you need should be determined by the aim or the project and by your research questions if you have them. You need to show a logical connection which could take the form of a conditional statement:

"If I find this data, then I can conclude that..."

A common error when designing a project is to overlook the connection between the aim and the data, and simply to use data that is convenient. This mistake can appear in undergraduate projects, for example, when the aim of the project might be to evaluate the effectiveness of a geographical localisation algorithm, but the evaluation focuses on whether the users enjoyed the experience of using the app. Although it is easy to collect user experience data, it is not relevant to the aim of the project in this particular case.

It is in answering the fundamental question of what data you need that you find out whether you will be dealing with quantitative or qualitative data.

Data sources

The next logical step after deciding what data is needed is to consider where that data might come from. There may be multiple possible sources, and you should discuss their suitability for the current project. It may be, for example, that the ideal data is not accessible to you and you therefore need to find an alternative.

The sources of quantitative data are usually fairly clear. Complications can arise, however, if you cannot get direct access to the data, and you have to use a proxy - that is, an alternative metric that can stand in for the ideal data. If you want to investigate the energy cost of a particular algorithm, for example, it is not possible to fully isolate the energy used by that specific algorithm from any other processing that the computer is doing at the time. In that case, you would need to identify a metric that you can use instead such as the algorithm's execution time. In presenting the metric, you should clearly explain why it should be accepted in place of the actual data.

Data collection methods

This where the decisions get slightly easier. Having decided on the data you need and where it comes from, your choice of data collection method is narrowed down for you. As always, it is worth discussing your options and presenting your reasons for selecting a particular method. You might also select more than one source of data and a corresponding method of collection so that you can check for consistency. This is referred to as triangulation.

There is a great deal of information about data collection methods in the research methods literature. Most of it focuses on qualitative data because of the challenges related to ensuring that it is valid and reliable. Whatever data collection method(s) you decide to use, it is vital that you locate some reference material to guide you in carrying them out correctly. It is not sufficient to say that you are using a particular method and then to make the process up as you go along. In an academic project, you are expected to apply established methods in a competent way. Ignoring recognised procedures and going your own way is of no academic value.

What would be an appropriate way to process the data?

Another common misconception among students who are unfamiliar with data analysis is that it simply follows from the data collection. This problem might occur, for example, in a survey-based project where the questionnaire contents may have been carefully developed, but no thought has been given to the analysis of the results once they come in. The problems that arise might be conceptual - that is, something may have been overlooked in the question design so that a vital piece of data is not available to the required analysis. The problems might also be practical in that the analysis process turns out to be much harder and more time-consuming than had been assumed.

To avoid these issues, and to address the question of academic competence mentioned in the previous section, careful thought needs to be given to the selection of appropriate and feasible methods of data analysis. A very good approach is to test your method of analysis in advance using dummy data. This will reveal any gaps or practical issues with your plan. One specific case where this is important is where your main data collection method is a questionnaire. Before sending it out to your participants, it is good practice to carry out a pilot study to test the wording and sequence of the questions.

What form should my results take?

Bearing in mind that the most important criterion is the effectiveness of your communication with the Reader, you need to select a method of presentation that suits the type of data you are dealing with, and communicates its significance clearly. You should consider how the results can be summarised so that the main message stands out; you should not offer the Reader a mass of raw figures without interpretation.

There are two main options for the presentation of summary data - tables and charts. In both cases, you must also show the Reader how they were created from the raw data. One way to do that is to include the raw data in an appendix. When choosing the appropriate presentation, a handy source of inspiration is the set of references you have covered in the literature review. Many of them will contain results that have similar characteristics to yours, and so provide good examples for you to emulate.

Some types of data have established presentation conventions, and if that is the case, you should find out what those conventions are and follow them. The main example is tables of statistical test results. This blog post does a good job of explaining the main things you need to be aware of.

Further reading

Sage Methods: quantitative data analysis Sage Methods: quantitative data analysis

Sage Methods: sampling Sage Methods: sampling

University of Michigan: experiments University of Michigan: experiments

Statistics without maths for psychology Statistics without maths for psychology

Sources of open data

Kaggle Kaggle

Twine.net Twine.net