High Dimensional Data
Last updated
Copyright Continuum Labs - 2023
Last updated
In this video transcript, the speaker is discussing the concept of big data and the challenges associated with analysing and understanding high-dimensional datasets.
The speaker explains that data can be considered "big" in two ways:
by having a large number of samples (rows)
by having a large number of measurements (columns) for each sample
The speaker illustrates this concept by using an example of measuring the length of people's thumbs. If the dataset contains measurements for a large number of people, such as the entire population of the United States or the world, the dataset becomes large and complex to handle, even though it only has one measurement per person.
Next, the speaker introduces the concept of high-dimensional datasets, where each sample (row) has multiple measurements (columns).
For example, instead of just measuring thumb length, one could measure the length of each finger, weight, height, shoe size, cholesterol level, and so on. As the number of measurements (m) increases, the dataset becomes high-dimensional, and the challenges associated with visualising, analysing, and understanding the data grow.
The complexities of high-dimensional datasets arise from the difficulty in visualising and interpreting the data.
While it is relatively easy to visualise datasets with one or two dimensions using scatter plots, it becomes increasingly difficult to visualise datasets with more than three dimensions.
The speaker mentions that one can use animations or other coding tricks to visualise up to four dimensions, but beyond that, it becomes extremely challenging.
Another complexity of high-dimensional datasets is the generalisation of analytical techniques.
The speaker mentions that some techniques, such as multivariate regression, can be generalised to handle high-dimensional data.
However, the goal is to find analytical techniques that can effectively handle datasets with a large number of samples (n) and a large number of measurements (m) for each sample.
Another complexity of high-dimensional datasets is the generalisation of analytical techniques.
The speaker mentions that some techniques, such as multivariate regression, can be generalised to handle high-dimensional data. However, the goal is to find analytical techniques that can effectively handle datasets with a large number of samples (n) and a large number of measurements (m) for each sample.
To deal with high-dimensional datasets, the speaker emphasises the importance of understanding the data, which involves two main aspects:
Pattern analysis: This involves finding patterns in the data where certain factors may co-occur. In high-dimensional datasets, performing pairwise correlations between all variables becomes tedious and inaccurate. Instead, the goal is to look at the entire dataset as a whole and identify high-dimensional patterns that emerge.
Prediction: This involves using the wealth of data available to make informed predictions about new samples. Given a new person's subset of symptoms and characteristics, can we predict what we might expect to see for the other measurements based on the patterns observed in the existing data?
The speaker concludes by stating that there are many techniques available for dealing with high-dimensional datasets. Throughout the discussion, the speaker emphasises the importance of keeping in mind the structure of the data matrix, with n rows representing samples and m columns representing measurements, and how this structure allows for slicing and analysing the data in various ways.
In summary, the main challenges of high-dimensional datasets include visualisation, generalisation of analytical techniques, and understanding the data through pattern analysis and prediction.
Addressing these challenges is crucial for extracting valuable insights from big data in various fields, such as public health and personalized medicine.