
I tested if we might differentiate rock types and their associations based on the patterns of words that occur around them in large archives of geological reports. Using a text embeddings model generated through the unsupervised machine learning from thousands of geological survey reports, approximately 2,000 rock type names were compared to each other. The dimensionality was reduced in a t-SNE plot.
Some rock types such as salts, volcanics, organics and glacial deposits appear to be clearly differentiated by the similarity of surrounding words to their name, as are igneous intrusions and metamorphic classifications with some overlap between these two in places. Carbonates seem well differentiated, with a split of some types closely associated to organics. Extra-terrestrial rocks appear split with perhaps micrometeorites differentiated, which requires further investigation.
Clastics seems split into two, at first glance this may be differentiating (left hand group) superficial deposits, unconsolidated sediments and gravels from the larger clastic group on the right. Mudrocks (inc. fine grained clastics) appear split into 4, with one group closely associated with organics, one with carbonates and two groups associated to the two clastic groups previously mentioned.
There is some overlap/lack of differentiation in the middle with many rock types, and plenty of intriguing ‘outliers’ to investigate and try to understand.
I don’t have a very specific use case, question or problem in mind per se. I’m just inductively exploring the data (each data point is specific rock type classification e.g. peridotite, schist, sandstone, marl, tuff etc. that I have not displayed on the chart for readability) and seeing what may be of interest and thinking about what use cases might emerge if any.
One very early poorly formed idea perhaps might be to have a geological embedding ‘baseline’ in which to revisit old reports or even compare newly described rocks in reports. If these new occurrences plot well outside existing similarity tolerances built from vast collections, it might point to something worth examining. Such as a potential misclassification, novel associations or something else that may be significant for some aspect of re-interpretation.
Considerations to consider. Would a single generalisable embeddings model ‘work’ globally, or perhaps more likely several regional ones needed. Another element may be the changing nature of language by different authors and evolving science, and the impact of different languages and nationalities. Many other aspects of course!
hashtag#geology hashtag#lithology hashtag#rock hashtag#geoscience hashtag#earthscience hashtag#machinelearning hashtag#artificialintelligence hashtag#ai hashtag#textembeddings hashtag#naturallanguageprocessing hashtag#analytics hashtag#datascience hashtag#bigdata hashtag#datadiscovery hashtag#mining hashtag#oilandgas hashtag#geothermal hashtag#hydrogeology hashtag#geotechnical hashtag#geologicalengineering hashtag#planetarygeology hashtag#geohazards hashtag#ccs hashtag#research hashtag#datamanagement hashtag#datainnovation hashtag#datamining hashtag#subsurface