top of page

METHODS & CODE BLOCKS

Python

image.png

Initial Processing

1. Importing contents from downloaded transcripts from the Chinese court documents archive: 120 court cases transcript.

 

(40 for Tibet with female defendant, Tibet with male defendant, Shanghai with female defendant, and Shanghai with male defendant respectively.)

2. Starting to filter the content from the documents, removing all punctuations and tags.

Tokenization

1. Import dictionaries like jieba


(This dictionary recognizes and separates continous Chinese characters into words).

2. Use other dictionaries to identify and remove common stop words

image.png
image.png
image.png

Visualisation

1. Recombination


Tokenized documents are recombined into a string for further analysis

2. The data is processed for visual outputs in the form of word clouds

 Specifics of the visual output is adjusted

TF-IDF Analysis

1. Translation

The keywords to the processed documents are translated to english to facilitate analysis

2. TF-IDF

 TF-IDF analysis is performed to analyze the frequency of previously defined keywords to examine the relationship between the causes of a divorce trial and the geographical location/gender of the defendant in the case brought to court.

image.png
image.png
image.png

Heat Maps

After TF-IDF results are obtained, heat maps are generated.


Keywords are replaced to english to generate visual outcomes​. The frequency of the group of each keyword is displayed on a heat map to show the frequency.

Tokenization

1. Import dictionaries like jieba


(This dictionary recognizes and separates continous Chinese characters into words).

2. Use other dictionaries to identify and remove common stop words

image.png

© 2024 by Xinyue Xu

Powered and secured by Wix

bottom of page