使用子节点创建具有相同名称的多个节点

啊啊啊啊啊吖

2019-03-12 阅读量: 854

使用子节点创建具有相同名称的多个节点

我有一个文本文件，我用python使用xml.etree.cElementTree库解析它。在输入中我有一个包含句子的段落<s>，每个句子都有单词<w>，这里是文本文件的样子：

This

first

sentence.

This

second

sentence.

在输出中我想有以下xml文件：

<s>

<w>first</w>

<w>sentence</w>

</s>

<s>

<w>second</w>

<w>sentence</w>

</s>

我编写了以下python代码，它给了我段落标签和单词标签，我不知道如何实现具有多个<s>标签的情况。一个句子以大写字母开头，以点结尾。我的python代码：

source_file = open("file.txt", "r")

for line in source_file:

# catch ponctuation : . and , and ! and ? and ()

if re.match("(\(|\)|\.|\,|\!)", str(line)):

ET.SubElement(p, "pc").text = line

else:

ET.SubElement(p, "w").text = line

tree.write("my_file.xml", encoding="UTF-8", xml_declaration=True)

以下xml输出：

<?xml version="1.0" encoding="UTF-8"?>

<w>first</w>

<w>sentence</w>

<w>second</w>

<w>sentence</w>

我面临的问题是我不能<s>为每个新句子创建一个新标签，有没有办法用xthon库使用python？

解决办法：基本上你需要一个逻辑来识别新的句子。忽略明显的部分，如下所示，

import os

eos = False

s = ET.SubElement(p, 's')

for line in source_file:

line = str(line).rstrip(os.linesep) #to remove new line char at the end of each line

# catch ponctuation : . and , and ! and ? and ()

if re.match("(\(|\)|\.|\,|\!)", line): #don't think this matches 'sentence.', you will need to verify

ET.SubElement(s, "pc").text = line

eos = True

else:

if eos and line.strip() and line[0].isupper():

s = ET.SubElement(p, 's')

eos = False

ET.SubElement(s, "w").text = line

此外，您的正则表达式可能需要修复。

7.6719

关注作者

发表评论

暂无数据

CDA考试动态

CDA报考指南

推荐帖子